More than 5 years have passed since last update.

Nagios(nrpe)で、ElasticsearchのOutOfMemoryエラーを監視

Posted at 2015-03-10

ElasticsearchのOutOfMemoryErrorが多発したので、そろそろelasticsearch.logを監視しないといけないと思い、Nagios環境で監視した時のメモ。

check_log3という素敵なプラグインで簡単にファイルの文字列監視。
※最初はfluentdでやろうと思ったけど、今回はfluentdの集約サーバとElasticsearchが同一筐体に共存していた関係で、別環境で立っているNagiosで行う事とした。

check_log3のオプション一覧

基本的には、下記のコマンドを組み立てるだけなのですごく直感的で良い。

# Grab options from command line
GetOptions (
        "l|logfile=s"           => \$log_file,
        "m|log-pattern=s"       => \$log_pattern,
        "t|log-select=s"        => \$log_select,
        "s|seekfile=s"          => \$seek_file,
        "p|pattern=s"           => \@patterns,
        "P|patternfile=s"       => \$pattern_file,
        "n|negpattern=s"        => \@negpatterns,
        "f|negpatternfile=s"    => \$negpatternfile,
        "w|warning=s"           => \$warning,
        "c|critical=s"          => \$critical,
        "i|case-insensitive"    => \$case_insensitive,
        "d|nodiff-warn"         => \$diff_warn,
        "D|nodiff-crit"         => \$diff_crit,
        "e|parse=s"             => \$parse_pattern,
        "E|parsefile=s"         => \$parse_file,
        "a|output-all"          => \$output_all,
        "C|context=s"           => \$context,
        "1|stop-first-match"    => \$stop_first_match,
        "report-first-only"     => \$report_first_only,
        "negate"                => \$negate,
        "ok"                    => \$always_ok,
        "missing=s"             => \$missing,
        "missing-ok"            => \$missing_ok,
        "missing-msg=s"         => \$missing_msg,
        "no-timeout"            => \$no_timeout,
        "timestamp=s"           => \$timestamp,
        "quiet"                 => \$quiet,
        "v|version"             => \$version,
        "h|help"                => \$help,
        "debug"                 => \$debug,
);

サンプル Elasticsearch.logからOutOfMemoryの文字列を検知する

クライアント側

該当のコマンドをcommand名と実行コマンドを定義

/usr/local/nagios/etc/nrpe.cfg

command[check_es_outofmem]=/usr/local/nagios/libexec/check_log3 -l /var/log/elasticsearch/elasticsearch.log -P /usr/local/nagios/etc/elasticsearch.txt -c 1 -n nrpe -s /tmp/es_outofmem_seek

監視文字列を定義
※文字列を追加すれば次のCheckから監視文字列が追加される。

/usr/local/nagios/etc/elasticsearch.txt

OutOfMemory

監視サーバ側

/usr/local/nagios/etc/servers/mgmt.cfg

define service{
        use                     XXX
        host_name               XXX
        service_description     RES_ES_OUTOFMEM
        check_command           check_nrpe!check_es_outofmem
        }

適用方法

Nagiosサーバ側

$ sudo service nagios reload

※クライアント側は自動適用(次に、監視サーバ側からCheckが走った際に監視開始となる)

エラーが出た際は、メール通知が飛ぶようにしている。（ちょっとレガシーだがそこは置いておく。。）

こんなエラー出た

非常にわかりやすいエラー解説からコピペさせていただきました。

「NRPE:Unable to read output」で「UNKNOWN」がでる。

原因：リモートサーバ側の「nrpe.cfg」に定義されたコマンドラインを実行しようとしたが、実際のコマンドがなかったため実行できなかったので、リモートサーバ側のシェルのレベルでエラーが出た。
対処：リモートサーバ側のnrpe.cfgを開きコマンド定義を見直し、必要なら、実行しようとしているコマンドラインをシェルで実行する。

今回の場合は、nrpe.cfgのコマンド組み立てを間違えていた。。

Elasticsearchはプロセス監視くらいしかしていないので、今回のを契機に監視もしっかりしていこうと思う。
只、あんまエラーに出会っていないのでもっと監視すべき文字列とかあるのかなーとふと思った。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up