More than 5 years have passed since last update.

FluentdでアプリログをS3へ

Fluentd

Last updated at 2016-06-09Posted at 2015-10-21

目的

ログ欠損を防ぎ、完全性の高いログ収集基盤を作る

方法

今回はAggregatorやProcessorを用いず、シングル構成で任意のアプリログをS3へdailyでUploadする

環境

Amazon Linux AMI 2015.09 (HVM)
td-agent-2.2.1-0.el2015.x86_64
- fluent-plugin-s3
- fluent-plugin-forest
rubygems-2.0-0.3.amzn1.noarch

設定

前提

S3へのUploadが可能なIAMを持っている
S3の構成はBucket/Folder1/Folder2/Folder3/Objectとする
- Bucketは任意、Folder1は任意のアプリ名で事前に作成する
- Folder2はログ種別
- Folder3にはノード名を自動登録する
- Objectはファイル名_yyyymmdd_prefix.gzとする
  - prefixはbuffer_chunk_limitを超え、一度にUPできない場合、0,1,2とインクリメントし付与される

ファイルディスクリプタ

【Fluentd comment】

If your console shows 1024, it is insufficient. Please add following lines to your /etc/security/limits.conf file and reboot your machine.

root soft nofile 65536
root hard nofile 65536
- soft nofile 65536
- hard nofile 65536

td-agent-1.1.20以降はinit.dスクリプトで動的に設定されているので、個別設定は不要

/opt/td-agent/etc/init.d/td-agent

(snip)
#
# Function that starts the daemon/service
#
do_start() {
  # Set Max number of file descriptors for the safety sake
  # see http://docs.fluentd.org/en/articles/before-install
  ulimit -n 65536 1>/dev/null 2>&1 || true
  local RETVAL=0
  daemon --pidfile="${TD_AGENT_PID_FILE}" ${START_STOP_DAEMON_ARGS} "${TD_AGENT_RUBY}" ${TD_AGENT_ARGS} || RETVAL="$?"
  [ $RETVAL -eq 0 ] && touch "${TD_AGENT_LOCK_FILE}"
  return $RETVAL
}
(snip)

確認

$ ulimit -n
1024

設定

$ cd /etc/security
$ sudo cp limits.conf limits.conf.origin
$ sudo vim limits.conf
# 追記
root soft nofile 65536
root hard nofile 65536
* soft nofile 65536
* hard nofile 65536

# 変更を適用
$ sudo shutdown -r now

確認

$ ulimit -n
65536

ネットワークカーネルパラメータ最適化

【Fluentd comment】

For high load environments consisting of many Fluentd instances, please add these parameters to your /etc/sysctl.conf file.

net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.ip_local_port_range = 10240 65535

設定

net.ipv4.tcp_tw_recycleを有効にすることで接続障害が発生する事例もあり、スケールアップで対処できるため、有効化しない。
また、Fluentdコメントにも注釈がある。

If your environment doesn’t have a problem with TCP_WAIT, then these changes are not needed.

インストール

rubygems

$ sudo yum install rubygems

Fluentd

$ curl -L https://toolbelt.treasuredata.com/sh/install-redhat-td-agent2.sh | sh

設定

$ sudo service td-agent start
Starting td-agent:                                         [  OK  ]
$ sudo chkconfig td-agent on

確認

$ ps aux | grep td-agent
td-agent  2485  0.0  2.1 216404 22384 ?        Sl   13:03   0:00 /opt/td-agent/embedded/bin/ruby /usr/sbin/td-agent --log /var/log/td-agent/td-agent.log --use-v1-config --group td-agent --daemon /var/run/td-agent/td-agent.pid
td-agent  2488  0.2  3.4 242124 35492 ?        Sl   13:03   0:00 /opt/td-agent/embedded/bin/ruby /usr/sbin/td-agent --log /var/log/td-agent/td-agent.log --use-v1-config --group td-agent --daemon /var/run/td-agent/td-agent.pid

$ sudo chkconfig --list | grep td-agent
td-agent       	0:off	1:off	2:on	3:on	4:on	5:on	6:off

Plugin

fluent-plugin-s3
fluent-plugin-forest
- タグなどのプレースホルダを用いて、プラグインの動的な設定を可能にする素晴らしいPlugin

# S3へupload
$ sudo /usr/sbin/td-agent-gem install fluent-plugin-s3
# プレースホルダー対応
$ sudo /usr/sbin/td-agent-gem install fluent-plugin-forest

設定

rootユーザが所有するログファイルを一般ユーザが収集できるようにtd-agentデーモンがrootユーザで起動する
- ログファイルのパーミッションを変更する方法もある

/etc/init.d/td-agentで下記を記述

# TD_AGENT_USER=td-agent
TD_AGENT_USER=root

Config File Location
/etc/td-agent/td-agent.conf
- LSBリロードで再ロードされる

$ sudo /etc/init.d/td-agent reload

tail Input Plugin

The in_tail Input plugin allows Fluentd to read events from the tail of text files. Its behavior is similar to the tail -F command.

"*"が使用できる

<source>
path /var/log/app/*.log
</source>

/var/log/app/配下に複数のログがある場合は下記のような記述でfollow可能
* %Y%m%d%H%Mでの指定も可能

<source>
path /var/log/app/%Y%m%d%H%M.log
</source>

systemディレクティブ記述
- info以上のwarn,error,fatalを出力
- 下記ログは抑制しない
  - 連続した同一エラー出力
  - 指定時間内の同一エラー出力
  - 起動時に設定ファイルの標準ログ出力

<system>
log_level info
</system>

メモリバッファではなく、ディスクバッファを使う
- 不意にプロセスが落ちた場合でも、再起動後に処理が再開される
- td-agentプロセスが停止した場合、キューを強制的に出力させるため、flush_at_shutdown trueを使用する
pos_fileオプションを使用し、ログ欠損を防ぐ
- i-node番号とオフセットが保存されるため、fluentd再起動後も続きから処理が可能
Inputプラグインはtailを使用する
- tailプラグインにプリセットされているパーサ未使用の理由としては下記
  - アプリ毎でフォーマットログを変更している場合、パターン不一致としてログ取得が失敗する
  - 収集時のパース処理負荷を軽減する
    - Fluentd公式にはパース処理はノードで処理するのではなく、AggregatorNodeで集約し、Processorノードでパース処理するよう推奨されている
    - パースするよりもしない方が高速処理でき、必然的にCPUコストも軽減される

Apacheのaccess_logとerror_logを取得

td-agent.conf

# Apache
<source>
 type tail
 format none
 path /var/log/httpd/access_log
 pos_file /var/log/td-agent/httpd.access.pos
 tag td.httpd.access
</source>

<source>
 type tail
 format none
 path /var/log/httpd/error_log
 pos_file /var/log/td-agent/httpd.error.pos
 tag td.httpd.error
</source>

<match td.httpd.**>
 type forest
 subtype s3
 <template>
  aws_key_id ***************
  aws_sec_key ***************
  s3_bucket ******
  s3_region ap-northeast-1
  s3_enpoint s3-ap-northeast-1.amazonaws.com
 </template>
  <case *.httpd.access>
   buffer_type file
   buffer_path /var/log/td-agent/buffer/httpd.access
   path httpd/access_log/${hostname}/access_log_
   time_slice_format %Y%m%d
   time_slice_wait 10m
   flush_at_shutdown true
  </case>
  <case *.httpd.error>
   buffer_type file
   buffer_path /var/log/td-agent/buffer/httpd.error
   path httpd/error_log/${hostname}/error_log_
   time_slice_format %Y%m%d
   time_slice_wait 10m
   flush_at_shutdown true
  </case>
</match>

S3にはbucket/access_log/host-name/access_log_20151014_0.gzとして出力される。
bucket/20151014/access_log/host-name/access_log_20151014_0.gzのようにフォルダに取得日を作成したい場合は、下記ように記述する

<match>
 path %Y%m%d/httpd/access_log${hostname}/access_log_
</match>

time_slice_formatを使用する情報もあるが、time_slice_formatはファイル作成に使用するものだし、可読性を上げるため、使用しない。

確認

$ ll /var/log/td-agent | grep httpd.*.pos
-rw-r--r-- 1 td-agent td-agent      httpd.access.pos
-rw-r--r-- 1 td-agent td-agent      httpd.error.pos

$ ll /var/log/td-agent/buffer/ | grep httpd.*
-rw-r--r-- 1 root     td-agent      httpd.access.*************.log
-rw-r--r-- 1 root     td-agent      httpd.error.*************.log

$ aws s3 ls s3://******/httpd/access_log/host-name/
2015-10-15 00:10:17        access_log_20151014_0.gz

$ aws s3 ls s3://******/httpd/error_log/host-name/
2015-10-15 00:10:16        error_log_20151014_0.gz

補足

logrotate
- td-agent.logのログローテートはデフォルトではDailyで過去30日分を保管となっている

/etc/logrotate.d/td-agent

/var/log/td-agent/td-agent.log {
  daily
  rotate 30
  compress
  delaycompress
  notifempty
  create 640 td-agent td-agent
  sharedscripts
  postrotate
    pid=/var/run/td-agent/td-agent.pid
    test -s $pid && kill -USR1 "$(cat $pid)"
  endscript
}

なぜOutputが00:10:16なのか。
time_slice_format

The time format used as part of the file name. The following characters are replaced with actual values when the file is created:
%Y: year including the century (at least 4 digits)
%m: month of the year (01..12)
%d: Day of the month (01..31)
%H: Hour of the day, 24-hour clock (00..23)
%M: Minute of the hour (00..59)
%S: Second of the minute (00..60)
The default format is %Y%m%d%H, which creates one file per hour.

time_slice_format %Y%m%d
- matchディレクティブで時間ベースでバッファ管理するよう定義しているので、00:00に出力される。
time_slice_wait 10m
- 遅延ログを書き出すための待ち時間設定
  （00:00を過ぎたログは新しいバッファに入るが、00:00前のログが00:00を過ぎて到着した場合の為の措置）
buffer_queue_limit、buffer_chunk_limitはtime_slice_formatではデフォルトで256MB
- これらはサイト要件によって変更する

バッファの仕組みは公式が分かりやすい
overview

td-agent.confへの定義が増える（buffer）とサービス再起動に時間がかかり、正常に停止しなくなるため、timeoutの時間を検討する必要がある。
- 私の環境だと平均で04:40ほどかかりました・・・

/etc/init.d/td-agent

# timeout can be overridden from /etc/sysconfig/td-agent
# STOPTIMEOUT=120
STOPTIMEOUT=300

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up