More than 5 years have passed since last update.

fluent-plugin-bigquery利用時に、tableを動的に設定する

Posted at 2014-06-10

経緯

ちょうどログ解析基盤を移行しようとしていたところに、下記の記事が。
Googleの虎の子「BigQuery」をFluentdユーザーが使わない理由がなくなった理由 #gcpja

BigQueryは、社内の利用者も多いGoogle Apps Script用のAPIも用意されているので、これは検証せねばと思っていました。

検証には、こちらの記事がたいへん参考になりましたm(__)m
FluentdでGoogle BigQueryにログを挿入してクエリを実行する

そして、課題も。。
fluent-plugin-bigquery単体では、BigQueryの格納先tableを動的に変更することができません。

BigQueryのPricingをみると、クエリ毎にtableのデータ量で課金されます。また、recordの削除はできないので、定期的にtableを変更してクエリ対象のtableが肥大化しすぎないように制御する必要があると認識しました。

先週、Cloud Platform Developer Roadshow 2014の大阪会場に参加しましたので、質疑応答時にkazunori279さんに質問したり、懇親会で同じ悩みをもつ複数の方とお話しました。そこで、とりあえずIssue書こうと思ったんですが、やりとりが英語。。。

でもでも、よくよく考えると、fluent-plugin-forestのplaceholderを利用すれば、実現できるのではないかと思って検証した結果がこちらになります。

検証結果

結論から言うと、出来ました。
検証環境はAmazon Linux AMI 2014.03、fluentdはyumでtd-agentをインストールして検証しています。

2つのPluginを利用しています。

下記の例では、Apacheのaccess_logをBigQueryに格納する際に、月毎にテーブルを分割しています。また、Amazon S3にもバックアップしています。

<source>
  type tail
  format apache
  time_format %d/%b/%Y:%H:%M:%S %z
  path /var/log/httpd/access_log
  tag td.apache_access_log
  pos_file /var/lib/fluent/td.apache_access_log.pos
</source>

...

<match td.apache_access_log>
  type record_reformer
  enable_ruby true
  tag ${tag}.${time.strftime('%Y%m')}
  <record>
    input_tag    ${tag}
    port         ${port}
    host         ${host}
    user         ${user}
    method       ${method}
    path         ${path}
    http_version ${http_version}
    conde        ${conde}
    size         ${size}
    referer      ${referer}
    agent        ${agent}
  </record>
</match>

<match td.**>
  type forest
  subtype copy
  remove_prefix td
  <template>
    # Amazon S3
    <store>
      type s3
      s3_bucket s3-bucket
      path /
      buffer_path /var/log/td-agent/buffer/s3.${tag}.*.buffer
      time_slice_format ${tag_parts[0]}/%Y/%Y%m%d/${tag}-%Y%m%d-%H
    </store>
  </template>
  # Google BigQuery
  <case apache_access_log.**>
    <store>
      type bigquery
      method insert
      auth_method private_key
      email XXXXXXXXXXXX@developer.gserviceaccount.com
      private_key_path /etc/td-agent/XXXXXXXXXXXX-privatekey.p12
      project your-project-id
      dataset your_dataset
      table apache_access_log_${tag_parts[-1]}
      time_format %s
      time_field time
      schema_path /etc/td-agent/apache_access_log_schema.json
    </store>
  </case>
</match>

ポイントは2つです

record_reformerを利用して、enable_ruby trueとしつつ、tagの末尾に年月tag ${tag}.${time.strftime('%Y%m')}を付与しています。
bigqueryにて、table apache_access_log_${tag_parts[-1]}とすることによって、格納先のtableを月毎に自動的に分割します。

課題

BigQueryのtable作成は、別途仕組みを作る必要があります。
record_reformerにて、enable_ruby trueとすることは、推奨されていないようです。

but, please note that enabling ruby codes is not encouraged by security reasons and also in terms of the performance.

ということで、fluent-plugin-bigquery自体で、table分割に対応されると嬉しいなと思いました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up