More than 5 years have passed since last update.

embulk でよくやるカラム操作

Embulk

Last updated at 2018-09-13Posted at 2018-09-12

概要

embulk を使っていて、よくやるカラム操作をメモ的な感じでまとめる
とりあえず自分がよく使うものや、躓いたところから
もっといろいろやっているので気が向いたときに更新する

json を展開する

filters:
  - type: expand_json
    json_column_name: record
    expanded_columns:
      - {name: _id, type: string}
      - {name: user_id, type: long}
      - {name: created_at, type: timestamp, format: '%Y-%m-%dT%H:%M:%S.%N %z'}

カラムを追加する

自分で指定した値を追加したり、既存カラムを元にコピーしたりできる
embulk-filter-column

filters:
  - type: expand_json
    json_column_name: record
    expanded_columns:
      - {name: _id, type: string}
      - {name: user_id, type: long}
      - {name: created_at, type: timestamp, format: '%Y-%m-%dT%H:%M:%S.%N %z'}
  - type: column
    add_columns:
      - {name: time, type: timestamp, default: "2015-07-13", format: "%Y-%m-%d"}
      - {name: ymd, src: created_at}

既存カラムを元にフォーマットや型を変える

typecast で型を変える
embulk-filter-typecast
この例では、timestamp な created_at を ymd にコピーし、string にキャストし、最後に long にキャストしている

filters:
  - type: expand_json
    json_column_name: record
    expanded_columns:
      - {name: _id, type: string}
      - {name: user_id, type: long}
      - {name: created_at, type: timestamp, format: '%Y-%m-%dT%H:%M:%S.%N %z'}
  - type: column
    add_columns:
      - {name: ymd, src: created_at}
  - type: typecast
    columns:
      - {name: ymd, type: string, format: "%Y%m%d", timezone: "+09:00"}
  - type: typecast
    columns:
      - {name: ymd, type: long}

カラムの順序を変える

add_columns でカラムを追加して、それを先頭に持ってきたいときなどに利用
embulk-filter-column
とりあえず最後に embulk-filter-column の columns を意図した順序にするだけのよう

filters:
  - type: expand_json
    json_column_name: record
    expanded_columns:
      - {name: _id, type: string}
      - {name: user_id, type: long}
      - {name: created_at, type: timestamp}
  - type: column
    add_columns:
      - {name: ymd, src: created_at}
  - type: typecast
    columns:
      - {name: ymd, type: string, format: "%Y%m%d", timezone: "+09:00"}
  - type: typecast
    columns:
      - {name: ymd, type: long}
  - type: column
    columns:
      - {name: ymd, type: long}
      - {name: _id, type: string}
      - {name: user_id, type: long}
      - {name: created_at, type: timestamp}

json内の配列を分解する

embulk-parser-jsonpath
Twitter の Search API の例

{
	"statuses" : [
		{
			"created_at" : "Sun Feb 25 18:11:01 +0000 2018",
			"id" : 1234
			"id_str" : "1234"
			"text" : "おはよう"
			...
		},
		{
			"created_at" : "Sun Feb 25 18:11:01 +0000 2018",
			"id" : 2345
			"id_str" : "2345"
			"text" : "こんにちは"
			...
		}
		...
	]
}

in:
  type: http
  url: ...
  params:
    ...
  parser:
    type: jsonpath
    root: "$.statuses"
    default_timezone: "Asia/Tokyo"
    columns:
      - {name: "created_at", type: string}
      - {name: "id", type: long}
      - {name: "id_str", type: string}
      - {name: "text", type: string}
  method: get

これを out すると、配列の数の分のレコードが出力される。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up