17
12

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

AthenaでJSONにクエリを実行してみる

Last updated at Posted at 2019-02-20

JSONファイルに対してもAthenaからクエリを実行できるらしいので試してみました。
ネストされた(入れ子になった)JSONでもQueryを実行できます。

参考:JSONSerDe によるマッピングを使って,入れ子の JSON から Amazon Athena のテーブルを作成する

#テスト用データ作成
##テスト用のデータ構造

サンプル
{
	"eventType": "Send",
	"mail": {
		"timestamp": "2017-01-09T18:08:44.830Z",
		"source": "youraddress@example.com",
		"sourceArn": "arn:aws:ses:us-west-2:111222333:identity/youraddress@example.com",
		"sendingAccountId": "111222333",
		"messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000",
		"destination": [
			"success@simulator.amazonses.com"
		],
		"headersTruncated": false,
		"headers": [
			{
				"name": "From",
				"value": "youraddress@example.com"
			},
			{
				"name": "To",
				"value": "success@simulator.amazonses.com"
			},
			{
				"name": "Subject",
				"value": "Bounced Like a Bad Check"
			},
			{
				"name": "MIME-Version",
				"value": "1.0"
			},
			{
				"name": "Content-Type",
				"value": "text/plain; charset=UTF-8"
			},
			{
				"name": "Content-Transfer-Encoding",
				"value": "7bit"
			}
		],
		"commonHeaders": {
			"from": [
				"youraddress@example.com"
			],
			"to": [
				"success@simulator.amazonses.com"
			],
			"messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000",
			"subject": "Test"
		},
		"tags": {
			"ses:configuration-set": [
				"Firehose"
			],
			"ses:source-ip": [
				"54.55.55.55"
			],
			"ses:from-domain": [
				"amazon.com"
			],
			"ses:caller-identity": [
				"root"
			]
		}
	},
	"send": {}
}

参考URLではJSONが改行やタブなどで見やすく整形されていますが、
保存するファイルは下記「テスト用データ」のように整形されていない状態にします。
1行1レコードとして扱われるようです。ファイルは複数ファイルに分かれていても問題なく読み込めます。

※改行で見やすく整形されている場合、クエリ発行時に以下のようなエラーが発生します。

エラー例
Your query has the following error(s):

HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: A JSONObject text must end with '}' at 2 [character 3 line 1]

This query ran against the "test-db" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 8791cb04-3272-4b5f-b427-883e455eb8a.

上記エラーはCREATE TABLEする際の以下のオプション設定で無視できるようです。
・ignore.malformed.json を true に設定する。(詳細は参考URLを確認)
参考:Amazon Athena の JSON データを読み込もうとするとエラーが発生します。

テスト用データ

以下の内容を任意のS3バケットに保存する。

テストデータ
{ "eventType": "Send", "mail": { "timestamp": "2017-01-09T18:08:44.830Z", "source": "youraddress@example.com", "sourceArn": "arn:aws:ses:us-west-2:111222333:identity/youraddress@example.com", "sendingAccountId": "111222333", "messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000", "destination": [ "success@simulator.amazonses.com" ], "headersTruncated": false, "headers": [ { "name": "From", "value": "youraddress@example.com" }, { "name": "To", "value": "success@simulator.amazonses.com" }, { "name": "Subject", "value": "Bounced Like a Bad Check" }, { "name": "MIME-Version", "value": "1.0" }, { "name": "Content-Type", "value": "text/plain; charset=UTF-8" }, { "name": "Content-Transfer-Encoding", "value": "7bit" } ], "commonHeaders": { "from": [ "youraddress@example.com" ], "to": [ "success@simulator.amazonses.com" ], "messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000", "subject": "Test" }, "tags": { "ses:configuration-set": [ "Firehose" ], "ses:source-ip": [ "54.55.55.55" ], "ses:from-domain": [ "amazon.com" ], "ses:caller-identity": [ "root" ] } }, "send": {} }
{ "eventType": "Bounce", "mail": { "timestamp": "2017-01-10T18:08:44.830Z", "source": "youraddress@example.com", "sourceArn": "arn:aws:ses:us-west-2:111222333:identity/youraddress@example.com", "sendingAccountId": "111222333", "messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000", "destination": [ "success@simulator.amazonses.com" ], "headersTruncated": false, "headers": [ { "name": "From", "value": "youraddress@example.com" }, { "name": "To", "value": "success@simulator.amazonses.com" }, { "name": "Subject", "value": "Bounced Like a Bad Check" }, { "name": "MIME-Version", "value": "1.0" }, { "name": "Content-Type", "value": "text/plain; charset=UTF-8" }, { "name": "Content-Transfer-Encoding", "value": "7bit" } ], "commonHeaders": { "from": [ "youraddress@example.com" ], "to": [ "success@simulator.amazonses.com" ], "messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000", "subject": "Test" }, "tags": { "ses:configuration-set": [ "Firehose" ], "ses:source-ip": [ "54.55.55.55" ], "ses:from-domain": [ "amazon.com" ], "ses:caller-identity": [ "test1" ] } }, "send": {} }
{ "eventType": "Send", "mail": { "timestamp": "2017-01-13T18:08:44.830Z", "source": "youraddress@example.com", "sourceArn": "arn:aws:ses:us-west-2:111222333:identity/youraddress@example.com", "sendingAccountId": "111222333", "messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000", "destination": [ "success@simulator.amazonses.com" ], "headersTruncated": false, "headers": [ { "name": "From", "value": "youraddress@example.com" }, { "name": "To", "value": "success@simulator.amazonses.com" }, { "name": "Subject", "value": "Bounced Like a Bad Check" }, { "name": "MIME-Version", "value": "1.0" }, { "name": "Content-Type", "value": "text/plain; charset=UTF-8" }, { "name": "Content-Transfer-Encoding", "value": "7bit" } ], "commonHeaders": { "from": [ "youraddress@example.com" ], "to": [ "success@simulator.amazonses.com" ], "messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000", "subject": "Test" }, "tags": { "ses:configuration-set": [ "Firehose" ], "ses:source-ip": [ "54.55.55.55" ], "ses:from-domain": [ "amazon.com" ], "ses:caller-identity": [ "test2" ] } }, "send": {} }
{ "eventType": "Bounce", "mail": { "timestamp": "2017-01-14T18:08:44.830Z", "source": "youraddress@example.com", "sourceArn": "arn:aws:ses:us-west-2:111222333:identity/youraddress@example.com", "sendingAccountId": "111222333", "messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000", "destination": [ "success@simulator.amazonses.com" ], "headersTruncated": false, "headers": [ { "name": "From", "value": "youraddress@example.com" }, { "name": "To", "value": "success@simulator.amazonses.com" }, { "name": "Subject", "value": "Bounced Like a Bad Check" }, { "name": "MIME-Version", "value": "1.0" }, { "name": "Content-Type", "value": "text/plain; charset=UTF-8" }, { "name": "Content-Transfer-Encoding", "value": "7bit" } ], "commonHeaders": { "from": [ "youraddress@example.com" ], "to": [ "success@simulator.amazonses.com" ], "messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000", "subject": "Test" }, "tags": { "ses:configuration-set": [ "Firehose" ], "ses:source-ip": [ "54.55.55.55" ], "ses:from-domain": [ "amazon.com" ], "ses:caller-identity": [ "test3" ] } }, "send": {} }
{ "eventType": "Bounce", "mail": { "timestamp": "2017-01-15T18:08:44.830Z", "source": "youraddress@example.com", "sourceArn": "arn:aws:ses:us-west-2:111222333:identity/youraddress@example.com", "sendingAccountId": "111222333", "messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000", "destination": [ "success@simulator.amazonses.com" ], "headersTruncated": false, "headers": [ { "name": "From", "value": "youraddress@example.com" }, { "name": "To", "value": "success@simulator.amazonses.com" }, { "name": "Subject", "value": "Bounced Like a Bad Check" }, { "name": "MIME-Version", "value": "1.0" }, { "name": "Content-Type", "value": "text/plain; charset=UTF-8" }, { "name": "Content-Transfer-Encoding", "value": "7bit" } ], "commonHeaders": { "from": [ "youraddress@example.com" ], "to": [ "success@simulator.amazonses.com" ], "messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000", "subject": "Test" }, "tags": { "ses:configuration-set": [ "Firehose" ], "ses:source-ip": [ "54.55.55.55" ], "ses:from-domain": [ "amazon.com" ], "ses:caller-identity": [ "test4" ] } }, "send": {} }
{ "eventType": "Send", "mail": { "timestamp": "2017-01-18T18:08:44.830Z", "source": "youraddress@example.com", "sourceArn": "arn:aws:ses:us-west-2:111222333:identity/youraddress@example.com", "sendingAccountId": "111222333", "messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000", "destination": [ "success@simulator.amazonses.com" ], "headersTruncated": false, "headers": [ { "name": "From", "value": "youraddress@example.com" }, { "name": "To", "value": "success@simulator.amazonses.com" }, { "name": "Subject", "value": "Bounced Like a Bad Check" }, { "name": "MIME-Version", "value": "1.0" }, { "name": "Content-Type", "value": "text/plain; charset=UTF-8" }, { "name": "Content-Transfer-Encoding", "value": "7bit" } ], "commonHeaders": { "from": [ "youraddress@example.com" ], "to": [ "success@simulator.amazonses.com" ], "messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000", "subject": "Test" }, "tags": { "ses:configuration-set": [ "Firehose" ], "ses:source-ip": [ "54.55.55.55" ], "ses:from-domain": [ "amazon.com" ], "ses:caller-identity": [ "test5" ] } }, "send": {} }

#テーブル作成
以下のQueryをAthenaのマネジメントコンソールから実行する。

テーブル作成
CREATE EXTERNAL TABLE sesblog2 (
  eventType string,
  mail struct<`timestamp`:string,
              source:string,
              sourceArn:string,
              sendingAccountId:string,
              messageId:string,
              destination:string,
              headersTruncated:boolean,
              headers:array<struct<name:string,value:string>>,
              commonHeaders:struct<`from`:array<string>,to:array<string>,messageId:string,subject:string>,
              tags:struct<ses_configurationset:string,ses_source_ip:string,ses_from_domain:string,ses_caller_identity:string>
              > 
  )
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  "mapping.ses_configurationset"="ses:configuration-set",
  "mapping.ses_source_ip"="ses:source-ip", 
  "mapping.ses_from_domain"="ses:from-domain", 
  "mapping.ses_caller_identity"="ses:caller-identity"
  )
LOCATION 's3://<YOUR BUCKET NAME>/<KEY>/' 

##補足説明
structでネストされたJSONの値を表現。
arrayでJSONの配列を表現。
・予約語と同一名のカラムを定義する場合にはバックォート(例:from)で囲む。
ROW FORMAT SERDEでJSON読み込み用のライブラリを指定している。
・「:」などは使用できない禁止文字はWITH SERDEPROPERTIESで別の文字で置き換えた文字列にマッピング。
LOCATION でJSONファイル保存場所を指定。
※予約語
https://prestodb.github.io/docs/current/language/types.html
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Keywords,Non-reservedKeywordsandReservedKeywords

#クエリ実行

##Preview table
Athenaのマネジメントコンソールが用意してくれているプレビュー用のクエリを実行してみます。

Preview_table
SELECT * FROM "test-db"."sesblog2" limit 10;

「*」で取得した場合、最上位階層のカラムのみ(ネストされた部分はひとまとめにされて)表示されています。

Preview_table実行結果(CSV保存した内容)
"eventtype","mail"
"Send","{timestamp=2017-01-09T18:08:44.830Z, source=youraddress@example.com, sourcearn=arn:aws:ses:us-west-2:111222333:identity/youraddress@example.com, sendingaccountid=111222333, messageid=01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000, destination=[""success@simulator.amazonses.com""], headerstruncated=false, headers=[{name=From, value=youraddress@example.com}, {name=To, value=success@simulator.amazonses.com}, {name=Subject, value=Bounced Like a Bad Check}, {name=MIME-Version, value=1.0}, {name=Content-Type, value=text/plain; charset=UTF-8}, {name=Content-Transfer-Encoding, value=7bit}], commonheaders={from=[youraddress@example.com], to=[success@simulator.amazonses.com], messageid=01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000, subject=Test}, tags={ses_configurationset=[""Firehose""], ses_source_ip=[""54.55.55.55""], ses_from_domain=[""amazon.com""], ses_caller_identity=[""root""]}}"
"Bounce","{timestamp=2017-01-10T18:08:44.830Z, source=youraddress@example.com, sourcearn=arn:aws:ses:us-west-2:111222333:identity/youraddress@example.com, sendingaccountid=111222333, messageid=01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000, destination=[""success@simulator.amazonses.com""], headerstruncated=false, headers=[{name=From, value=youraddress@example.com}, {name=To, value=success@simulator.amazonses.com}, {name=Subject, value=Bounced Like a Bad Check}, {name=MIME-Version, value=1.0}, {name=Content-Type, value=text/plain; charset=UTF-8}, {name=Content-Transfer-Encoding, value=7bit}], commonheaders={from=[youraddress@example.com], to=[success@simulator.amazonses.com], messageid=01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000, subject=Test}, tags={ses_configurationset=[""Firehose""], ses_source_ip=[""54.55.55.55""], ses_from_domain=[""amazon.com""], ses_caller_identity=[""test1""]}}"
"Send","{timestamp=2017-01-13T18:08:44.830Z, source=youraddress@example.com, sourcearn=arn:aws:ses:us-west-2:111222333:identity/youraddress@example.com, sendingaccountid=111222333, messageid=01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000, destination=[""success@simulator.amazonses.com""], headerstruncated=false, headers=[{name=From, value=youraddress@example.com}, {name=To, value=success@simulator.amazonses.com}, {name=Subject, value=Bounced Like a Bad Check}, {name=MIME-Version, value=1.0}, {name=Content-Type, value=text/plain; charset=UTF-8}, {name=Content-Transfer-Encoding, value=7bit}], commonheaders={from=[youraddress@example.com], to=[success@simulator.amazonses.com], messageid=01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000, subject=Test}, tags={ses_configurationset=[""Firehose""], ses_source_ip=[""54.55.55.55""], ses_from_domain=[""amazon.com""], ses_caller_identity=[""test2""]}}"
"Bounce","{timestamp=2017-01-14T18:08:44.830Z, source=youraddress@example.com, sourcearn=arn:aws:ses:us-west-2:111222333:identity/youraddress@example.com, sendingaccountid=111222333, messageid=01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000, destination=[""success@simulator.amazonses.com""], headerstruncated=false, headers=[{name=From, value=youraddress@example.com}, {name=To, value=success@simulator.amazonses.com}, {name=Subject, value=Bounced Like a Bad Check}, {name=MIME-Version, value=1.0}, {name=Content-Type, value=text/plain; charset=UTF-8}, {name=Content-Transfer-Encoding, value=7bit}], commonheaders={from=[youraddress@example.com], to=[success@simulator.amazonses.com], messageid=01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000, subject=Test}, tags={ses_configurationset=[""Firehose""], ses_source_ip=[""54.55.55.55""], ses_from_domain=[""amazon.com""], ses_caller_identity=[""test3""]}}"
"Bounce","{timestamp=2017-01-15T18:08:44.830Z, source=youraddress@example.com, sourcearn=arn:aws:ses:us-west-2:111222333:identity/youraddress@example.com, sendingaccountid=111222333, messageid=01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000, destination=[""success@simulator.amazonses.com""], headerstruncated=false, headers=[{name=From, value=youraddress@example.com}, {name=To, value=success@simulator.amazonses.com}, {name=Subject, value=Bounced Like a Bad Check}, {name=MIME-Version, value=1.0}, {name=Content-Type, value=text/plain; charset=UTF-8}, {name=Content-Transfer-Encoding, value=7bit}], commonheaders={from=[youraddress@example.com], to=[success@simulator.amazonses.com], messageid=01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000, subject=Test}, tags={ses_configurationset=[""Firehose""], ses_source_ip=[""54.55.55.55""], ses_from_domain=[""amazon.com""], ses_caller_identity=[""test4""]}}"
"Send","{timestamp=2017-01-18T18:08:44.830Z, source=youraddress@example.com, sourcearn=arn:aws:ses:us-west-2:111222333:identity/youraddress@example.com, sendingaccountid=111222333, messageid=01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000, destination=[""success@simulator.amazonses.com""], headerstruncated=false, headers=[{name=From, value=youraddress@example.com}, {name=To, value=success@simulator.amazonses.com}, {name=Subject, value=Bounced Like a Bad Check}, {name=MIME-Version, value=1.0}, {name=Content-Type, value=text/plain; charset=UTF-8}, {name=Content-Transfer-Encoding, value=7bit}], commonheaders={from=[youraddress@example.com], to=[success@simulator.amazonses.com], messageid=01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000, subject=Test}, tags={ses_configurationset=[""Firehose""], ses_source_ip=[""54.55.55.55""], ses_from_domain=[""amazon.com""], ses_caller_identity=[""test5""]}}"

##ネストされた項目で絞り込み
ネストされたデータ項目をwhere句やカラムでも問題なく使用できます。

ネストされた項目で絞り込み
SELECT eventtype as Event,
       mail.destination as Destination, 
       mail.messageId as MessageID,
       mail.timestamp as Timestamp
FROM sesblog2
WHERE eventType = 'Bounce' and mail.timestamp like '2017-01-10%'
ネストされた項目で絞り込み実行結果(CSV保存した内容)
"Event","Destination","MessageID","Timestamp"
"Bounce","[""success@simulator.amazonses.com""]","01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000","2017-01-10T18:08:44.830Z"

##マッピングされた項目で絞り込み
マッピングした項目も問題なく使用できています。
クエリ実行時のPrestoの予約語はダブルクォートで囲む必要があります(例:"from")

マッピングされた項目で絞り込み
SELECT eventtype as Event,
         mail.timestamp as Timestamp,
         mail.tags.ses_source_ip as SourceIP,
         mail.tags.ses_caller_identity as AuthenticatedBy,
         mail.commonHeaders."from" as FromAddress,
         mail.commonHeaders.to as ToAddress
FROM sesblog2
WHERE mail.tags.ses_caller_identity = '["root"]'
マッピングされた項目で絞り込み実行結果(CSV保存した内容)
"Event","Timestamp","SourceIP","AuthenticatedBy","FromAddress","ToAddress"
"Send","2017-01-09T18:08:44.830Z","[""54.55.55.55""]","[""root""]","[youraddress@example.com]","[success@simulator.amazonses.com]"

#おまけ:hive-json-schema を使ってみる
以下の記述がありましたが、具体的な手順等はなかったのでやってみました。

hive-json-schema を使うことで,入れ子になった JSON DDL を記述することができます。

リンク先を確認したところhive-json-schemaはJava(Mavenプロジェクト)で作成されており、
使用するにはビルドする必要があります。

##mavenインストール
ビルドするためにmavenをインストールします。

Chocolateyでインストールしました。
jdkがインストールされていない場合、一緒にインストールされます。

mavenインストール
> choco install maven
Chocolatey v0.10.11
Installing the following packages:
maven
By installing you accept licenses for the packages.
Progress: Downloading jdk8 8.0.201... 100%
Progress: Downloading maven 3.6.0... 100%

jdk8 v8.0.201 [Approved]

~中略~

Do you want to run the script?([Y]es/[N]o/[P]rint): y

Downloading JDK from http://download.oracle.com/otn-pub/java/jdk/8u201-b09/42970487e3af4f5aa5bca3f542482c60/jdk-8u201-windows-x64.exe

~中略~

The install of jdk8 was successful.
 Software installed to 'C:\Program Files\Java\jdk1.8.0_201\'

maven v3.6.0 [Approved]

~中略~

Do you want to run the script?([Y]es/[N]o/[P]rint): y

Downloading Maven
  from 'https://archive.apache.org/dist/maven/maven-3/3.6.0/binaries/apache-maven-3.6.0-bin.zip'

~中略~

The install of maven was successful.
 Software installed to 'C:\ProgramData\chocolatey\lib\maven'

Chocolatey installed 2/2 packages.
 See the log for details (C:\ProgramData\chocolatey\logs\chocolatey.log).

確認
> mvn -v
Apache Maven 3.6.0 (97c98ec64a1fdfee7767ce5ffb20918da4f719f3; 2018-10-25T03:41:47+09:00)
Maven home: C:\ProgramData\chocolatey\lib\maven\apache-maven-3.6.0\bin\..
Java version: 1.8.0_201, vendor: Oracle Corporation, runtime: C:\Program Files\Java\jdk1.8.0_201\jre
Default locale: ja_JP, platform encoding: MS932
OS name: "windows 8.1", version: "6.3", arch: "amd64", family: "windows"

hive-json-schemaのダウンロード

gitで管理されているのでgit cloneします。
gitコマンドがない場合は、直接ダウンロードしても大丈夫だと思います(試してはいない)。

hive-json-schemaのダウンロード
$ git clone https://github.com/quux00/hive-json-schema
Cloning into 'hive-json-schema'...
remote: Enumerating objects: 155, done.
remote: Total 155 (delta 0), reused 0 (delta 0), pack-reused 155
Receiving objects: 100% (155/155), 142.86 KiB | 0 bytes/s, done.
Resolving deltas: 100% (35/35), done.

##ビルド
ビルドするとtargetディレクトリの中にjarファイルが作成されます。

ビルド
$ cd hive-json-schema

$ ls
pom.xml  README.md  src/

$ mvn package
[INFO] Scanning for projects...

~中略~

[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  7.541 s
[INFO] Finished at: 2019-02-15T18:39:03+09:00
[INFO] ------------------------------------------------------------------------

$ ls
pom.xml  README.md  src/  target/
$ ls target/
archive-tmp/  classes/  generated-sources/  json-hive-schema-1.0.jar  json-hive-schema-1.0-jar-with-dependencies.jar  maven-archiver/  maven-status/

##使ってみる
withなしのjarは-cpでクラスパスを指定しないといけないようなので、
json-hive-schema-1.0-jar-with-dependencies.jarを使ってみました。

###シンプルなJSONを変換してみる。
JSONファイルは整形された状態、1行の状態、どちらでも大丈夫でした。
また、複数行ある場合でも、構造が違うJSONが混じっている場合でも変換してくれました(先頭が優先されるっぽい)。

sample01.json
{
  "foo": [1, null, 3],
  "bar": {
    "hoge": [
       {"pyo": true},
       {"pyo": false}
     ]
  }
}

LOCATION がないので、そのままでは使用できませんが、いい感じに変換してくれているように見えます。
テーブル名は指定することができるようです。(以下の例は指定した場合、何も指定しないとデフォルトで「x」がテーブル名になる)

実行
$ java -jar target/json-hive-schema-1.0-jar-with-dependencies.jar sample_json/sample01.json table_name
CREATE TABLE table_name (
  bar struct<hoge:array<struct<pyo:boolean>>>,
  foo array<int>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';

LOCATIONを追加してAthenaで実行しましたがエラーが発生しました。
CREATE EXTERNAL TABLEのみサポートしているとのことなので、Queryを変更して実行したところ問題なく作成されました。

CREATE_TABLEだとエラー
Your query has the following error(s):
Only external table creation is supported. (Service: AmazonAthena; Status Code: 400; Error Code: InvalidRequestException; Request ID: ceeee646-1b
修正したQuery
CREATE EXTERNAL TABLE test_table (
  bar struct<hoge:array<struct<pyo:boolean>>>,
  foo array<int>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION
  's3://bucket/test'
;

###今回使用したJSONを変換してみる。

実行
$ java -jar target/json-hive-schema-1.0-jar-with-dependencies.jar sample_json/sample02.json
CREATE TABLE x (
  eventtype string,
  mail struct<commonheaders:struct<from:array<string>, messageid:string, subject:string, to:array<string>>, destination:array<string>, headers:array<struct<name:string, value:string>>, headerstruncated:boolean, messageid:string, sendingaccountid:string, source:string, sourcearn:string, tags:struct<ses:caller-identity:array<string>, ses:configuration-set:array<string>, ses:from-domain:array<string>, ses:source-ip:array<string>>, timestamp:string>,
  send struc>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';

ちょっと項目が多く見ずらいので整形。

整形版生成したQuery
CREATE TABLE x (
eventtype string,
mail struct<
    commonheaders:struct<
        from:array<string>,
        messageid:string,
        subject:string,
        to:array<string>
    >,
    destination:array<string>,
    headers:array<
        struct<
            name:string,
            value:string
        >
    >,
    headerstruncated:boolean,
    messageid:string,
    sendingaccountid:string,
    source:string,
    sourcearn:string,
    tags:struct<
        ses:caller-identity:array<string>,
        ses:configuration-set:array<string>,
        ses:from-domain:array<string>,
        ses:source-ip:array<string>
    >,
    timestamp:string
>,
send struc>
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';

うまく変換できない箇所がいくつかあり、使用する場合は手直しが必要そうです。
使わずに全部を手で作成するよりは良いのかな?

以下、修正が必要そうな個所をまとめて列挙。
CREATE EXTERNAL TABLE に変更する。
LOCATION の追記する。
・「:」などは使用できない禁止文字なので変更する。(WITH SERDEPROPERTIESでマッピングする)
・大文字が小文字に変換されているので直す。
・バックォート(例:from)で囲む必要がある文字列がそのまま。
"send": {}send struc>となり変換に失敗している。

以上、やってみたメモでした。

17
12
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
17
12

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?