JSONファイルに対してもAthenaからクエリを実行できるらしいので試してみました。
ネストされた(入れ子になった)JSONでもQueryを実行できます。
参考:JSONSerDe によるマッピングを使って,入れ子の JSON から Amazon Athena のテーブルを作成する
#テスト用データ作成
##テスト用のデータ構造
{
"eventType": "Send",
"mail": {
"timestamp": "2017-01-09T18:08:44.830Z",
"source": "youraddress@example.com",
"sourceArn": "arn:aws:ses:us-west-2:111222333:identity/youraddress@example.com",
"sendingAccountId": "111222333",
"messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000",
"destination": [
"success@simulator.amazonses.com"
],
"headersTruncated": false,
"headers": [
{
"name": "From",
"value": "youraddress@example.com"
},
{
"name": "To",
"value": "success@simulator.amazonses.com"
},
{
"name": "Subject",
"value": "Bounced Like a Bad Check"
},
{
"name": "MIME-Version",
"value": "1.0"
},
{
"name": "Content-Type",
"value": "text/plain; charset=UTF-8"
},
{
"name": "Content-Transfer-Encoding",
"value": "7bit"
}
],
"commonHeaders": {
"from": [
"youraddress@example.com"
],
"to": [
"success@simulator.amazonses.com"
],
"messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000",
"subject": "Test"
},
"tags": {
"ses:configuration-set": [
"Firehose"
],
"ses:source-ip": [
"54.55.55.55"
],
"ses:from-domain": [
"amazon.com"
],
"ses:caller-identity": [
"root"
]
}
},
"send": {}
}
参考URLではJSONが改行やタブなどで見やすく整形されていますが、
保存するファイルは下記「テスト用データ」のように整形されていない状態にします。
1行1レコードとして扱われるようです。ファイルは複数ファイルに分かれていても問題なく読み込めます。
※改行で見やすく整形されている場合、クエリ発行時に以下のようなエラーが発生します。
Your query has the following error(s):
HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: A JSONObject text must end with '}' at 2 [character 3 line 1]
This query ran against the "test-db" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 8791cb04-3272-4b5f-b427-883e455eb8a.
上記エラーはCREATE TABLE
する際の以下のオプション設定で無視できるようです。
・ignore.malformed.json を true に設定する。(詳細は参考URLを確認)
参考:Amazon Athena の JSON データを読み込もうとするとエラーが発生します。
テスト用データ
以下の内容を任意のS3バケットに保存する。
{ "eventType": "Send", "mail": { "timestamp": "2017-01-09T18:08:44.830Z", "source": "youraddress@example.com", "sourceArn": "arn:aws:ses:us-west-2:111222333:identity/youraddress@example.com", "sendingAccountId": "111222333", "messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000", "destination": [ "success@simulator.amazonses.com" ], "headersTruncated": false, "headers": [ { "name": "From", "value": "youraddress@example.com" }, { "name": "To", "value": "success@simulator.amazonses.com" }, { "name": "Subject", "value": "Bounced Like a Bad Check" }, { "name": "MIME-Version", "value": "1.0" }, { "name": "Content-Type", "value": "text/plain; charset=UTF-8" }, { "name": "Content-Transfer-Encoding", "value": "7bit" } ], "commonHeaders": { "from": [ "youraddress@example.com" ], "to": [ "success@simulator.amazonses.com" ], "messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000", "subject": "Test" }, "tags": { "ses:configuration-set": [ "Firehose" ], "ses:source-ip": [ "54.55.55.55" ], "ses:from-domain": [ "amazon.com" ], "ses:caller-identity": [ "root" ] } }, "send": {} }
{ "eventType": "Bounce", "mail": { "timestamp": "2017-01-10T18:08:44.830Z", "source": "youraddress@example.com", "sourceArn": "arn:aws:ses:us-west-2:111222333:identity/youraddress@example.com", "sendingAccountId": "111222333", "messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000", "destination": [ "success@simulator.amazonses.com" ], "headersTruncated": false, "headers": [ { "name": "From", "value": "youraddress@example.com" }, { "name": "To", "value": "success@simulator.amazonses.com" }, { "name": "Subject", "value": "Bounced Like a Bad Check" }, { "name": "MIME-Version", "value": "1.0" }, { "name": "Content-Type", "value": "text/plain; charset=UTF-8" }, { "name": "Content-Transfer-Encoding", "value": "7bit" } ], "commonHeaders": { "from": [ "youraddress@example.com" ], "to": [ "success@simulator.amazonses.com" ], "messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000", "subject": "Test" }, "tags": { "ses:configuration-set": [ "Firehose" ], "ses:source-ip": [ "54.55.55.55" ], "ses:from-domain": [ "amazon.com" ], "ses:caller-identity": [ "test1" ] } }, "send": {} }
{ "eventType": "Send", "mail": { "timestamp": "2017-01-13T18:08:44.830Z", "source": "youraddress@example.com", "sourceArn": "arn:aws:ses:us-west-2:111222333:identity/youraddress@example.com", "sendingAccountId": "111222333", "messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000", "destination": [ "success@simulator.amazonses.com" ], "headersTruncated": false, "headers": [ { "name": "From", "value": "youraddress@example.com" }, { "name": "To", "value": "success@simulator.amazonses.com" }, { "name": "Subject", "value": "Bounced Like a Bad Check" }, { "name": "MIME-Version", "value": "1.0" }, { "name": "Content-Type", "value": "text/plain; charset=UTF-8" }, { "name": "Content-Transfer-Encoding", "value": "7bit" } ], "commonHeaders": { "from": [ "youraddress@example.com" ], "to": [ "success@simulator.amazonses.com" ], "messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000", "subject": "Test" }, "tags": { "ses:configuration-set": [ "Firehose" ], "ses:source-ip": [ "54.55.55.55" ], "ses:from-domain": [ "amazon.com" ], "ses:caller-identity": [ "test2" ] } }, "send": {} }
{ "eventType": "Bounce", "mail": { "timestamp": "2017-01-14T18:08:44.830Z", "source": "youraddress@example.com", "sourceArn": "arn:aws:ses:us-west-2:111222333:identity/youraddress@example.com", "sendingAccountId": "111222333", "messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000", "destination": [ "success@simulator.amazonses.com" ], "headersTruncated": false, "headers": [ { "name": "From", "value": "youraddress@example.com" }, { "name": "To", "value": "success@simulator.amazonses.com" }, { "name": "Subject", "value": "Bounced Like a Bad Check" }, { "name": "MIME-Version", "value": "1.0" }, { "name": "Content-Type", "value": "text/plain; charset=UTF-8" }, { "name": "Content-Transfer-Encoding", "value": "7bit" } ], "commonHeaders": { "from": [ "youraddress@example.com" ], "to": [ "success@simulator.amazonses.com" ], "messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000", "subject": "Test" }, "tags": { "ses:configuration-set": [ "Firehose" ], "ses:source-ip": [ "54.55.55.55" ], "ses:from-domain": [ "amazon.com" ], "ses:caller-identity": [ "test3" ] } }, "send": {} }
{ "eventType": "Bounce", "mail": { "timestamp": "2017-01-15T18:08:44.830Z", "source": "youraddress@example.com", "sourceArn": "arn:aws:ses:us-west-2:111222333:identity/youraddress@example.com", "sendingAccountId": "111222333", "messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000", "destination": [ "success@simulator.amazonses.com" ], "headersTruncated": false, "headers": [ { "name": "From", "value": "youraddress@example.com" }, { "name": "To", "value": "success@simulator.amazonses.com" }, { "name": "Subject", "value": "Bounced Like a Bad Check" }, { "name": "MIME-Version", "value": "1.0" }, { "name": "Content-Type", "value": "text/plain; charset=UTF-8" }, { "name": "Content-Transfer-Encoding", "value": "7bit" } ], "commonHeaders": { "from": [ "youraddress@example.com" ], "to": [ "success@simulator.amazonses.com" ], "messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000", "subject": "Test" }, "tags": { "ses:configuration-set": [ "Firehose" ], "ses:source-ip": [ "54.55.55.55" ], "ses:from-domain": [ "amazon.com" ], "ses:caller-identity": [ "test4" ] } }, "send": {} }
{ "eventType": "Send", "mail": { "timestamp": "2017-01-18T18:08:44.830Z", "source": "youraddress@example.com", "sourceArn": "arn:aws:ses:us-west-2:111222333:identity/youraddress@example.com", "sendingAccountId": "111222333", "messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000", "destination": [ "success@simulator.amazonses.com" ], "headersTruncated": false, "headers": [ { "name": "From", "value": "youraddress@example.com" }, { "name": "To", "value": "success@simulator.amazonses.com" }, { "name": "Subject", "value": "Bounced Like a Bad Check" }, { "name": "MIME-Version", "value": "1.0" }, { "name": "Content-Type", "value": "text/plain; charset=UTF-8" }, { "name": "Content-Transfer-Encoding", "value": "7bit" } ], "commonHeaders": { "from": [ "youraddress@example.com" ], "to": [ "success@simulator.amazonses.com" ], "messageId": "01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000", "subject": "Test" }, "tags": { "ses:configuration-set": [ "Firehose" ], "ses:source-ip": [ "54.55.55.55" ], "ses:from-domain": [ "amazon.com" ], "ses:caller-identity": [ "test5" ] } }, "send": {} }
#テーブル作成
以下のQueryをAthenaのマネジメントコンソールから実行する。
CREATE EXTERNAL TABLE sesblog2 (
eventType string,
mail struct<`timestamp`:string,
source:string,
sourceArn:string,
sendingAccountId:string,
messageId:string,
destination:string,
headersTruncated:boolean,
headers:array<struct<name:string,value:string>>,
commonHeaders:struct<`from`:array<string>,to:array<string>,messageId:string,subject:string>,
tags:struct<ses_configurationset:string,ses_source_ip:string,ses_from_domain:string,ses_caller_identity:string>
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
"mapping.ses_configurationset"="ses:configuration-set",
"mapping.ses_source_ip"="ses:source-ip",
"mapping.ses_from_domain"="ses:from-domain",
"mapping.ses_caller_identity"="ses:caller-identity"
)
LOCATION 's3://<YOUR BUCKET NAME>/<KEY>/'
##補足説明
・struct
でネストされたJSONの値を表現。
・array
でJSONの配列を表現。
・予約語と同一名のカラムを定義する場合にはバックォート(例:from
)で囲む。
・ROW FORMAT SERDE
でJSON読み込み用のライブラリを指定している。
・「:」などは使用できない禁止文字はWITH SERDEPROPERTIES
で別の文字で置き換えた文字列にマッピング。
・LOCATION
でJSONファイル保存場所を指定。
※予約語
・https://prestodb.github.io/docs/current/language/types.html
・https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Keywords,Non-reservedKeywordsandReservedKeywords
#クエリ実行
##Preview table
Athenaのマネジメントコンソールが用意してくれているプレビュー用のクエリを実行してみます。
SELECT * FROM "test-db"."sesblog2" limit 10;
「*」で取得した場合、最上位階層のカラムのみ(ネストされた部分はひとまとめにされて)表示されています。
"eventtype","mail"
"Send","{timestamp=2017-01-09T18:08:44.830Z, source=youraddress@example.com, sourcearn=arn:aws:ses:us-west-2:111222333:identity/youraddress@example.com, sendingaccountid=111222333, messageid=01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000, destination=[""success@simulator.amazonses.com""], headerstruncated=false, headers=[{name=From, value=youraddress@example.com}, {name=To, value=success@simulator.amazonses.com}, {name=Subject, value=Bounced Like a Bad Check}, {name=MIME-Version, value=1.0}, {name=Content-Type, value=text/plain; charset=UTF-8}, {name=Content-Transfer-Encoding, value=7bit}], commonheaders={from=[youraddress@example.com], to=[success@simulator.amazonses.com], messageid=01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000, subject=Test}, tags={ses_configurationset=[""Firehose""], ses_source_ip=[""54.55.55.55""], ses_from_domain=[""amazon.com""], ses_caller_identity=[""root""]}}"
"Bounce","{timestamp=2017-01-10T18:08:44.830Z, source=youraddress@example.com, sourcearn=arn:aws:ses:us-west-2:111222333:identity/youraddress@example.com, sendingaccountid=111222333, messageid=01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000, destination=[""success@simulator.amazonses.com""], headerstruncated=false, headers=[{name=From, value=youraddress@example.com}, {name=To, value=success@simulator.amazonses.com}, {name=Subject, value=Bounced Like a Bad Check}, {name=MIME-Version, value=1.0}, {name=Content-Type, value=text/plain; charset=UTF-8}, {name=Content-Transfer-Encoding, value=7bit}], commonheaders={from=[youraddress@example.com], to=[success@simulator.amazonses.com], messageid=01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000, subject=Test}, tags={ses_configurationset=[""Firehose""], ses_source_ip=[""54.55.55.55""], ses_from_domain=[""amazon.com""], ses_caller_identity=[""test1""]}}"
"Send","{timestamp=2017-01-13T18:08:44.830Z, source=youraddress@example.com, sourcearn=arn:aws:ses:us-west-2:111222333:identity/youraddress@example.com, sendingaccountid=111222333, messageid=01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000, destination=[""success@simulator.amazonses.com""], headerstruncated=false, headers=[{name=From, value=youraddress@example.com}, {name=To, value=success@simulator.amazonses.com}, {name=Subject, value=Bounced Like a Bad Check}, {name=MIME-Version, value=1.0}, {name=Content-Type, value=text/plain; charset=UTF-8}, {name=Content-Transfer-Encoding, value=7bit}], commonheaders={from=[youraddress@example.com], to=[success@simulator.amazonses.com], messageid=01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000, subject=Test}, tags={ses_configurationset=[""Firehose""], ses_source_ip=[""54.55.55.55""], ses_from_domain=[""amazon.com""], ses_caller_identity=[""test2""]}}"
"Bounce","{timestamp=2017-01-14T18:08:44.830Z, source=youraddress@example.com, sourcearn=arn:aws:ses:us-west-2:111222333:identity/youraddress@example.com, sendingaccountid=111222333, messageid=01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000, destination=[""success@simulator.amazonses.com""], headerstruncated=false, headers=[{name=From, value=youraddress@example.com}, {name=To, value=success@simulator.amazonses.com}, {name=Subject, value=Bounced Like a Bad Check}, {name=MIME-Version, value=1.0}, {name=Content-Type, value=text/plain; charset=UTF-8}, {name=Content-Transfer-Encoding, value=7bit}], commonheaders={from=[youraddress@example.com], to=[success@simulator.amazonses.com], messageid=01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000, subject=Test}, tags={ses_configurationset=[""Firehose""], ses_source_ip=[""54.55.55.55""], ses_from_domain=[""amazon.com""], ses_caller_identity=[""test3""]}}"
"Bounce","{timestamp=2017-01-15T18:08:44.830Z, source=youraddress@example.com, sourcearn=arn:aws:ses:us-west-2:111222333:identity/youraddress@example.com, sendingaccountid=111222333, messageid=01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000, destination=[""success@simulator.amazonses.com""], headerstruncated=false, headers=[{name=From, value=youraddress@example.com}, {name=To, value=success@simulator.amazonses.com}, {name=Subject, value=Bounced Like a Bad Check}, {name=MIME-Version, value=1.0}, {name=Content-Type, value=text/plain; charset=UTF-8}, {name=Content-Transfer-Encoding, value=7bit}], commonheaders={from=[youraddress@example.com], to=[success@simulator.amazonses.com], messageid=01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000, subject=Test}, tags={ses_configurationset=[""Firehose""], ses_source_ip=[""54.55.55.55""], ses_from_domain=[""amazon.com""], ses_caller_identity=[""test4""]}}"
"Send","{timestamp=2017-01-18T18:08:44.830Z, source=youraddress@example.com, sourcearn=arn:aws:ses:us-west-2:111222333:identity/youraddress@example.com, sendingaccountid=111222333, messageid=01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000, destination=[""success@simulator.amazonses.com""], headerstruncated=false, headers=[{name=From, value=youraddress@example.com}, {name=To, value=success@simulator.amazonses.com}, {name=Subject, value=Bounced Like a Bad Check}, {name=MIME-Version, value=1.0}, {name=Content-Type, value=text/plain; charset=UTF-8}, {name=Content-Transfer-Encoding, value=7bit}], commonheaders={from=[youraddress@example.com], to=[success@simulator.amazonses.com], messageid=01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000, subject=Test}, tags={ses_configurationset=[""Firehose""], ses_source_ip=[""54.55.55.55""], ses_from_domain=[""amazon.com""], ses_caller_identity=[""test5""]}}"
##ネストされた項目で絞り込み
ネストされたデータ項目をwhere句やカラムでも問題なく使用できます。
SELECT eventtype as Event,
mail.destination as Destination,
mail.messageId as MessageID,
mail.timestamp as Timestamp
FROM sesblog2
WHERE eventType = 'Bounce' and mail.timestamp like '2017-01-10%'
"Event","Destination","MessageID","Timestamp"
"Bounce","[""success@simulator.amazonses.com""]","01010159b2c4471e-fc6e26e2-af14-4f28-b814-69e488740023-000000","2017-01-10T18:08:44.830Z"
##マッピングされた項目で絞り込み
マッピングした項目も問題なく使用できています。
クエリ実行時のPrestoの予約語はダブルクォートで囲む必要があります(例:"from")
SELECT eventtype as Event,
mail.timestamp as Timestamp,
mail.tags.ses_source_ip as SourceIP,
mail.tags.ses_caller_identity as AuthenticatedBy,
mail.commonHeaders."from" as FromAddress,
mail.commonHeaders.to as ToAddress
FROM sesblog2
WHERE mail.tags.ses_caller_identity = '["root"]'
"Event","Timestamp","SourceIP","AuthenticatedBy","FromAddress","ToAddress"
"Send","2017-01-09T18:08:44.830Z","[""54.55.55.55""]","[""root""]","[youraddress@example.com]","[success@simulator.amazonses.com]"
#おまけ:hive-json-schema を使ってみる
以下の記述がありましたが、具体的な手順等はなかったのでやってみました。
hive-json-schema を使うことで,入れ子になった JSON DDL を記述することができます。
リンク先を確認したところhive-json-schemaはJava(Mavenプロジェクト)で作成されており、
使用するにはビルドする必要があります。
##mavenインストール
ビルドするためにmavenをインストールします。
Chocolateyでインストールしました。
jdkがインストールされていない場合、一緒にインストールされます。
> choco install maven
Chocolatey v0.10.11
Installing the following packages:
maven
By installing you accept licenses for the packages.
Progress: Downloading jdk8 8.0.201... 100%
Progress: Downloading maven 3.6.0... 100%
jdk8 v8.0.201 [Approved]
~中略~
Do you want to run the script?([Y]es/[N]o/[P]rint): y
Downloading JDK from http://download.oracle.com/otn-pub/java/jdk/8u201-b09/42970487e3af4f5aa5bca3f542482c60/jdk-8u201-windows-x64.exe
~中略~
The install of jdk8 was successful.
Software installed to 'C:\Program Files\Java\jdk1.8.0_201\'
maven v3.6.0 [Approved]
~中略~
Do you want to run the script?([Y]es/[N]o/[P]rint): y
Downloading Maven
from 'https://archive.apache.org/dist/maven/maven-3/3.6.0/binaries/apache-maven-3.6.0-bin.zip'
~中略~
The install of maven was successful.
Software installed to 'C:\ProgramData\chocolatey\lib\maven'
Chocolatey installed 2/2 packages.
See the log for details (C:\ProgramData\chocolatey\logs\chocolatey.log).
> mvn -v
Apache Maven 3.6.0 (97c98ec64a1fdfee7767ce5ffb20918da4f719f3; 2018-10-25T03:41:47+09:00)
Maven home: C:\ProgramData\chocolatey\lib\maven\apache-maven-3.6.0\bin\..
Java version: 1.8.0_201, vendor: Oracle Corporation, runtime: C:\Program Files\Java\jdk1.8.0_201\jre
Default locale: ja_JP, platform encoding: MS932
OS name: "windows 8.1", version: "6.3", arch: "amd64", family: "windows"
hive-json-schemaのダウンロード
gitで管理されているのでgit cloneします。
gitコマンドがない場合は、直接ダウンロードしても大丈夫だと思います(試してはいない)。
$ git clone https://github.com/quux00/hive-json-schema
Cloning into 'hive-json-schema'...
remote: Enumerating objects: 155, done.
remote: Total 155 (delta 0), reused 0 (delta 0), pack-reused 155
Receiving objects: 100% (155/155), 142.86 KiB | 0 bytes/s, done.
Resolving deltas: 100% (35/35), done.
##ビルド
ビルドするとtargetディレクトリの中にjarファイルが作成されます。
$ cd hive-json-schema
$ ls
pom.xml README.md src/
$ mvn package
[INFO] Scanning for projects...
~中略~
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 7.541 s
[INFO] Finished at: 2019-02-15T18:39:03+09:00
[INFO] ------------------------------------------------------------------------
$ ls
pom.xml README.md src/ target/
$ ls target/
archive-tmp/ classes/ generated-sources/ json-hive-schema-1.0.jar json-hive-schema-1.0-jar-with-dependencies.jar maven-archiver/ maven-status/
##使ってみる
withなしのjarは-cpでクラスパスを指定しないといけないようなので、
json-hive-schema-1.0-jar-with-dependencies.jarを使ってみました。
###シンプルなJSONを変換してみる。
JSONファイルは整形された状態、1行の状態、どちらでも大丈夫でした。
また、複数行ある場合でも、構造が違うJSONが混じっている場合でも変換してくれました(先頭が優先されるっぽい)。
{
"foo": [1, null, 3],
"bar": {
"hoge": [
{"pyo": true},
{"pyo": false}
]
}
}
LOCATION がないので、そのままでは使用できませんが、いい感じに変換してくれているように見えます。
テーブル名は指定することができるようです。(以下の例は指定した場合、何も指定しないとデフォルトで「x」がテーブル名になる)
$ java -jar target/json-hive-schema-1.0-jar-with-dependencies.jar sample_json/sample01.json table_name
CREATE TABLE table_name (
bar struct<hoge:array<struct<pyo:boolean>>>,
foo array<int>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
LOCATIONを追加してAthenaで実行しましたがエラーが発生しました。
CREATE EXTERNAL TABLEのみサポートしているとのことなので、Queryを変更して実行したところ問題なく作成されました。
Your query has the following error(s):
Only external table creation is supported. (Service: AmazonAthena; Status Code: 400; Error Code: InvalidRequestException; Request ID: ceeee646-1b
CREATE EXTERNAL TABLE test_table (
bar struct<hoge:array<struct<pyo:boolean>>>,
foo array<int>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION
's3://bucket/test'
;
###今回使用したJSONを変換してみる。
$ java -jar target/json-hive-schema-1.0-jar-with-dependencies.jar sample_json/sample02.json
CREATE TABLE x (
eventtype string,
mail struct<commonheaders:struct<from:array<string>, messageid:string, subject:string, to:array<string>>, destination:array<string>, headers:array<struct<name:string, value:string>>, headerstruncated:boolean, messageid:string, sendingaccountid:string, source:string, sourcearn:string, tags:struct<ses:caller-identity:array<string>, ses:configuration-set:array<string>, ses:from-domain:array<string>, ses:source-ip:array<string>>, timestamp:string>,
send struc>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
ちょっと項目が多く見ずらいので整形。
CREATE TABLE x (
eventtype string,
mail struct<
commonheaders:struct<
from:array<string>,
messageid:string,
subject:string,
to:array<string>
>,
destination:array<string>,
headers:array<
struct<
name:string,
value:string
>
>,
headerstruncated:boolean,
messageid:string,
sendingaccountid:string,
source:string,
sourcearn:string,
tags:struct<
ses:caller-identity:array<string>,
ses:configuration-set:array<string>,
ses:from-domain:array<string>,
ses:source-ip:array<string>
>,
timestamp:string
>,
send struc>
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
うまく変換できない箇所がいくつかあり、使用する場合は手直しが必要そうです。
使わずに全部を手で作成するよりは良いのかな?
以下、修正が必要そうな個所をまとめて列挙。
・CREATE EXTERNAL TABLE
に変更する。
・LOCATION
の追記する。
・「:」などは使用できない禁止文字なので変更する。(WITH SERDEPROPERTIES
でマッピングする)
・大文字が小文字に変換されているので直す。
・バックォート(例:from)で囲む必要がある文字列がそのまま。
・"send": {}
が send struc>
となり変換に失敗している。
以上、やってみたメモでした。