概要
S3 Selectをシンプルなコードで試しました
https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectSELECTContent.html
2018-04-20
Node.jsで動かそうと思ったけど、SDKが対応してなさそうなので、Pythonにします
準備
てきとうなバケットを作って、JSONを置いておきます。
name.json
{"id": "16509", "name": "hoge"}
{"id": "6217", "name": "fuga"}
{"id": "1398", "name": "piyo"}
コード
# boto3をインストール
% pip install boto3
s3select.py
# -*- coding: utf-8 -*-
import boto3
s3 = boto3.client('s3')
response = s3.select_object_content(
Bucket='<BUCKET_NAME>',
Key='name.json',
ExpressionType='SQL',
Expression='select s.id, s.name from S3Object s',
InputSerialization={
'CompressionType': 'NONE',
'JSON': {
'Type':'LINES'
}
},
OutputSerialization={
'JSON': {
'RecordDelimiter': '\n',
}
}
)
for event in response['Payload']:
if 'Records' in event:
print(event['Records']['Payload'].decode('utf-8'))
% python s3select.py
{"id":"16509","name":"hoge"}
{"id":"6217","name":"fuga"}
{"id":"1398","name":"piyo"}
Expression
のSQLだけ変えてもう少し試します
一部の列だけ持ってくる
Expression='select s.name from S3Object s',
% python ./s3select.py
{"name":"hoge"}
{"name":"fuga"}
{"name":"piyo"}
group byはできないけど、
Expression='select s.name, sum(cast(s.id as int)), count(s.id) from S3Object s group by s.name',
% python s3select.py
Traceback (most recent call last):
File "s3select.py", line 19, in <module>
'RecordDelimiter': '\n',
File "/usr/local/lib/python2.7/dist-packages/botocore/client.py", line 314, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python2.7/dist-packages/botocore/client.py", line 612, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (UnsupportedSqlOperation) when calling the SelectObjectContent operation: Unsupported SQL operation GROUP BY. Please check the service documentation for supported operations.
ちょっとした集計くらいはできる
(id
を sum
するというよくわからないことをしているところには目をつむる)
Expression='select sum(cast(s.id as int)), count(s.id) from S3Object s',
% python s3select.py
{"_1":24124,"_2":3}
まとめ
S3 Selectを使うと、selectできて便利っぽいです