Help us understand the problem. What is going on with this article?

PythonでS3 Selectを試しました。2018-04-20

More than 1 year has passed since last update.

概要

S3 Selectをシンプルなコードで試しました
https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectSELECTContent.html

2018-04-20
Node.jsで動かそうと思ったけど、SDKが対応してなさそうなので、Pythonにします

準備

てきとうなバケットを作って、JSONを置いておきます。

name.json
{"id": "16509", "name": "hoge"}
{"id": "6217", "name": "fuga"}
{"id": "1398", "name": "piyo"}

コード

# boto3をインストール
% pip install boto3
s3select.py
# -*- coding: utf-8 -*-

import boto3
s3 = boto3.client('s3')

response = s3.select_object_content(
    Bucket='<BUCKET_NAME>',
    Key='name.json',
    ExpressionType='SQL',
    Expression='select s.id, s.name from S3Object s',
    InputSerialization={
        'CompressionType': 'NONE',
        'JSON': {
            'Type':'LINES'
        }
    },
    OutputSerialization={
        'JSON': {
            'RecordDelimiter': '\n',
        }
    }
)

for event in response['Payload']:
    if 'Records' in event:
        print(event['Records']['Payload'].decode('utf-8'))

% python s3select.py
{"id":"16509","name":"hoge"}
{"id":"6217","name":"fuga"}
{"id":"1398","name":"piyo"}

Expression のSQLだけ変えてもう少し試します

一部の列だけ持ってくる

Expression='select s.name from S3Object s',
% python ./s3select.py
{"name":"hoge"}
{"name":"fuga"}
{"name":"piyo"}

group byはできないけど、

Expression='select s.name, sum(cast(s.id as int)), count(s.id) from S3Object s group by s.name',
% python s3select.py
Traceback (most recent call last):
  File "s3select.py", line 19, in <module>
    'RecordDelimiter': '\n',
  File "/usr/local/lib/python2.7/dist-packages/botocore/client.py", line 314, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python2.7/dist-packages/botocore/client.py", line 612, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (UnsupportedSqlOperation) when calling the SelectObjectContent operation: Unsupported SQL operation GROUP BY. Please check the service documentation for supported operations.

ちょっとした集計くらいはできる

idsum するというよくわからないことをしているところには目をつむる)

Expression='select sum(cast(s.id as int)), count(s.id) from S3Object s',
% python s3select.py
{"_1":24124,"_2":3}

まとめ

S3 Selectを使うと、selectできて便利っぽいです

abetomo
Why not register and get more from Qiita?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away