More than 1 year has passed since last update.

boto3を使ってファイルを読み込んだ際Overflowになるときの対処法(OverfloawError: signed integer is greater than maximum)

Last updated at 2022-07-28Posted at 2022-07-28

まえがき

boto3を使ってS3にあるcsvファイルを読み込もうとしたときに、以下のようなエラーが発生しました。
OverflowError: signed integer is greater than maximum

Python3.6からPython3.8へバージョンアップした際に生じたエラーです。

エラーが生じたプログラム

import os
#S3操作用モジュール
import boto3
from botocore.client import Config
#データ操作に使うモジュール
import pandas as pd
import numpy as np
import io
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
import time

config = Config(connect_timeout=100, read_timeout=100)

role = 'ロール名'
bucket = 'バケット名'
key = 'プレフィックス名'
accesskey = 'アクセスキー'
sercretkey = 'シークレットアクセスキー'
my_region = 'ap-northeast-1'

#####################データ読み込み#####################
#######################################################
s3 = boto3.resource('s3', aws_access_key_id = accesskey, aws_secret_access_key = sercretkey, region_name = 'ap-northeast-1',config=config)
response = s3.Object(bucket, key)

body = response.get()['Body'].read().decode('utf-8') #csvコンテンツを読み込み
buffer_str = io.StringIO(body) #csvテキストをバッファに書き出し
reader = pd.read_csv(buffer_str, chunksize=1000000, encoding='utf8') #バッファをpandasのread_csv()でチャンクごとに読み出してDataFrameに変換する。

#readerに格納されたチャンクごとのDataFrameをpandasのconcat()で結合する
model_data = pd.concat((r for r in reader), ignore_index=True) 
model_data.head()
#######################################################

原因と対応策

以下の記事によると、Python 3.8で発生するバグによって引き起こされる問題とのことです。

the core issue in Python 3.8 is the bug with reading more than 1gb at a time. You can use a variant of the workaround suggested in the bug to read the file in chunks.

すなわち、Python3.8では一度に1GBを超えるファイルを読み込む際にオーバーフローをするバグがあるということのようです。
引用記事によると、チャンク読み取りを実行して保存していく方法ではなく、bytearrayバッファーのmemoryviewを使用することで回避することができるそうです。

対処したプログラム

import os
#S3操作用モジュール
import boto3
from botocore.client import Config
#データ操作に使うモジュール
import pandas as pd
import numpy as np
import io
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

config = Config(connect_timeout=100, read_timeout=100)

role = 'ロール名'
bucket = 'バケット名'
key = 'プレフィックス名'
accesskey = 'アクセスキー'
sercretkey = 'シークレットアクセスキー'
my_region = 'ap-northeast-1'

###########ここから2GB以上あるときの対処コード###########
#######################################################
s3 = boto3.client('s3', aws_access_key_id = accesskey, aws_secret_access_key = sercretkey, region_name = 'ap-northeast-1',config=config)
response = s3.get_object(Bucket=bucket,Key=key)

buf = bytearray(response['ContentLength']) #バイナリデータで格納
view = memoryview(buf) #メモリビューオブジェクトを取得

pos = 0
while True:
    chunk = response['Body'].read(67108864)
    if len(chunk) == 0:
        break
    
    view[pos:pos+len(chunk)] = chunk
    pos += len(chunk)
#######################################################

buffer_str = io.BytesIO(view)
buffer_str.seek(0)
model_data = pd.read_csv(buffer_str, encoding='utf8')
model_data.head()

上記のコードにすることで、Python3.6でもアップデートしたPython3.8でもデータを読み込めることが確認できました。
参照サイトによると、S3に保存するのではなくEFS（Amazon Elastic File System）などをつかう方法も紹介されていました。求めるコストやレスポンス速度に応じて使い分けをしていこうとおもいます。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up