More than 3 years have passed since last update.

Featherフォーマットでの保存で躓いた二点について

Last updated at 2021-06-16Posted at 2021-06-16

0.Intro

拾ってきたpipelineを適用してデータ分析コンペティションサイトの某コンペを解いてみようとしてました。
そのpipelineはまず最初に訓練データ、テストデータをFeatherフォーマットで保存するのですが、その際に躓いた二点について。

1.修正課題

以下の点について修正し、Featherフォーマットでの保存が可能となりました。

1.1.インデックス

以下のメッセージで怒られました

sys:1: DtypeWarning: Columns (10) have mixed types.Specify dtype option on import or set low_memory=False.
Traceback (most recent call last):
  File "scripts/convert_to_feather.py", line 58, in <module>
    train_dfs.to_feather(data_dir.joinpath("input/train.feather"))
  File "/opt/conda/lib/python3.7/site-packages/pandas/util/_decorators.py", line 199, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pandas/core/frame.py", line 2215, in to_feather
    to_feather(self, path, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pandas/io/feather_format.py", line 48, in to_feather
    "feather does not support serializing a non-default index for the index; "
ValueError: feather does not support serializing a non-default index for the index; you can .reset_index() to make the index into column(s)

原因はFeatherフォーマットはインデックスがあると保存出来ないとのことでした。
よって、

train_dfs.reset_index(inplace=True)

の一行にて解決。

1.2.型の混在

次は

sys:1: DtypeWarning: Columns (10) have mixed types.Specify dtype option on import or set low_memory=False.
Traceback (most recent call last):
  File "scripts/convert_to_feather.py", line 58, in <module>
    train_dfs.to_feather(data_dir.joinpath("input/train.feather"))
  File "/opt/conda/lib/python3.7/site-packages/pandas/util/_decorators.py", line 199, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pandas/core/frame.py", line 2215, in to_feather
    to_feather(self, path, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pandas/io/feather_format.py", line 64, in to_feather
    feather.write_feather(df, path, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pyarrow/feather.py", line 151, in write_feather
    table = Table.from_pandas(df, preserve_index=False)
  File "pyarrow/table.pxi", line 1376, in pyarrow.lib.Table.from_pandas
  File "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 593, in dataframe_to_arrays
    arrays[i] = maybe_fut.result()
  File "/opt/conda/lib/python3.7/concurrent/futures/_base.py", line 428, in result
    return self.__get_result()
  File "/opt/conda/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/opt/conda/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 559, in convert_column
    result = pa.array(col, type=type_, from_pandas=True, safe=safe)
  File "pyarrow/array.pxi", line 265, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 80, in pyarrow.lib._ndarray_to_array
TypeError: an integer is required (got type str)

というメッセージで怒られました。
色々調査したのですが、結果としてこのエラーの原因は型の混在でした。

from pandas.api.types import infer_dtype

というものを使い

train_dfs.apply(infer_dtype)

として調べると…

(略)
面積（㎡）           mixed-integer
(略)

と混在が指摘されました。
複数のDataFrameを結合させて作ったDataFrameなので発生した事象と思われます。
よって

df_tmp["面積（㎡）"] = df_tmp["面積（㎡）"].apply(lambda x : int(x))

等により全体の型を合わせて、dtypes にて確認、objectからfloat64に変更されていることを確認。

無事にFeatherフォーマットで保存出来ました。

2.結論

データの最初の読み込みはそれなりに時間がかかります。一回につき数秒でも、特に特徴量を調整しているときなどは試行回数が多いので、体感でわかる程度の速度は改善しておきたいものです。
また本調査しながら単純にデータの読み込みを早くしたいだけなのに、前処理みたいなことしないといけないのって何となく本末転倒だなと思ったのですが、同様の作業はどのみち行わないといけないものです。
まだ他のデータで試していませんが、上記のように
・インデックス
・型の混在
さえを付ければ簡単にFeatherに変換出来そうです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up