0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

`pandas.read_csv` (v1.5.3): CSVの列数が`names`引数で指定した列数より小さい場合、CSVの行数が1行だと`ParseError`が発生する

Last updated at Posted at 2023-03-08

環境

  • Python 3.11.2
  • pandas 1.5.3(pandas 2.0.0ではエラーは発生しません)
In [37]: pandas.show_versions()
/home/vagrant/.pyenv/versions/3.11.2/lib/python3.11/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")

INSTALLED VERSIONS
------------------
commit           : 2e218d10984e9919f0296931d92ea851c6a6faf5
python           : 3.11.2.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.15.0-60-generic
Version          : #66-Ubuntu SMP Fri Jan 20 14:29:49 UTC 2023
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : ja_JP.UTF-8
LOCALE           : ja_JP.UTF-8

pandas           : 1.5.3
numpy            : 1.24.2
pytz             : 2022.7.1
dateutil         : 2.8.2
setuptools       : 65.5.0
pip              : 23.0.1
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.9.2
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 3.1.2
IPython          : 8.11.0
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
matplotlib       : None
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : None
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
zstandard        : None
tzdata           : None

やりたいこと

ヘッダ行がないCSVファイルをpandas.read_csv関数で読み込みたいです。

input1.csv
1,Alice,Japan
2,Bob,U.S.
3,Chris,China
In [39]: df = pandas.read_csv("input1.csv", header=None, names=["id","name","country"])

In [40]: df
Out[40]: 
   id   name country
0   1  Alice   Japan
1   2    Bob    U.S.
2   3  Chris   China

In [41]: df.dtypes
Out[41]: 
id          int64
name       object
country    object
dtype: object

country列はオプショナルな列なので、3列目が存在しないCSVファイルも読み込めるようにしたいです。

input2.csv
1,Alice
2,Bob
3,Chris
In [43]: df = pandas.read_csv("input2.csv", header=None, names=["id","name","country"])

In [44]: df
Out[44]: 
   id   name  country
0   1  Alice      NaN
1   2    Bob      NaN
2   3  Chris      NaN

3列目が存在しないCSVinput2.csvを読み込むことができました。

起きたこと

データが1行の場合

3列目が存在しないデータが1行しかない場合、ParserError: Too many columns specified: expected 3 and found 2が発生しました。
原因は不明です。

input3.csv
1,Alice
In [45]: df = pandas.read_csv("input3.csv", header=None, names=["id","name","country"])
---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
Cell In[45], line 1
----> 1 df = pandas.read_csv("input3.csv", header=None, names=["id","name","country"])

File ~/.pyenv/versions/3.11.2/lib/python3.11/site-packages/pandas/util/_decorators.py:211, in deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper(*args, **kwargs)
    209     else:
    210         kwargs[new_arg_name] = new_arg_value
--> 211 return func(*args, **kwargs)

File ~/.pyenv/versions/3.11.2/lib/python3.11/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    325 if len(args) > num_allow_args:
    326     warnings.warn(
    327         msg.format(arguments=_format_argument_list(allow_args)),
    328         FutureWarning,
    329         stacklevel=find_stack_level(),
    330     )
--> 331 return func(*args, **kwargs)

File ~/.pyenv/versions/3.11.2/lib/python3.11/site-packages/pandas/io/parsers/readers.py:950, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    935 kwds_defaults = _refine_defaults_read(
    936     dialect,
    937     delimiter,
   (...)
    946     defaults={"delimiter": ","},
    947 )
    948 kwds.update(kwds_defaults)
--> 950 return _read(filepath_or_buffer, kwds)

File ~/.pyenv/versions/3.11.2/lib/python3.11/site-packages/pandas/io/parsers/readers.py:611, in _read(filepath_or_buffer, kwds)
    608     return parser
    610 with parser:
--> 611     return parser.read(nrows)

File ~/.pyenv/versions/3.11.2/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1778, in TextFileReader.read(self, nrows)
   1771 nrows = validate_integer("nrows", nrows)
   1772 try:
   1773     # error: "ParserBase" has no attribute "read"
   1774     (
   1775         index,
   1776         columns,
   1777         col_dict,
-> 1778     ) = self._engine.read(  # type: ignore[attr-defined]
   1779         nrows
   1780     )
   1781 except Exception:
   1782     self.close()

File ~/.pyenv/versions/3.11.2/lib/python3.11/site-packages/pandas/io/parsers/c_parser_wrapper.py:230, in CParserWrapper.read(self, nrows)
    228 try:
    229     if self.low_memory:
--> 230         chunks = self._reader.read_low_memory(nrows)
    231         # destructive to chunks
    232         data = _concatenate_chunks(chunks)

File ~/.pyenv/versions/3.11.2/lib/python3.11/site-packages/pandas/_libs/parsers.pyx:808, in pandas._libs.parsers.TextReader.read_low_memory()

File ~/.pyenv/versions/3.11.2/lib/python3.11/site-packages/pandas/_libs/parsers.pyx:890, in pandas._libs.parsers.TextReader._read_rows()

File ~/.pyenv/versions/3.11.2/lib/python3.11/site-packages/pandas/_libs/parsers.pyx:952, in pandas._libs.parsers.TextReader._convert_column_data()

ParserError: Too many columns specified: expected 3 and found 2

header=0を指定した場合

データが複数あるinput2.csvでも、header=0を指定して先頭行をヘッダ行とみなして読み込むと、同様のParserErrorが発生しました。

df = pandas.read_csv("input2.csv", header=0, names=["id","name","country"])
...
ParserError: Too many columns specified: expected 3 and found 2

テストコードには、このケースを確認するテストがあったので、この動きは想定通りのようです。

解決策

names引数を指定せずにCSVを読み込み、あとで列名を変更すれば解決です。
ただしコード量が増えるので、names引数を指定したときより、コードの見通しは悪いです。

In [63]: df = pandas.read_csv("input2.csv", header=None)

In [64]: df
Out[64]: 
   0      1
0  1  Alice
1  2    Bob
2  3  Chris

In [65]: def rename_columns(df):
    ...:     df.rename(columns={0:"id",1:"name",2:"country"}, inplace=True)
    ...:     if "country" not in df.columns:
    ...:         df["country"] = numpy.nan
    ...: 

In [66]: df
Out[66]: 
   id   name  country
0   1  Alice      NaN
1   2    Bob      NaN
2   3  Chris      NaN

まとめ

pandsa.read_csv関数にnames引数を指定すると、想定しない動きに遭遇してハマるかもしれません。

参考にしたサイト

補足

pandasにissueを作成しました。
https://github.com/pandas-dev/pandas/issues/51924

コメントによるとメインブランチでは発生しないそうです。
次のバージョンで解決するかもしません。
panads 2.0.0で修正されました。

またengine="python"を指定するとエラーが発生しないことが分かりました。

In [18]: data2="""
    ...: 1,Alice
    ...: """

In [25]: pandas.read_csv(StringIO(data2), header=None, names=["id","name","country"], engine="python")
Out[25]: 
   id   name  country
0   1  Alice      NaN

pandas.read_csvのドキュメントには「"python" engineは、"more feature-complete"」ということなので、"python" engineではエラーが出なかったんですかね。

Parser engine to use. The C and pyarrow engines are faster, while the python engine is currently more feature-complete. Multithreading is currently only supported by the pyarrow engine.

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?