pandas DataFrame：`data[...]`では、インデックスは列参照、スライスは行参照を表す

Last updated at 2018-07-24Posted at 2018-07-04

pandasのDataFrameを初めて使った際に、ハマったことをまとめます。

環境

Python 3.6
pandas 0.23.0
IPython 6.4.0

やりたいこと

以下のような、pandasのDataFrameオブジェクトがあります。

area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})

IPython_Console

In [1]: data
Out[1]:
              area       pop
California  423967  38332521
Texas       695662  26448193
New York    141297  19651127
Florida     170312  19552860
Illinois    149995  12882135

このDataFrameオブジェクトに対して、行単位でデータを取得したいです。

ハマったこと

dataに対してスライスでアクセスすると、行単位のデータを取得できます。

IPython_Console

In [2]: data[1:3]
Out[2]:
            area       pop
Texas     695662  26448193
New York  141297  19651127

しかし、dataに対してインデックスでアクセスすると、KeyErrorが発生しました。

In [2]: data[1]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\Anaconda3_5.2.0\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3062             try:
-> 3063                 return self._engine.get_loc(key)
   3064             except KeyError:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 1

私は、「DataFrameを行要素のlistとして扱える」と期待していたので、上記の現象が理解できませんでした。

原因

pandasのDataFrameは、「インデックスは列を参照、スライスは行を参照」するからです。

「Python データサイエンス　ハンドブック」の英語版には、以下の文が書かれていました。

First, while indexing refers to columns, slicing refers to rows:

Data Indexing and Selection - Python Data Science Handbook 引用

以下のようなアクセス方法もあります。

IPython_Console

In [3]: data['Florida':'Illinois']
Out[3]:
            area       pop
Florida   170312  19552860
Illinois  149995  12882135

pandas DataFrameのソースコード

data[...]でアクセスしたときに呼び出される、__getitem__メソッドの中身です。

frame.py

class DataFrame(NDFrame):
# ....

    def __getitem__(self, key):
        key = com._apply_if_callable(key, self)

        # shortcut if we are an actual column
        is_mi_columns = isinstance(self.columns, MultiIndex)
        try:
            if key in self.columns and not is_mi_columns:
                return self._getitem_column(key)
        except:
            pass

        # see if we can slice the rows
        indexer = convert_to_index_sliceable(self, key)
        if indexer is not None:
            return self._getitem_slice(indexer)

        if isinstance(key, (Series, np.ndarray, Index, list)):
            # either boolean or fancy integer index
            return self._getitem_array(key)
        elif isinstance(key, DataFrame):
            return self._getitem_frame(key)
        elif is_mi_columns:
            return self._getitem_multilevel(key)
        else:
            return self._getitem_column(key)

data[1]でアクセスしたときは、最後のelseブロックに入り、self._getitem_column(key)を実行します。
そして1という列はないので、KeyErrorが発生します。

解決

data.ilocを使います。

Purely integer-location based indexing for selection by position.

https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.iloc.html 引用

IPython_Console

In [4]: data.iloc[1]
Out[4]:
area      695662
pop     26448193
Name: Texas, dtype: int64

Indexers: loc, iloc, and ix 参考

補足

スライスに列名のキーを使う場合

dict_list = [dict(a=1,b=2,c=3,d=4), dict(a=11,b=12,c=13,d=14)]
data2 = pd.DataFrame(dict_list)

列bから列dまでを抽出したい場合は、data2.loc[:,"b":"d"]でアクセスします。

In [21]: data2
Out[21]:
    a   b   c   d
0   1   2   3   4
1  11  12  13  14

In [24]: data2.loc[:, "b":"d"]
Out[24]:
    b   c   d
0   2   3   4
1  12  13  14

複数の列を抽出する場合

data[...]に、抽出したい列名のlistを渡せばよいです。

In [26]: data2[["a","c"]]           
Out[26]:                            
    a   c                           
0   1   3                           
1  11  13

ちなみに、インデックスにlistを渡せばDataFrameが、文字列を渡せばSeriesが返ります。

In [30]: type(data2[["a"]])
Out[30]: pandas.core.frame.DataFrame

In [31]: type(data2["a"])
Out[31]: pandas.core.series.Series

参考

この記事のコードは、以下のサイトのコードを参考にしました。
https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.02-Data-Indexing-and-Selection.ipynb

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up