More than 5 years have passed since last update.

pandasのSeriesとDataFrameを区別して使えてますか？

Last updated at 2018-03-09Posted at 2018-03-09

pandasのDataFrameを使えば、pythonでもデータをリレーショナルデータベースライクに使えるんだわーい！と思っていました。すぐに気づきました。Seriesって何だ。。。データが扱えればいいやではなく、pandasが提供するSeriesとDataFrameがなぜ存在するのか、どう使うのか、どう使い分ければいいのかをきちんと理解した上で使って行きましょう。

pandasのドキュメントによれば、Seriesは、

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

DataFrameは、

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.

とあります。つまり、SeriesはDataFrameの1つのカラムを指すデータ構造であり、DataFrame内のデータは多数のSeriesの集まりであると考えられます。別の捉え方をすれば、Seriesは一次元のデータ構造、DataFrameは二次元のデータ構造であるとも言えます。

どういうことか、実際にSeriesとDataFrameを定義して確認してみましょう。

Seriesを定義する

まず、pandasをimportします。

import pandas as pd

Seriesの定義の仕方は、array型とdict型の2種類があります。

s1 = pd.Series([80, 50, 60, 70, 90], index=['Japanese', 'math', 'science', 'society', 'English'])
s1

s2 = pd.Series({'Japanese': 80, 'math': 50, 'science': 60, 'society': 70, 'English': 90})
s2

どちらも結果は以下のようになります。

Japanese    80
math        50
science     60
society     70
English     90
dtype: int64

ある人の国数理社英のテストの点数です。まさにpythonでいうdictonaryのデータ構造です。{'Japanse': 80, 'math': 50, 'science': 60, 'society': 70, 'English': 90}という形でデータを格納しています。一次元のデータ構造です。'Japanese'などのkeyをSeriesはindexと表現します。

Seriesからデータを取り出す

データの取り出し型もarray型とdict型の２週類があります。

# array型
s1[2]

# dict型
s1['science']

# => 60

Seriesに値を追加する

また、値を追加してみましょう。

s3 = pd.Series({'Japanese': 80, 'math': 50, 'science': 60, 'society': 70, 'English': 90})
s4 = pd.Series({'arts': 100})
s3 = s3.append(s4)
s3

Japanese     80
math         50
science      60
society      70
English      90
arts        100
dtype: int64

値の追加は、appendを使ってSeriesを追加するだけなので簡単ですね。SeriesとSeriesの足し算であることがわかります。

その他、Seriesからのデータの取り出し方や、Series同士の足し算など色々できますが、割愛します。ドキュメントに詳しく書いてあります。

次に、DataFrameを見て行きましょう。

DataFrameを定義する

「DataFrame内のデータは多数のSeriesの集まりである」とあるように、先ほど定義したSeriesからDataFrameが作られます。一人ひとりのテストの点数がわかったら、今度はクラスの全員の点数を並べて見たくなるものです。ここで、次元が1つあがります、DataFrameの登場です。

クラスのメンバーとして、太郎くんと次郎くんを考えます。

DataFrameもarray型、dict型2つの定義の仕方があります。

df1 = pd.DataFrame([s, ss], index=['Taro', 'Jiro'])
df1

      English  Japanese  math  science  society
Taro       90        80    50       60       70
Jiro       90        80    50       60       70

df2 = pd.DataFrame({'Taro': s, 'Jiro': ss})
df2

          Jiro  Taro
English     90    90
Japanese    80    80
math        50    50
science     60    60
society     70    70

アウトプットされるDataFrameの形が異なりますね。欲しいのは1つ目の方です、Seriesの際indexだった文字列は、DataFrameではcolumnに変わっています。これは、indexとcolumnというメソッドで確認できます。

df1.index
# => Index(['Taro', 'Jiro'], dtype='object')

df1.columns
# => Index(['English', 'Japanese', 'math', 'science', 'society'], dtype='object')

DataFrameから値を取り出す

次に、DataFrameから値をとってみます。まず、太郎くんのテスト結果を見て見ます。

df1.loc['Taro', :]

English     90
Japanese    80
math        50
science     60
society     70
Name: Taro, dtype: int64

type(df1.loc['Taro', :])
# => pandas.core.series.Series

一次元のデータ、すなわちSeriesになりました。次に、日本語のテストの結果をとってみましょう。

df1.loc[:, 'Japanese']

Taro    80
Jiro    80
Name: Japanese, dtype: int64

type(df1.loc[:, 'Japanese'])
# => pandas.core.series.Series

こちらも一次元のデータ、すなわちSeriesになりました。次に、太郎くんの日本語のテストの結果をとってみましょう。

df1.loc['Taro', 'Japanese']

type(df1.loc['Taro', 'Japanese'])
# => numpy.int64

これはnumpy.int64ですね。もうpandasの型ではありません。次に、DataFrameに新しい行を追加して見ます。DataFrameはSeriesが集まってできているので、Seriesを足してあげればいいですね。

Seriesのname属性に注目する

ところで、df.loc[:, 'Taro']とした時、Name: Taroというものが付いていました。そうです、Seriesはname属性を持ちます。次に、太郎くんと次郎くんの点数をname属性付きで再定義し、それを元にDataFrameを作って見ましょう。

s5 = pd.Series([80, 50, 60, 70, 90], index=['Japanese', 'math', 'science', 'society', 'English'], name='Taro')
s5

Japanese    80
math        50
science     60
society     70
English     90
Name: Taro, dtype: int64

s6 = pd.Series([80, 50, 60, 70, 90], index=['Japanese', 'math', 'science', 'society', 'English'], name='Jiro')
s6

Japanese    80
math        50
science     60
society     70
English     90
Name: Jiro, dtype: int64

2つのSeriesにnameが付与されましたね。ここからDataFrameを定義します。

df3 = pd.DataFrame([s5, s6])
df3

Japanese	math	science	society	English
Taro	80	50	60	70	90
Jiro	80	50	60	70	90

df1を定義した時はindexも指定しましたが、Seriesがname属性を持っている時はそれがindexになります。

DataFrameに行を追加する

最後に行を追加してみましょう。

s7 = pd.Series([80, 50, 60, 70, 90], index=['Japanese', 'math', 'science', 'society', 'English'], name='Saburo')
df4 = df3.append(s7)
df4

Japanese	math	science	society	English
Taro	80	50	60	70	90
Jiro	80	50	60	70	90
Saburo	80	50	60	70	90

まとめ

どうしても初めてpandasのDataFrameを触ると、リレーショナルデータベースのノリで捉えてしまうため、細かな理解がおろそかになると思います。pandasが持つSeriesとDataFrameというデータ構造を理解して触ってみるとまた違った側面で物事を捉えられそうです。

結論として、Seriesは一次元の、DataFrameは二次元のデータ構造であり、DataFrameはSeriesから構成されるということです。

111

115

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up