More than 1 year has passed since last update.

@Shmwa2(Shohei Miwa)in

株式会社BeeX

Pandas 入門

Last updated at 2021-12-24Posted at 2021-12-24

matplotlib に引き続き、python のライブラリPandasについて学んでいきたいと思います。

Pandas とは

pythonのライブラリで、データ分析用に表形式のデータや行列を扱う事が可能です。
集計処理、計算処理をPandasにて置き換えることが出来ます。

pip 等でインストールも可能ですが、Anacondaに含まれていますので今回はそちらを使用します。

結果をすぐに確認するため、今回も Jupyter Notebook を使用します。

Pandas用語

いくつか専門の用語がありますので、先に理解をします。

Series
データ形式の一つで一次元の配列から単一の列から成す表
DataFrame
行列からなる表形式の配列データ。Pandas の中心となるデータ形式です。
index
SeriesやDataFrameの行データに付与できるラベルです。
columns
DataFrameの列データに付与できるラベルです。
integer-location
DataFrame は行(列)形式のため、番号指定でデータにアクセス出来ます。
この方式をinteger-locationと呼びます。

よく使用されるpandasのデータタイプ

pandasで使用されるデータタイプとして、標準にint以外にNumpyを使用したfloatや、文字列等が挙げられます。

DataType	Description
bool	真理値
np.int64	64bit整数(intと同等)
np.float64	64bit浮動小数点(floatと同等)
pd.StringDtype()	pandasの文字列
object	pythonのオブジェクト

Series

Series はpd.Series()メソッドにて定義します。
この例では、a,b,c,d,eという列データに対して 1,2,3,4,5 というindexが付与されます。

sample1.py

import pandas as pd

s = pd.Series(["a", "b", "c","d","e"], index=["1", "2", "3","4","5"])
print(s)

特定のデータの表示を行う場合、indexを指定して表示する事も可能です。
データタイプを表示するにはdtypeを指定します。

sample2.py

import pandas as pd

s1 = pd.Series([1, 2, 3], index=["a", "b", "c"])
# s1のindex=aの列データを表示
print(s1.a)

s2 = pd.Series([1, 2, 3], index=["a", "b", "c"], dtype=int)
# s2のデータタイプを表示
print(s2.dtype)

DataFrame

DataFrame は表形式データで、Seriesのindex以外にcolumnsという列データにラベルを付与します。pd.DataFrame()メソッドを呼び出して表形式のデータを生成します。

リストの場合、pd.DataFrame(2次元配列データ, columns=columnsのリスト, index=indexのリスト) が基本になります。

sample3.py

import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]], columns=['col1', 'col2'], index=["a", "b", "c", "d", "e"])
print(df)

辞書の場合、data = { columns1 : データ配列 ..... , columns2 : データ配列 .... } pd.DataFrame(data , index=indexのリスト) が基本になります。

sample4.py

import pandas as pd

data = {'col1': [1, 3, 5, 7, 9], 'col2': [2, 4, 6 ,8,10]}
df = pd.DataFrame(data, index=["a", "b", "c","d","e"])
print(df)

ファイルの読み書き

CSVからデータを読み込む場合 pd.read_csv()メソッドを使用します。
pd.read_csv(ファイルパス) が基本になりますが、index のラベルは任意で割り振りされます。
index_col を指定する事で、対象列をindexとして割り振る事も可能です。

sample5.py

import pandas as pd

df= pd.read_csv("C:/Users/user-name/data.csv",index_col="col1")
print(df)

定義された DataFrameをCSVへ書き込む場合、df.to_csv()を使用します。
df.to_csv()が基本になりますが、index=Falseを指定する事で、indexを書き込みしない指定が可能です。

sample6.py

import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]], columns=['col1', 'col2'], index=["a", "b", "c", "d", "e"])
df.to_csv("C:/Users/user-name/data2.csv")

data2.csv の出力結果
data3.csv(index=False)の出力結果

演算・統計

DataFrameで同士では、カラムが一致している場合に四則演算を受け付けます。
四則演算に用いる演算子は +,-,*,/ となります。

sample7.py

import pandas as pd

df1 = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]], columns=['col1', 'col2'], index=["a", "b", "c", "d", "e"])
df2 = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]], columns=['col1', 'col2'], index=["a", "b", "c", "d", "e"])
sum =df1 + df2

print(sum)

統計処理を行う場合、以下のメソッドを使用します。

Method	Description
df.count()	データ件数
df.mean()	平均
df.max()	最大値
df.min()	最小値
df.var()	分散値
df.describe()	一括取得
df.sample()	ランダムに行の値を取得

sample8.py

import pandas as pd
df = pd.DataFrame([[1,2], [3,4], [5,6], [7,8], [9,10]], columns=['x','y'])

# データ件数の取得
c = df.count()

print(c)

置換・ソート・集計

データの置換を行う場合、df.replace() メソッドを使用します。
df.replace(置換したいデータ,置換後の値) を指定します。

sample9.py

import pandas as pd

df1 = pd.DataFrame([["yuko","tomohisa"], ["eriko","yosure"]], columns=['Women', 'Men'], index=["a", "b"])
print(df1)
# yosure を yosuke へ置換
df2 = df.replace('yosure', 'yosuke')
print(df2)

ソートは、df.sort_values()メソッドを使用します。
df.sort_values('ソートするcolumns',ascending=True[昇順]/False[降順])が基本になります。
複数の columnsを指定する事は可能ですが、先頭にソートしたindexが優先されます。

sample10.py

import pandas as pd

df = pd.DataFrame([[5, 4], [2, 6], [7, 8], [1, 10], [3, 9]], columns=['col1', 'col2'], index=["a", "b", "c", "d", "e"])
print(df)

df2 = df.sort_values(['col1'], ascending=[True])
print(df2)

集計は df.groupby()メソッドを使用します。

Method	Value
df.groupby([columns]).max()	最大値
df.groupby([columns]).min()	最小値
df.groupby([columns]).var()	分散値
df.groupby([columns]).sum()	合計値
df.groupby([columns]).mean()	平均値
df.groupby([columns]).std()	偏差値

以下は、index=team a,b,c に対してheight/weight のデータが入っています。これらteam a,b,c単位に各集計処理を実施します。

sample10.py

import pandas as pd

team = ["a", "b", "c", "b", "c", "a"]
height = [181, 180, 171, 167, 188, 159]
weight = [72, 82, 70, 67, 90, 59]
data = {'team': team, 'height': height, 'weight': weight}
df = pd.DataFrame(data)

m = df.groupby("team").max()
print(m)

フィルター

フィルターはindexと同じサイズのbool型のシーケンスを定義します。
True であるリストのみ表示されます。
以下の例では、0行目、2行目、4行目 が表示されます。

sample11.py

import pandas as pd

data = {'A': [1, 2, 3, 4 ,5], 'B': ["a", "b", "c", "d", "e"]}
df = pd.DataFrame(data)

condition = [True, False, True, False,True]
print(df[condition])

Pandas は他にも多くのメソッドが存在するため一括りは紹介できそうにありませんでした・・・・。
次回は pandas + matplotlib を組み合わせた記事を書こうと思います。

以上です。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up