pandas 基本機能
基礎項目に加え、データサイエンス・機械学習、Kaggle等でよく使う機能をまとめました。
Pandasは、Pythonでデータ分析を行うためのライブラリで、データの読み込みや編集、統計量の表示が可能。
公式ドキュメント:http://pandas.pydata.org/pandas-docs/stable/
###0.ライブラリのインポート
import pandas as pd
###1.シリーズとデータフレーム
pandasには、Numpyをベースとした Series と DataFrame という型が存在する。
■ Seriesの生成
ser = pd.Series([10, 15, 20, 25])
>>> ser
0 10
1 15
2 20
3 25
dtype: int64
■ DataFrameの生成
pd.DataFrame([[100, "a", True],
[150, "b", False],
[300, "c", False],
[550, "d", True]])
![Screen Shot 2019-01-27 at 16.36.27.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2Fdd36f0f4-3fde-7c33-4cdb-eed4513cdce8.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=c15e267aa276f53372a4d310fd4e655f)
DataFrameオブジェクトを作る方法は多数ある
pd.DataFrame(np.arange(12).reshape((3, 4)))
![Screen Shot 2019-01-27 at 16.37.42.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2Fbc8bcb76-1b2c-369c-5fa0-7b83ad0a5b4a.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=c98286c8f0bae272f525dd29e144401d)
# 辞書からデータフレームを作成
pd.DataFrame({
'A' : [1, 5, 2, 7, 3],
'B' : [1, 6, 2, 2, 4]
})
![Screen Shot 2019-01-27 at 16.34.03.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2F560c979d-67b5-920d-5392-af1670144d7e.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=25a27307f072dc8e4f6a594e40dba240)
2.データの基本操作・整形
生成したデータを使い基本操作を解説します。
■ データの作成
df = pd.DataFrame(np.arange(15).reshape((3, 5)))
df
![Screen Shot 2019-01-27 at 17.30.09.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2F72fb1390-5b74-9164-e3ed-fb1e03b04a32.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=32d161aaa94c5fbb0ae044d2b85e0ebc)
■ 行名、列名の指定
df.index = ["01", "02", "03"]
df.columns = ["A", "B", "C", "D", "E"]
df
![Screen Shot 2019-01-27 at 17.35.47.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2F047c95d7-83f8-846f-3fe4-3064c9fe9623.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=aa73426779566e91bf11a34bb129e2c4)
■ 列を指定し表示
df["A"] #A列のみ表示
![Screen Shot 2019-01-27 at 17.39.19.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2F9009a9a7-2056-bc8b-717c-d283ab369004.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=051fe6a7d14bf3488a67c1311905cb52)
■ 複数の列の指定
df[["A", "B"]] #[]忘れないように..
![Screen Shot 2019-01-27 at 17.41.00.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2F2b6f976d-321a-c67f-7608-ffd1d93ce729.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=c37a8fddf728c322ee2eff168d5cb0e3)
■ 行の指定
df[1:] #2行目以降を表示
![Screen Shot 2019-01-27 at 17.46.13.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2F5d102249-75a7-795d-24be-5fb57b35e62c.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=46936fe1f1932bf0c28ccf387ce580c7)
■ locメソッドを使った範囲指定
loc[]の引数には、行名、列名を入力する
df.loc["02", ["B", "D"]] #行"02"、列"B"と"D"の値を出力
![Screen Shot 2019-01-27 at 17.53.06.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2F7f8aabd7-72f5-2630-33c1-626a66d94239.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=a6970d3338e40c01456e76ab3028b224)
■ ilocメソッドを使った範囲指定
loc[]の引数には、行番号、列番号を入力する
df.iloc[:2, 2:4]
![Screen Shot 2019-01-27 at 18.00.03.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2F170cc110-49f5-cbf3-88fa-fab37e3ed213.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=0201d83ced7ef17d630233672f67b43f)
■ 条件で抽出
df[df["E"] > 5] #列Eが5より大きいデータを抽出
![Screen Shot 2019-01-27 at 18.35.29.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2Fc403a9ca-70a2-72a4-5b4c-edf41b049c02.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=96be726dfcc9957e571e06c23b4f5361)
■ queryメソッドで抽出
SQL構文のように複数条件で抽出可能
df.query("A > 3 and C < 10")
![Screen Shot 2019-01-27 at 19.33.51.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2F9ab072b1-a7f8-b573-f66a-4b8c19b45be0.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=16e868a1161e88443400e4324551177f)
■ 列の削除
df.drop("A", axis=1) #列名を指定
![Screen Shot 2019-01-27 at 19.48.45.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2Ffa89f25d-e640-e09c-b55b-8e1a49eb0f62.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=5fa432f595cfe2c5e87b623fb0c32810)
■ 行の削除
df.drop(["01", "03"])
![Screen Shot 2019-01-27 at 21.11.28.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2F6370776d-1470-85f8-3989-f8b957a93189.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=04a772babd1a59a4d7a0d50b0e890040)
■ 列の追加
df['new']=[100,200,300]
df
![Screen Shot 2019-01-28 at 0.41.01.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2F28fb3a43-320d-6ec7-d3b4-cf391596ebba.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=3da9ff473b40691368afd72dc3b8d1c9)
3.データを基本情報を調べる
アヤメのデータを使って基本情報を抽出します。
■ CSVファイルの読み込み
df = pd.read_csv("/Users/..../iris.csv")
#CSVファイルのローカルアドレスをタプルに入力
#CSVに区切りがある場合はタプルに("~.scv", sep = ",")のように指定する
■ 最初の5行のみ表示
df.head() #タプル内に数字を指定し任意の行数に変えることも可能
![Screen Shot 2019-01-27 at 21.13.14.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2Fe01bc3df-bfd1-56f7-9e63-d6a015beb579.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=c3dc708307769d77587d3de353383c32)
■ 最後の5行のみを表示
df.tail()
![Screen Shot 2019-01-27 at 21.13.57.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2F6971a146-eb91-93c1-a9ff-4f139da9f685.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=458aab5e0f805dfe65553a4a6121697d)
■ データフレームの大きさを確認
>>> df.shape
(150, 5) #150行5列のデータフレーム
■ 基本統計量をまとめて表示
df.describe()
![Screen Shot 2019-01-27 at 22.03.00.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2F02ddda08-8177-29f5-6c9c-fbf7f53765dd.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=ce827b2b2b32555ef4d330bf73531a9e)
■ 基本統計量メソッド
- .max() 最大値
- .min() 最小値
- .mode() 最頻値
- .mean() 平均値
- .median() 中央値
- .std() 標準偏差
- .count() 件数
- .corr() 相関係数(列間のデータの相関関係を数値化、主にテーブル全体に対して使う。)
- .unique() カテゴリカルデータの項目を表示
- .value_counts() 項目別数量
df.mean() #列ごとの平均を出力
![Screen Shot 2019-01-29 at 1.56.19.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2F07eccb69-971e-6d25-804a-fe877821187b.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=db0e02a2122940176c7036561fe0b402)
4.その他の実用的な整形メソッド
データ分析で使う様々なメソッドを紹介します。
3.同様アヤメのデータ(df)を使用します。
■ ソート・並べ替え
df.sort_values(by="PetalWidth")
#"PetalWidth"列でソート
#デフォルトでは昇順
![Screen Shot 2019-01-28 at 11.37.22.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2F54cd5f50-c4f3-46c5-9ec6-4d957d4d0b2f.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=2c64c87dfd83fe61462d834a2e402094)
df.sort_values(by="PetalWidth",ascending=False)
#降順
![Screen Shot 2019-01-28 at 11.38.15.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2F74e05513-7bb8-0870-388a-51143be5b172.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=cc06d55a123b6d93d8720c7edfdfd307)
■ グループバイ
df_groupby = df.groupby("Name") #Name列で集約されたオブジェクトを作成
df_groupby.mean() #Nameのカテゴリー別平均
![Screen Shot 2019-01-28 at 13.47.14.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2F43496239-e5d8-b653-027b-147614eac4c6.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=ad8a1939a5cd1628a712a6de3ce5513b)
df_groupby.describe() #カテゴリー別基本統計量をまとめて表示
![Screen Shot 2019-01-28 at 13.50.21.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2Fd01f348d-8ca0-f627-55b6-ae0086322347.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=9caca5d3fa5489afddc2d3190f5ee0ba)
■ ピポッドテーブル
新しくデータフレームを作成します
df=DataFrame([['1-09','A','Apple',150],
['1-10','A','Banana',100],
['1-10','C','Orange',200],
['1-10','C','Aplle',170],
['1-11','B','Orange',200],
['1-11','D','Orange',210],
['1-12','B','Banana',110]],
columns=['Date','Shop','Fruit','Price'])
df
![Screen Shot 2019-01-28 at 21.27.57.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2Fcd237f61-3381-e3c2-ed83-f448791efe88.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=af811c6f167a47b7b8ec8752468f7d68)
df.pivot_table(values = 'Price', index = 'Shop', aggfunc = sum)
#店舗別の売上合計
![Screen Shot 2019-01-28 at 21.32.14.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2Ffbd12038-4f3a-d34c-7144-f5f5b266d93a.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=3bf1b466fa841a9f96b1de7e0e5387a0)
df.pivot_table(values = 'Price', index = 'Shop', columns = 'Fruit', aggfunc = sum)
#columnsを追加し商品別に売上を表示
![Screen Shot 2019-01-28 at 22.07.41.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2F2f165d4e-68cb-9061-e7e2-892cbabbf8ec.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=87bc875cf62eefa3249b8e5c15ebcc76)
df.pivot_table(values = 'Price', index = 'Shop', columns = 'Fruit',
aggfunc = sum, fill_value = 0)
#fill_value = 0 でNaNを0に変換
![Screen Shot 2019-01-28 at 22.14.15.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2F0c99750f-d5da-797b-ccd2-93995f0bd2c3.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=d6c89ac6a047376759683c6bde4ee923)
df.pivot_table(values = 'Price', index = 'Shop', columns = 'Fruit',
aggfunc = sum, fill_value = 0, margins = True)
#margins = True 合計を表示
![Screen Shot 2019-01-28 at 22.18.14.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2F58672a7f-270f-1df5-f252-9ea75364e331.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=de112d4370ccc163799c5d25daf6468e)
df.pivot_table(values = 'Price', index = ['Date','Shop'], columns = 'Fruit',
aggfunc = sum, fill_value = 0, margins = True)
#'Date'を加えてindexを階層化
![Screen Shot 2019-01-28 at 22.22.47.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2Fe39d2476-6431-aff1-013a-386bf73cb4e5.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=3913a13b2fe71dda878625a42169f77b)
■ データフレームの連結
データフレーム df を改めて確認
df
![Screen Shot 2019-01-28 at 22.36.14.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2Ffd948910-ff05-8fc9-9106-364eccb821f9.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=b389f806392360b06efc6305b38302f0)
同じ構造をもつ df_2 を作成
df_2 = DataFrame([['1-13','D','Banana',100],
['1-13','C','Orange',200],
['1-13','B','Apple',150],
['1-14','C','Aplle',170],
['1-14','A','Banana',120],
['1-15','C','Orange',210],
['1-15','D','Banana',100]],
columns=['Date','Shop','Fruit','Price'])
df_2
![Screen Shot 2019-01-28 at 22.48.16.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2Fa13a4144-2200-7a3c-0520-c5148b2ae5f0.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=85755d5a6d304efd80deb435925a00da)
df、 df_2 を行で連結
pd.concat([df, df_2], ignore_index=True)
#ignore_index=True でインデックスの振り直しが可能
![Screen Shot 2019-01-28 at 22.48.55.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2Fe413ac44-b6f8-0a05-e167-60dead13dd35.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=163a078ba671386ba05428d540728d90)
df、 df_2 を列で連結
pd.concat([df, df_2], axis=1, join_axes=[df.index])
![Screen Shot 2019-01-28 at 23.01.06.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2F1feac42b-d6ce-3ee9-634f-2e9a32cfe97a.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=ee02461b9e7486fbaf6fb06ef6d5b7ca)
連結元のデータをラベルで分けて連結
pd.concat([df, df_2], axis=1, join_axes=[df.index], keys=['X', 'Y'])
![Screen Shot 2019-01-28 at 23.17.58.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2F9429f73f-aac6-c9e9-888b-86f926a8f35d.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=cd9c87865b95a66ae200925e6046aca9)
■ データフレームの結合
引き続きデータフレーム df を使用
df
![Screen Shot 2019-01-28 at 22.36.14.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2Ffd948910-ff05-8fc9-9106-364eccb821f9.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=b389f806392360b06efc6305b38302f0)
*df_3を作成 *
df_3 = DataFrame([['1','A','East'],
['2','B','West'],
['3','C','South'],
['4','D','North'],
['5','E','EC']],
columns=['Shop_No','Shop','Area'])
df_3
![Screen Shot 2019-01-29 at 1.20.04.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2Faffd0e02-0469-0656-1131-7accce1f6bb7.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=79249f62593ffbb9ed8f42066b2e12b9)
'Shop'を基点に結合
pd.merge(df, df_3, on='Shop')
![Screen Shot 2019-01-29 at 0.58.49.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2F7be1a8f1-3966-1efc-34cd-1542d6c90e37.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=0368718f658d0c4c77cb6ed31113c6a8)
howで結合方法を選択可能
- inner: 両方のデータに含まれるキーだけを残す。(内部結合)
- left: 1つめのデータのキーをすべて残す。(左外部結合)
- right: 2つめのデータのキーをすべて残す。(右外部結合)
- outer: すべてのキーを残す。(完全外部結合)
pd.merge(df, df_3, on='Shop', how='left')
![Screen Shot 2019-01-29 at 1.23.29.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2F1228d897-f8bf-3db1-0f98-85610695da3d.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=f620f78a93473f26fe567ba62713686b)
pd.merge(df, df_3, on='Shop', how='right')
![Screen Shot 2019-01-29 at 1.23.34.png](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.amazonaws.com%2F0%2F278569%2Fcd68ff00-0d02-cbdf-ae8f-068f3c74c756.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=c4bae4f061ba345e760580acfb769a9e)
5.その他メモ
■ インデックス番号で抽出
df.ix[5:10]
■ データ変換(DataFrame → Numpy配列)
アルゴリズムによってはDataFrameのままでは読み込めないものもあり、その際に変換が必要。
df["A"].values #列AをNumpy配列に変換
■ データフレームに関数を適用
#データフレームの全ての値を2乗にしたい場合
func = lambda x: x**2
df.apply(func)
#列に適用し更新したい場合
df["A"] = df["A"].apply(func)
#タプルにそのまま関数を書いても良い
df.apply(lambda x: x**2)
■ インデックス変更・更新
df.set_index('Date', inplace=True)
#第一引数にindexにしたい列名を入力
#inplace=True でデータを更新
■ オブジェクト型のみに絞る / オブジェクト型を除外
#オブジェクト型のみ出力
df.select_dtypes(include=['object'])
#オブジェクト型を除外
df.select_dtypes(exclude=['object'])
■ エクセルファイルの読み込み
pd.read_excel("~.xlsx")