第14回
DataFrameのNaNについて
import numpy as np
import pandas as pd
df = pd.read_csv('train.csv')
df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
NaNを含んでいたレコードが全てdropされています.(indexはそのままになってます.基本,.reset_index()しない限りindexは再振りされません
df.dropna().head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
10 | 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.0 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
11 | 12 | 1 | 1 | Bonnell, Miss. Elizabeth | female | 58.0 | 0 | 0 | 113783 | 26.5500 | C103 | S |
axis=1を引数にいれるとNaNを含むカラムをdropできます(デフォルトはaxis=0で行).
あまり使わない.モデルを組む際に,データ数を減らさずにデータを説明する変数(説明変数)を減らす作戦のときに使いますが,「NaNが一つでもあるのでその説明変数を減らす」ということはまずありません.どの説明変数がモデル構築に重要なのかというのは非常に重要かつ慎重に考えるべき問題です.
df.dropna(axis=1) .head()
PassengerId | Survived | Pclass | Name | Sex | SibSp | Parch | Ticket | Fare | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 1 | 0 | A/5 21171 | 7.2500 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 1 | 0 | PC 17599 | 71.2833 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 0 | 0 | STON/O2. 3101282 | 7.9250 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 1 | 0 | 113803 | 53.1000 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 0 | 0 | 373450 | 8.0500 |
カラム名のリストをsubset引数に渡すことで,そのカラムにおいてNaNを含む行のみをdropしてくれます.
特定のカラムにおけるNaNの行だけを落とす必要がでてきます.とても便利なので覚えておきましょう.
当然, .dropna() しても,元の df は上書きされません.元の df を更新したい場合はおなじみの inplace=True か df = df.dropna() で再代入します.
df.dropna(subset=['Age']).head() ###index=888がdropされている.
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
.fillna(value)
NaNに特定のValueを代入する.
df.fillna('THIS IS IT').head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | THIS IS IT | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | THIS IS IT | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | THIS IS IT | S |
特定のカラムに含まれるNaNにカラムの平均値を代入する
df['Age'].mean()
29.69911764705882
df['Age'].fillna(df['Age'].mean()).head()
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: Age, dtype: float64
df['Age'] = df['Age'].fillna(df['Age'].mean())
df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
pd.isna(df).head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False | False | False | False | True | False |
1 | False | False | False | False | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False | False | False | True | False |
3 | False | False | False | False | False | False | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False | False | False | False | True | False |
# Cabin_nanカラムを使いして,CabinのNaN判定結果を代入する
df['Cabin_nan'] = pd.isna(df['Cabin'])
df
第15回
.groupby()関数()でgroupby
df = pd.read_csv('train.csv')
df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
Pclassでgroupby
df.groupby(‘Pclass’)だけではgroupbyしてグループにまとめた後になにすればいいかわからないので,その後に.mean()や.count()などの関数をコールします.
df.groupby('Pclass').mean()
PassengerId | Survived | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|
Pclass | ||||||
1 | 461.597222 | 0.629630 | 38.233441 | 0.416667 | 0.356481 | 84.154687 |
2 | 445.956522 | 0.472826 | 29.877630 | 0.402174 | 0.380435 | 20.662183 |
3 | 439.154786 | 0.242363 | 25.140620 | 0.615071 | 0.393075 | 13.675550 |
df = df[df['Pclass']==1]
df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
11 | 12 | 1 | 1 | Bonnell, Miss. Elizabeth | female | 58.0 | 0 | 0 | 113783 | 26.5500 | C103 | S |
23 | 24 | 1 | 1 | Sloper, Mr. William Thompson | male | 28.0 | 0 | 0 | 113788 | 35.5000 | A6 | S |
'Pclass'==1の各カラムの統計量を取ります
df[df['Pclass']==1].describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 216.000000 | 216.000000 | 216.0 | 186.000000 | 216.000000 | 216.000000 | 216.000000 |
mean | 461.597222 | 0.629630 | 1.0 | 38.233441 | 0.416667 | 0.356481 | 84.154687 |
std | 246.737616 | 0.484026 | 0.0 | 14.802856 | 0.611898 | 0.693997 | 78.380373 |
min | 2.000000 | 0.000000 | 1.0 | 0.920000 | 0.000000 | 0.000000 | 0.000000 |
25% | 270.750000 | 0.000000 | 1.0 | 27.000000 | 0.000000 | 0.000000 | 30.923950 |
50% | 472.000000 | 1.000000 | 1.0 | 37.000000 | 0.000000 | 0.000000 | 60.287500 |
75% | 670.500000 | 1.000000 | 1.0 | 49.000000 | 1.000000 | 0.000000 | 93.500000 |
max | 890.000000 | 1.000000 | 1.0 | 80.000000 | 3.000000 | 4.000000 | 512.329200 |
meanだけ取り出す
一つ一つ取り出していては大変なので.groupby()でまとめて見るとよい
df[df['Pclass']==1].describe().loc['mean']
PassengerId 461.597222
Survived 0.629630
Pclass 1.000000
Age 38.233441
SibSp 0.416667
Parch 0.356481
Fare 84.154687
Name: mean, dtype: float64
groupby後は,indexにgroupbyの第一引数である by に指定した値をとります.上の例ではPclassの値(1, 2, 3)です.groubyの結果も当然DataFrameなので, .loc[] で特定のグループのSeriesを取ってくることができます.
df.groupby('Pclass').mean().loc[1]
PassengerId 461.597222
Survived 0.629630
Age 38.233441
SibSp 0.416667
Parch 0.356481
Fare 84.154687
Name: 1, dtype: float64
df.groupby('Pclass').count().loc[1] #count()やsum()なども可能
PassengerId 216
Survived 216
Name 216
Sex 216
Age 186
SibSp 216
Parch 216
Ticket 216
Fare 216
Cabin 176
Embarked 214
Name: 1, dtype: int64
df.groupby('Pclass').describe()
PassengerId | Survived | Age | SibSp | Parch | Fare | |||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
Pclass | ||||||||||||||||||||||||||||||||||||||||||||||||
1 | 216.0 | 461.597222 | 246.737616 | 2.0 | 270.75 | 472.0 | 670.5 | 890.0 | 216.0 | 0.629630 | 0.484026 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 186.0 | 38.233441 | 14.802856 | 0.92 | 27.0 | 37.0 | 49.0 | 80.0 | 216.0 | 0.416667 | 0.611898 | 0.0 | 0.0 | 0.0 | 1.0 | 3.0 | 216.0 | 0.356481 | 0.693997 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | 216.0 | 84.154687 | 78.380373 | 0.0 | 30.92395 | 60.2875 | 93.5 | 512.3292 |
2 | 184.0 | 445.956522 | 250.852161 | 10.0 | 234.50 | 435.5 | 668.0 | 887.0 | 184.0 | 0.472826 | 0.500623 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 173.0 | 29.877630 | 14.001077 | 0.67 | 23.0 | 29.0 | 36.0 | 70.0 | 184.0 | 0.402174 | 0.601633 | 0.0 | 0.0 | 0.0 | 1.0 | 3.0 | 184.0 | 0.380435 | 0.690963 | 0.0 | 0.0 | 0.0 | 1.0 | 3.0 | 184.0 | 20.662183 | 13.417399 | 0.0 | 13.00000 | 14.2500 | 26.0 | 73.5000 |
3 | 491.0 | 439.154786 | 264.441453 | 1.0 | 200.00 | 432.0 | 666.5 | 891.0 | 491.0 | 0.242363 | 0.428949 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 355.0 | 25.140620 | 12.495398 | 0.42 | 18.0 | 24.0 | 32.0 | 74.0 | 491.0 | 0.615071 | 1.374883 | 0.0 | 0.0 | 0.0 | 1.0 | 8.0 | 491.0 | 0.393075 | 0.888861 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 | 491.0 | 13.675550 | 11.778142 | 0.0 | 7.75000 | 8.0500 | 15.5 | 69.5500 |
df.groupby('Pclass').describe()['Age'] #Ageだけ取り出した
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Pclass | ||||||||
1 | 186.0 | 38.233441 | 14.802856 | 0.92 | 27.0 | 37.0 | 49.0 | 80.0 |
2 | 173.0 | 29.877630 | 14.001077 | 0.67 | 23.0 | 29.0 | 36.0 | 70.0 |
3 | 355.0 | 25.140620 | 12.495398 | 0.42 | 18.0 | 24.0 | 32.0 | 74.0 |
JupyterではDataFrameのカラムや行が表示しきれない場合は省略されて表示されます.
省略させずに全てのカラムを(もしくは全ての行を)表示させたい場合はそれぞれ以下を実行することで省略させないようにすることができます.
# カラムを省略せずに表示
pd.set_option('display.max_columns', None)
# 行を省略せずに表示
pd.set_option('display.max_rows', None)
groupbyの結果をfor文でまわす
for i, group_df in df.groupby('Pclass'):
print("{}: group_df's type is {} and has {}".format(i, type(group_df), len(group_df)))
1: group_df's type is <class 'pandas.core.frame.DataFrame'> and has 216
2: group_df's type is <class 'pandas.core.frame.DataFrame'> and has 184
3: group_df's type is <class 'pandas.core.frame.DataFrame'> and has 491
各Pclassのグループの中で,各レコードが何番目にFareが高いか数字を振ってみる
df = pd.read_csv('train.csv')
results = []
for i, group_df in df.groupby('Pclass'):
sorted_group_df = group_df.sort_values('Fare')
sorted_group_df['RankInClass'] = np.arange(len(sorted_group_df))
results.append(sorted_group_df)
results_df = pd.concat(results)
results_df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | RankInClass | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
633 | 634 | 0 | 1 | Parr, Mr. William Henry Marsh | male | NaN | 0 | 0 | 112052 | 0.0 | NaN | S | 0 |
822 | 823 | 0 | 1 | Reuchlin, Jonkheer. John George | male | 38.0 | 0 | 0 | 19972 | 0.0 | NaN | S | 1 |
815 | 816 | 0 | 1 | Fry, Mr. Richard | male | NaN | 0 | 0 | 112058 | 0.0 | B102 | S | 2 |
806 | 807 | 0 | 1 | Andrews, Mr. Thomas Jr | male | 39.0 | 0 | 0 | 112050 | 0.0 | A36 | S | 3 |
263 | 264 | 0 | 1 | Harrison, Mr. William | male | 40.0 | 0 | 0 | 112059 | 0.0 | B94 | S | 4 |
表の結合
表の結合とは,大きく二つあります.
特定のカラムやindexをKeyにして結合する
DataFrameを単純に横に(もしくは縦に)結合する(ガッチャンコさせる)
import pandas as pd
df1 = pd.DataFrame({'Key':['k0','k','k2'],
'A':['a0','a1','a2'],
'B':['b0','b1','b2']})
df2 = pd.DataFrame({'Key':['k0','k1','k2'],
'C':['c0','c2','c3'],
'D':['d0','d1','d2']})
df1
Key | A | B | |
---|---|---|---|
0 | k0 | a0 | b0 |
1 | k | a1 | b1 |
2 | k2 | a2 | b2 |
df2
Key | C | D | |
---|---|---|---|
0 | k0 | c0 | d0 |
1 | k1 | c2 | d1 |
2 | k2 | c3 | d2 |
どちらも’Key’というカラムを持っていて,その値はどちらも同じです.
他のカラムはそれぞれ別の値を持っています.この’Key’というカラムをKey(キー)にして二つのDataFrameを横に結合します.結合には .merge() を使います.
df1.merge(df2)
Key | A | B | C | D | |
---|---|---|---|---|---|
0 | k0 | a0 | b0 | c0 | d0 |
1 | k2 | a2 | b2 | c3 | d2 |
DataFrameを単純に横に(もしくは縦に)結合する(ガッチャンコさせる)
pd.concat()を使う concat=concatenate
# 縦 (よく使う)
pd.concat([df1,df2], axis=0) #デフォルトはaxis=0
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
A | B | C | D | Key | |
---|---|---|---|---|---|
0 | a0 | b0 | NaN | NaN | k0 |
1 | a1 | b1 | NaN | NaN | k |
2 | a2 | b2 | NaN | NaN | k2 |
0 | NaN | NaN | c0 | d0 | k0 |
1 | NaN | NaN | c2 | d1 | k1 |
2 | NaN | NaN | c3 | d2 | k2 |
#横
pd.concat([df1,df2], axis=1)
Key | A | B | Key | C | D | |
---|---|---|---|---|---|---|
0 | k0 | a0 | b0 | k0 | c0 | d0 |
1 | k | a1 | b1 | k1 | c2 | d1 |
2 | k2 | a2 | b2 | k2 | c3 | d2 |
第16回
.merge()の使い方
how : どう結合するか→{‘left’, ‘right’, ‘outer’, ‘inner’}, デフォルトは ‘inner’
on : keyにするカラムを指定(どちらのDataFrameにも存在するカラム).指定をしないと共通のカラムで結合される
left_on:leftのDataFrameのkeyにするカラム
right_on:rightのDataFrameのkeyにするカラム
left_index:leftのKeyをindexにする場合Trueを指定
right_index:rightのKeyをindexにする場合Trueを指定
how
df1
Key | A | B | |
---|---|---|---|
0 | k0 | a0 | b0 |
1 | k | a1 | b1 |
2 | k2 | a2 | b2 |
df2
Key | C | D | |
---|---|---|---|
0 | k0 | c0 | d0 |
1 | k1 | c2 | d1 |
2 | k2 | c3 | d2 |
df1.merge(df2, how='left')
Key | A | B | C | D | |
---|---|---|---|---|---|
0 | k0 | a0 | b0 | c0 | d0 |
1 | k | a1 | b1 | NaN | NaN |
2 | k2 | a2 | b2 | c3 | d2 |
df1.merge(df2, how='outer')
Key | A | B | C | D | |
---|---|---|---|---|---|
0 | k0 | a0 | b0 | c0 | d0 |
1 | k | a1 | b1 | NaN | NaN |
2 | k2 | a2 | b2 | c3 | d2 |
3 | k1 | NaN | NaN | c2 | d1 |
df1.merge(df2, how='inner')
Key | A | B | C | D | |
---|---|---|---|---|---|
0 | k0 | a0 | b0 | c0 | d0 |
1 | k2 | a2 | b2 | c3 | d2 |
on
引数onは,結合するときにどのカラムをKeyにして結合するかを指定します.leftの表もrightの表もどちらにもあるカラムしか指定できません.
なお,共通のカラムがある場合はなにも指定しなくてもそのカラムがKeyとなり結合されます,が,基本は指定しましょう.わかりやすいし安全です.「予期せぬカラムで結合してた」ということもよくあります.複数の共通カラムがある場合は,どのカラムで結合されるかわからないですし,たとえ一つしか共通カラムがない場合でも指定するのがいいと思います.(今までの例はonの説明前だったので意図的にon引数を指定せずに書いてました.)
indexをKeyにする場合は後述のright_index, left_indexをTrueにします.また,それぞれの表(DataFrame)のカラム名が異なる場合は後述のleft_on, right_onを指定します.
df1 = pd.DataFrame({'Key':['k0','k1','k2'],
'ID':['aa','bb','cc'],
'A':['a0','a1','a2'],
'B':['b0','b1','b2']})
df2 = pd.DataFrame({'Key':['k0','k1','k3'],
'ID':['aa','bb','cc'],
'C':['c0','c1','c3'],
'D':['d0','d1','d3']})
df1.merge(df2, on='Key')
Key | ID_x | A | B | ID_y | C | D | |
---|---|---|---|---|---|---|---|
0 | k0 | aa | a0 | b0 | aa | c0 | d0 |
1 | k1 | bb | a1 | b1 | bb | c1 | d1 |
df1.merge(df2, on='ID')
Key_x | ID | A | B | Key_y | C | D | |
---|---|---|---|---|---|---|---|
0 | k0 | aa | a0 | b0 | k0 | c0 | d0 |
1 | k1 | bb | a1 | b1 | k1 | c1 | d1 |
2 | k2 | cc | a2 | b2 | k3 | c3 | d3 |
suffixを変更する
df1.merge(df2, on='ID', suffixes=('_left', '_right'))
Key_left | ID | A | B | Key_right | C | D | |
---|---|---|---|---|---|---|---|
0 | k0 | aa | a0 | b0 | k0 | c0 | d0 |
1 | k1 | bb | a1 | b1 | k1 | c1 | d1 |
2 | k2 | cc | a2 | b2 | k3 | c3 | d3 |
left_on, right_on
Keyにしたいカラム名がleftとrightで異なるとき,この引数を指定します.
df1 = pd.DataFrame({'Key1':['k0','k1','k2'],
'A':['a0','a1','a2'],
'B':['b0','b1','b2']})
df2 = pd.DataFrame({'Key2':['k0','k1','k3'],
'C':['c0','c1','c3'],
'D':['d0','d1','d3']})
df1.merge(df2, left_on='Key1', right_on='Key2')
Key1 | A | B | Key2 | C | D | |
---|---|---|---|---|---|---|
0 | k0 | a0 | b0 | k0 | c0 | d0 |
1 | k1 | a1 | b1 | k1 | c1 | d1 |
left_index, right_index
カラムではなくIndexをKeyに指定したい場合,left_index, right_indexにTrueを指定します.
df1.merge(df2, left_index=True, right_index=True)
Key1 | A | B | Key2 | C | D | |
---|---|---|---|---|---|---|
0 | k0 | a0 | b0 | k0 | c0 | d0 |
1 | k1 | a1 | b1 | k1 | c1 | d1 |
2 | k2 | a2 | b2 | k3 | c3 | d3 |
join
join関数を使うとindexで結合してくれますが,mergeでもほぼ同じことができるので覚える必要はない
df1 = pd.DataFrame({'Key1':['k0','k1','k2'],
'A':['a0','a1','a2'],
'B':['b0','b1','b2']})
df2 = pd.DataFrame({'Key2':['k0','k1','k3'],
'C':['c0','c1','c3'],
'D':['d0','d1','d3']})
df1.join(df2)
Key1 | A | B | Key2 | C | D | |
---|---|---|---|---|---|---|
0 | k0 | a0 | b0 | k0 | c0 | d0 |
1 | k1 | a1 | b1 | k1 | c1 | d1 |
2 | k2 | a2 | b2 | k3 | c3 | d3 |
df1.merge(df2, left_index=True, right_index=True)
Key1 | A | B | Key2 | C | D | |
---|---|---|---|---|---|---|
0 | k0 | a0 | b0 | k0 | c0 | d0 |
1 | k1 | a1 | b1 | k1 | c1 | d1 |
2 | k2 | a2 | b2 | k3 | c3 | d3 |
df1 = pd.DataFrame({'Key1':['k0','k1','k2'],
'A':['a0','a1','a2'],
'B':['b0','b1','b2']})
df2 = pd.DataFrame({'Key2':['k0','k1','k3'],
'C':['c0','c1','c3'],
'D':['d0','d1','d3']})
df3 = pd.DataFrame({'Key3':['k0','k1','k4'],
'E':['c0','c1','c3'],
'F':['d0','d1','d3']})
df1.join([df2, df3])
Key1 | A | B | Key2 | C | D | Key3 | E | F | |
---|---|---|---|---|---|---|---|---|---|
0 | k0 | a0 | b0 | k0 | c0 | d0 | k0 | c0 | d0 |
1 | k1 | a1 | b1 | k1 | c1 | d1 | k1 | c1 | d1 |
2 | k2 | a2 | b2 | k3 | c3 | d3 | k4 | c3 | d3 |
第17回
.unique() .nunique()
import pandas as pd
df = pd.read_csv('train.csv')
df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
df['Pclass'].unique()
array([3, 1, 2])
df['Pclass'].nunique()
3
.value_counts()
df['Pclass'].value_counts()
3 491
1 216
2 184
Name: Pclass, dtype: int64
(超重要) .apply()
apply()関数を使って,DataFrameの全てのレコードに処理をして,その結果を別のカラムに格納することができます. 各行に処理をapplyするイメージです
def get_age_group(age):
return str(age)[0] + '0s'
get_age_group(45)
'40s'
df = pd.DataFrame({'name':['John','Mike','Emily'],
'age':['23','36','42']})
df
name | age | |
---|---|---|
0 | John | 23 |
1 | Mike | 36 |
2 | Emily | 42 |
df['age'].apply(get_age_group)
0 20s
1 30s
2 40s
Name: age, dtype: object
lambda関数を使った.apply()の使い方
#lambda関数に変数fに代入して
f = lambda x: str(x)[0] + '0s'
#試しに43を入れる
f(43)
'40s'
df['age_group'] = df['age'].apply(lambda x: str(x)[0] + '0s')
df
name | age | age_group | |
---|---|---|---|
0 | John | 23 | 20s |
1 | Mike | 36 | 30s |
2 | Emily | 42 | 40s |
レコード全体に対して使う.apply()の使い方
df = pd.DataFrame({'name':['John','Mike','Emily'],
'age':['23','36','42']})
df['description'] = df.apply(lambda row:'{} is {} years old'.format(row['name'], row['age']), axis=1)
df
name | age | description | |
---|---|---|---|
0 | John | 23 | John is 23 years old |
1 | Mike | 36 | Mike is 36 years old |
2 | Emily | 42 | Emily is 42 years old |
第18回
.to_csv()でDataFrameをcsv形式で保存
df = pd.read_csv('train.csv')
df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
df['Adult'] = df['Age'].apply(lambda x: x>20)
df.tail()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Adult | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.00 | NaN | S | True |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.00 | B42 | S | False |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.45 | NaN | S | False |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.00 | C148 | C | True |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.75 | NaN | Q | True |
df.to_csv('train_w_adult.csv')
df = pd.read_csv('train_w_adult.csv')
df.head(3)
Unnamed: 0 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Adult | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | True |
1 | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | True |
2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | True |
df.head(3) の結果をみると,前回保存した時のindex情報が’Unnamed:0’という謎のカラムに保存されています.
.to_csv()にindex=Falseを指定するとindexを保存しないで済みます.基本常にindex=Falseを指定してcsv形式に保存しておきます.
保存先にすでに同じファイルがある場合,上書き保存されるので注意してください.
df = pd.read_csv('train.csv')
df['Adult'] = df['Age'].apply(lambda x: x>20)
df.to_csv('train_w_adult.csv', index=False)
df = pd.read_csv('train_w_adult.csv')
df.head(3)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Adult | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | True |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | True |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | True |
df = pd.DataFrame({'A':[['a', 'b'], 2, 3], 'B':[['c', 'd'], 5, 6]})
df
A | B | |
---|---|---|
0 | [a, b] | [c, d] |
1 | 2 | 5 |
2 | 3 | 6 |
# 格納されている値がリストであることを確認します.
type(df['A'].iloc[0])
list
# csvで保存
df.to_csv('temp.csv', index=False)
# 保存したcsvを読み込み
df = pd.read_csv('temp.csv')
df
A | B | |
---|---|---|
0 | ['a', 'b'] | ['c', 'd'] |
1 | 2 | 5 |
2 | 3 | 6 |
type(df['A'].iloc[0])
str
.iterrows()でDataFrameをイテレーション
DataFrameをfor文でイテレーションするときに使います.覚えにくいですが,「rows」を「iteration」するのでiter + row + s. と覚えましょう.forで回せるものになるので複数系のsがあると考えましょう.
「イテレーション」というのは繰り返し処理を回すことを意味します.ループです.例えばリストだったらfor i in list:でイテレーションできました(第4回参照)
DataFrameでは,リストのように直接for i in df:というのはできません. .iterrows() という関数を使って以下のように書きます.
df = pd.read_csv('train.csv')
for idx, row in df.iterrows():
if row['Age'] > 40 and row['Pclass'] == 3 and row['Sex'] == 'male' and row['Survived'] == 1:
print('{} is very lucky guy...!'.format(row['Name']))
Dahl, Mr. Karl Edwart is very lucky guy...!
Sundman, Mr. Johan Julian is very lucky guy...!
.apply() では,各レコードの処理をした結果を別のカラムに保存するときに使い,今回の .iterows() では値を返すのではなく処理だけをしたいときに使うことが多いです.
例えばDataFrameにファイルパスが格納されていて,それを .iterrows() してファイルを移動させたり読み込んだりします.
.sort_values()で特定のカラムでソート
#年齢が若い順にソート
df.sort_values('Age')
df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
.pivot_table()でピボットテーブルを作成
data = {'Data':['Jan-1','Jan-1','Jan-1','Jan-2','Jan-2','Jan-2'],
'User':['Emily', 'John','Nick','Kevin','Emily','John'],
'Method':['Card','Card','Cash','Card','Cash','Cash'],
'Price':[100,250,200,460,200,130]}
df = pd.DataFrame(data)
df
Data | User | Method | Price | |
---|---|---|---|---|
0 | Jan-1 | Emily | Card | 100 |
1 | Jan-1 | John | Card | 250 |
2 | Jan-1 | Nick | Cash | 200 |
3 | Jan-2 | Kevin | Card | 460 |
4 | Jan-2 | Emily | Cash | 200 |
5 | Jan-2 | John | Cash | 130 |
valuesには,集計したいカラムを入れます.今回ではPriceです.
まず,それぞれのセルに入るのがvaluesで指定した値(今回ではPrice).集計したいカラムです.
それに対してindexとcolumnsをそれぞれ指定したいカラムをリストで渡すだけ.
df.pivot_table(values='Price', index=['Data', 'User'], columns=['Method'])
Method | Card | Cash | |
---|---|---|---|
Data | User | ||
Jan-1 | Emily | 100.0 | NaN |
John | 250.0 | NaN | |
Nick | NaN | 200.0 | |
Jan-2 | Emily | NaN | 200.0 |
John | NaN | 130.0 | |
Kevin | 460.0 | NaN |
df.pivot_table(values='Price', index=['Data', 'Method'], columns=['User'])
User | Emily | John | Kevin | Nick | |
---|---|---|---|---|---|
Data | Method | ||||
Jan-1 | Card | 100.0 | 250.0 | NaN | NaN |
Cash | NaN | NaN | NaN | 200.0 | |
Jan-2 | Card | NaN | NaN | 460.0 | NaN |
Cash | 200.0 | 130.0 | NaN | NaN |
最初に「どのカラムを集計したいのか」を明確にしてそのカラムをvaluesに入れてしまえばあとは欲しい情報をindexとcolumnsに入れていくだけです.
.xs()でcross-section操作
.xs() はcross sectionの略です.これもあまり使いませんが,ピボットのような複数のindexをもったDataFrameを操作する際に重宝します.ピボットと合わせて覚えておくといい.
この .xs() は何をするときに必要かというと,先ほどのピボットテーブルで,例えば「Card」の行だけうまく抜き出したいときに使います.(まさにcross-section)
#pivot.xs('Card', level = 'Method')