Pandasの基本操作をJupyter Labで書いてみた(前編)

この記事はかめ(@usdatascientist)さんのブログ(https://datawokagaku.com/python_for_ds_summary/) に書かれているPandasの基本操作を実際にJupyter Labを用いてコーディングしてみた、という記事です。



import pandas as pd
import numpy as np


data = {'name':'John', 'sex':'male', 'age': 22}
john_s = pd.Series(data)
name    John
sex     male
age       22
dtype: object
array = np.array([10,20,30])
0    10
1    20
2    30
dtype: int64
array = np.array([10,20,30])
labels = ['a','b','c']
pd.Series(array, labels)
a    10
b    20
c    30
dtype: int64




data = {'name':'John', 'sex':'male', 'age': 22}
john_s = pd.Series(data)
name    John
sex     male
age       22
dtype: object
ndarray = np.random.randint(5, size=(5,4))
0 1 2 3
0 1 1 1 0
1 4 1 0 0
2 3 2 1 0
3 3 1 1 3
4 4 0 1 3
columns = ['a','b','c','d']
index = np.arange(0,50,10)
pd.DataFrame(data=ndarray, index=index, columns=columns)
a b c d
0 1 1 1 0
10 4 1 0 0
20 3 2 1 0
30 3 1 1 3
40 4 0 1 3


data1 = {
data2 = {
data3 ={
pd.DataFrame([data1, data2, data3])
name sex age
0 John male 22
1 Zack male 30
2 Emily female 32
df = pd.read_csv('train.csv')
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S



PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S


PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
type(df.describe()) #typeはDataFrame


Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
type(df.columns) #typeはindex
df.index #indexもある.
RangeIndex(start=0, stop=891, step=1)


0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64


Age Parch Fare
0 22.0 0 7.2500
1 38.0 0 71.2833
2 26.0 0 7.9250
3 35.0 0 53.1000
4 35.0 0 8.0500


df.iloc[888] #index location
PassengerId                                         889
Survived                                              0
Pclass                                                3
Name           Johnston, Miss. Catherine Helen "Carrie"
Sex                                              female
Age                                                 NaN
SibSp                                                 1
Parch                                                 2
Ticket                                       W./C. 6607
Fare                                              23.45
Cabin                                               NaN
Embarked                                              S
Name: 888, dtype: object
ndarray = np.random.randint(10, size=(5,5))
columns = [0,1,2,3,4]
index = ['a','b','c','d','e']
df_1 = pd.DataFrame(data=ndarray, index=index, columns=columns)
0 1 2 3 4
a 5 8 9 5 0
b 0 1 7 6 9
c 2 4 5 2 4
d 2 4 7 7 9
e 1 7 0 6 9
a    5
b    0
c    2
d    2
e    1
Name: 0, dtype: int64
df_1.loc['c'] #行がintではない時は['str']にする。
0    2
1    4
2    5
3    2
4    4
Name: c, dtype: int64



df.drop(0) .head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q


df.drop('Age', axis=1) .head()
PassengerId Survived Pclass Name Sex SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 0 0 373450 8.0500 NaN S

複数のカラムを落とす時は引数にリストを渡す .drop([]). dropしても元のdfは変更されない

df.drop(['Age','PassengerId'], axis=1) .head()
Survived Pclass Name Sex SibSp Parch Ticket Fare Cabin Embarked
0 0 3 Braund, Mr. Owen Harris male 1 0 A/5 21171 7.2500 NaN S
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 1 0 PC 17599 71.2833 C85 C
2 1 3 Heikkinen, Miss. Laina female 0 0 STON/O2. 3101282 7.9250 NaN S
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 1 0 113803 53.1000 C123 S
4 0 3 Allen, Mr. William Henry male 0 0 373450 8.0500 NaN S
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S


df = pd.read_csv('train.csv')
df.drop(['Age', 'Cabin'], axis=1, inplace=True) 
df .head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
df = pd.read_csv('train.csv')
df = df.drop(['Age', 'Cabin'], axis=1)


PassengerId Survived Pclass Name Sex SibSp Parch Ticket Fare Embarked
5 6 0 3 Moran, Mr. James male 0 0 330877 8.4583 Q
6 7 0 1 McCarthy, Mr. Timothy J male 0 0 17463 51.8625 S
7 8 0 3 Palsson, Master. Gosta Leonard male 3 1 349909 21.0750 S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 0 2 347742 11.1333 S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 1 0 237736 30.0708 C



df = pd.read_csv('train.csv')
df = df['Survived'] == 1#生存者をfilterする
0    False
1     True
2     True
3     True
4    False
Name: Survived, dtype: bool
filter = df['Survived'] ==1 #filterという変数に入れる
df = df[filter]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
df = df[df['Survived'] ==1] #こちらの方が一般的
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
df[df['Survived'] ==1].describe() #生存者のデータのみをdescribe
PassengerId Survived Pclass Age SibSp Parch Fare
count 342.000000 342.0 342.000000 290.000000 342.000000 342.000000 342.000000
mean 444.368421 1.0 1.950292 28.343690 0.473684 0.464912 48.395408
std 252.358840 0.0 0.863321 14.950952 0.708688 0.771712 66.596998
min 2.000000 1.0 1.000000 0.420000 0.000000 0.000000 0.000000
25% 250.750000 1.0 1.000000 19.000000 0.000000 0.000000 12.475000
50% 439.500000 1.0 2.000000 28.000000 0.000000 0.000000 26.000000
75% 651.500000 1.0 3.000000 36.000000 1.000000 1.000000 57.000000
max 890.000000 1.0 3.000000 80.000000 4.000000 5.000000 512.329200
df.describe() #元データ
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
df[df['Age'] >= 60].describe() #'Age'>=60のみ
PassengerId Survived Pclass Age SibSp Parch Fare
count 26.000000 26.000000 26.000000 26.000000 26.000000 26.000000 26.000000
mean 455.807692 0.269231 1.538462 65.096154 0.230769 0.307692 43.467950
std 240.078490 0.452344 0.811456 5.110811 0.429669 0.837579 51.269998
min 34.000000 0.000000 1.000000 60.000000 0.000000 0.000000 6.237500
25% 277.250000 0.000000 1.000000 61.250000 0.000000 0.000000 10.500000
50% 489.000000 0.000000 1.000000 63.500000 0.000000 0.000000 28.275000
75% 629.750000 0.750000 2.000000 69.000000 0.000000 0.000000 58.860450
max 852.000000 1.000000 3.000000 80.000000 1.000000 4.000000 263.000000
df[(df['Age']>=60) & (df['Sex']=='female')] #60才以上かつ女性のみのデータ
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
275 276 1 1 Andrews, Miss. Kornelia Theodosia female 63.0 1 0 13502 77.9583 D7 S
366 367 1 1 Warren, Mrs. Frank Manley (Anna Sophia Atkinson) female 60.0 1 0 110813 75.2500 D37 C
483 484 1 3 Turkula, Mrs. (Hedwig) female 63.0 0 0 4134 9.5875 NaN S
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0000 B28 NaN
df[(df['Pclass']==1) | (df['Age']<10)] #1stclassもしくは10才未満のみのデータ
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C

~ (スクィグル)をつけるとNOT演算でフィルタ可能

data =[{'Name':'John', 'Survived':True},
      {'Name':'Emily', 'Survived':False},
      {'Name':'Ben', 'Survived':True}]
df = pd.DataFrame(data)
Name Survived
0 John True
1 Emily False
2 Ben True


Name Survived
0 John True
2 Ben True

SurvivedカラムはすでにBooleanなので,==True必要ないです. df[‘Survived’]がすでにBooleanのSeriesになるので左のようにそのままフィルタできます.

Name Survived
0 John True
2 Ben True

Survived==Falseに絞りたい場合は, df[df['Survived'==False] なんてことする必要なく,以下のようにできます

Name Survived
1 Emily False



df = pd.read_csv('train.csv')
df = df[df['Sex']=='male']
df.head() #indexがバラバラ
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S


.drop() 同様,もとの df は上書きされないので, df を更新したい場合は inplace=True もしくは df = df.reset_index() で再代入しましょう.

df.reset_index() .head()
index PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
2 5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
3 6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
4 7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S



.reset_index() 同様, inplace=True でもとのdfを上書けます.

PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
Braund, Mr. Owen Harris 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S
Allen, Mr. William Henry 5 0 3 male 35.0 0 0 373450 8.0500 NaN S
Moran, Mr. James 6 0 3 male NaN 0 0 330877 8.4583 NaN Q
McCarthy, Mr. Timothy J 7 0 1 male 54.0 0 0 17463 51.8625 E46 S
Palsson, Master. Gosta Leonard 8 0 3 male 2.0 3 1 349909 21.0750 NaN S

