LoginSignup
4
1

More than 3 years have passed since last update.

Pandasの基本操作をJupyter Labで書いてみた(前編)

Last updated at Posted at 2020-04-15

この記事はかめ(@usdatascientist)さんのブログ(https://datawokagaku.com/python_for_ds_summary/) に書かれているPandasの基本操作を実際にJupyter Labを用いてコーディングしてみた、という記事です。

Pandasの基本操作まとめ

第10回

import pandas as pd
import numpy as np

Series

data = {'name':'John', 'sex':'male', 'age': 22}
john_s = pd.Series(data)
print(john_s)
name    John
sex     male
age       22
dtype: object
array = np.array([10,20,30])
pd.Series(array)
0    10
1    20
2    30
dtype: int64
array = np.array([10,20,30])
labels = ['a','b','c']
pd.Series(array, labels)
a    10
b    20
c    30
dtype: int64

第11回

DataFrameの作り方

ndarrayから作る

data = {'name':'John', 'sex':'male', 'age': 22}
john_s = pd.Series(data)
print(john_s)
print(john_s['age'])
name    John
sex     male
age       22
dtype: object
22
ndarray = np.random.randint(5, size=(5,4))
pd.DataFrame(data=ndarray)
0 1 2 3
0 1 1 1 0
1 4 1 0 0
2 3 2 1 0
3 3 1 1 3
4 4 0 1 3
columns = ['a','b','c','d']
index = np.arange(0,50,10)
pd.DataFrame(data=ndarray, index=index, columns=columns)
a b c d
0 1 1 1 0
10 4 1 0 0
20 3 2 1 0
30 3 1 1 3
40 4 0 1 3

dictionaryから作る

data1 = {
    'name':'John',
    'sex':'male',
    'age':22
}
data2 = {
    'name':'Zack',
    'sex':'male',
    'age':30
}
data3 ={
    'name':'Emily',
    'sex':'female',
    'age':32
}
pd.DataFrame([data1, data2, data3])
name sex age
0 John male 22
1 Zack male 30
2 Emily female 32
df = pd.read_csv('train.csv')
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

第12回

.head()で最初の5行を表示

df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

.describe()で統計量を確認

df.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
type(df.describe()) #typeはDataFrame
pandas.core.frame.DataFrame

.columnsでカラムのリストを表示

df.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
type(df.columns) #typeはindex
pandas.core.indexes.base.Index
df.index #indexもある.
RangeIndex(start=0, stop=891, step=1)

ブラケット[]で特定のカラムを抱け抜き出したSeriesを取得する。

df['Age'].head()
0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64
type(df['Age'])
pandas.core.series.Series

ブラケット[]にカラムのリストを入れて複数のカラムをまとめて抽出する

df[['Age','Parch','Fare']].head()
Age Parch Fare
0 22.0 0 7.2500
1 38.0 0 71.2833
2 26.0 0 7.9250
3 35.0 0 53.1000
4 35.0 0 8.0500

.iloc[int]で特定の行をSeriesで取得する

df.iloc[888] #index location
PassengerId                                         889
Survived                                              0
Pclass                                                3
Name           Johnston, Miss. Catherine Helen "Carrie"
Sex                                              female
Age                                                 NaN
SibSp                                                 1
Parch                                                 2
Ticket                                       W./C. 6607
Fare                                              23.45
Cabin                                               NaN
Embarked                                              S
Name: 888, dtype: object
df.iloc[888]['Age']
nan
np.isnan(df.iloc[888]['Age'])
True
np.random.seed(1)
ndarray = np.random.randint(10, size=(5,5))
columns = [0,1,2,3,4]
index = ['a','b','c','d','e']
df_1 = pd.DataFrame(data=ndarray, index=index, columns=columns)
df_1
0 1 2 3 4
a 5 8 9 5 0
b 0 1 7 6 9
c 2 4 5 2 4
d 2 4 7 7 9
e 1 7 0 6 9
df_1[0] 
a    5
b    0
c    2
d    2
e    1
Name: 0, dtype: int64
df_1.loc['c'] #行がintではない時は['str']にする。
0    2
1    4
2    5
3    2
4    4
Name: c, dtype: int64

Slicingで特定の行、列を落とす

index=0(0列目)を落とす

df.drop(0) .head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q

'Age'のカラムを落とす

df.drop('Age', axis=1) .head()
PassengerId Survived Pclass Name Sex SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 0 0 373450 8.0500 NaN S

複数のカラムを落とす時は引数にリストを渡す .drop([]). dropしても元のdfは変更されない

df.drop(['Age','PassengerId'], axis=1) .head()
Survived Pclass Name Sex SibSp Parch Ticket Fare Cabin Embarked
0 0 3 Braund, Mr. Owen Harris male 1 0 A/5 21171 7.2500 NaN S
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 1 0 PC 17599 71.2833 C85 C
2 1 3 Heikkinen, Miss. Laina female 0 0 STON/O2. 3101282 7.9250 NaN S
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 1 0 113803 53.1000 C123 S
4 0 3 Allen, Mr. William Henry male 0 0 373450 8.0500 NaN S
df.head()#dropしても元のdfは変更されない
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

dfを上書きする方法2通りある.inplace=Trueにすると元のDataFrameが変更される

df = pd.read_csv('train.csv')
df.drop(['Age', 'Cabin'], axis=1, inplace=True) 
df .head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
df = pd.read_csv('train.csv')
df = df.drop(['Age', 'Cabin'], axis=1)
id(df)
140285150057616

slicingで複数行を取得する

df.iloc[5:10]
PassengerId Survived Pclass Name Sex SibSp Parch Ticket Fare Embarked
5 6 0 3 Moran, Mr. James male 0 0 330877 8.4583 Q
6 7 0 1 McCarthy, Mr. Timothy J male 0 0 17463 51.8625 S
7 8 0 3 Palsson, Master. Gosta Leonard male 3 1 349909 21.0750 S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 0 2 347742 11.1333 S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 1 0 237736 30.0708 C

第13回

DataFrameを特定の条件でフィルタ(filter)する

df = pd.read_csv('train.csv')
df = df['Survived'] == 1#生存者をfilterする
df.head()
0    False
1     True
2     True
3     True
4    False
Name: Survived, dtype: bool
filter = df['Survived'] ==1 #filterという変数に入れる
df = df[filter]
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
df = df[df['Survived'] ==1] #こちらの方が一般的
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
df[df['Survived'] ==1].describe() #生存者のデータのみをdescribe
PassengerId Survived Pclass Age SibSp Parch Fare
count 342.000000 342.0 342.000000 290.000000 342.000000 342.000000 342.000000
mean 444.368421 1.0 1.950292 28.343690 0.473684 0.464912 48.395408
std 252.358840 0.0 0.863321 14.950952 0.708688 0.771712 66.596998
min 2.000000 1.0 1.000000 0.420000 0.000000 0.000000 0.000000
25% 250.750000 1.0 1.000000 19.000000 0.000000 0.000000 12.475000
50% 439.500000 1.0 2.000000 28.000000 0.000000 0.000000 26.000000
75% 651.500000 1.0 3.000000 36.000000 1.000000 1.000000 57.000000
max 890.000000 1.0 3.000000 80.000000 4.000000 5.000000 512.329200
df.describe() #元データ
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
df[df['Age'] >= 60].describe() #'Age'>=60のみ
PassengerId Survived Pclass Age SibSp Parch Fare
count 26.000000 26.000000 26.000000 26.000000 26.000000 26.000000 26.000000
mean 455.807692 0.269231 1.538462 65.096154 0.230769 0.307692 43.467950
std 240.078490 0.452344 0.811456 5.110811 0.429669 0.837579 51.269998
min 34.000000 0.000000 1.000000 60.000000 0.000000 0.000000 6.237500
25% 277.250000 0.000000 1.000000 61.250000 0.000000 0.000000 10.500000
50% 489.000000 0.000000 1.000000 63.500000 0.000000 0.000000 28.275000
75% 629.750000 0.750000 2.000000 69.000000 0.000000 0.000000 58.860450
max 852.000000 1.000000 3.000000 80.000000 1.000000 4.000000 263.000000
df[(df['Age']>=60) & (df['Sex']=='female')] #60才以上かつ女性のみのデータ
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
275 276 1 1 Andrews, Miss. Kornelia Theodosia female 63.0 1 0 13502 77.9583 D7 S
366 367 1 1 Warren, Mrs. Frank Manley (Anna Sophia Atkinson) female 60.0 1 0 110813 75.2500 D37 C
483 484 1 3 Turkula, Mrs. (Hedwig) female 63.0 0 0 4134 9.5875 NaN S
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0000 B28 NaN
df[(df['Pclass']==1) | (df['Age']<10)] #1stclassもしくは10才未満のみのデータ
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C

~ (スクィグル)をつけるとNOT演算でフィルタ可能

data =[{'Name':'John', 'Survived':True},
      {'Name':'Emily', 'Survived':False},
      {'Name':'Ben', 'Survived':True}]
df = pd.DataFrame(data)
df
Name Survived
0 John True
1 Emily False
2 Ben True

値がbooleanのカラムでフィルタする時によく使います.

df[df['Survived']==True] 
Name Survived
0 John True
2 Ben True

SurvivedカラムはすでにBooleanなので,==True必要ないです. df[‘Survived’]がすでにBooleanのSeriesになるので左のようにそのままフィルタできます.

df[df['Survived']] 
Name Survived
0 John True
2 Ben True

Survived==Falseに絞りたい場合は, df[df['Survived'==False] なんてことする必要なく,以下のようにできます

df[~df['Survived']] 
Name Survived
1 Emily False

indexを変更する

.reset_index()で再度indexを割り振る

df = pd.read_csv('train.csv')
df = df[df['Sex']=='male']
df.head() #indexがバラバラ
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S

indexを揃える

.drop() 同様,もとの df は上書きされないので, df を更新したい場合は inplace=True もしくは df = df.reset_index() で再代入しましょう.

df.reset_index() .head()
index PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
2 5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
3 6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
4 7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S

.set_index()で特定のカラムをindexにする

indexを’Name’にする

.reset_index() 同様, inplace=True でもとのdfを上書けます.

df.set_index('Name').head()
PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
Name
Braund, Mr. Owen Harris 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S
Allen, Mr. William Henry 5 0 3 male 35.0 0 0 373450 8.0500 NaN S
Moran, Mr. James 6 0 3 male NaN 0 0 330877 8.4583 NaN Q
McCarthy, Mr. Timothy J 7 0 1 male 54.0 0 0 17463 51.8625 E46 S
Palsson, Master. Gosta Leonard 8 0 3 male 2.0 3 1 349909 21.0750 NaN S
4
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
4
1