はじめに
機械学習の勉強を始めたVBAユーザです。
備忘録としてPython・Rの文法をVBAと比較しながらまとめています。
以前の記事でデータフレームについて書きましたが、今回はその続きで、Rのdplyr
というパッケージについて書きます。
前回と同じ項目について、Pythonのpandas
とRのdplyr
を比較したいと思います。
目次
データフレーム
今回もデータフレームについてです。Pythonのpandas
とRのdplyr
を比較します。
データフレームは、このような形をしたデータで、縦方向に同じデータ型のデータが同じ数だけ並んでいるものです。つまり、列ベクトルを横に並べたようなものです。(ベクトルについてはこちら参照。)
||X|Y|Z|
|--:|:-:|:-:|:-:|:-:|
|1|20|100.1|AAA|
|2|40|200.2|BBB|
|3|10|300.3|CCC|
|4|30|400.4|DDD|
|5|50|500.5|EEE|
縦方向を「行」、横方向を「列」といいます。
Python(pandas) | R | RDB(Access) | Excel | 行列 | |
---|---|---|---|---|---|
オブジェクト | DataFrame | data.frame | テーブル | テーブル | 行列(matrix) |
行 | index(axis=0) | 行(row) | レコード | 行(ROW) | 行(row) |
列 | columns(axis=1) | 列(col) | フィールド | 列(COLUMN) | 列(column) |
要素 | values | 値 | CELL | 要素 |
データフレームへのアクセス
指定した列へのアクセス
データフレームから列を取り出します。
Python
import numpy as np
import pandas as pd
x = np.array([2,4,1,3,5])*10
y = np.arange(1,5+1)*100+np.arange(1,5+1)/10
z = ['AAA','BBB','CCC','DDD','EEE']
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
df
# X Y Z
# 1 20 100.1 AAA
# 2 40 200.2 BBB
# 3 10 300.3 CCC
# 4 30 400.4 DDD
# 5 50 500.5 EEE
df.X
df['X']
print(df['X'])
# 0 20
# 1 40
# 2 10
# 3 30
# 4 50
# Name: X, dtype: int32
type(df['X'])
# pandas.core.series.Series
df[['X']]
print(df[['X']])
df.loc[:,['X']]
print(df.loc[:,['X']])
# X
# 0 20
# 1 40
# 2 10
# 3 30
# 4 50
type(df[['X']])
# pandas.core.frame.DataFrame
df[['X','Z']]
print(df[['X','Z']])
df.loc[:, ['X','Z']]
print(df.loc[:, ['X','Z']])
df.iloc[:, [1-1,3-1]]
print(df.iloc[:, [1-1,3-1]])
# X Z
# 0 20 AAA
# 1 40 BBB
# 2 10 CCC
# 3 30 DDD
# 4 50 EEE
df[['Z','Y','X']]
print(df[['Z','Y','X']])
df.loc[:,['Z','Y','X']]
print(df.loc[:,['Z','Y','X']])
df.iloc[:,[3-1,2-1,1-1]]
print(df.iloc[:,[3-1,2-1,1-1]])
var = ['Z','Y','X']
df.loc[:,var]
print(df.loc[:,var])
# Z Y X
# 0 AAA 100.1 20
# 1 BBB 200.2 40
# 2 CCC 300.3 10
# 3 DDD 400.4 30
# 4 EEE 500.5 50
pandasのDataFrameはSeries(1次元)を横に並べたものです。
df.X
,df['X']
は、そのDataFrameのもとになっているSeriesを返します。
df[['X']]
は、df[['X','Z']]
やdf[['Z','Y','X']]
と同じく、DataFrameから指定した列だけ抜き出したもの(その列数がたまたま1個だっただけ)で、DataFrameのsubset(一部)のDataFrameが返ります。
R
library(dplyr)
x = c(2,4,1,3,5)*10
y = 1:5*100+1:5/10
z = c("AAA","BBB","CCC","DDD","EEE")
df <- data.frame(X=x, Y=y, Z=z)
df
# X Y Z
# 1 20 100.1 AAA
# 2 40 200.2 BBB
# 3 10 300.3 CCC
# 4 30 400.4 DDD
# 5 50 500.5 EEE
pull(df, X)
pull(df, 1)
df %>% pull(X)
df %>% pull(1)
# [1] 20 40 10 30 50
class(pull(df, X))
# [1] "numeric"
pull(df, -2) # 右から2列目
df %>% pull(-2)
# [1] 100.1 200.2 300.3 400.4 500.5
select(df, X)
df %>% select(X)
# X
# 1 20
# 2 40
# 3 10
# 4 30
# 5 50
class(select(df, X))
# [1] "data.frame"
select(df, X, Z)
df %>% select(X, Z)
# X Z
# 1 20 AAA
# 2 40 BBB
# 3 10 CCC
# 4 30 DDD
# 5 50 EEE
select(df, -Y) # Y以外
df %>% select(-Y)
# X Z
# 1 20 AAA
# 2 40 BBB
# 3 10 CCC
# 4 30 DDD
# 5 50 EEE
select(df, Z, Y, X)
df %>% select(Z, Y, X)
# Z Y X
# 1 AAA 100.1 20
# 2 BBB 200.2 40
# 3 CCC 300.3 10
# 4 DDD 400.4 30
# 5 EEE 500.5 50
select(df, X:Z)
df %>% select(X:Z)
# X Y Z
1 20 100.1 AAA
2 40 200.2 BBB
3 10 300.3 CCC
4 30 400.4 DDD
5 50 500.5 EEE
select(df, Z:X)
df %>% select(Z:X)
# Z Y X
# 1 AAA 100.1 20
# 2 BBB 200.2 40
# 3 CCC 300.3 10
# 4 DDD 400.4 30
# 5 EEE 500.5 50
var <- c("Z", "Y", "X")
select(df, !!var)
df %>% select(!!var)
# Z Y X
# 1 AAA 100.1 20
# 2 BBB 200.2 40
# 3 CCC 300.3 10
# 4 DDD 400.4 30
# 5 EEE 500.5 50
pull(df, X)
はベクトルを返します。一方、select(df, X)
は1列だけのデータフレームを返します。
pull(df, X)
と書くのとdf %>% pull(X)
と書くのは等価です。
%>%
はパイプという機能で、左のdf
を右の関数pull()
の第一引数に渡しています。これを使うことで、関数を何重にも重ねる代わりに、右に連結していけます(関数の連結)。dplyr
の関数は第一引数にデータフレームを取るように統一されているため、このパイプ機能が相性良く使えます。
df %>% 関数1() %>% 関数2() %>% 関数3() %>% ...
のような形で。
なお、このパイプ演算子%>%
は、RStudioでは Shift + Ctrl + M で入力できます。
最後の例の!!var
は、変数varに入っている文字列を列名として使っています。
指定した行へのアクセス
先頭の方の行と末尾の方の行を取り出します。
Python
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
df.head(3)
print(df.head(3))
# X Y Z
# 0 20 100.1 AAA
# 1 40 200.2 BBB
# 2 10 300.3 CCC
df.tail(2)
print(df.tail(2))
# X Y Z
# 3 30 400.4 DDD
# 4 50 500.5 EEE
R
df <- data.frame(X=x, Y=y, Z=z)
df
slice_head(df, n = 3)
df %>% slice_head(n = 3)
# X Y Z
# 1 20 100.1 AAA
# 2 40 200.2 BBB
# 3 10 300.3 CCC
slice_tail(df, n = 2)
df %>% slice_tail(n = 2)
# X Y Z
# 1 30 400.4 DDD
# 2 50 500.5 EEE
slice(df, 1, 3)
df %>% slice(1, 3)
# X Y Z
# 1 20 100.1 AAA
# 2 10 300.3 CCC
slice(df, c(1, 3))
df %>% slice(c(1, 3))
# X Y Z
# 1 20 100.1 AAA
# 2 10 300.3 CCC
slice(df, 2:4) # 2-4列
df %>% slice(2:4)
# X Y Z
# 1 40 200.2 BBB
# 2 10 300.3 CCC
# 3 30 400.4 DDD
slice(df, -(2:4))
df %>% slice(-(2:4)) # 2-4列以外
# X Y Z
# 1 20 100.1 AAA
# 2 50 500.5 EEE
slice(df, -c(1,5)) # 1,5列以外
df %>% slice(-c(1,5))
# X Y Z
# 1 40 200.2 BBB
# 2 10 300.3 CCC
# 3 30 400.4 DDD
slice(df, -2, -(4:5))
slice(df, -c(2, 4:5))
df %>% slice(-2, -(4:5))
df %>% slice(-c(2, 4:5))
# X Y Z
# 1 20 100.1 AAA
# 2 10 300.3 CCC
指定した行と列の選択
データフレームから行・列を指定して取り出します。
Python
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
df.iat[1-1,1-1]
df.at[1-1,'X']
df.iloc[1-1,1-1]
df.loc[1-1,'X']
df['X'][1-1]
df.X[1-1]
# 20
df.iloc[[1-1,3-1],[1-1,3-1]]
print(df.iloc[[1-1,3-1],[1-1,3-1]])
df.loc[[1-1,3-1],['X','Z']]
print(df.loc[[1-1,3-1],['X','Z']])
# X Z
# 0 20 AAA
# 2 10 CCC
df.iloc[:,[1-1,3-1]]
print(df.iloc[:,[1-1,3-1]])
df.loc[:,['X','Z']]
print(df.loc[:,['X','Z']])
df[['X','Z']]
print(df[['X','Z']])
# X Z
# 0 20 AAA
# 1 40 BBB
# 2 10 CCC
# 3 30 DDD
# 4 50 EEE
df.iloc[[1-1,3-1],:]
# print(df.iloc[[1-1,3-1],:])
df.loc[[1-1,3-1],:]
# print(df.loc[[1-1,3-1],:])
df.iloc[[1-1,3-1]]
# print(df.iloc[[1-1,3-1]])
df.loc[[1-1,3-1]]
# print(df.loc[[1-1,3-1]])
# X Y Z
# 0 20 100.1 AAA
# 2 10 300.3 CCC
loc
もiloc
も選択する場所(location)を指定。loc
とiloc
の違いは、loc
が label-based(名前で指定)、iloc
が integer-location based(番号で指定)。
R
pull(df, 1)[1]
pull(df, X)[1]
df %>% pull(1) %>% `[`(1)
df %>% pull(X) %>% `[`(1)
df %>% `[`(1,1)
df[1,1]
# [1] 20
# なお、ベクトルxに対して x %>% `[`(1) は x[1] と同等
x %>% `[`(1)
x[1]
# [1] 20
select(slice(df, 1, 3), 1, 3)
select(slice(df, 1, 3), X, Z)
slice(select(df, 1, 3), 1, 3)
slice(select(df, X, Z), 1, 3)
slice(select(df, c(1, 3)), c(1, 3))
slice(select(df, c(X, Z)), c(1, 3))
df %>% select(1, 3) %>% slice(1, 3)
df %>% select(X, Z) %>% slice(1, 3)
df %>% slice(1, 3) %>% select(1, 3)
df %>% slice(1, 3) %>% select(X, Z)
df %>% slice(c(1, 3)) %>% select(c(1, 3))
df %>% slice(c(1, 3)) %>% select(c(X, Z))
# X Z
# 1 20 AAA
# 2 10 CCC
select(df, 1, 3)
select(df, X, Z)
select(df, c(1, 3))
select(df, c(X, Z))
select(df, c("X", "Z"))
# X Z
# 1 20 AAA
# 2 40 BBB
# 3 10 CCC
# 4 30 DDD
# 5 50 EEE
select(df, -Y)
select(df, !Y)
df %>% select(-Y)
df %>% select(!Y)
# X Z
# 1 20 AAA
# 2 40 BBB
# 3 10 CCC
# 4 30 DDD
# 5 50 EEE
slice(df, 1, 3)
df %>% slice(1, 3)
# X Y Z
# 1 20 100.1 AAA
# 2 10 300.3 CCC
slice(df, 2:4)
df %>% slice(2:4)
# X Y Z
# 1 40 200.2 BBB
# 2 10 300.3 CCC
# 3 30 400.4 DDD
slice(df, -2, -(4:5))
slice(df, -c(2, 4:5))
df %>% slice(-2, -(4:5))
df %>% slice(-c(2, 4:5))
# X Y Z
# 1 20 100.1 AAA
# 2 10 300.3 CCC
slice(select(df, X, Z), 1, 3)
は、select()
関数にslice()
関数を重ねていますが、パイプを使って連結するとdf %>% select(X, Z) %>% slice(1, 3)
とすっきり書けてわかりやすくなります。
データフレームの操作
列の選択
SQLのSELECT文に相当する列選択です。
Python
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
df
# X Y Z
# 1 20 100.1 AAA
# 2 40 200.2 BBB
# 3 10 300.3 CCC
# 4 30 400.4 DDD
# 5 50 500.5 EEE
df[['X']]
print(df[['X']])
df.iloc[:,[1-1]]
print(df.iloc[:,[1-1]])
df.loc[:,['X']]
print(df.loc[:,['X']])
# X
# 0 20
# 1 40
# 2 10
# 3 30
# 4 50
df[['X','Z']]
print(df[['X','Z']])
df.loc[:,['X','Z']]
print(df.loc[:,['X','Z']])
df.iloc[:,[1-1,3-1]]
print(df.iloc[:,[1-1,3-1]])
# X Z
# 0 20 AAA
# 1 40 BBB
# 2 10 CCC
# 3 30 DDD
# 4 50 EEE
df[['Z','Y','X']]
print(df[['Z','Y','X']])
df.loc[:,['Z','Y','X']]
print(df.loc[:,['Z','Y','X']])
df.iloc[:,[3-1,2-1,1-1]]
print(df.iloc[:,[3-1,2-1,1-1]])
var = ['Z','Y','X']
df.loc[:,var]
print(df.loc[:,var])
# Z Y X
# 0 AAA 100.1 20
# 1 BBB 200.2 40
# 2 CCC 300.3 10
# 3 DDD 400.4 30
# 4 EEE 500.5 50
R
df <- data.frame(X = x, Y = y, Z = z)
df
# X Y Z
# 1 20 100.1 AAA
# 2 40 200.2 BBB
# 3 10 300.3 CCC
# 4 30 400.4 DDD
# 5 50 500.5 EEE
select(df, X)
df %>% select(X)
# X
# 1 20
# 2 40
# 3 10
# 4 30
# 5 50
select(df, X, Z)
df %>% select(X, Z)
# X Z
# 1 20 AAA
# 2 40 BBB
# 3 10 CCC
# 4 30 DDD
# 5 50 EEE
select(df, Z, X, Y)
df %>% select(Z, X, Y)
# Z X Y
# 1 AAA 20 100.1
# 2 BBB 40 200.2
# 3 CCC 10 300.3
# 4 DDD 30 400.4
# 5 EEE 50 500.5
SQL
SELECT X FROM df;
SELECT Z, X FROM df;
SELECT Z, X, Y FROM df;
列の追加
データフレームに新しい列を追加する方法についてです。
Python
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
df['A'] = np.arange(1,5+1,1)
df
df['X3'] = df['X'] * 3
df
df['S'] = list('MFMFM') # ['M', 'F', 'M', 'F', 'M']
df
print(df)
# X Y Z A X3 S
# 0 20 100.1 AAA 1 60 M
# 1 40 200.2 BBB 2 120 F
# 2 10 300.3 CCC 3 30 M
# 3 30 400.4 DDD 4 90 F
# 4 50 500.5 EEE 5 150 M
R
df <- data.frame(X=x, Y=y, Z=z)
df
mutate(df, A = 1:5)
df %>% mutate(A = 1:5)
# X Y Z A
# 1 20 100.1 AAA 1
# 2 40 200.2 BBB 2
# 3 10 300.3 CCC 3
# 4 30 400.4 DDD 4
# 5 50 500.5 EEE 5
mutate(df, X3 = X*3)
df %>% mutate(X3 = X*3)
# X Y Z A S X3
# 1 20 100.1 AAA 1 M 60
# 2 40 200.2 BBB 2 F 120
# 3 10 300.3 CCC 3 M 30
# 4 30 400.4 DDD 4 F 90
# 5 50 500.5 EEE 5 M 150
mutate(df, s = tolower(S))
df %>% mutate(s = tolower(S))
# X Y Z X3
# 1 20 100.1 AAA 60
# 2 40 200.2 BBB 120
# 3 10 300.3 CCC 30
# 4 30 400.4 DDD 90
# 5 50 500.5 EEE 150
mutate(mutate(df, A = 1:5), X3 = X*3)
df %>% mutate(A = 1:5) %>% mutate(X3 = X*3)
# X Y Z A X3
# 1 20 100.1 AAA 1 60
# 2 40 200.2 BBB 2 120
# 3 10 300.3 CCC 3 30
# 4 30 400.4 DDD 4 90
# 5 50 500.5 EEE 5 150
mutate(df, A = 1:5, X3 = X*3)
df %>% mutate(A = 1:5, X3 = X*3)
# X Y Z A X3
# 1 20 100.1 AAA 1 60
# 2 40 200.2 BBB 2 120
# 3 10 300.3 CCC 3 30
# 4 30 400.4 DDD 4 90
# 5 50 500.5 EEE 5 150
mutate(df, A = 1:5,
X3 = X*3,
S = rep(c("M","F"),length=5),
s = tolower(S))
df %>% mutate(A = 1:5,
X3 = X*3,
S = rep(c("M","F"),length=5),
s = tolower(S))
# X Y Z A X3 S s
# 1 20 100.1 AAA 1 60 M m
# 2 40 200.2 BBB 2 120 F f
# 3 10 300.3 CCC 3 30 M m
# 4 30 400.4 DDD 4 90 F f
# 5 50 500.5 EEE 5 150 M m
df %>%
mutate(A = 1:5) %>%
mutate(X3 = X*3) %>%
mutate(S = rep(c("M","F"),length=5)) %>%
mutate(s = tolower(S))
SQL
SELECT X, Y, Z, X * 3 AS X3
FROM df;
SELECT
X,
Y,
Z,
X * 3 AS X3
FROM
df
;
列名変更
Python
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
df.columns = ['xx','yy','zz']
df
print(df)
# xx yy zz
# 0 20 100.1 AAA
# 1 40 200.2 BBB
# 2 10 300.3 CCC
# 3 30 400.4 DDD
# 4 50 500.5 EEE
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
df.rename(columns={'X':'xx','Y':'yy','Z':'zz'})
print(df.rename(columns={'X':'xx','Y':'yy','Z':'zz'}))
# xx yy zz
# 0 20 100.1 AAA
# 1 40 200.2 BBB
# 2 10 300.3 CCC
# 3 30 400.4 DDD
# 4 50 500.5 EEE
df.rename(columns={'X':'xx','Y':'yy','Z':'zz'}, inplace=True)
df
print(df)
# xx yy zz
# 0 20 100.1 AAA
# 1 40 200.2 BBB
# 2 10 300.3 CCC
# 3 30 400.4 DDD
# 4 50 500.5 EEE
R
df <- data.frame(X=x, Y=y, Z=z)
df
rename(df, xx = X, yy = Y, zz = Z)
df %>% rename(xx = X, yy = Y, zz = Z)
# xx yy zz
# 1 20 100.1 AAA
# 2 40 200.2 BBB
# 3 10 300.3 CCC
# 4 30 400.4 DDD
# 5 50 500.5 EEE
df %>%
rename(xx = X, yy = Y, zz = Z) %>%
rename(yyy = yy)
# xx yyy zz
# 1 20 100.1 AAA
# 2 40 200.2 BBB
# 3 10 300.3 CCC
# 4 30 400.4 DDD
# 5 50 500.5 EEE
SQL
SELECT X AS xx, Y AS yyy, Z AS zz
FROM df;
列への代入
Python
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
df['Y'] = df['X'] / 1000
df
print(df)
# X Y Z
# 0 20 0.02 AAA
# 1 40 0.04 BBB
# 2 10 0.01 CCC
# 3 30 0.03 DDD
# 4 50 0.05 EEE
R
df <- data.frame(X=x, Y=y, Z=z)
df
mutate(df, Y = X / 1000)
df %>% mutate(Y = X / 1000)
# X Y Z
# 1 20 0.02 AAA
# 2 40 0.04 BBB
# 3 10 0.01 CCC
# 4 30 0.03 DDD
# 5 50 0.05 EEE
SQL
SELECT
X,
X / 100 AS Y,
Z
FROM
df;
行の条件抽出フィルタ
SQLのWHERE条件に相当する、条件を指定して行を抽出(フィルタ)する方法です。
Python
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
df['X'] < 30
# 0 True
# 1 False
# 2 True
# 3 False
# 4 False
# Name: X, dtype: bool
type(df['X'] < 30)
# pandas.core.series.Series
df[df['X'] < 30]
print(df[df['X'] < 30])
df.loc[df['X'] < 30,:]
print(df.loc[df['X'] < 30,:])
df.query('X < 30')
print(df.query('X < 30'))
# X Y Z
# 0 20 100.1 AAA
# 2 10 300.3 CCC
(df['X'] > 10) & (df['X'] < 50)
# 0 True
# 1 True
# 2 False
# 3 True
# 4 False
# Name: X, dtype: bool
df[(df['X'] > 10) & (df['X'] < 50)]
print(df[(df['X'] > 10) & (df['X'] < 50)])
df.query('X > 10 & X < 50')
print(df.query('X > 10 & X < 50'))
df.query('10 < X < 50')
print(df.query('10 < X < 50'))
# X Y Z
# 0 20 100.1 AAA
# 1 40 200.2 BBB
# 3 30 400.4 DDD
R
df <- data.frame(X=x, Y=y, Z=z)
df
filter(df, X < 30)
df %>% filter(X < 30)
# X Y Z
# 1 20 100.1 AAA
# 2 10 300.3 CCC
filter(df, X > 10 & X < 50)
df %>% filter(X > 10 & X < 50)
# X Y Z
# 1 20 100.1 AAA
# 2 40 200.2 BBB
# 3 30 400.4 DDD
SQL
SELECT *
FROM df
WHERE X < 30;
SELECT *
FROM df
WHERE X > 10 AND X < 50;
行のソート
Python
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
df.sort_values(by='X')
print(df.sort_values(by='X'))
# X Y Z
# 2 10 300.3 CCC
# 0 20 100.1 AAA
# 3 30 400.4 DDD
# 1 40 200.2 BBB
# 4 50 500.5 EEE
df.sort_values(by='X', ascending=False)
print(df.sort_values(by='X', ascending=False))
# X Y Z
# 4 50 500.5 EEE
# 1 40 200.2 BBB
# 3 30 400.4 DDD
# 0 20 100.1 AAA
# 2 10 300.3 CCC
df['S'] = ['M','F','M','F','M']
df
print(df)
# X Y Z S
# 0 20 100.1 AAA M
# 1 40 200.2 BBB F
# 2 10 300.3 CCC M
# 3 30 400.4 DDD F
# 4 50 500.5 EEE M
df.sort_values(by=['S','X'])
print(df.sort_values(by=['S','X']))
# X Y Z S
# 3 30 400.4 DDD F
# 1 40 200.2 BBB F
# 2 10 300.3 CCC M
# 0 20 100.1 AAA M
# 4 50 500.5 EEE M
R
df <- data.frame(X=x, Y=y, Z=z)
df
arrange(df, X)
df %>% arrange(X)
# X Y Z
# 1 10 300.3 CCC
# 2 20 100.1 AAA
# 3 30 400.4 DDD
# 4 40 200.2 BBB
# 5 50 500.5 EEE
arrange(df, desc(X))
df %>% arrange(desc(X))
# X Y Z
# 1 50 500.5 EEE
# 2 40 200.2 BBB
# 3 30 400.4 DDD
# 4 20 100.1 AAA
# 5 10 300.3 CCC
df <- data.frame(X=x, Y=y, Z=z)
df <- df %>% mutate(S = rep(c("M","F"),length=5))
df
# X Y Z S
# 1 20 100.1 AAA M
# 2 40 200.2 BBB F
# 3 10 300.3 CCC M
# 4 30 400.4 DDD F
# 5 50 500.5 EEE M
arrange(df, S, X)
df %>% arrange(S, X)
# X Y Z S
# 4 30 400.4 DDD F
# 2 40 200.2 BBB F
# 3 10 300.3 CCC M
# 1 20 100.1 AAA M
# 5 50 500.5 EEE M
SQL
SELECT *
FROM df
ORDER BY X ASC;
SELECT *
FROM df
ORDER BY X DESC;
SELECT *
FROM df
ORDER BY S ASC, X ASC;
重複行の削除
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
df
x = rep(1:3, 1:3)*10
y = 1:6*100+1:6/10
z = c("AAA","BBB","CCC","DDD","EEE", "FFF")
df <- data.frame(X = x, Y = y, Z = z)
df
# X Y Z
# 1 10 100.1 AAA
# 2 20 200.2 BBB
# 3 20 300.3 CCC
# 4 30 400.4 DDD
# 5 30 500.5 EEE
# 6 30 600.6 FFF
distinct(df, X)
distinct(df, X, .keep_all = FALSE)
df %>% distinct(X)
df %>% distinct(X, .keep_all = FALSE)
# X
# 1 10
# 2 20
# 3 30
distinct(df, X, .keep_all = TRUE)
df %>% distinct(X, .keep_all = TRUE)
# X Y Z
# 1 10 100.1 AAA
# 2 20 200.2 BBB
# 3 30 400.4 DDD
df %>% arrange(X, desc(Y))
# X Y Z
# 1 10 100.1 AAA
# 2 20 300.3 CCC
# 3 20 200.2 BBB
# 4 30 600.6 FFF
# 5 30 500.5 EEE
# 6 30 400.4 DDD
df %>% arrange(X, desc(Y)) %>% distinct(X, .keep_all = TRUE)
# X Y Z
# 1 10 100.1 AAA
# 2 20 300.3 CCC
# 3 30 600.6 FFF
データフレームの集計
基本統計量
基本統計量(要約統計量)の表示です。
Python
データフレームのdescribe()
メソッドで基本統計量を表示できます。
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
df
print(df)
# X Y Z
# 0 20 100.1 AAA
# 1 40 200.2 BBB
# 2 10 300.3 CCC
# 3 30 400.4 DDD
# 4 50 500.5 EEE
df.describe()
print(df.describe())
# X Y
# count 5.000000 5.000000
# mean 30.000000 300.300000
# std 15.811388 158.271997
# min 10.000000 100.100000
# 25% 20.000000 200.200000
# 50% 30.000000 300.300000
# 75% 40.000000 400.400000
# max 50.000000 500.500000
df.describe().T
print(df.describe().T)
# count mean std min 25% 50% 75% max
# X 5.0 30.0 15.811388 10.0 20.0 30.0 40.0 50.0
# Y 5.0 300.3 158.271997 100.1 200.2 300.3 400.4 500.5
type(df.describe())
# pandas.core.frame.DataFrame
結果は、8行(今の場合は8×2)のデータフレームが返ります。
df.describe().T
は転置して(2×8の形で)表示しているだけです。
R
dplyr
の機能ではありませんが、summary()
でデータフレームのサマリーを表示できます。
df <- data.frame(X=x, Y=y, Z=z)
df
# X Y Z
# 1 20 100.1 AAA
# 2 40 200.2 BBB
# 3 10 300.3 CCC
# 4 30 400.4 DDD
# 5 50 500.5 EEE
summary(df)
df %>% summary()
# X Y Z
# Min. :10 Min. :100.1 AAA:1
# 1st Qu.:20 1st Qu.:200.2 BBB:1
# Median :30 Median :300.3 CCC:1
# Mean :30 Mean :300.3 DDD:1
# 3rd Qu.:40 3rd Qu.:400.4 EEE:1
# Max. :50 Max. :500.5
SQL
集計
Python
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
df.count()
print(df.count())
# X 5
# Y 5
# Z 5
# S 5
# dtype: int64
type(df.count())
# pandas.core.series.Series
df.sum()
print(df.sum())
# X 150
# Y 1501.5
# Z AAABBBCCCDDDEEE
# S MFMFM
# dtype: object
df[['X','Y']].sum()
print(df[['X','Y']].sum())
# X 150.0
# Y 1501.5
# dtype: float64
df.mean()
print(df.mean())
# X 30.0
# Y 300.3
# dtype: float64
df.cumsum()
print(df.cumsum())
# X Y Z S
# 0 20 100.1 AAA M
# 1 60 300.3 AAABBB MF
# 2 70 600.6 AAABBBCCC MFM
# 3 100 1001.0 AAABBBCCCDDD MFMF
# 4 150 1501.5 AAABBBCCCDDDEEE MFMFM
df.cov()
print(df.cov())
# X Y
# X 250.00 1251.250
# Y 1251.25 25050.025
df.corr()
print(df.corr())
# X Y
# X 1.0 0.5
# Y 0.5 1.0
R
df <- data.frame(X=x, Y=y, Z=z)
df
count(df)
df %>% count()
# n
# 1 5
summarise(df, sumX = sum(X), sumY = sum(Y), meanX = mean(X), meanY = mean(Y))
df %>% summarise(sumX = sum(X), sumY = sum(Y), meanX = mean(X), meanY = mean(Y))
# sumX sumY meanX meanY
# 1 150 1501.5 30 300.3
summarise(df, n(), min(X), max(X))
df %>% summarise(n(), min(X), max(X), median(X))
# n() min(X) max(X) median(X)
# 1 5 10 50 30
summarise(df, mean(X), sd(X), IQR(X), mad(X))
df %>% summarise(mean(X), sd(X), IQR(X), mad(X))
# mean(X) sd(X) IQR(X) mad(X)
# 1 30 15.81139 20 14.826
summarise(df, quantile(X), quantile(Y))
df %>% summarise(quantile(X), quantile(Y))
# quantile(X) quantile(Y)
# 1 10 100.1
# 2 20 200.2
# 3 30 300.3
# 4 40 400.4
# 5 50 500.5
SQL
SELECT count(*), sum(X), sum(Y), mean(X), mean(Y)
FROM df;
グループ化
グループ化して集計についてです。
Python
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
s = ['M','F','M','F','M']
df['S'] = s
df
print(df)
# X Y Z S
# 0 20 100.1 AAA M
# 1 40 200.2 BBB F
# 2 10 300.3 CCC M
# 3 30 400.4 DDD F
# 4 50 500.5 EEE M
# グループ化
df.groupby('S')
df.groupby('S').sum()
print(df.groupby('S').sum())
# X Y
# S
# F 70 600.6
# M 80 900.9
df.groupby('S').mean()
print(df.groupby('S').mean())
# X Y
# S
# F 35.000000 300.3
# M 26.666667 300.3
df.groupby(['S','Z'])
df.groupby(['S','Z']).sum()
print(df.groupby(['S','Z']).sum())
# X Y
# S Z
# F BBB 40 200.2
# DDD 30 400.4
# M AAA 20 100.1
# CCC 10 300.3
# EEE 50 500.5
df.groupby(['S','Z']).mean()
print(df.groupby(['S','Z']).mean())
# X Y
# S Z
# F BBB 40 200.2
# DDD 30 400.4
# M AAA 20 100.1
# CCC 10 300.3
# EEE 50 500.5
R
s <- rep(c("M","F"),length=5)
df <- data.frame(X=x, Y=y, Z=z)
df <- df %>% mutate(S = s)
df
# X Y Z S
# 1 20 100.1 AAA M
# 2 40 200.2 BBB F
# 3 10 300.3 CCC M
# 4 30 400.4 DDD F
# 5 50 500.5 EEE M
df %>% group_by(S) %>% summarise(sumX = sum(X), sumY = sum(Y))
# S sumX sumY
# <chr> <dbl> <dbl>
# 1 F 70 601.
# 2 M 80 901.
df %>% group_by(S, Z) %>%
summarise(sumX = sum(X), sumY = sum(Y))
# S Z sumX sumY
# <chr> <fct> <dbl> <dbl>
# 1 F BBB 40 200.
# 2 F DDD 30 400.
# 3 M AAA 20 100.
# 4 M CCC 10 300.
# 5 M EEE 50 500.
SQL
SELECT sum(X), sum(Y), mean(X), mean(Y)
FROM df
GROUP BY S;
SELECT sum(X), sum(Y), mean(X), mean(Y)
FROM df
GROUP BY S, Z;
データフレームの結合
横に並べる
Python
k1 = np.arange(1,5+1)
k2 = np.array([1,3,5,1,3])
x1 = k1*10
x2 = k1*100
x3 = k1*1000
k1, k2, x1, x2, x3
# (array([1, 2, 3, 4, 5]),
# array([1, 3, 5, 1, 3]),
# array([10, 20, 30, 40, 50]),
# array([100, 200, 300, 400, 500]),
# array([1000, 2000, 3000, 4000, 5000]))
dfk1x1 = pd.DataFrame({'k1':k1, 'x1':x1})
dfk1x1
print(dfk1x1)
# k1 x1
# 0 1 10
# 1 2 20
# 2 3 30
# 3 4 40
# 4 5 50
dfk2x2 = pd.DataFrame({'k2':k2, 'x2':x2})
dfk2x2
print(dfk2x2)
# k2 x2
# 0 1 100
# 1 3 200
# 2 5 300
# 3 1 400
# 4 3 500
# 横に並べる
pd.concat([dfk1x1, dfk2x2], axis=1)
print(pd.concat([dfk1x1, dfk2x2], axis=1))
# k1 x1 k2 x2
# 0 1 10 1 100
# 1 2 20 3 200
# 2 3 30 5 300
# 3 4 40 1 400
# 4 5 50 3 500
R
k1 <- 1:5
k2 <- c(1,3,5,1,3)
x1 <- k1*10
x2 <- k1*100
x3 <- k1*1000
k1
# [1] 1 2 3 4 5
k2
# [1] 1 3 5 1 3
x1
# [1] 10 20 30 40 50
x2
# [1] 100 200 300 400 500
x3
# [1] 1000 2000 3000 4000 5000
dfk1x1 <- data.frame(k1, x1)
dfk1x1
# k1 x1
# 1 1 10
# 2 2 20
# 3 3 30
# 4 4 40
# 5 5 50
dfk2x2 <- data.frame(k2, x2)
dfk2x2
# k2 x2
# 1 1 100
# 2 3 200
# 3 5 300
# 4 1 400
# 5 3 500
# 横に並べる
bind_cols(dfk1x1, dfk2x2)
# k1 x1 k2 x2
# 1 1 10 1 100
# 2 2 20 3 200
# 3 3 30 5 300
# 4 4 40 1 400
# 5 5 50 3 500
SQL
縦に並べる
データフレームを縦に並べます。SQLのUNIONの操作です。
Python
dfk2x2_2 = pd.DataFrame({'k1':k2, 'x1':x2})
dfk2x2_2
print(dfk2x2_2)
# k1 x1
# 0 1 100
# 1 3 200
# 2 5 300
# 3 1 400
# 4 3 500
# 縦に並べる
pd.concat([dfk1x1, dfk2x2_2], axis=0)
print(pd.concat([dfk1x1, dfk2x2_2], axis=0))
pd.concat([dfk1x1, dfk2x2_2])
print(pd.concat([dfk1x1, dfk2x2_2]))
dfk1x1.append(dfk2x2_2)
print(dfk1x1.append(dfk2x2_2))
# k1 x1
# 0 1 10
# 1 2 20
# 2 3 30
# 3 4 40
# 4 5 50
# 0 1 100
# 1 3 200
# 2 5 300
# 3 1 400
# 4 3 500
pd.concat([dfk1x1, dfk2x2_2], ignore_index=True)
print(pd.concat([dfk1x1, dfk2x2_2], ignore_index=True))
dfk1x1.append(dfk2x2_2, ignore_index=True)
print(dfk1x1.append(dfk2x2_2, ignore_index=True))
# k1 x1
# 0 1 10
# 1 2 20
# 2 3 30
# 3 4 40
# 4 5 50
# 5 1 100
# 6 3 200
# 7 5 300
# 8 1 400
# 9 3 500
横に並べるのに使ったconcat()
をaxis=0
(デフォルト)として使える他に、append()
が使えます。ignore_index=True
とするとインデックスがふり直されます。列名は一致させておく必要があります。
R
dfk2x2_2 <- data.frame(k1 = k2, x1 = x2)
dfk2x2_2
# k1 x1
# 1 1 100
# 2 3 200
# 3 5 300
# 4 1 400
# 5 3 500
# 縦に並べる
bind_rows(dfk1x1, dfk2x2_2)
# k1 x1
# 1 1 10
# 2 2 20
# 3 3 30
# 4 4 40
# 5 5 50
# 6 1 100
# 7 3 200
# 8 5 300
# 9 1 400
# 10 3 500
union_all(x = dfk1x1, y = dfk2x2_2)
dfk1x1 %>% union_all(y = dfk2x2_2)
# k1 x1
# 1 1 10
# 2 2 20
# 3 3 30
# 4 4 40
# 5 5 50
# 6 1 100
# 7 3 200
# 8 5 300
# 9 1 400
# 10 3 500
union_all(x = dfk1x1, y = dfk1x1) # 重複行を削除せずUNION
dfk1x1 %>% union_all(y = dfk1x1)
# k1 x1
# 1 1 10
# 2 2 20
# 3 3 30
# 4 4 40
# 5 5 50
# 6 1 10
# 7 2 20
# 8 3 30
# 9 4 40
# 10 5 50
union(x = dfk1x1, y = dfk1x1) # 重複行を削除してUNION
dfk1x1 %>% union(y = dfk1x1)
# k1 x1
# 1 1 10
# 2 2 20
# 3 3 30
# 4 4 40
# 5 5 50
SQL
SELECT * FROM dfk1x1
UNION ALL
SELECT * FROM dfk2x2
;
SELECT * FROM dfk1x1
UNION
SELECT * FROM dfk2x2
;
キーで結合
SQLのジョインの操作です。
1つのキーで結合
Python
dfk1x2 = pd.DataFrame({'k':k1, 'x2':x2}).drop(2-1)
dfk1x2
print(dfk1x2)
# k x2
# 0 1 100
# 2 3 300
# 3 4 400
# 4 5 500
dfk1x3 = pd.DataFrame({'k':k1, 'x3':x3}).drop(3-1)
dfk1x3
print(dfk1x3)
# k x3
# 0 1 1000
# 1 2 2000
# 3 4 4000
# 4 5 5000
# 内部結合(INNER JOIN)
pd.merge(dfk1x2, dfk1x3, on='k')
print(pd.merge(dfk1x2, dfk1x3, on='k'))
pd.merge(dfk1x2, dfk1x3, on='k', how='inner')
print(pd.merge(dfk1x2, dfk1x3, on='k', how='inner'))
# k x2 x3
# 0 1 100 1000
# 1 4 400 4000
# 2 5 500 5000
# 左外部結合(LEFT OUTER JOIN)
pd.merge(dfk1x2, dfk1x3, on='k', how='left')
print(pd.merge(dfk1x2, dfk1x3, on='k', how='left'))
# k x2 x3
# 0 1 100 1000.0
# 1 3 300 NaN
# 2 4 400 4000.0
# 3 5 500 5000.0
# 右外部結合(RIGHT OUTER JOIN)
pd.merge(dfk1x2, dfk1x3, on='k', how='right')
print(pd.merge(dfk1x2, dfk1x3, on='k', how='right'))
# k x2 x3
# 0 1 100.0 1000
# 1 4 400.0 4000
# 2 5 500.0 5000
# 3 2 NaN 2000
# 完全外部結合(FULL OUTER JOIN)
pd.merge(dfk1x2, dfk1x3, on='k', how='outer')
print(pd.merge(dfk1x2, dfk1x3, on='k', how='outer'))
# k x2 x3
# 0 1 100.0 1000.0
# 1 3 300.0 NaN
# 2 4 400.0 4000.0
# 3 5 500.0 5000.0
# 4 2 NaN 2000.0
R
dfk1x2 <- data.frame(k1, x2)[-2,]
dfk1x2
# k1 x2
# 1 1 100
# 3 3 300
# 4 4 400
# 5 5 500
dfk1x3 <- data.frame(k1, x3)[-3,]
dfk1x3
# k1 x3
# 1 1 1000
# 2 2 2000
# 4 4 4000
# 5 5 5000
# 内部結合(INNER JOIN)
inner_join(x = dfk1x2, y = dfk1x3, by = "k1")
dfk1x2 %>% inner_join(y = dfk1x3, by = "k1")
# k1 x2 x3
# 1 1 100 1000
# 2 4 400 4000
# 3 5 500 5000
# 左外部結合(LEFT OUTER JOIN)
left_join(x = dfk1x2, y = dfk1x3, by ="k1")
dfk1x2 %>% left_join(y = dfk1x3, by ="k1")
# k1 x2 x3
# 1 1 100 1000
# 2 3 300 NA
# 3 4 400 4000
# 4 5 500 5000
# 右外部結合(RIGHT OUTER JOIN)
right_join(x = dfk1x2, y = dfk1x3, by = "k1")
dfk1x2 %>% right_join(y = dfk1x3, by = "k1")
# k1 x2 x3
# 1 1 100 1000
# 2 4 400 4000
# 3 5 500 5000
# 4 2 NA 2000
right_join(x = dfk1x2, y = dfk1x3, by = "k1") %>% arrange(k1)
dfk1x2 %>% right_join(y = dfk1x3, by = "k1") %>% arrange(k1)
# k1 x2 x3
# 1 1 100 1000
# 2 2 NA 2000
# 3 4 400 4000
# 4 5 500 5000
# 完全外部結合(FULL OUTER JOIN)
full_join(x = dfk1x2, y = dfk1x3, by = "k1")
dfk1x2 %>% full_join(y = dfk1x3, by = "k1")
# # k1 x2 x3
# 1 1 100 1000
# 2 3 300 NA
# 3 4 400 4000
# 4 5 500 5000
# 5 2 NA 2000
full_join(x = dfk1x2, y = dfk1x3, by = "k1") %>% arrange(k1)
dfk1x2 %>% full_join(x = dfk1x2, y = dfk1x3, by = "k1") %>% arrange(k1)
# k1 x2 x3
# 1 1 100 1000
# 2 2 NA 2000
# 3 3 300 NA
# 4 4 400 4000
# 5 5 500 5000
SQL
SELECT *
FROM dfk1x2 INNER JOIN dfk1x3 ON dfk1x2.k = dfk1x3.k
;
SELECT *
FROM dfk1x2 LEFT JOIN dfk1x3 ON dfk1x2.k = dfk1x3.k
;
SELECT *
FROM dfk1x2 RIGHT JOIN dfk1x3 ON dfk1x2.k = dfk1x3.k
;
SELECT *
FROM dfk1x2 FULL JOIN dfk1x3 ON dfk1x2.k = dfk1x3.k
;
名前の異なるキーで結合
Python
# 名前の異なるキーで結合
pd.merge(dfk1x1, dfk2x2, left_on='k1', right_on='k2', how='left')
print(pd.merge(dfk1x1, dfk2x2, left_on='k1', right_on='k2', how='left'))
# k1 x1 k2 x2
# 0 1 10 1.0 100.0
# 1 1 10 1.0 400.0
# 2 2 20 NaN NaN
# 3 3 30 3.0 200.0
# 4 3 30 3.0 500.0
# 5 4 40 NaN NaN
# 6 5 50 5.0 300.0
R
# 名前の異なるキーで結合
left_join(x = dfk1x1, y = dfk2x2, by = c("k1" = "k2"))
dfk1x1 %>% left_join(y = dfk2x2, by = c("k1" = "k2"))
# k1 x1 x2
# 1 1 10 100
# 2 1 10 400
# 3 2 20 NA
# 4 3 30 200
# 5 3 30 500
# 6 4 40 NA
# 7 5 50 300
SQL
SELECT *
FROM dfk1x1 LEFT JOIN dfk2x2 ON dfk1x1.k1 = dfk2x2.k2
;
複数のキーで結合
Python
# 複数のキーで結合
dfk1k2x2 = pd.DataFrame({'k1':k1, 'k2':k2, 'x2':x2}).drop(2-1)
dfk1k2x2
print(dfk1k2x2)
# k1 k2 x2
# 0 1 1 100
# 2 3 5 300
# 3 4 1 400
# 4 5 3 500
dfk1k2x3 = pd.DataFrame({'k1':k1, 'k2':k2, 'x3':x3}).drop(3-1)
dfk1k2x3
print(dfk1k2x3)
# k1 k2 x3
# 0 1 1 1000
# 1 2 3 2000
# 3 4 1 4000
# 4 5 3 5000
pd.merge(dfk1k2x2, dfk1k2x3, on=['k1','k2'], how='left')
print(pd.merge(dfk1k2x2, dfk1k2x3, on=['k1','k2'], how='left'))
# k1 k2 x2 x3
# 0 1 1 100 1000.0
# 1 3 5 300 NaN
# 2 4 1 400 4000.0
# 3 5 3 500 5000.0
dfk1k2x2.merge(dfk1k2x3, on=['k1','k2'], how='left')
print(dfk1k2x2.merge(dfk1k2x3, on=['k1','k2'], how='left'))
# k1 k2 x2 x3
# 0 1 1 100 1000.0
# 1 3 5 300 NaN
# 2 4 1 400 4000.0
# 3 5 3 500 5000.0
最後の例ように、pd.merge(dfx, dfy, ...)
をdfx.merge(dfy, ...)
の形にも書くこともできます。他の上の例も同様です。
R
# 複数のキーで結合
dfk1k2x2 <- data.frame(k1, k2, x2)[-2,]
dfk1k2x2
# k1 k2 x2
# 1 1 1 100
# 3 3 5 300
# 4 4 1 400
# 5 5 3 500
dfk1k2x3 <- data.frame(k1, k2, x3)[-3,]
dfk1k2x3
# k1 k2 x3
# 1 1 1 1000
# 2 2 3 2000
# 4 4 1 4000
# 5 5 3 5000
left_join(x = dfk1k2x2, y = dfk1k2x3, by = c("k1","k2"))
dfk1k2x2 %>% left_join(y = dfk1k2x3, by = c("k1","k2"))
# k1 k2 x2 x3
# 1 1 1 100 1000
# 2 3 5 300 NA
# 3 4 1 400 4000
# 4 5 3 500 5000
SQL
SELECT *
FROM dfk1k2x2 LEFT JOIN dfk1k2x3
ON dfk1k2x2.k1 = dfk1k2x3.k1 AND
dfk1k2x2.k2 = dfk1k2x3.k2
;
dplyr
について詳しくは、こちらの記事参照
dplyr — 高速data.frame処理
まとめ
一覧
各言語で使用する関数等を一覧にまとめます。(比較のために、SQLも示しました。)
Python(pandas) | R | RDB(Access) | Excel | 行列 | |
---|---|---|---|---|---|
オブジェクト | DataFrame | data.frame | テーブル | テーブル | 行列(matrix) |
行 | index(axis=0) | 行(row) | レコード | 行(ROW) | 行(row) |
列 | columns(axis=1) | 列(col) | フィールド | 列(COLUMN) | 列(column) |
要素 | values | 値 | CELL | 要素 | |
列の正体 | pandasのSeries | ベクトル | 同じデータ型の値 |
データフレームdf
||X|Y|Z|
|--:|:-:|:-:|:-:|:-:|
|1|20|100.1|AAA|
|2|40|200.2|BBB|
|3|10|300.3|CCC|
|4|30|400.4|DDD|
|5|50|500.5|EEE|
データフレームへのアクセス
Python(pandas) | R(標準) | R(dplyr) | 結果 | |
---|---|---|---|---|
列ベクトル |
df.X df['X']
|
df$X df[["X"]] df[[1]]
|
pull(df, X) pull(df, 1)
|
ベクトル (20,40, 10,30,50) |
列指定 |
df[['X']] df.loc[:,['X']]
|
df["X"] df[1]
|
select(df, X) select(df, 1)
|
データフレーム |
複数列指定 |
df[['X','Z']] df.loc[:,['X','Z']] df.iloc[:, [1-1,3-1]]
|
df[c("X","Z")] df[c(1,3)]
|
select(df, X, Z) select(df, 1, 3)
|
データフレーム |
df[['Z','Y','X']] df.loc[:, ['Z','Y','X']] df.iloc[:, [3-1,2-1,1-1]]
|
df[c("Z","X","Y")] df[c(3,1,2)]
|
select(df, Z, Y, X) select(df, 3, 2, 1)
|
データフレーム | |
先頭n行 | df.head(n) |
head(df, n) |
slice_head(df, n = n) |
データフレーム |
末尾n行 | df.tail(n) |
tail(df, n) |
slice_tail(df, n = 2) |
データフレーム |
要素 |
df.iat[1-1,1-1] df.at[1-1,'X'] df.iloc[1-1,1-1] df.loc[1-1,'X'] df['X'][1-1] df.X[1-1]
|
df[1,1] df[1,"X"] df[[1,1]] df[[1,"X"]] df$X[1] df[["X"]][1]
|
pull(df, 1)[1] pull(df, X)[1]
|
値 20 |
複数行 複数列 指定 |
df.iloc[ [1-1,3-1],[1-1,3-1]] df.loc[ [1-1,3-1],['X','Z']]
|
df[c(1,3),c(1,3)] df[c(1,3),c("X","Z")] df[c(T,F,T,F,F), c(T,F,T)]
|
slice(select(df, 1, 3), 1, 3) slice(select(df, X, Z), 1, 3) slice(select(df, c(1, 3)), c(1, 3)) slice(select(df, c(X, Z)), c(1, 3))
|
データフレーム |
複数列指定 |
df.iloc[:,[1-1,3-1]] df.loc[:,['X','Z']] df[['X','Z']]
|
df[,c(1,3)] df[,c("X","Z")] df[,c(T,F,T)]
|
select(df, 1, 3) select(df, X, Z) select(df, c(1, 3)) select(df, c(X, Z)) select(df, c("X", "Z"))
|
データフレーム |
複数行指定 |
df.iloc[[1-1,3-1],:] df.loc[[1-1,3-1],:] df.iloc[[1-1,3-1]] df.loc[[1-1,3-1]]
|
df[c(1,3),] df[c(T,F,T,F,F),]
|
slice(df, 1, 3) slice(df, c(1, 3))
|
データフレーム |
データフレームの操作
Python(pandas) | R(標準) | R(dplyr) | SQL | |
---|---|---|---|---|
列の選択 | df[['Z','Y','X']] |
df[c("Z","X","Y")] |
select(df, Z, X, Y)
|
SELECT Z, X, Y FROM df;
|
列の追加 | df['X3'] = df['X'] * 3 |
df$X3 <- df$X * 3 |
mutate(df, X3 = X*3)
|
SELECT X, Y, Z, X * 3 AS X3 FROM df;
|
列名変更 |
df.rename( columns={'X':'xx', 'Y':'yy', 'Z':'zz'})
|
names(df) <- c("xx", "yy", "zz")
|
rename(df, xx=X, yy=Y, zz=Z)
|
SELECT X AS xx, Y AS yy, Z AS zzz FROM df;
|
行の条件抽出 フィルタ |
df['X'] < 30 df.query( 'X < 30')
|
df[df$X < 30,] |
filter(df, X < 30)
|
SELECT * FROM df WHERE X < 30;
|
df[(df['X'] > 10) & (df['X'] < 50)] df.query( '10 < X < 50')
|
df[df$X > 10 & df$X < 50,]
|
filter(df, X > 10 & X < 50)
|
SELECT * FROM df WHERE X > 10 AND X < 50;
|
|
行のソート |
df.sort_values( by='X')
|
df[order(df$X), ] |
arrange(df, X)
|
SELECT * FROM df ORDER BY X ASC;
|
行のソート |
df.sort_values( by='X', ascending=False)
|
df[order(df$X, decreasing=T),]
|
arrange(df, desc(X))
|
SELECT * FROM df ORDER BY X DESC;
|
行のソート |
df.sort_values( by=['S','X'])
|
df[order( df$S, df$X), ]
|
arrange(df, S, X)
|
SELECT * FROM df ORDER BY S ASC, X ASC;
|
重複行の削除 |
distinct(df, X, .keep_all=T)
|
データフレームの集計
Python(pandas) | R(標準) | R(dplyr) | SQL | |
---|---|---|---|---|
基本統計量 |
df.describe() df.describe().T
|
summary(df) |
||
グループ化 | df.groupby('S') |
group_by(df, S) |
GROUP BY S; |
|
グループ化 | df.groupby(['S','Z']) |
group_by(df, S, Z)
|
GROUP BY S, Z; |
データフレームの結合
Python(pandas) | R(標準) | R(dplyr) | SQL | |
---|---|---|---|---|
横に並べる |
pd.concat([df1, df2], axis=1)
|
cbind(df1, df2) |
bind_cols( df1, df2)
|
|
縦に並べる |
pd.concat([df1, df2], axis=0) pd.concat([df1, df2], ignore_index=True) df1.append(df2, ignore_index=True)
|
rbind(df1, df2) |
bind_rows( df1, df2) union_all( x = df1, y = df2)
|
UNION ALL |
内部結合 (INNER JOIN) |
pd.merge(df1, df2, on='k') pd.merge(df1, df2, on='k', how='inner')
|
merge(df1, df2) merge(df1, df2, by="k")
|
inner_join( x = df1, y = df2, by = "k")
|
FROM df1 INNER JOIN df2 ON df1.k = df2.k
|
左外部結合 (LEFT OUTER JOIN) |
pd.merge(df1, df2, on='k', how='left')
|
merge(df1, df2, by="k", all.x=T)
|
left_join( x = df1, y = df2, by = "k")
|
FROM df1 LEFT JOIN df2 ON df1.k = df2.k
|
右外部結合 (RIGHT OUTER JOIN) |
pd.merge(df1, df2, on='k', how='right')
|
merge(df1, df2, by="k", all.y=T)
|
right_join( x = df1, y = df2, by = "k")
|
FROM df1 RIGHT JOIN df2 ON df1.k = df2.k
|
完全外部結合 (FULL OUTER JOIN) |
pd.merge(df1, df2, on='k', how='outer')
|
merge(df1, df2, by="k", all=T)
|
full_join( x = df1, y = df2, by = "k")
|
FROM df1 FULL JOIN df2 ON df1.k = df2.k
|
名前の異なる キーで結合 |
pd.merge(df1, df2, left_on='k1', right_on='k2', how='left')
|
merge(df1, df2, by.x="k1", by.y="k2", all.x=T)
|
left_join( x = df1, y = df2, by = c("k1"="k2"))
|
FROM df1 LEFT JOIN df2 ON df1.k1 = df2.k2
|
複数のキーで結合 |
pd.merge(df1, df2, on=['k1','k2'], how='left')
|
merge(df1, df2, by=c("k1","k2"), all.x=T)
|
left_join( x = df1, y = df2, by = c("k1","k2"))
|
FROM df2 LEFT JOIN df2 ON df1.k1 = df2.k1 AND df1.k2 = df2.k2
|
プログラム全体
参考までに使ったプログラムの全体を示します。
(今回は長くなったので省略します。)
Python
R
VBA
参考
- Python
- R