LoginSignup
0
3

More than 1 year has passed since last update.

VBAユーザがPython・Rを使ってみた:データフレーム(続)

Last updated at Posted at 2021-02-23

はじめに

機械学習の勉強を始めたVBAユーザです。
備忘録としてPython・Rの文法をVBAと比較しながらまとめています。

以前の記事でデータフレームについて書きましたが、今回はその続きで、Rのdplyr というパッケージについて書きます。
前回と同じ項目について、PythonのpandasとRのdplyrを比較したいと思います。

目次

データフレーム

今回もデータフレームについてです。PythonのpandasとRのdplyrを比較します。

データフレームは、このような形をしたデータで、縦方向に同じデータ型のデータが同じ数だけ並んでいるものです。つまり、列ベクトルを横に並べたようなものです。(ベクトルについてはこちら参照。)

X Y Z
1 20 100.1 AAA
2 40 200.2 BBB
3 10 300.3 CCC
4 30 400.4 DDD
5 50 500.5 EEE

縦方向を「行」、横方向を「列」といいます。

Python(pandas) R RDB(Access) Excel 行列
オブジェクト DataFrame data.frame テーブル テーブル 行列(matrix)
index(axis=0) 行(row) レコード 行(ROW) 行(row)
columns(axis=1) 列(col) フィールド 列(COLUMN) 列(column)
要素 values CELL 要素

データフレームへのアクセス

指定した列へのアクセス

データフレームから列を取り出します。

Python

Python3
import numpy as np
import pandas as pd

x = np.array([2,4,1,3,5])*10
y = np.arange(1,5+1)*100+np.arange(1,5+1)/10
z = ['AAA','BBB','CCC','DDD','EEE']
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
df
#    X     Y   Z
# 1 20 100.1 AAA
# 2 40 200.2 BBB
# 3 10 300.3 CCC
# 4 30 400.4 DDD
# 5 50 500.5 EEE

df.X
df['X']
print(df['X'])
# 0    20
# 1    40
# 2    10
# 3    30
# 4    50
# Name: X, dtype: int32
type(df['X'])
# pandas.core.series.Series

df[['X']]
print(df[['X']])
df.loc[:,['X']]
print(df.loc[:,['X']])
#     X
# 0  20
# 1  40
# 2  10
# 3  30
# 4  50
type(df[['X']])
# pandas.core.frame.DataFrame

df[['X','Z']]
print(df[['X','Z']])
df.loc[:, ['X','Z']]
print(df.loc[:, ['X','Z']])
df.iloc[:, [1-1,3-1]]
print(df.iloc[:, [1-1,3-1]])
#     X    Z
# 0  20  AAA
# 1  40  BBB
# 2  10  CCC
# 3  30  DDD
# 4  50  EEE

df[['Z','Y','X']]
print(df[['Z','Y','X']])
df.loc[:,['Z','Y','X']]
print(df.loc[:,['Z','Y','X']])
df.iloc[:,[3-1,2-1,1-1]]
print(df.iloc[:,[3-1,2-1,1-1]])
var = ['Z','Y','X']
df.loc[:,var]
print(df.loc[:,var])
#      Z      Y   X
# 0  AAA  100.1  20
# 1  BBB  200.2  40
# 2  CCC  300.3  10
# 3  DDD  400.4  30
# 4  EEE  500.5  50

pandasのDataFrameはSeries(1次元)を横に並べたものです。
df.X,df['X']は、そのDataFrameのもとになっているSeriesを返します。
df[['X']]は、df[['X','Z']]df[['Z','Y','X']]と同じく、DataFrameから指定した列だけ抜き出したもの(その列数がたまたま1個だっただけ)で、DataFrameのsubset(一部)のDataFrameが返ります。

R

R
library(dplyr)

x = c(2,4,1,3,5)*10
y = 1:5*100+1:5/10
z = c("AAA","BBB","CCC","DDD","EEE")
df <- data.frame(X=x, Y=y, Z=z)
df
#    X     Y   Z
# 1 20 100.1 AAA
# 2 40 200.2 BBB
# 3 10 300.3 CCC
# 4 30 400.4 DDD
# 5 50 500.5 EEE

pull(df, X)
pull(df, 1)
df %>% pull(X)
df %>% pull(1)
# [1] 20 40 10 30 50
class(pull(df, X))
# [1] "numeric"

pull(df, -2) # 右から2列目
df %>% pull(-2)
# [1] 100.1 200.2 300.3 400.4 500.5

select(df, X)
df %>% select(X)
#    X
# 1 20
# 2 40
# 3 10
# 4 30
# 5 50
class(select(df, X))
# [1] "data.frame"

select(df, X, Z)
df %>% select(X, Z)
#    X   Z
# 1 20 AAA
# 2 40 BBB
# 3 10 CCC
# 4 30 DDD
# 5 50 EEE

select(df, -Y) # Y以外
df %>% select(-Y)
#    X   Z
# 1 20 AAA
# 2 40 BBB
# 3 10 CCC
# 4 30 DDD
# 5 50 EEE

select(df, Z, Y, X)
df %>% select(Z, Y, X)
#     Z     Y  X
# 1 AAA 100.1 20
# 2 BBB 200.2 40
# 3 CCC 300.3 10
# 4 DDD 400.4 30
# 5 EEE 500.5 50

select(df, X:Z)
df %>% select(X:Z)
#    X     Y   Z
1 20 100.1 AAA
2 40 200.2 BBB
3 10 300.3 CCC
4 30 400.4 DDD
5 50 500.5 EEE

select(df, Z:X)
df %>% select(Z:X)
#     Z     Y  X
# 1 AAA 100.1 20
# 2 BBB 200.2 40
# 3 CCC 300.3 10
# 4 DDD 400.4 30
# 5 EEE 500.5 50

var <- c("Z", "Y", "X")
select(df, !!var)
df %>% select(!!var)
#     Z     Y  X
# 1 AAA 100.1 20
# 2 BBB 200.2 40
# 3 CCC 300.3 10
# 4 DDD 400.4 30
# 5 EEE 500.5 50

pull(df, X)はベクトルを返します。一方、select(df, X)は1列だけのデータフレームを返します。

pull(df, X)と書くのとdf %>% pull(X)と書くのは等価です。
%>%はパイプという機能で、左のdfを右の関数pull()の第一引数に渡しています。これを使うことで、関数を何重にも重ねる代わりに、右に連結していけます(関数の連結)。dplyrの関数は第一引数にデータフレームを取るように統一されているため、このパイプ機能が相性良く使えます。
df %>% 関数1() %>% 関数2() %>% 関数3() %>% ...のような形で。
なお、このパイプ演算子%>%は、RStudioでは Shift + Ctrl + M で入力できます。

最後の例の!!varは、変数varに入っている文字列を列名として使っています。

指定した行へのアクセス

先頭の方の行と末尾の方の行を取り出します。

Python

Python3
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
df.head(3)
print(df.head(3))
#     X      Y    Z
# 0  20  100.1  AAA
# 1  40  200.2  BBB
# 2  10  300.3  CCC
df.tail(2)
print(df.tail(2))
#     X      Y    Z
# 3  30  400.4  DDD
# 4  50  500.5  EEE

R

R
df <- data.frame(X=x, Y=y, Z=z)
df
slice_head(df, n = 3)
df %>% slice_head(n = 3)
#    X     Y   Z
# 1 20 100.1 AAA
# 2 40 200.2 BBB
# 3 10 300.3 CCC

slice_tail(df, n = 2)
df %>% slice_tail(n = 2)
#    X     Y   Z
# 1 30 400.4 DDD
# 2 50 500.5 EEE

slice(df, 1, 3)
df %>% slice(1, 3)
#    X     Y   Z
# 1 20 100.1 AAA
# 2 10 300.3 CCC

slice(df, c(1, 3))
df %>% slice(c(1, 3))
#    X     Y   Z
# 1 20 100.1 AAA
# 2 10 300.3 CCC

slice(df, 2:4) # 2-4列
df %>% slice(2:4)
#    X     Y   Z
# 1 40 200.2 BBB
# 2 10 300.3 CCC
# 3 30 400.4 DDD

slice(df, -(2:4))
df %>% slice(-(2:4)) # 2-4列以外
#    X     Y   Z
# 1 20 100.1 AAA
# 2 50 500.5 EEE 

slice(df, -c(1,5)) # 1,5列以外
df %>% slice(-c(1,5))
#    X     Y   Z
# 1 40 200.2 BBB
# 2 10 300.3 CCC
# 3 30 400.4 DDD

slice(df, -2, -(4:5))
slice(df, -c(2, 4:5))
df %>% slice(-2, -(4:5))
df %>% slice(-c(2, 4:5))
#    X     Y   Z
# 1 20 100.1 AAA
# 2 10 300.3 CCC

指定した行と列の選択

データフレームから行・列を指定して取り出します。

Python

Python3
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
df.iat[1-1,1-1]
df.at[1-1,'X']
df.iloc[1-1,1-1]
df.loc[1-1,'X']
df['X'][1-1]
df.X[1-1]
# 20

df.iloc[[1-1,3-1],[1-1,3-1]]
print(df.iloc[[1-1,3-1],[1-1,3-1]])
df.loc[[1-1,3-1],['X','Z']]
print(df.loc[[1-1,3-1],['X','Z']])
#     X    Z
# 0  20  AAA
# 2  10  CCC

df.iloc[:,[1-1,3-1]]
print(df.iloc[:,[1-1,3-1]])
df.loc[:,['X','Z']]
print(df.loc[:,['X','Z']])
df[['X','Z']]
print(df[['X','Z']])
#     X    Z
# 0  20  AAA
# 1  40  BBB
# 2  10  CCC
# 3  30  DDD
# 4  50  EEE

df.iloc[[1-1,3-1],:]
# print(df.iloc[[1-1,3-1],:])
df.loc[[1-1,3-1],:]
# print(df.loc[[1-1,3-1],:])
df.iloc[[1-1,3-1]]
# print(df.iloc[[1-1,3-1]])
df.loc[[1-1,3-1]]
# print(df.loc[[1-1,3-1]])
#     X      Y    Z
# 0  20  100.1  AAA
# 2  10  300.3  CCC

locilocも選択する場所(location)を指定。locilocの違いは、locが label-based(名前で指定)、ilocが integer-location based(番号で指定)。

R

R
pull(df, 1)[1]
pull(df, X)[1]
df %>% pull(1) %>% `[`(1)
df %>% pull(X) %>% `[`(1)
df %>% `[`(1,1)
df[1,1]
# [1] 20

# なお、ベクトルxに対して x %>% `[`(1) は x[1] と同等
x %>% `[`(1)
x[1]
# [1] 20

select(slice(df, 1, 3), 1, 3)
select(slice(df, 1, 3), X, Z)
slice(select(df, 1, 3), 1, 3)
slice(select(df, X, Z), 1, 3)
slice(select(df, c(1, 3)), c(1, 3))
slice(select(df, c(X, Z)), c(1, 3))
df %>% select(1, 3) %>% slice(1, 3)
df %>% select(X, Z) %>% slice(1, 3)
df %>% slice(1, 3) %>% select(1, 3)
df %>% slice(1, 3) %>% select(X, Z)
df %>% slice(c(1, 3)) %>% select(c(1, 3))
df %>% slice(c(1, 3)) %>% select(c(X, Z))
#    X   Z
# 1 20 AAA
# 2 10 CCC

select(df, 1, 3)
select(df, X, Z)
select(df, c(1, 3))
select(df, c(X, Z))
select(df, c("X", "Z"))
#    X   Z
# 1 20 AAA
# 2 40 BBB
# 3 10 CCC
# 4 30 DDD
# 5 50 EEE

select(df, -Y)
select(df, !Y)
df %>% select(-Y)
df %>% select(!Y)
#    X   Z
# 1 20 AAA
# 2 40 BBB
# 3 10 CCC
# 4 30 DDD
# 5 50 EEE

slice(df, 1, 3)
df %>% slice(1, 3)
#    X     Y   Z
# 1 20 100.1 AAA
# 2 10 300.3 CCC

slice(df, 2:4)
df %>% slice(2:4)
#    X     Y   Z
# 1 40 200.2 BBB
# 2 10 300.3 CCC
# 3 30 400.4 DDD

slice(df, -2, -(4:5))
slice(df, -c(2, 4:5))
df %>% slice(-2, -(4:5))
df %>% slice(-c(2, 4:5))
#    X     Y   Z
# 1 20 100.1 AAA
# 2 10 300.3 CCC

slice(select(df, X, Z), 1, 3)は、select()関数にslice()関数を重ねていますが、パイプを使って連結するとdf %>% select(X, Z) %>% slice(1, 3)とすっきり書けてわかりやすくなります。

データフレームの操作

列の選択

SQLのSELECT文に相当する列選択です。

Python

Python3
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
df
#    X     Y   Z
# 1 20 100.1 AAA
# 2 40 200.2 BBB
# 3 10 300.3 CCC
# 4 30 400.4 DDD
# 5 50 500.5 EEE

df[['X']]
print(df[['X']])
df.iloc[:,[1-1]]
print(df.iloc[:,[1-1]])
df.loc[:,['X']]
print(df.loc[:,['X']])
#     X
# 0  20
# 1  40
# 2  10
# 3  30
# 4  50

df[['X','Z']]
print(df[['X','Z']])
df.loc[:,['X','Z']]
print(df.loc[:,['X','Z']])
df.iloc[:,[1-1,3-1]]
print(df.iloc[:,[1-1,3-1]])
#     X    Z
# 0  20  AAA
# 1  40  BBB
# 2  10  CCC
# 3  30  DDD
# 4  50  EEE

df[['Z','Y','X']]
print(df[['Z','Y','X']])
df.loc[:,['Z','Y','X']]
print(df.loc[:,['Z','Y','X']])
df.iloc[:,[3-1,2-1,1-1]]
print(df.iloc[:,[3-1,2-1,1-1]])
var = ['Z','Y','X']
df.loc[:,var]
print(df.loc[:,var])
#      Z      Y   X
# 0  AAA  100.1  20
# 1  BBB  200.2  40
# 2  CCC  300.3  10
# 3  DDD  400.4  30
# 4  EEE  500.5  50

R

R
df <- data.frame(X = x, Y = y, Z = z)
df
#    X     Y   Z
# 1 20 100.1 AAA
# 2 40 200.2 BBB
# 3 10 300.3 CCC
# 4 30 400.4 DDD
# 5 50 500.5 EEE

select(df, X)
df %>% select(X)
#    X
# 1 20
# 2 40
# 3 10
# 4 30
# 5 50

select(df, X, Z)
df %>% select(X, Z)
#    X   Z
# 1 20 AAA
# 2 40 BBB
# 3 10 CCC
# 4 30 DDD
# 5 50 EEE

select(df, Z, X, Y)
df %>% select(Z, X, Y)
#     Z  X     Y
# 1 AAA 20 100.1
# 2 BBB 40 200.2
# 3 CCC 10 300.3
# 4 DDD 30 400.4
# 5 EEE 50 500.5

SQL

SQL
SELECT X FROM df;

SELECT Z, X FROM df;

SELECT Z, X, Y FROM df;

列の追加

データフレームに新しい列を追加する方法についてです。

Python

Python3
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
df['A'] = np.arange(1,5+1,1)
df
df['X3'] = df['X'] * 3
df
df['S'] = list('MFMFM') # ['M', 'F', 'M', 'F', 'M']
df
print(df)
#     X      Y    Z  A   X3  S
# 0  20  100.1  AAA  1   60  M
# 1  40  200.2  BBB  2  120  F
# 2  10  300.3  CCC  3   30  M
# 3  30  400.4  DDD  4   90  F
# 4  50  500.5  EEE  5  150  M

R

R
df <- data.frame(X=x, Y=y, Z=z)
df

mutate(df, A = 1:5)
df %>% mutate(A = 1:5)
#    X     Y   Z A
# 1 20 100.1 AAA 1
# 2 40 200.2 BBB 2
# 3 10 300.3 CCC 3
# 4 30 400.4 DDD 4
# 5 50 500.5 EEE 5

mutate(df, X3 = X*3)
df %>% mutate(X3 = X*3)
#    X     Y   Z A S  X3
# 1 20 100.1 AAA 1 M  60
# 2 40 200.2 BBB 2 F 120
# 3 10 300.3 CCC 3 M  30
# 4 30 400.4 DDD 4 F  90
# 5 50 500.5 EEE 5 M 150

mutate(df, s = tolower(S))
df %>% mutate(s = tolower(S))
#    X     Y   Z  X3
# 1 20 100.1 AAA  60
# 2 40 200.2 BBB 120
# 3 10 300.3 CCC  30
# 4 30 400.4 DDD  90
# 5 50 500.5 EEE 150

mutate(mutate(df, A = 1:5), X3 = X*3)
df %>% mutate(A = 1:5) %>% mutate(X3 = X*3)
#    X     Y   Z A  X3
# 1 20 100.1 AAA 1  60
# 2 40 200.2 BBB 2 120
# 3 10 300.3 CCC 3  30
# 4 30 400.4 DDD 4  90
# 5 50 500.5 EEE 5 150

mutate(df, A = 1:5, X3 = X*3)
df %>% mutate(A = 1:5, X3 = X*3)
#    X     Y   Z A  X3
# 1 20 100.1 AAA 1  60
# 2 40 200.2 BBB 2 120
# 3 10 300.3 CCC 3  30
# 4 30 400.4 DDD 4  90
# 5 50 500.5 EEE 5 150

mutate(df, A = 1:5,
           X3 = X*3,
           S = rep(c("M","F"),length=5),
           s = tolower(S))
df %>% mutate(A = 1:5,
              X3 = X*3,
              S = rep(c("M","F"),length=5),
              s = tolower(S))
#    X     Y   Z A  X3 S s
# 1 20 100.1 AAA 1  60 M m
# 2 40 200.2 BBB 2 120 F f
# 3 10 300.3 CCC 3  30 M m
# 4 30 400.4 DDD 4  90 F f
# 5 50 500.5 EEE 5 150 M m

df %>%
  mutate(A = 1:5) %>%
  mutate(X3 = X*3) %>%
  mutate(S = rep(c("M","F"),length=5)) %>%
  mutate(s = tolower(S))

SQL

SQL
SELECT X, Y, Z, X * 3 AS X3
FROM df;

SELECT
  X,
  Y,
  Z,
  X * 3 AS X3   
FROM
  df
;

列名変更

Python

Python3
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
df.columns = ['xx','yy','zz']
df
print(df)
#    xx     yy   zz
# 0  20  100.1  AAA
# 1  40  200.2  BBB
# 2  10  300.3  CCC
# 3  30  400.4  DDD
# 4  50  500.5  EEE

df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
df.rename(columns={'X':'xx','Y':'yy','Z':'zz'})
print(df.rename(columns={'X':'xx','Y':'yy','Z':'zz'}))
#    xx     yy   zz
# 0  20  100.1  AAA
# 1  40  200.2  BBB
# 2  10  300.3  CCC
# 3  30  400.4  DDD
# 4  50  500.5  EEE
df.rename(columns={'X':'xx','Y':'yy','Z':'zz'}, inplace=True)
df
print(df)
#    xx     yy   zz
# 0  20  100.1  AAA
# 1  40  200.2  BBB
# 2  10  300.3  CCC
# 3  30  400.4  DDD
# 4  50  500.5  EEE

R

R
df <- data.frame(X=x, Y=y, Z=z)
df

rename(df, xx = X, yy = Y, zz = Z)
df %>% rename(xx = X, yy = Y, zz = Z)
#   xx    yy  zz
# 1 20 100.1 AAA
# 2 40 200.2 BBB
# 3 10 300.3 CCC
# 4 30 400.4 DDD
# 5 50 500.5 EEE

df %>%
  rename(xx = X, yy = Y, zz = Z) %>%
  rename(yyy = yy)
#   xx   yyy  zz
# 1 20 100.1 AAA
# 2 40 200.2 BBB
# 3 10 300.3 CCC
# 4 30 400.4 DDD
# 5 50 500.5 EEE

SQL

SQL
SELECT X AS xx, Y AS yyy, Z AS zz
FROM df;

列への代入

Python

Python3
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
df['Y'] = df['X'] / 1000
df
print(df)
#     X     Y    Z
# 0  20  0.02  AAA
# 1  40  0.04  BBB
# 2  10  0.01  CCC
# 3  30  0.03  DDD
# 4  50  0.05  EEE

R

R
df <- data.frame(X=x, Y=y, Z=z)
df

mutate(df, Y = X / 1000)
df %>% mutate(Y = X / 1000)
#    X    Y   Z
# 1 20 0.02 AAA
# 2 40 0.04 BBB
# 3 10 0.01 CCC
# 4 30 0.03 DDD
# 5 50 0.05 EEE

SQL

SQL
SELECT
  X,
  X / 100 AS Y,
  Z
FROM
  df;

行の条件抽出フィルタ

SQLのWHERE条件に相当する、条件を指定して行を抽出(フィルタ)する方法です。

Python

Python3
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
df['X'] < 30
# 0     True
# 1    False
# 2     True
# 3    False
# 4    False
# Name: X, dtype: bool
type(df['X'] < 30)
# pandas.core.series.Series
df[df['X'] < 30]
print(df[df['X'] < 30])
df.loc[df['X'] < 30,:]
print(df.loc[df['X'] < 30,:])
df.query('X < 30')
print(df.query('X < 30'))
#     X      Y    Z
# 0  20  100.1  AAA
# 2  10  300.3  CCC

(df['X'] > 10) & (df['X'] < 50)
# 0     True
# 1     True
# 2    False
# 3     True
# 4    False
# Name: X, dtype: bool
df[(df['X'] > 10) & (df['X'] < 50)]
print(df[(df['X'] > 10) & (df['X'] < 50)])
df.query('X > 10 & X < 50')
print(df.query('X > 10 & X < 50'))
df.query('10 < X < 50')
print(df.query('10 < X < 50'))
#     X      Y    Z
# 0  20  100.1  AAA
# 1  40  200.2  BBB
# 3  30  400.4  DDD

R

R
df <- data.frame(X=x, Y=y, Z=z)
df

filter(df, X < 30)
df %>% filter(X < 30)
#    X     Y   Z
# 1 20 100.1 AAA
# 2 10 300.3 CCC

filter(df, X > 10 & X < 50)
df %>% filter(X > 10 & X < 50)
#    X     Y   Z
# 1 20 100.1 AAA
# 2 40 200.2 BBB
# 3 30 400.4 DDD

SQL

SQL
SELECT *
FROM df
WHERE X < 30;

SELECT *
FROM df
WHERE X > 10 AND X < 50;

行のソート

Python

Python3
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
df.sort_values(by='X')
print(df.sort_values(by='X'))
#     X      Y    Z
# 2  10  300.3  CCC
# 0  20  100.1  AAA
# 3  30  400.4  DDD
# 1  40  200.2  BBB
# 4  50  500.5  EEE

df.sort_values(by='X', ascending=False)
print(df.sort_values(by='X', ascending=False))
#     X      Y    Z
# 4  50  500.5  EEE
# 1  40  200.2  BBB
# 3  30  400.4  DDD
# 0  20  100.1  AAA
# 2  10  300.3  CCC

df['S'] = ['M','F','M','F','M']
df
print(df)
#     X      Y    Z  S
# 0  20  100.1  AAA  M
# 1  40  200.2  BBB  F
# 2  10  300.3  CCC  M
# 3  30  400.4  DDD  F
# 4  50  500.5  EEE  M
df.sort_values(by=['S','X'])
print(df.sort_values(by=['S','X']))
#     X      Y    Z  S
# 3  30  400.4  DDD  F
# 1  40  200.2  BBB  F
# 2  10  300.3  CCC  M
# 0  20  100.1  AAA  M
# 4  50  500.5  EEE  M

R

R
df <- data.frame(X=x, Y=y, Z=z)
df

arrange(df, X)
df %>% arrange(X)
#    X     Y   Z
# 1 10 300.3 CCC
# 2 20 100.1 AAA
# 3 30 400.4 DDD
# 4 40 200.2 BBB
# 5 50 500.5 EEE

arrange(df, desc(X))
df %>% arrange(desc(X))
#    X     Y   Z
# 1 50 500.5 EEE
# 2 40 200.2 BBB
# 3 30 400.4 DDD
# 4 20 100.1 AAA
# 5 10 300.3 CCC

df <- data.frame(X=x, Y=y, Z=z)
df <- df %>% mutate(S = rep(c("M","F"),length=5))
df
#    X     Y   Z S
# 1 20 100.1 AAA M
# 2 40 200.2 BBB F
# 3 10 300.3 CCC M
# 4 30 400.4 DDD F
# 5 50 500.5 EEE M

arrange(df, S, X)
df %>% arrange(S, X)
#    X     Y   Z S
# 4 30 400.4 DDD F
# 2 40 200.2 BBB F
# 3 10 300.3 CCC M
# 1 20 100.1 AAA M
# 5 50 500.5 EEE M

SQL

SQL
SELECT *
FROM df
ORDER BY X ASC;

SELECT *
FROM df
ORDER BY X DESC;

SELECT *
FROM df
ORDER BY S ASC, X ASC;

重複行の削除

Python3
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
df

R
x = rep(1:3, 1:3)*10
y = 1:6*100+1:6/10
z = c("AAA","BBB","CCC","DDD","EEE", "FFF")
df <- data.frame(X = x, Y = y, Z = z)
df
#    X     Y   Z
# 1 10 100.1 AAA
# 2 20 200.2 BBB
# 3 20 300.3 CCC
# 4 30 400.4 DDD
# 5 30 500.5 EEE
# 6 30 600.6 FFF

distinct(df, X)
distinct(df, X, .keep_all = FALSE)
df %>% distinct(X)
df %>% distinct(X, .keep_all = FALSE)
#    X
# 1 10
# 2 20
# 3 30

distinct(df, X, .keep_all = TRUE)
df %>% distinct(X, .keep_all = TRUE)
#    X     Y   Z
# 1 10 100.1 AAA
# 2 20 200.2 BBB
# 3 30 400.4 DDD

df %>% arrange(X, desc(Y))
#    X     Y   Z
# 1 10 100.1 AAA
# 2 20 300.3 CCC
# 3 20 200.2 BBB
# 4 30 600.6 FFF
# 5 30 500.5 EEE
# 6 30 400.4 DDD
df %>% arrange(X, desc(Y)) %>% distinct(X, .keep_all = TRUE)
#    X     Y   Z
# 1 10 100.1 AAA
# 2 20 300.3 CCC
# 3 30 600.6 FFF
SQL

データフレームの集計

基本統計量

基本統計量(要約統計量)の表示です。

Python
データフレームのdescribe()メソッドで基本統計量を表示できます。

Python3
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
df
print(df)
#     X      Y    Z
# 0  20  100.1  AAA
# 1  40  200.2  BBB
# 2  10  300.3  CCC
# 3  30  400.4  DDD
# 4  50  500.5  EEE
df.describe()
print(df.describe())
#                X           Y
# count   5.000000    5.000000
# mean   30.000000  300.300000
# std    15.811388  158.271997
# min    10.000000  100.100000
# 25%    20.000000  200.200000
# 50%    30.000000  300.300000
# 75%    40.000000  400.400000
# max    50.000000  500.500000
df.describe().T
print(df.describe().T)
#    count   mean         std    min    25%    50%    75%    max
# X    5.0   30.0   15.811388   10.0   20.0   30.0   40.0   50.0
# Y    5.0  300.3  158.271997  100.1  200.2  300.3  400.4  500.5
type(df.describe())
# pandas.core.frame.DataFrame

結果は、8行(今の場合は8×2)のデータフレームが返ります。
df.describe().Tは転置して(2×8の形で)表示しているだけです。

R
dplyrの機能ではありませんが、summary()でデータフレームのサマリーを表示できます。

R
df <- data.frame(X=x, Y=y, Z=z)
df
#    X     Y   Z
# 1 20 100.1 AAA
# 2 40 200.2 BBB
# 3 10 300.3 CCC
# 4 30 400.4 DDD
# 5 50 500.5 EEE

summary(df)
df %>% summary()
#        X            Y           Z    
#  Min.   :10   Min.   :100.1   AAA:1  
#  1st Qu.:20   1st Qu.:200.2   BBB:1  
#  Median :30   Median :300.3   CCC:1  
#  Mean   :30   Mean   :300.3   DDD:1  
#  3rd Qu.:40   3rd Qu.:400.4   EEE:1  
#  Max.   :50   Max.   :500.5             

SQL

SQL

集計

Python

Python3
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
df.count()
print(df.count())
# X    5
# Y    5
# Z    5
# S    5
# dtype: int64
type(df.count())
# pandas.core.series.Series

df.sum()
print(df.sum())
# X                150
# Y             1501.5
# Z    AAABBBCCCDDDEEE
# S              MFMFM
# dtype: object

df[['X','Y']].sum()
print(df[['X','Y']].sum())
# X     150.0
# Y    1501.5
# dtype: float64

df.mean()
print(df.mean())
# X     30.0
# Y    300.3
# dtype: float64

df.cumsum()
print(df.cumsum())
#      X       Y                Z      S
# 0   20   100.1              AAA      M
# 1   60   300.3           AAABBB     MF
# 2   70   600.6        AAABBBCCC    MFM
# 3  100  1001.0     AAABBBCCCDDD   MFMF
# 4  150  1501.5  AAABBBCCCDDDEEE  MFMFM

df.cov()
print(df.cov())
#          X          Y
# X   250.00   1251.250
# Y  1251.25  25050.025

df.corr()
print(df.corr())
#      X    Y
# X  1.0  0.5
# Y  0.5  1.0

R

R
df <- data.frame(X=x, Y=y, Z=z)
df

count(df)
df %>% count()
#   n
# 1 5

summarise(df, sumX = sum(X), sumY = sum(Y), meanX = mean(X), meanY = mean(Y))
df %>% summarise(sumX = sum(X), sumY = sum(Y), meanX = mean(X), meanY = mean(Y))
#   sumX   sumY meanX meanY
# 1  150 1501.5    30 300.3

summarise(df, n(), min(X), max(X))
df %>% summarise(n(), min(X), max(X), median(X))
#   n() min(X) max(X) median(X)
# 1   5     10     50        30

summarise(df, mean(X), sd(X), IQR(X), mad(X))
df %>% summarise(mean(X), sd(X), IQR(X), mad(X))
#   mean(X)    sd(X) IQR(X) mad(X)
# 1      30 15.81139     20 14.826

summarise(df, quantile(X), quantile(Y))
df %>% summarise(quantile(X), quantile(Y))
#   quantile(X) quantile(Y)
# 1          10       100.1
# 2          20       200.2
# 3          30       300.3
# 4          40       400.4
# 5          50       500.5

SQL

SQL
SELECT count(*), sum(X), sum(Y), mean(X), mean(Y)
FROM df;

グループ化

グループ化して集計についてです。

Python

Python3
df = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
s = ['M','F','M','F','M']
df['S'] = s
df
print(df)
#     X      Y    Z  S
# 0  20  100.1  AAA  M
# 1  40  200.2  BBB  F
# 2  10  300.3  CCC  M
# 3  30  400.4  DDD  F
# 4  50  500.5  EEE  M

# グループ化
df.groupby('S')
df.groupby('S').sum()
print(df.groupby('S').sum())
#     X      Y
# S           
# F  70  600.6
# M  80  900.9
df.groupby('S').mean()
print(df.groupby('S').mean())
#            X      Y
# S                  
# F  35.000000  300.3
# M  26.666667  300.3

df.groupby(['S','Z'])
df.groupby(['S','Z']).sum()
print(df.groupby(['S','Z']).sum())
#         X      Y
# S Z             
# F BBB  40  200.2
#   DDD  30  400.4
# M AAA  20  100.1
#   CCC  10  300.3
#   EEE  50  500.5
df.groupby(['S','Z']).mean()
print(df.groupby(['S','Z']).mean())
#         X      Y
# S Z             
# F BBB  40  200.2
#   DDD  30  400.4
# M AAA  20  100.1
#   CCC  10  300.3
#   EEE  50  500.5

R

R
s <- rep(c("M","F"),length=5)
df <- data.frame(X=x, Y=y, Z=z)
df <- df %>% mutate(S = s)
df
#    X     Y   Z S
# 1 20 100.1 AAA M
# 2 40 200.2 BBB F
# 3 10 300.3 CCC M
# 4 30 400.4 DDD F
# 5 50 500.5 EEE M

df %>% group_by(S) %>% summarise(sumX = sum(X), sumY = sum(Y))
#   S      sumX  sumY
#   <chr> <dbl> <dbl>
# 1 F        70  601.
# 2 M        80  901.

df %>% group_by(S, Z) %>%
  summarise(sumX = sum(X), sumY = sum(Y))
#   S     Z      sumX  sumY
#   <chr> <fct> <dbl> <dbl>
# 1 F     BBB      40  200.
# 2 F     DDD      30  400.
# 3 M     AAA      20  100.
# 4 M     CCC      10  300.
# 5 M     EEE      50  500.

SQL

SQL
SELECT sum(X), sum(Y), mean(X), mean(Y)
FROM df
GROUP BY S;

SELECT sum(X), sum(Y), mean(X), mean(Y)
FROM df
GROUP BY S, Z;

データフレームの結合

横に並べる

Python

Python3
k1 = np.arange(1,5+1)
k2 = np.array([1,3,5,1,3])
x1 = k1*10
x2 = k1*100
x3 = k1*1000
k1, k2, x1, x2, x3
# (array([1, 2, 3, 4, 5]),
#  array([1, 3, 5, 1, 3]),
#  array([10, 20, 30, 40, 50]),
#  array([100, 200, 300, 400, 500]),
#  array([1000, 2000, 3000, 4000, 5000]))

dfk1x1 = pd.DataFrame({'k1':k1, 'x1':x1})
dfk1x1
print(dfk1x1)
#    k1  x1
# 0   1  10
# 1   2  20
# 2   3  30
# 3   4  40
# 4   5  50
dfk2x2 = pd.DataFrame({'k2':k2, 'x2':x2})
dfk2x2
print(dfk2x2)
#    k2   x2
# 0   1  100
# 1   3  200
# 2   5  300
# 3   1  400
# 4   3  500

# 横に並べる
pd.concat([dfk1x1, dfk2x2], axis=1)
print(pd.concat([dfk1x1, dfk2x2], axis=1))
#    k1  x1  k2   x2
# 0   1  10   1  100
# 1   2  20   3  200
# 2   3  30   5  300
# 3   4  40   1  400
# 4   5  50   3  500

R

R
k1 <- 1:5
k2 <- c(1,3,5,1,3)
x1 <- k1*10
x2 <- k1*100
x3 <- k1*1000
k1
# [1] 1 2 3 4 5
k2
# [1] 1 3 5 1 3
x1
# [1] 10 20 30 40 50
x2
# [1] 100 200 300 400 500
x3
# [1] 1000 2000 3000 4000 5000

dfk1x1 <- data.frame(k1, x1)
dfk1x1
#   k1 x1
# 1  1 10
# 2  2 20
# 3  3 30
# 4  4 40
# 5  5 50
dfk2x2 <- data.frame(k2, x2)
dfk2x2
#   k2  x2
# 1  1 100
# 2  3 200
# 3  5 300
# 4  1 400
# 5  3 500

# 横に並べる
bind_cols(dfk1x1, dfk2x2)
#   k1 x1 k2  x2
# 1  1 10  1 100
# 2  2 20  3 200
# 3  3 30  5 300
# 4  4 40  1 400
# 5  5 50  3 500

SQL

SQL

縦に並べる

データフレームを縦に並べます。SQLのUNIONの操作です。

Python

Python3
dfk2x2_2 = pd.DataFrame({'k1':k2, 'x1':x2})
dfk2x2_2
print(dfk2x2_2)
#    k1   x1
# 0   1  100
# 1   3  200
# 2   5  300
# 3   1  400
# 4   3  500

# 縦に並べる
pd.concat([dfk1x1, dfk2x2_2], axis=0)
print(pd.concat([dfk1x1, dfk2x2_2], axis=0))
pd.concat([dfk1x1, dfk2x2_2])
print(pd.concat([dfk1x1, dfk2x2_2]))
dfk1x1.append(dfk2x2_2)
print(dfk1x1.append(dfk2x2_2))
#    k1   x1
# 0   1   10
# 1   2   20
# 2   3   30
# 3   4   40
# 4   5   50
# 0   1  100
# 1   3  200
# 2   5  300
# 3   1  400
# 4   3  500

pd.concat([dfk1x1, dfk2x2_2], ignore_index=True)
print(pd.concat([dfk1x1, dfk2x2_2], ignore_index=True))
dfk1x1.append(dfk2x2_2, ignore_index=True)
print(dfk1x1.append(dfk2x2_2, ignore_index=True))
#    k1   x1
# 0   1   10
# 1   2   20
# 2   3   30
# 3   4   40
# 4   5   50
# 5   1  100
# 6   3  200
# 7   5  300
# 8   1  400
# 9   3  500

横に並べるのに使ったconcat()axis=0(デフォルト)として使える他に、append()が使えます。ignore_index=Trueとするとインデックスがふり直されます。列名は一致させておく必要があります。

R

R
dfk2x2_2 <- data.frame(k1 = k2, x1 = x2)
dfk2x2_2
#   k1  x1
# 1  1 100
# 2  3 200
# 3  5 300
# 4  1 400
# 5  3 500

# 縦に並べる
bind_rows(dfk1x1, dfk2x2_2)
#    k1  x1
# 1   1  10
# 2   2  20
# 3   3  30
# 4   4  40
# 5   5  50
# 6   1 100
# 7   3 200
# 8   5 300
# 9   1 400
# 10  3 500

union_all(x = dfk1x1, y = dfk2x2_2)
dfk1x1 %>% union_all(y = dfk2x2_2)
#    k1  x1
# 1   1  10
# 2   2  20
# 3   3  30
# 4   4  40
# 5   5  50
# 6   1 100
# 7   3 200
# 8   5 300
# 9   1 400
# 10  3 500

union_all(x = dfk1x1, y = dfk1x1) # 重複行を削除せずUNION
dfk1x1 %>% union_all(y = dfk1x1)
#    k1 x1
# 1   1 10
# 2   2 20
# 3   3 30
# 4   4 40
# 5   5 50
# 6   1 10
# 7   2 20
# 8   3 30
# 9   4 40
# 10  5 50

union(x = dfk1x1, y = dfk1x1)     # 重複行を削除してUNION
dfk1x1 %>% union(y = dfk1x1)
#   k1 x1
# 1  1 10
# 2  2 20
# 3  3 30
# 4  4 40
# 5  5 50

SQL

SQL
SELECT * FROM dfk1x1
UNION ALL
SELECT * FROM dfk2x2
;

SELECT * FROM dfk1x1
UNION
SELECT * FROM dfk2x2
;

キーで結合

SQLのジョインの操作です。

1つのキーで結合

Python

Python3
dfk1x2 = pd.DataFrame({'k':k1, 'x2':x2}).drop(2-1)
dfk1x2
print(dfk1x2)
#    k   x2
# 0  1  100
# 2  3  300
# 3  4  400
# 4  5  500
dfk1x3 = pd.DataFrame({'k':k1, 'x3':x3}).drop(3-1)
dfk1x3
print(dfk1x3)
#    k    x3
# 0  1  1000
# 1  2  2000
# 3  4  4000
# 4  5  5000

# 内部結合(INNER JOIN)
pd.merge(dfk1x2, dfk1x3, on='k')
print(pd.merge(dfk1x2, dfk1x3, on='k'))
pd.merge(dfk1x2, dfk1x3, on='k', how='inner')
print(pd.merge(dfk1x2, dfk1x3, on='k', how='inner'))
#    k   x2    x3
# 0  1  100  1000
# 1  4  400  4000
# 2  5  500  5000

# 左外部結合(LEFT OUTER JOIN)
pd.merge(dfk1x2, dfk1x3, on='k', how='left')
print(pd.merge(dfk1x2, dfk1x3, on='k', how='left'))
#    k   x2      x3
# 0  1  100  1000.0
# 1  3  300     NaN
# 2  4  400  4000.0
# 3  5  500  5000.0

# 右外部結合(RIGHT OUTER JOIN)
pd.merge(dfk1x2, dfk1x3, on='k', how='right')
print(pd.merge(dfk1x2, dfk1x3, on='k', how='right'))
#    k     x2    x3
# 0  1  100.0  1000
# 1  4  400.0  4000
# 2  5  500.0  5000
# 3  2    NaN  2000

# 完全外部結合(FULL OUTER JOIN)
pd.merge(dfk1x2, dfk1x3, on='k', how='outer')
print(pd.merge(dfk1x2, dfk1x3, on='k', how='outer'))
#    k     x2      x3
# 0  1  100.0  1000.0
# 1  3  300.0     NaN
# 2  4  400.0  4000.0
# 3  5  500.0  5000.0
# 4  2    NaN  2000.0

R

R
dfk1x2 <- data.frame(k1, x2)[-2,]
dfk1x2
#   k1  x2
# 1  1 100
# 3  3 300
# 4  4 400
# 5  5 500
dfk1x3 <- data.frame(k1, x3)[-3,]
dfk1x3
#   k1   x3
# 1  1 1000
# 2  2 2000
# 4  4 4000
# 5  5 5000

# 内部結合(INNER JOIN)
inner_join(x = dfk1x2, y = dfk1x3, by = "k1")
dfk1x2 %>% inner_join(y = dfk1x3, by = "k1")
#   k1  x2   x3
# 1  1 100 1000
# 2  4 400 4000
# 3  5 500 5000

# 左外部結合(LEFT OUTER JOIN)
left_join(x = dfk1x2, y = dfk1x3, by ="k1")
dfk1x2 %>% left_join(y = dfk1x3, by ="k1")
#   k1  x2   x3
# 1  1 100 1000
# 2  3 300   NA
# 3  4 400 4000
# 4  5 500 5000

# 右外部結合(RIGHT OUTER JOIN)
right_join(x = dfk1x2, y = dfk1x3, by = "k1")
dfk1x2 %>% right_join(y = dfk1x3, by = "k1")
#   k1  x2   x3
# 1  1 100 1000
# 2  4 400 4000
# 3  5 500 5000
# 4  2  NA 2000
right_join(x = dfk1x2, y = dfk1x3, by = "k1") %>% arrange(k1)
dfk1x2 %>% right_join(y = dfk1x3, by = "k1") %>% arrange(k1)
#   k1  x2   x3
# 1  1 100 1000
# 2  2  NA 2000
# 3  4 400 4000
# 4  5 500 5000

# 完全外部結合(FULL OUTER JOIN)
full_join(x = dfk1x2, y = dfk1x3, by = "k1")
dfk1x2 %>% full_join(y = dfk1x3, by = "k1")
# #   k1  x2   x3
# 1  1 100 1000
# 2  3 300   NA
# 3  4 400 4000
# 4  5 500 5000
# 5  2  NA 2000
full_join(x = dfk1x2, y = dfk1x3, by = "k1") %>% arrange(k1)
dfk1x2 %>% full_join(x = dfk1x2, y = dfk1x3, by = "k1") %>% arrange(k1)
#   k1  x2   x3
# 1  1 100 1000
# 2  2  NA 2000
# 3  3 300   NA
# 4  4 400 4000
# 5  5 500 5000

SQL

SQL
SELECT *
FROM dfk1x2 INNER JOIN dfk1x3 ON dfk1x2.k = dfk1x3.k
;

SELECT *
FROM dfk1x2 LEFT JOIN dfk1x3 ON dfk1x2.k = dfk1x3.k
;

SELECT *
FROM dfk1x2 RIGHT JOIN dfk1x3 ON dfk1x2.k = dfk1x3.k
;

SELECT *
FROM dfk1x2 FULL JOIN dfk1x3 ON dfk1x2.k = dfk1x3.k
;

名前の異なるキーで結合

Python

Python3
# 名前の異なるキーで結合
pd.merge(dfk1x1, dfk2x2, left_on='k1', right_on='k2', how='left')
print(pd.merge(dfk1x1, dfk2x2, left_on='k1', right_on='k2', how='left'))
#    k1  x1   k2     x2
# 0   1  10  1.0  100.0
# 1   1  10  1.0  400.0
# 2   2  20  NaN    NaN
# 3   3  30  3.0  200.0
# 4   3  30  3.0  500.0
# 5   4  40  NaN    NaN
# 6   5  50  5.0  300.0

R

R
# 名前の異なるキーで結合
left_join(x = dfk1x1, y = dfk2x2, by = c("k1" = "k2"))
dfk1x1 %>% left_join(y = dfk2x2, by = c("k1" = "k2"))
#   k1 x1  x2
# 1  1 10 100
# 2  1 10 400
# 3  2 20  NA
# 4  3 30 200
# 5  3 30 500
# 6  4 40  NA
# 7  5 50 300

SQL

SQL
SELECT *
FROM dfk1x1 LEFT JOIN dfk2x2 ON dfk1x1.k1 = dfk2x2.k2
;

複数のキーで結合

Python

Python3
# 複数のキーで結合
dfk1k2x2 = pd.DataFrame({'k1':k1, 'k2':k2, 'x2':x2}).drop(2-1)
dfk1k2x2
print(dfk1k2x2)
#    k1  k2   x2
# 0   1   1  100
# 2   3   5  300
# 3   4   1  400
# 4   5   3  500
dfk1k2x3 = pd.DataFrame({'k1':k1, 'k2':k2, 'x3':x3}).drop(3-1)
dfk1k2x3
print(dfk1k2x3)
#    k1  k2    x3
# 0   1   1  1000
# 1   2   3  2000
# 3   4   1  4000
# 4   5   3  5000
pd.merge(dfk1k2x2, dfk1k2x3, on=['k1','k2'], how='left')
print(pd.merge(dfk1k2x2, dfk1k2x3, on=['k1','k2'], how='left'))
#    k1  k2   x2      x3
# 0   1   1  100  1000.0
# 1   3   5  300     NaN
# 2   4   1  400  4000.0
# 3   5   3  500  5000.0
dfk1k2x2.merge(dfk1k2x3, on=['k1','k2'], how='left')
print(dfk1k2x2.merge(dfk1k2x3, on=['k1','k2'], how='left'))
#    k1  k2   x2      x3
# 0   1   1  100  1000.0
# 1   3   5  300     NaN
# 2   4   1  400  4000.0
# 3   5   3  500  5000.0

最後の例ように、pd.merge(dfx, dfy, ...)dfx.merge(dfy, ...)の形にも書くこともできます。他の上の例も同様です。

R

R
# 複数のキーで結合
dfk1k2x2 <- data.frame(k1, k2, x2)[-2,]
dfk1k2x2
#   k1 k2  x2
# 1  1  1 100
# 3  3  5 300
# 4  4  1 400
# 5  5  3 500
dfk1k2x3 <- data.frame(k1, k2, x3)[-3,]
dfk1k2x3
#   k1 k2   x3
# 1  1  1 1000
# 2  2  3 2000
# 4  4  1 4000
# 5  5  3 5000

left_join(x = dfk1k2x2, y = dfk1k2x3, by = c("k1","k2"))
dfk1k2x2 %>% left_join(y = dfk1k2x3, by = c("k1","k2"))
#   k1 k2  x2   x3
# 1  1  1 100 1000
# 2  3  5 300   NA
# 3  4  1 400 4000
# 4  5  3 500 5000

SQL

SQL
SELECT *
FROM dfk1k2x2 LEFT JOIN dfk1k2x3
 ON dfk1k2x2.k1 = dfk1k2x3.k1 AND
    dfk1k2x2.k2 = dfk1k2x3.k2
;

dplyrについて詳しくは、こちらの記事参照
dplyr — 高速data.frame処理

まとめ

一覧

各言語で使用する関数等を一覧にまとめます。(比較のために、SQLも示しました。)

Python(pandas) R RDB(Access) Excel 行列
オブジェクト DataFrame data.frame テーブル テーブル 行列(matrix)
index(axis=0) 行(row) レコード 行(ROW) 行(row)
columns(axis=1) 列(col) フィールド 列(COLUMN) 列(column)
要素 values CELL 要素
列の正体 pandasのSeries ベクトル 同じデータ型の値

データフレームdf

X Y Z
1 20 100.1 AAA
2 40 200.2 BBB
3 10 300.3 CCC
4 30 400.4 DDD
5 50 500.5 EEE

データフレームへのアクセス

Python(pandas) R(標準) R(dplyr) 結果
列ベクトル df.X
df['X']
df$X
df[["X"]]
df[[1]]
pull(df, X)
pull(df, 1)
ベクトル
(20,40,
10,30,50)
列指定 df[['X']]
df.loc[:,['X']]
df["X"]
df[1]
select(df, X)
select(df, 1)
データフレーム
複数列指定 df[['X','Z']]
df.loc[:,['X','Z']]
df.iloc[:,
[1-1,3-1]]
df[c("X","Z")]
df[c(1,3)]
select(df, X, Z)
select(df, 1, 3)
データフレーム
df[['Z','Y','X']]
df.loc[:,
['Z','Y','X']]
df.iloc[:,
[3-1,2-1,1-1]]
df[c("Z","X","Y")]
df[c(3,1,2)]
select(df, Z, Y, X)
select(df, 3, 2, 1)
データフレーム
先頭n行 df.head(n) head(df, n) slice_head(df, n = n) データフレーム
末尾n行 df.tail(n) tail(df, n) slice_tail(df, n = 2) データフレーム
要素 df.iat[1-1,1-1]
df.at[1-1,'X']
df.iloc[1-1,1-1]
df.loc[1-1,'X']
df['X'][1-1]
df.X[1-1]
df[1,1]
df[1,"X"]
df[[1,1]]
df[[1,"X"]]
df$X[1]
df[["X"]][1]
pull(df, 1)[1]
pull(df, X)[1]

20
複数行
複数列
指定
df.iloc[
[1-1,3-1],[1-1,3-1]]
df.loc[
[1-1,3-1],['X','Z']]
df[c(1,3),c(1,3)]
df[c(1,3),c("X","Z")]
df[c(T,F,T,F,F),
c(T,F,T)]
slice(select(df,
1, 3), 1, 3)
slice(select(df,
X, Z), 1, 3)
slice(select(df,
c(1, 3)), c(1, 3))
slice(select(df,
c(X, Z)), c(1, 3))
データフレーム
複数列指定 df.iloc[:,[1-1,3-1]]
df.loc[:,['X','Z']]
df[['X','Z']]
df[,c(1,3)]
df[,c("X","Z")]
df[,c(T,F,T)]
select(df, 1, 3)
select(df, X, Z)
select(df, c(1, 3))
select(df, c(X, Z))
select(df, c("X", "Z"))
データフレーム
複数行指定 df.iloc[[1-1,3-1],:]
df.loc[[1-1,3-1],:]
df.iloc[[1-1,3-1]]
df.loc[[1-1,3-1]]
df[c(1,3),]
df[c(T,F,T,F,F),]
slice(df, 1, 3)
slice(df, c(1, 3))
データフレーム

データフレームの操作

Python(pandas) R(標準) R(dplyr) SQL
列の選択 df[['Z','Y','X']] df[c("Z","X","Y")] select(df,
Z, X, Y)
SELECT
Z, X, Y
FROM df;
列の追加 df['X3'] = df['X'] * 3 df$X3 <- df$X * 3 mutate(df,
X3 = X*3)
SELECT
X, Y, Z,
X * 3 AS X3
FROM df;
列名変更 df.rename(
columns={'X':'xx',
'Y':'yy',
'Z':'zz'})
names(df) <-
c("xx", "yy", "zz")
rename(df,
xx=X,
yy=Y,
zz=Z)
SELECT
X AS xx,
Y AS yy,
Z AS zzz
FROM df;
行の条件抽出
フィルタ
df['X'] < 30
df.query(
'X < 30')
df[df$X < 30,] filter(df,
X < 30)
SELECT *
FROM df
WHERE
X < 30;
df[(df['X'] > 10)
& (df['X'] < 50)]
df.query(
'10 < X < 50')
df[df$X > 10
& df$X < 50,]
filter(df,
X > 10
& X < 50)
SELECT *
FROM df
WHERE
X > 10
AND
X < 50;
行のソート df.sort_values(
by='X')
df[order(df$X), ] arrange(df,
X)
SELECT *
FROM df
ORDER BY
X ASC;
行のソート df.sort_values(
by='X',
ascending=False)
df[order(df$X,
decreasing=T),]
arrange(df,
desc(X))
SELECT *
FROM df
ORDER BY
X DESC;
行のソート df.sort_values(
by=['S','X'])
df[order(
df$S, df$X), ]
arrange(df,
S, X)
SELECT *
FROM df
ORDER BY
S ASC,
X ASC;
重複行の削除 distinct(df,
X,
.keep_all=T)

データフレームの集計

Python(pandas) R(標準) R(dplyr) SQL
基本統計量 df.describe()
df.describe().T
summary(df)
グループ化 df.groupby('S') group_by(df, S) GROUP BY S;
グループ化 df.groupby(['S','Z']) group_by(df,
S, Z)
GROUP BY S, Z;

データフレームの結合

Python(pandas) R(標準) R(dplyr) SQL
横に並べる pd.concat([df1, df2],
axis=1)
cbind(df1, df2) bind_cols(
df1, df2)
縦に並べる pd.concat([df1, df2],
axis=0)
pd.concat([df1, df2],
ignore_index=True)
df1.append(df2,
ignore_index=True)
rbind(df1, df2) bind_rows(
df1, df2)
union_all(
x = df1,
y = df2)
UNION ALL
内部結合
(INNER JOIN)
pd.merge(df1, df2,
on='k')
pd.merge(df1, df2,
on='k',
how='inner')
merge(df1, df2)
merge(df1, df2,
by="k")
inner_join(
x = df1,
y = df2,
by = "k")
FROM df1
INNER JOIN df2
ON df1.k = df2.k
左外部結合
(LEFT
OUTER JOIN)
pd.merge(df1, df2,
on='k',
how='left')
merge(df1, df2,
by="k",
all.x=T)
left_join(
x = df1,
y = df2,
by = "k")
FROM df1
LEFT JOIN df2
ON df1.k = df2.k
右外部結合
(RIGHT
OUTER JOIN)
pd.merge(df1, df2,
on='k',
how='right')
merge(df1, df2,
by="k",
all.y=T)
right_join(
x = df1,
y = df2,
by = "k")
FROM df1
RIGHT JOIN df2
ON df1.k = df2.k
完全外部結合
(FULL
OUTER JOIN)
pd.merge(df1, df2,
on='k',
how='outer')
merge(df1, df2,
by="k",
all=T)
full_join(
x = df1,
y = df2,
by = "k")
FROM df1
FULL JOIN df2
ON df1.k = df2.k
名前の異なる
キーで結合
pd.merge(df1, df2,
left_on='k1',
right_on='k2',
how='left')
merge(df1, df2,
by.x="k1",
by.y="k2",
all.x=T)
left_join(
x = df1,
y = df2,
by = c("k1"="k2"))
FROM df1
LEFT JOIN df2
ON df1.k1 = df2.k2
複数のキーで結合 pd.merge(df1, df2,
on=['k1','k2'],
how='left')
merge(df1, df2,
by=c("k1","k2"),
all.x=T)
left_join(
x = df1,
y = df2,
by = c("k1","k2"))
FROM df2
LEFT JOIN df2
ON df1.k1 = df2.k1
AND df1.k2 = df2.k2

プログラム全体

参考までに使ったプログラムの全体を示します。
(今回は長くなったので省略します。)

Python

Python3

R

R

VBA

VBA

参考

0
3
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
3