はじめに
Rのデータフレームの計算に用いられる関数について、サンプルデータを用いて具体的に練習してみました。
目次
Rのデータフレームの計算に用いる関数
Rのデータフレームの計算に用いる関数について、サンプルデータを用いて具体的に練習します。
なお、以下については別記事でまとめています。
参考:
https://r4ds.had.co.nz/transform.html
https://github.com/tidyverse/dplyr
https://heavywatal.github.io/rstats/dplyr.html
練習に用いるデータ
Rのサンプルデータiris
とanscombe
を用います。
行番号を表示した方が分かりやすいため、tibble::rowid_to_column()
で行番号を列名にしています。
(こちらの記事と同じデータを用います。)
library(dplyr)
library(tibble)
library(tidyr)
iris
# 行番号を列名に
iris_tbl <- iris %>% as_tibble() %>% rowid_to_column("id")
iris_tbl
# # A tibble: 150 x 6
# id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <int> <dbl> <dbl> <dbl> <dbl> <fct>
# 1 1 5.1 3.5 1.4 0.2 setosa
# 2 2 4.9 3 1.4 0.2 setosa
anscombe
anscombe_tbl <- anscombe %>% as_tibble() %>% rowid_to_column("id")
anscombe_tbl
# # A tibble: 11 x 9
# id x1 x2 x3 x4 y1 y2 y3 y4
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 10 10 10 8 8.04 9.14 7.46 6.58
# 2 2 8 8 8 8 6.95 8.14 6.77 5.76
主にmutate()やfilter()を補助するもの
row_number(), min_rank(), dense_rank(), percent_rank(), cume_dist(), ntile():順位計算
順位(昇順の順位)を計算する関数です。グループ化しておくと、グループごとの順位になります。
引数なしでrow_number()
とすると、行の通し番号になります。
# row_number()
iris_tbl %>%
mutate(row = row_number())
# A tibble: 150 x 7
# id Sepal.Length Sepal.Width Petal.Length Petal.Width Species row
# <int> <dbl> <dbl> <dbl> <dbl> <fct> <int>
# 1 1 5.1 3.5 1.4 0.2 setosa 1
# 2 2 4.9 3 1.4 0.2 setosa 2
# 3 3 4.7 3.2 1.3 0.2 setosa 3
# 4 4 4.6 3.1 1.5 0.2 setosa 4
# 5 5 5 3.6 1.4 0.2 setosa 5
# 6 6 5.4 3.9 1.7 0.4 setosa 6
# 7 7 4.6 3.4 1.4 0.3 setosa 7
# 8 8 5 3.4 1.5 0.2 setosa 8
# 9 9 4.4 2.9 1.4 0.2 setosa 9
# 10 10 4.9 3.1 1.5 0.1 setosa 10
# # ... with 140 more rows
iris_tbl %>%
group_by(Species) %>%
mutate(row = row_number()) %>%
slice_head(n = 5)
# # A tibble: 15 x 7
# # Groups: Species [3]
# id Sepal.Length Sepal.Width Petal.Length Petal.Width Species row
# <int> <dbl> <dbl> <dbl> <dbl> <fct> <int>
# 1 1 5.1 3.5 1.4 0.2 setosa 1
# 2 2 4.9 3 1.4 0.2 setosa 2
# 3 3 4.7 3.2 1.3 0.2 setosa 3
# 4 4 4.6 3.1 1.5 0.2 setosa 4
# 5 5 5 3.6 1.4 0.2 setosa 5
# 6 51 7 3.2 4.7 1.4 versicolor 1
# 7 52 6.4 3.2 4.5 1.5 versicolor 2
# 8 53 6.9 3.1 4.9 1.5 versicolor 3
# 9 54 5.5 2.3 4 1.3 versicolor 4
# 10 55 6.5 2.8 4.6 1.5 versicolor 5
# 11 101 6.3 3.3 6 2.5 virginica 1
# 12 102 5.8 2.7 5.1 1.9 virginica 2
# 13 103 7.1 3 5.9 2.1 virginica 3
# 14 104 6.3 2.9 5.6 1.8 virginica 4
# 15 105 6.5 3 5.8 2.2 virginica 5
引数を指定すると、row_number()
は、引数の昇順に並び換えたときの行番号(同順位でも別番号)になります。
min_rank()
とdense_rank()
は順位(同順位は同じ番号)で、dense_rank()
は番号が飛ばない順位になります。percent_rank()
は、min_rank()
を0から1の範囲に変換したもの、cume_dist()
は、累積密度です。ntile()
は、引数の順にn個のクラスに分けたときのクラスの番号です。
desc()
関数を使うと、降順の順位が計算できます。
R Documentation の記載から引用:
-
row_number()
: equivalent to rank(ties.method = "first") -
min_rank()
: equivalent to rank(ties.method = "min") -
dense_rank()
: like min_rank(), but with no gaps between ranks -
percent_rank()
: a number between 0 and 1 computed by rescaling min_rank to [0, 1] -
cume_dist()
: a cumulative distribution function. Proportion of all values less than or equal to the current rank. -
ntile()
: a rough rank, which breaks the input vector into n buckets. The size of the buckets may differ by up to one, larger buckets have lower rank.
# row_number(), min_rank(), dense_rank(), percent_rank(), cume_dist(), ntile()
iris_tbl %>%
select(id, Sepal.Length, Species) %>%
group_by(Species) %>%
mutate(row_number = row_number(Sepal.Length),
min_rank = min_rank(Sepal.Length),
dense_rank = dense_rank(Sepal.Length),
percent_rank = percent_rank(Sepal.Length),
cume_dist = cume_dist(Sepal.Length),
ntile5 = ntile(Sepal.Length, n = 5))
# # A tibble: 150 x 9
# Groups: Species [3]
# id Sepal.Length Species row_number min_rank dense_rank percent_rank cume_dist ntile5
# <int> <dbl> <fct> <int> <int> <int> <dbl> <dbl> <int>
# 1 1 5.1 setosa 29 29 9 0.571 0.72 3
# 2 2 4.9 setosa 17 17 7 0.327 0.4 2
# 3 3 4.7 setosa 10 10 5 0.184 0.22 1
# 4 4 4.6 setosa 6 6 4 0.102 0.18 1
# 5 5 5 setosa 21 21 8 0.408 0.56 3
# 6 6 5.4 setosa 41 41 12 0.816 0.9 5
# 7 7 4.6 setosa 7 6 4 0.102 0.18 1
# 8 8 5 setosa 22 21 8 0.408 0.56 3
# 9 9 4.4 setosa 2 2 2 0.0204 0.08 1
# 10 10 4.9 setosa 18 17 7 0.327 0.4 2
# # ... with 140 more rows
iris_tbl %>%
select(id, Sepal.Length, Species) %>%
group_by(Species) %>%
mutate(row_number = row_number(Sepal.Length),
min_rank = min_rank(Sepal.Length),
dense_rank = dense_rank(Sepal.Length),
percent_rank = percent_rank(Sepal.Length),
cume_dist = cume_dist(Sepal.Length),
ntile5 = ntile(Sepal.Length, n = 5)) %>%
arrange(Sepal.Length)
# # A tibble: 150 x 9
# Groups: Species [3]
# id Sepal.Length Species row_number min_rank dense_rank percent_rank cume_dist ntile5
# <int> <dbl> <fct> <int> <int> <int> <dbl> <dbl> <int>
# 1 14 4.3 setosa 1 1 1 0 0.02 1
# 2 9 4.4 setosa 2 2 2 0.0204 0.08 1
# 3 39 4.4 setosa 3 2 2 0.0204 0.08 1
# 4 43 4.4 setosa 4 2 2 0.0204 0.08 1
# 5 42 4.5 setosa 5 5 3 0.0816 0.1 1
# 6 4 4.6 setosa 6 6 4 0.102 0.18 1
# 7 7 4.6 setosa 7 6 4 0.102 0.18 1
# 8 23 4.6 setosa 8 6 4 0.102 0.18 1
# 9 48 4.6 setosa 9 6 4 0.102 0.18 1
# 10 3 4.7 setosa 10 10 5 0.184 0.22 1
# # ... with 140 more rows
# desc()
iris_tbl %>%
select(id, Sepal.Length, Species) %>%
group_by(Species) %>%
mutate(row_number = row_number(desc(Sepal.Length)),
min_rank = min_rank(desc(Sepal.Length)),
dense_rank = dense_rank(desc(Sepal.Length)),
percent_rank = percent_rank(desc(Sepal.Length)),
cume_dist = cume_dist(desc(Sepal.Length)),
ntile5 = ntile(desc(Sepal.Length), n = 5)) %>%
arrange(Sepal.Length)
# # A tibble: 150 x 9
# # Groups: Species [3]
# id Sepal.Length Species row_number min_rank dense_rank percent_rank cume_dist ntile5
# <int> <dbl> <fct> <int> <int> <int> <dbl> <dbl> <int>
# 1 14 4.3 setosa 50 50 15 1 1 5
# 2 9 4.4 setosa 47 47 14 0.939 0.98 5
# 3 39 4.4 setosa 48 47 14 0.939 0.98 5
# 4 43 4.4 setosa 49 47 14 0.939 0.98 5
# 5 42 4.5 setosa 46 46 13 0.918 0.92 5
# 6 4 4.6 setosa 42 42 12 0.837 0.9 5
# 7 7 4.6 setosa 43 42 12 0.837 0.9 5
# 8 23 4.6 setosa 44 42 12 0.837 0.9 5
# 9 48 4.6 setosa 45 42 12 0.837 0.9 5
# 10 3 4.7 setosa 40 40 11 0.796 0.82 4
# # ... with 140 more rows
なお、これらの関数は、ベクトルに対して使える関数ですので、データフレームの列の計算だけでなく、filter()
の条件指定などでも使えます。
iris_tbl %>%
group_by(Species) %>%
filter(between(row_number(), 1, 3))
# これと同じ
iris_tbl %>%
group_by(Species) %>%
slice_head(n = 3)
# # A tibble: 9 x 6
# Groups: Species [3]
# id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <int> <dbl> <dbl> <dbl> <dbl> <fct>
# 1 1 5.1 3.5 1.4 0.2 setosa
# 2 2 4.9 3 1.4 0.2 setosa
# 3 3 4.7 3.2 1.3 0.2 setosa
# 4 51 7 3.2 4.7 1.4 versicolor
# 5 52 6.4 3.2 4.5 1.5 versicolor
# 6 53 6.9 3.1 4.9 1.5 versicolor
# 7 101 6.3 3.3 6 2.5 virginica
# 8 102 5.8 2.7 5.1 1.9 virginica
# 9 103 7.1 3 5.9 2.1 virginica
iris_tbl %>%
arrange(Sepal.Length) %>%
group_by(Species) %>%
filter(row_number() == 1)
# これと同じ
iris_tbl %>%
arrange(Sepal.Length) %>%
group_by(Species) %>%
slice_head(n = 1)
iris_tbl %>%
arrange(Sepal.Length) %>%
distinct(Species, .keep_all = TRUE)
# A tibble: 3 x 6
# id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <int> <dbl> <dbl> <dbl> <dbl> <fct>
# 1 14 4.3 3 1.1 0.1 setosa
# 2 58 4.9 2.4 3.3 1 versicolor
# 3 107 4.9 2.5 4.5 1.7 virginica
ベクトルに使った例も示します。
v <- c(2, 2, 4, 4, 4, 4, 1, 3, 5, NA)
v
# [1] 2 2 4 4 4 4 1 3 5 NA
row_number(v)
# [1] 2 3 5 6 7 8 1 4 9 NA
min_rank(v)
# [1] 2 2 5 5 5 5 1 4 9 NA
dense_rank(v)
# [1] 2 2 4 4 4 4 1 3 5 NA
percent_rank(v)
# [1] 0.125 0.125 0.500 0.500 0.500 0.500 0.000 0.375 1.000 NA
cume_dist(v)
# [1] 0.3333333 0.3333333 0.8888889 0.8888889 0.8888889 0.8888889
# [7] 0.1111111 0.4444444 1.0000000 NA
ntile(v, n = 5)
# [1] 1 2 3 3 4 4 1 2 5 NA
ntile(row_number(v), n = 5)
# [1] 1 2 3 3 4 4 1 2 5 NA
row_number(desc(v))
# [1] 7 8 2 3 4 5 9 6 1 NA
min_rank(desc(v))
# [1] 7 7 2 2 2 2 9 6 1 NA
dense_rank(desc(v))
# [1] 4 4 2 2 2 2 5 3 1 NA
percent_rank(desc(v))
# [1] 0.750 0.750 0.125 0.125 0.125 0.125 1.000 0.625 0.000 NA
cume_dist(desc(v))
# [1] 0.8888889 0.8888889 0.5555556 0.5555556 0.5555556 0.5555556
# [7] 1.0000000 0.6666667 0.1111111 NA
ntile(desc(v), n = 5)
# [1] 4 4 1 2 2 3 5 3 1 NA
ntile(row_number(desc(v)), n = 5)
# [1] 4 4 1 2 2 3 5 3 1 NA
data.frame(v = v,
r1 = row_number(v),
r2 = min_rank(v),
r3 = dense_rank(v),
r4 = percent_rank(v),
r5 = cume_dist(v),
r6 = ntile(v, n = 5)) %>%
arrange(v)
# v r1 r2 r3 r4 r5 r6
# 1 1 1 1 1 0.000 0.1111111 1
# 2 2 2 2 2 0.125 0.3333333 1
# 3 2 3 2 2 0.125 0.3333333 2
# 4 3 4 4 3 0.375 0.4444444 2
# 5 4 5 5 4 0.500 0.8888889 3
# 6 4 6 5 4 0.500 0.8888889 3
# 7 4 7 5 4 0.500 0.8888889 4
# 8 4 8 5 4 0.500 0.8888889 4
# 9 5 9 9 5 1.000 1.0000000 5
# 10 NA NA NA NA NA NA NA
data.frame(v = v,
r1 = row_number(desc(v)),
r2 = min_rank(desc(v)),
r3 = dense_rank(desc(v)),
r4 = percent_rank(desc(v)),
r5 = cume_dist(desc(v)),
r6 = ntile(desc(v), n = 5)) %>%
arrange(v)
# v r1 r2 r3 r4 r5 r6
# 1 1 9 9 5 1.000 1.0000000 5
# 2 2 7 7 4 0.750 0.8888889 4
# 3 2 8 7 4 0.750 0.8888889 4
# 4 3 6 6 3 0.625 0.6666667 3
# 5 4 2 2 2 0.125 0.5555556 1
# 6 4 3 2 2 0.125 0.5555556 2
# 7 4 4 2 2 0.125 0.5555556 2
# 8 4 5 2 2 0.125 0.5555556 3
# 9 5 1 1 1 0.000 0.1111111 1
# 10 NA NA NA NA NA NA NA
cumsum(), cumprod(), cummean(), cummax(), cummin():累積計算
# cumsum(), cumprod(), cummean(), cummax(), cummin()
iris_tbl %>%
select(id, Sepal.Length) %>%
arrange(Sepal.Length) %>%
mutate(cumsum = cumsum(Sepal.Length),
cumprod = cumprod(Sepal.Length),
cummean = cummean(Sepal.Length),
cummax = cummax(Sepal.Length),
cummin = cummin(Sepal.Length))
# # A tibble: 150 x 7
# id Sepal.Length cumsum cumprod cummean cummax cummin
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 14 4.3 4.3 4.3 4.3 4.3 4.3
# 2 9 4.4 8.7 18.9 4.3 4.4 4.3
# 3 39 4.4 13.1 83.2 4.33 4.4 4.3
# 4 43 4.4 17.5 366. 4.35 4.4 4.3
# 5 42 4.5 22 1648. 4.36 4.5 4.3
# 6 4 4.6 26.6 7582. 4.38 4.6 4.3
# 7 7 4.6 31.2 34878. 4.41 4.6 4.3
# 8 23 4.6 35.8 160440. 4.44 4.6 4.3
# 9 48 4.6 40.4 738024. 4.46 4.6 4.3
# 10 3 4.7 45.1 3468712. 4.47 4.7 4.3
# # ... with 140 more rows
ベクトルに使った例も示します。
v
# [1] 2 2 4 4 4 4 1 3 5 NA
cumsum(v)
# [1] 2 4 8 12 16 20 21 24 29 NA
cumprod(v)
# [1] 2 4 16 64 256 1024 1024 3072 15360 NA
cummean(v)
# [1] 2.000000 2.000000 2.000000 2.500000 2.800000 3.000000 3.142857
# [8] 2.875000 2.888889 3.100000
cummax(v)
# [1] 2 2 4 4 4 4 4 4 5 NA
cummin(v)
# [1] 2 2 2 2 2 2 1 1 1 NA
lead(), lag():行方向のずらし(リード・ラグ)計算
# lead(), lag()
iris_tbl %>%
select(id, Sepal.Length) %>%
mutate(behind = lag(Sepal.Length), # 1つ前
ahead = lead(Sepal.Length)) # 1つ後ろ
# # A tibble: 150 x 4
# id Sepal.Length behind ahead
# <int> <dbl> <dbl> <dbl>
# 1 1 5.1 NA 4.9
# 2 2 4.9 5.1 4.7
# 3 3 4.7 4.9 4.6
# 4 4 4.6 4.7 5
# 5 5 5 4.6 5.4
iris_tbl %>%
select(id, Sepal.Length) %>%
mutate(lag_0 = lag(Sepal.Length, n = 0),
lag_1 = lag(Sepal.Length, n = 1), # 1つ後ろ
lag_2 = lag(Sepal.Length, n = 2), # 2つ後ろ
lag_3 = lag(Sepal.Length, n = 3)) # 3つ後ろ
iris_tbl %>%
select(id, Sepal.Length) %>%
mutate(lag_0 = Sepal.Length) %>%
mutate(across(.cols = Sepal.Length,
.fns = list(~lag(., n = 1), ~lag(., n = 2), ~lag(., n = 3)),
.names = "lag_{fn}"))
iris_tbl %>%
select(id, Sepal.Length) %>%
mutate(across(.cols = Sepal.Length,
.fns = list(~lag(., n = 0), ~lag(., n = 1), ~lag(., n = 2), ~lag(., n = 3)),
.names = "lag_{fn-1}"))
# # A tibble: 150 x 6
# id Sepal.Length lag_0 lag_1 lag_2 lag_3
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 5.1 5.1 NA NA NA
# 2 2 4.9 4.9 5.1 NA NA
# 3 3 4.7 4.7 4.9 5.1 NA
# 4 4 4.6 4.6 4.7 4.9 5.1
# 5 5 5 5 4.6 4.7 4.9
注意)mutate(across())
については、こちらを参照。
ベクトルに使った例も示します。
v
# [1] 2 2 4 4 4 4 1 3 5 NA
lead(v)
# [1] 2 4 4 4 4 1 3 5 NA NA
lead(v, 2)
# [1] 4 4 4 4 1 3 5 NA NA NA
lag(v)
# [1] NA 2 2 4 4 4 4 1 3 5
lag(v, 2)
# [1] NA NA 2 2 4 4 4 4 1 3
if_else(), case_when():条件分岐
# if_else()
iris_tbl %>%
select(id, Sepal.Length) %>%
mutate(sl = if_else(Sepal.Length < 5.0, 1, 0))
# # A tibble: 150 x 3
# id Sepal.Length sl
# <int> <dbl> <dbl>
# 1 1 5.1 0
# 2 2 4.9 1
# 3 3 4.7 1
# 4 4 4.6 1
# 5 5 5 0
# case_when()
iris_tbl %>%
select(id, Sepal.Length) %>%
mutate(sl = case_when(is.na(Sepal.Length) ~ as.character(NA), # NA_character_
Sepal.Length < 4.0 ~ "<3.9",
Sepal.Length < 5.0 ~ "4.0-4.9",
Sepal.Length < 6.0 ~ "5.0-5.9",
TRUE ~ "6.0-"))
# # A tibble: 150 x 3
# id Sepal.Length sl
# <int> <dbl> <chr>
# 1 1 5.1 5.0-5.9
# 2 2 4.9 4.0-4.9
# 3 3 4.7 4.0-4.9
# 4 4 4.6 4.0-4.9
# 5 5 5 5.0-5.9
ベクトルに使った例も示します。
v
# [1] 2 2 4 4 4 4 1 3 5 NA
if_else(v %% 2 == 0, 0, 1)
# [1] 0 0 0 0 0 0 1 1 1 NA
case_when(v <= 2 ~ "1-2",
v <= 4 ~ "3-4",
TRUE ~ "5-")
# [1] "1-2" "1-2" "3-4" "3-4" "3-4" "3-4" "1-2" "3-4" "5-" "5-"
case_when(is.na(v) ~ NA_character_,
v <= 2 ~ "1-2",
v <= 4 ~ "3-4",
TRUE ~ "5-")
case_when(is.na(v) ~ as.character(NA),
v <= 2 ~ "1-2",
v <= 4 ~ "3-4",
TRUE ~ "5-")
# [1] "1-2" "1-2" "3-4" "3-4" "3-4" "3-4" "1-2" "3-4" "5-" NA
coalesce():NAの置き換え
# coalesce()
iris_tbl %>%
select(id, Sepal.Length) %>%
mutate(across(.cols = Sepal.Length,
.fns = list(~lag(., n = 0), ~lag(., n = 1), ~lag(., n = 2), ~lag(., n = 3)),
.names = "lag_{fn-1}")) %>%
mutate(x_1 = coalesce(lag_1, 0),
x_2 = coalesce(lag_2, 0),
x_3 = coalesce(lag_3, 0))
# # A tibble: 150 x 9
# id Sepal.Length lag_0 lag_1 lag_2 lag_3 x_1 x_2 x_3
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 5.1 5.1 NA NA NA 0 0 0
# 2 2 4.9 4.9 5.1 NA NA 5.1 0 0
# 3 3 4.7 4.7 4.9 5.1 NA 4.9 5.1 0
# 4 4 4.6 4.6 4.7 4.9 5.1 4.7 4.9 5.1
# 5 5 5 5 4.6 4.7 4.9 4.6 4.7 4.9
iris_tbl %>%
select(id, Sepal.Length) %>%
mutate(across(.cols = Sepal.Length,
.fns = list(~lag(., n = 0), ~lag(., n = 1), ~lag(., n = 2), ~lag(., n = 3)),
.names = "lag_{fn-1}")) %>%
mutate(x_1 = coalesce(lag_1, lag_0),
x_2 = coalesce(lag_2, lag_1, lag_0),
x_3 = coalesce(lag_3, lag_2, lag_1, lag_0))
# # A tibble: 150 x 9
# id Sepal.Length lag_0 lag_1 lag_2 lag_3 x_1 x_2 x_3
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 5.1 5.1 NA NA NA 5.1 5.1 5.1
# 2 2 4.9 4.9 5.1 NA NA 5.1 5.1 5.1
# 3 3 4.7 4.7 4.9 5.1 NA 4.9 5.1 5.1
# 4 4 4.6 4.6 4.7 4.9 5.1 4.7 4.9 5.1
# 5 5 5 5 4.6 4.7 4.9 4.6 4.7 4.9
ベクトルに使った例も示します。
v
# [1] 2 2 4 4 4 4 1 3 5 NA
coalesce(v, 0)
# [1] 2 2 4 4 4 4 1 3 5 0
lag_0 <- lag(v, n = 0)
lag_0
# [1] 2 2 4 4 4 4 1 3 5 NA
lag_1 <- lag(v, n = 1)
lag_1
# [1] NA 2 2 4 4 4 4 1 3 5
lag_2 <- lag(v, n = 2)
lag_2
# [1] NA NA 2 2 4 4 4 4 1 3
lag_3 <- lag(v, n = 3)
lag_3
# [1] NA NA NA 2 2 4 4 4 4 1
coalesce(lag_1, lag_0)
# [1] 2 2 2 4 4 4 4 1 3 5
coalesce(lag_2, lag_1, lag_0)
# [1] 2 2 2 2 4 4 4 4 1 3
coalesce(lag_3, lag_2, lag_1, lag_0)
# [1] 2 2 2 2 2 4 4 4 4 1
replace_na():NAの置き換え
# replace_na()
iris_tbl %>%
select(id, Sepal.Length) %>%
mutate(across(.cols = Sepal.Length,
.fns = list(~lag(., n = 0), ~lag(., n = 1), ~lag(., n = 2), ~lag(., n = 3)),
.names = "lag_{fn-1}")) %>%
mutate(x_1 = replace_na(lag_1, 0),
x_2 = replace_na(lag_2, 0),
x_3 = replace_na(lag_3, 0))
iris_tbl %>%
select(id, Sepal.Length) %>%
mutate(across(.cols = Sepal.Length,
.fns = list(~lag(., n = 0), ~lag(., n = 1), ~lag(., n = 2), ~lag(., n = 3)),
.names = "lag_{fn-1}")) %>%
mutate(across(lag_1:lag_3, ~replace_na(., 0)))
iris_tbl %>%
select(id, Sepal.Length) %>%
mutate(across(.cols = Sepal.Length,
.fns = list(~lag(., n = 0), ~lag(., n = 1), ~lag(., n = 2), ~lag(., n = 3)),
.names = "lag_{fn-1}")) %>%
replace_na(list(lag_1 = 0, lag_2 = 0, lag_3 = 0))
# # A tibble: 150 x 9
# id Sepal.Length lag_0 lag_1 lag_2 lag_3 x_1 x_2 x_3
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 5.1 5.1 NA NA NA 0 0 0
# 2 2 4.9 4.9 5.1 NA NA 5.1 0 0
# 3 3 4.7 4.7 4.9 5.1 NA 4.9 5.1 0
# 4 4 4.6 4.6 4.7 4.9 5.1 4.7 4.9 5.1
# 5 5 5 5 4.6 4.7 4.9 4.6 4.7 4.9
ベクトルに使った例も示します。
v
# [1] 2 2 4 4 4 4 1 3 5 NA
replace_na(v, 0)
# [1] 2 2 4 4 4 4 1 3 5 0
coalesce(v, 0)
# [1] 2 2 4 4 4 4 1 3 5 0
if_else(is.na(v), 0, v)
# [1] 2 2 4 4 4 4 1 3 5 0
if_else(!is.na(v), v, 0)
# [1] 2 2 4 4 4 4 1 3 5 0
na_if():NAへの置き換え
# na_if()
iris_tbl %>%
select(id, Sepal.Length) %>%
mutate(sl = na_if(Sepal.Length, 4.9))
iris_tbl %>%
select(id, Sepal.Length) %>%
mutate(sl = if_else(Sepal.Length == 4.9, as.double(NA), Sepal.Length))
iris_tbl %>%
select(id, Sepal.Length) %>%
mutate(sl = if_else(Sepal.Length == 4.9, NA_real_, Sepal.Length))
# # A tibble: 150 x 3
# id Sepal.Length sl
# <int> <dbl> <dbl>
# 1 1 5.1 5.1
# 2 2 4.9 NA
# 3 3 4.7 4.7
# 4 4 4.6 4.6
# 5 5 5 5
# 6 6 5.4 5.4
# 7 7 4.6 4.6
# 8 8 5 5
# 9 9 4.4 4.4
# 10 10 4.9 NA
ベクトルに使った例も示します。
v
# [1] 2 2 4 4 4 4 1 3 5 NA
na_if(v, 4)
# [1] 2 2 NA NA NA NA 1 3 5 NA
recode():コードの振り直し
# recode()
iris_tbl %>%
mutate(SP = recode(Species, setosa = toupper("setosa")))
# # A tibble: 150 x 7
# id Sepal.Length Sepal.Width Petal.Length Petal.Width Species SP
# <int> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
# 1 1 5.1 3.5 1.4 0.2 setosa SETOSA
# 2 2 4.9 3 1.4 0.2 setosa SETOSA
iris_tbl %>%
mutate(SP2 = recode(Species,
setosa = toupper("setosa"),
versicolor = toupper("versicolor"),
virginica = toupper("virginica"))) %>%
group_by(Species) %>% slice_head(n = 3)
# A tibble: 9 x 7
# Groups: Species [3]
# id Sepal.Length Sepal.Width Petal.Length Petal.Width Species SP2
# <int> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
# 1 1 5.1 3.5 1.4 0.2 setosa SETOSA
# 2 2 4.9 3 1.4 0.2 setosa SETOSA
# 3 3 4.7 3.2 1.3 0.2 setosa SETOSA
# 4 51 7 3.2 4.7 1.4 versicolor VERSICOLOR
# 5 52 6.4 3.2 4.5 1.5 versicolor VERSICOLOR
# 6 53 6.9 3.1 4.9 1.5 versicolor VERSICOLOR
# 7 101 6.3 3.3 6 2.5 virginica VIRGINICA
# 8 102 5.8 2.7 5.1 1.9 virginica VIRGINICA
# 9 103 7.1 3 5.9 2.1 virginica VIRGINICA
sp <- levels(iris_tbl$Species) # 変更前のlevels
sp
# [1] "setosa" "versicolor" "virginica"
sp_2 <- toupper(sp) # 変更後のlevels
named_sp <- setNames(sp_2, sp) # 名前付きベクトル
iris_tbl %>%
mutate(Species_2 = recode(Species, !!!named_sp)) %>%
group_by(Species) %>% slice_head(n = 2)
# # A tibble: 6 x 7
# # Groups: Species [3]
# id Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species_2
# <int> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
# 1 1 5.1 3.5 1.4 0.2 setosa SETOSA
# 2 2 4.9 3 1.4 0.2 setosa SETOSA
# 3 51 7 3.2 4.7 1.4 versicolor VERSICOLOR
# 4 52 6.4 3.2 4.5 1.5 versicolor VERSICOLOR
# 5 101 6.3 3.3 6 2.5 virginica VIRGINICA
# 6 102 5.8 2.7 5.1 1.9 virginica VIRGINICA
ベクトルに使った例も示します。
v <- c(2, 2, 4, 4, 4, 4, 1, 3, 5, NA)
v <- LETTERS[v]
v
# [1] "B" "B" "D" "D" "D" "D" "A" "C" "E" NA
recode(v, A = "AAA", B = "BBB", C = "CCC")
# [1] "BBB" "BBB" "D" "D" "D" "D" "AAA" "CCC" "E" NA
vf <- factor(v, levels = LETTERS[1:5])
vf
# [1] B B D D D D A C E <NA>
# Levels: A B C D E
recode(vf, A = "AAA", B = "BBB", C = "CCC")
# [1] BBB BBB D D D D AAA CCC E <NA>
# Levels: AAA BBB CCC D E
LETTERS[1:3] # 変更前のlevels
# [1] "A" "B" "C"
library(stringr)
str_dup(LETTERS[1:3], 3) # 変更後のlevels
# [1] "AAA" "BBB" "CCC"
named_levels <- setNames(str_dup(LETTERS[1:3], 3), LETTERS[1:3]) # 名前付きベクトル
recode(vf, !!!named_levels)
# [1] BBB BBB D D D D AAA CCC E <NA>
# Levels: AAA BBB CCC D E
指定した列のNAを置き換えるもの
まず、NAを含むデータを準備しておきます。
# NAを含むデータを準備
iris_tbl_na <- iris_tbl %>%
slice_head(n = 10) %>%
mutate(Sepal.Length = na_if(Sepal.Length, 4.9),
Sepal.Width = na_if(Sepal.Width, 3.4)) %>% print()
# # A tibble: 10 x 6
# id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <int> <dbl> <dbl> <dbl> <dbl> <fct>
# 1 1 5.1 3.5 1.4 0.2 setosa
# 2 2 NA 3 1.4 0.2 setosa
# 3 3 4.7 3.2 1.3 0.2 setosa
# 4 4 4.6 3.1 1.5 0.2 setosa
# 5 5 5 3.6 1.4 0.2 setosa
# 6 6 5.4 3.9 1.7 0.4 setosa
# 7 7 4.6 NA 1.4 0.3 setosa
# 8 8 5 NA 1.5 0.2 setosa
# 9 9 4.4 2.9 1.4 0.2 setosa
# 10 10 NA 3.1 1.5 0.1 setosa
replace_na():NAを指定した値で置き換え
指定した列のNAを指定した値で置き換えます。指定する列と置き換える値はリストで与えます。
# replace_na()
iris_tbl_na %>%
replace_na(list(Sepal.Length = 0.0, Sepal.Width = 99.9))
# # A tibble: 10 x 6
# id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <int> <dbl> <dbl> <dbl> <dbl> <fct>
# 1 1 5.1 3.5 1.4 0.2 setosa
# 2 2 0 3 1.4 0.2 setosa
# 3 3 4.7 3.2 1.3 0.2 setosa
# 4 4 4.6 3.1 1.5 0.2 setosa
# 5 5 5 3.6 1.4 0.2 setosa
# 6 6 5.4 3.9 1.7 0.4 setosa
# 7 7 4.6 99.9 1.4 0.3 setosa
# 8 8 5 99.9 1.5 0.2 setosa
# 9 9 4.4 2.9 1.4 0.2 setosa
# 10 10 0 3.1 1.5 0.1 setosa
list(Sepal.Length = 0.0, Sepal.Width = 99.9)
# $Sepal.Length
# [1] 0
#
# $Sepal.Width
# [1] 99.9
#
class(list(Sepal.Length = 0.0, Sepal.Width = 99.9))
# [1] "list"
なお、上で出てきたreplace_na()
はベクトルに対する関数でしたが、こちらのreplace_na()
はデータフレームに対する関数になっています。
fill():NAを前後の値で置き換え
指定した列のNAを同じ列の前後(上下)のNAでない値で置き換えます。引数.direction
で前後のどちらの方向で埋めるかを指定できます。指定できる方向には、"down"
(the default), "up"
, "downup"
(i.e. first down and then up) or "updown"
(first up and then down) の4通りあります。
# fill()
iris_tbl_na %>%
fill(Sepal.Length, Sepal.Width)
# # A tibble: 10 x 6
# id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <int> <dbl> <dbl> <dbl> <dbl> <fct>
# 1 1 5.1 3.5 1.4 0.2 setosa
# 2 2 5.1 3 1.4 0.2 setosa
# 3 3 4.7 3.2 1.3 0.2 setosa
# 4 4 4.6 3.1 1.5 0.2 setosa
# 5 5 5 3.6 1.4 0.2 setosa
# 6 6 5.4 3.9 1.7 0.4 setosa
# 7 7 4.6 3.9 1.4 0.3 setosa
# 8 8 5 3.9 1.5 0.2 setosa
# 9 9 4.4 2.9 1.4 0.2 setosa
# 10 10 4.4 3.1 1.5 0.1 setosa
iris_tbl_na %>%
fill(Sepal.Length, Sepal.Width, .direction = "up")
# # A tibble: 10 x 6
# id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <int> <dbl> <dbl> <dbl> <dbl> <fct>
# 1 1 5.1 3.5 1.4 0.2 setosa
# 2 2 4.7 3 1.4 0.2 setosa
# 3 3 4.7 3.2 1.3 0.2 setosa
# 4 4 4.6 3.1 1.5 0.2 setosa
# 5 5 5 3.6 1.4 0.2 setosa
# 6 6 5.4 3.9 1.7 0.4 setosa
# 7 7 4.6 2.9 1.4 0.3 setosa
# 8 8 5 2.9 1.5 0.2 setosa
# 9 9 4.4 2.9 1.4 0.2 setosa
# 10 10 NA 3.1 1.5 0.1 setosa
iris_tbl_na %>%
fill(Sepal.Length, Sepal.Width, .direction = "updown")
# # A tibble: 10 x 6
# id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <int> <dbl> <dbl> <dbl> <dbl> <fct>
# 1 1 5.1 3.5 1.4 0.2 setosa
# 2 2 4.7 3 1.4 0.2 setosa
# 3 3 4.7 3.2 1.3 0.2 setosa
# 4 4 4.6 3.1 1.5 0.2 setosa
# 5 5 5 3.6 1.4 0.2 setosa
# 6 6 5.4 3.9 1.7 0.4 setosa
# 7 7 4.6 2.9 1.4 0.3 setosa
# 8 8 5 2.9 1.5 0.2 setosa
# 9 9 4.4 2.9 1.4 0.2 setosa
# 10 10 4.4 3.1 1.5 0.1 setosa
参考
- tidyverse
- R for Data Science
- Heavy Watal - R stats
- Rのデータベースの操作についての自前のまとめ