More than 3 years have passed since last update.

Rのデータフレームの計算に用いる関数についてのメモ

Last updated at 2021-04-29Posted at 2021-03-13

はじめに

Rのデータフレームの計算に用いられる関数について、サンプルデータを用いて具体的に練習してみました。

Rのデータフレームの計算に用いる関数
参考

Rのデータフレームの計算に用いる関数

Rのデータフレームの計算に用いる関数について、サンプルデータを用いて具体的に練習します。
なお、以下については別記事でまとめています。

行の操作について：こちら
列の操作について：こちら
pivot操作について：こちら
集計の操作について：こちら

参考：
https://r4ds.had.co.nz/transform.html
https://github.com/tidyverse/dplyr
https://heavywatal.github.io/rstats/dplyr.html

練習に用いるデータ

Rのサンプルデータirisとanscombeを用います。
行番号を表示した方が分かりやすいため、tibble::rowid_to_column()で行番号を列名にしています。
（こちらの記事と同じデータを用います。）

library(dplyr)
library(tibble)
library(tidyr)

iris

# 行番号を列名に
iris_tbl <- iris %>% as_tibble() %>% rowid_to_column("id")
iris_tbl
# # A tibble: 150 x 6
#      id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#   <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>  
# 1     1          5.1         3.5          1.4         0.2 setosa 
# 2     2          4.9         3            1.4         0.2 setosa 

anscombe

anscombe_tbl <- anscombe %>% as_tibble() %>% rowid_to_column("id")
anscombe_tbl
# # A tibble: 11 x 9
#      id    x1    x2    x3    x4    y1    y2    y3    y4
#   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1     1    10    10    10     8  8.04  9.14  7.46  6.58
# 2     2     8     8     8     8  6.95  8.14  6.77  5.76

主にmutate()やfilter()を補助するもの

row_number(), min_rank(), dense_rank(), percent_rank(), cume_dist(), ntile()：順位計算

順位（昇順の順位）を計算する関数です。グループ化しておくと、グループごとの順位になります。

引数なしでrow_number()とすると、行の通し番号になります。

# row_number()
iris_tbl %>%
  mutate(row = row_number())
# A tibble: 150 x 7
#       id Sepal.Length Sepal.Width Petal.Length Petal.Width Species   row
#    <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>   <int>
#  1     1          5.1         3.5          1.4         0.2 setosa      1
#  2     2          4.9         3            1.4         0.2 setosa      2
#  3     3          4.7         3.2          1.3         0.2 setosa      3
#  4     4          4.6         3.1          1.5         0.2 setosa      4
#  5     5          5           3.6          1.4         0.2 setosa      5
#  6     6          5.4         3.9          1.7         0.4 setosa      6
#  7     7          4.6         3.4          1.4         0.3 setosa      7
#  8     8          5           3.4          1.5         0.2 setosa      8
#  9     9          4.4         2.9          1.4         0.2 setosa      9
# 10    10          4.9         3.1          1.5         0.1 setosa     10
# # ... with 140 more rows

iris_tbl %>%
  group_by(Species) %>%
  mutate(row = row_number()) %>%
  slice_head(n = 5)
# # A tibble: 15 x 7
# # Groups:   Species [3]
#       id Sepal.Length Sepal.Width Petal.Length Petal.Width Species      row
#    <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>      <int>
#  1     1          5.1         3.5          1.4         0.2 setosa         1
#  2     2          4.9         3            1.4         0.2 setosa         2
#  3     3          4.7         3.2          1.3         0.2 setosa         3
#  4     4          4.6         3.1          1.5         0.2 setosa         4
#  5     5          5           3.6          1.4         0.2 setosa         5
#  6    51          7           3.2          4.7         1.4 versicolor     1
#  7    52          6.4         3.2          4.5         1.5 versicolor     2
#  8    53          6.9         3.1          4.9         1.5 versicolor     3
#  9    54          5.5         2.3          4           1.3 versicolor     4
# 10    55          6.5         2.8          4.6         1.5 versicolor     5
# 11   101          6.3         3.3          6           2.5 virginica      1
# 12   102          5.8         2.7          5.1         1.9 virginica      2
# 13   103          7.1         3            5.9         2.1 virginica      3
# 14   104          6.3         2.9          5.6         1.8 virginica      4
# 15   105          6.5         3            5.8         2.2 virginica      5

引数を指定すると、row_number()は、引数の昇順に並び換えたときの行番号（同順位でも別番号）になります。
min_rank()とdense_rank()は順位（同順位は同じ番号）で、dense_rank()は番号が飛ばない順位になります。percent_rank()は、min_rank()を0から1の範囲に変換したもの、cume_dist()は、累積密度です。ntile()は、引数の順にn個のクラスに分けたときのクラスの番号です。

desc()関数を使うと、降順の順位が計算できます。

R Documentation の記載から引用：

row_number(): equivalent to rank(ties.method = "first")
min_rank(): equivalent to rank(ties.method = "min")
dense_rank(): like min_rank(), but with no gaps between ranks
percent_rank(): a number between 0 and 1 computed by rescaling min_rank to [0, 1]
cume_dist(): a cumulative distribution function. Proportion of all values less than or equal to the current rank.
ntile(): a rough rank, which breaks the input vector into n buckets. The size of the buckets may differ by up to one, larger buckets have lower rank.

# row_number(), min_rank(), dense_rank(), percent_rank(), cume_dist(), ntile()
iris_tbl %>%
  select(id, Sepal.Length, Species) %>%
  group_by(Species) %>%
  mutate(row_number   = row_number(Sepal.Length),
         min_rank     = min_rank(Sepal.Length),
         dense_rank   = dense_rank(Sepal.Length),
         percent_rank = percent_rank(Sepal.Length),
         cume_dist    = cume_dist(Sepal.Length),
         ntile5       = ntile(Sepal.Length, n = 5))
# # A tibble: 150 x 9
# Groups:   Species [3]
#       id Sepal.Length Species row_number min_rank dense_rank percent_rank cume_dist ntile5
#    <int>        <dbl> <fct>        <int>    <int>      <int>        <dbl>     <dbl>  <int>
#  1     1          5.1 setosa          29       29          9       0.571       0.72      3
#  2     2          4.9 setosa          17       17          7       0.327       0.4       2
#  3     3          4.7 setosa          10       10          5       0.184       0.22      1
#  4     4          4.6 setosa           6        6          4       0.102       0.18      1
#  5     5          5   setosa          21       21          8       0.408       0.56      3
#  6     6          5.4 setosa          41       41         12       0.816       0.9       5
#  7     7          4.6 setosa           7        6          4       0.102       0.18      1
#  8     8          5   setosa          22       21          8       0.408       0.56      3
#  9     9          4.4 setosa           2        2          2       0.0204      0.08      1
# 10    10          4.9 setosa          18       17          7       0.327       0.4       2
# # ... with 140 more rows

iris_tbl %>%
  select(id, Sepal.Length, Species) %>%
  group_by(Species) %>%
  mutate(row_number   = row_number(Sepal.Length),
         min_rank     = min_rank(Sepal.Length),
         dense_rank   = dense_rank(Sepal.Length),
         percent_rank = percent_rank(Sepal.Length),
         cume_dist    = cume_dist(Sepal.Length),
         ntile5       = ntile(Sepal.Length, n = 5)) %>%
  arrange(Sepal.Length)
# # A tibble: 150 x 9
# Groups:   Species [3]
#       id Sepal.Length Species row_number min_rank dense_rank percent_rank cume_dist ntile5
#    <int>        <dbl> <fct>        <int>    <int>      <int>        <dbl>     <dbl>  <int>
#  1    14          4.3 setosa           1        1          1       0           0.02      1
#  2     9          4.4 setosa           2        2          2       0.0204      0.08      1
#  3    39          4.4 setosa           3        2          2       0.0204      0.08      1
#  4    43          4.4 setosa           4        2          2       0.0204      0.08      1
#  5    42          4.5 setosa           5        5          3       0.0816      0.1       1
#  6     4          4.6 setosa           6        6          4       0.102       0.18      1
#  7     7          4.6 setosa           7        6          4       0.102       0.18      1
#  8    23          4.6 setosa           8        6          4       0.102       0.18      1
#  9    48          4.6 setosa           9        6          4       0.102       0.18      1
# 10     3          4.7 setosa          10       10          5       0.184       0.22      1
# # ... with 140 more rows

# desc()
iris_tbl %>%
  select(id, Sepal.Length, Species) %>%
  group_by(Species) %>%
  mutate(row_number   = row_number(desc(Sepal.Length)),
         min_rank     = min_rank(desc(Sepal.Length)),
         dense_rank   = dense_rank(desc(Sepal.Length)),
         percent_rank = percent_rank(desc(Sepal.Length)),
         cume_dist    = cume_dist(desc(Sepal.Length)),
         ntile5       = ntile(desc(Sepal.Length), n = 5)) %>%
  arrange(Sepal.Length)
# # A tibble: 150 x 9
# # Groups:   Species [3]
#       id Sepal.Length Species row_number min_rank dense_rank percent_rank cume_dist ntile5
#    <int>        <dbl> <fct>        <int>    <int>      <int>        <dbl>     <dbl>  <int>
#  1    14          4.3 setosa          50       50         15        1          1         5
#  2     9          4.4 setosa          47       47         14        0.939      0.98      5
#  3    39          4.4 setosa          48       47         14        0.939      0.98      5
#  4    43          4.4 setosa          49       47         14        0.939      0.98      5
#  5    42          4.5 setosa          46       46         13        0.918      0.92      5
#  6     4          4.6 setosa          42       42         12        0.837      0.9       5
#  7     7          4.6 setosa          43       42         12        0.837      0.9       5
#  8    23          4.6 setosa          44       42         12        0.837      0.9       5
#  9    48          4.6 setosa          45       42         12        0.837      0.9       5
# 10     3          4.7 setosa          40       40         11        0.796      0.82      4
# # ... with 140 more rows

なお、これらの関数は、ベクトルに対して使える関数ですので、データフレームの列の計算だけでなく、filter()の条件指定などでも使えます。

iris_tbl %>%
  group_by(Species) %>%
  filter(between(row_number(), 1, 3))
# これと同じ
iris_tbl %>%
  group_by(Species) %>%
  slice_head(n = 3)
# # A tibble: 9 x 6
# Groups:   Species [3]
#      id Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
#   <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>     
# 1     1          5.1         3.5          1.4         0.2 setosa    
# 2     2          4.9         3            1.4         0.2 setosa    
# 3     3          4.7         3.2          1.3         0.2 setosa    
# 4    51          7           3.2          4.7         1.4 versicolor
# 5    52          6.4         3.2          4.5         1.5 versicolor
# 6    53          6.9         3.1          4.9         1.5 versicolor
# 7   101          6.3         3.3          6           2.5 virginica 
# 8   102          5.8         2.7          5.1         1.9 virginica 
# 9   103          7.1         3            5.9         2.1 virginica 

iris_tbl %>%
  arrange(Sepal.Length) %>%
  group_by(Species) %>%
  filter(row_number() == 1)
# これと同じ
iris_tbl %>%
  arrange(Sepal.Length) %>%
  group_by(Species) %>%
  slice_head(n = 1)
iris_tbl %>%
  arrange(Sepal.Length) %>%
  distinct(Species, .keep_all = TRUE)
# A tibble: 3 x 6
#      id Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
#   <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>     
# 1    14          4.3         3            1.1         0.1 setosa    
# 2    58          4.9         2.4          3.3         1   versicolor
# 3   107          4.9         2.5          4.5         1.7 virginica

ベクトルに使った例も示します。

v <- c(2, 2, 4, 4, 4, 4, 1, 3, 5, NA)
v
# [1]  2  2  4  4  4  4  1  3  5 NA

row_number(v)
# [1]  2  3  5  6  7  8  1  4  9 NA
min_rank(v)
# [1]  2  2  5  5  5  5  1  4  9 NA
dense_rank(v)
# [1]  2  2  4  4  4  4  1  3  5 NA
percent_rank(v)
# [1] 0.125 0.125 0.500 0.500 0.500 0.500 0.000 0.375 1.000    NA
cume_dist(v)
# [1] 0.3333333 0.3333333 0.8888889 0.8888889 0.8888889 0.8888889
# [7] 0.1111111 0.4444444 1.0000000        NA
ntile(v, n = 5)
# [1]  1  2  3  3  4  4  1  2  5 NA
ntile(row_number(v), n = 5)
# [1]  1  2  3  3  4  4  1  2  5 NA

row_number(desc(v))
# [1]  7  8  2  3  4  5  9  6  1 NA
min_rank(desc(v))
# [1]  7  7  2  2  2  2  9  6  1 NA
dense_rank(desc(v))
# [1]  4  4  2  2  2  2  5  3  1 NA
percent_rank(desc(v))
# [1] 0.750 0.750 0.125 0.125 0.125 0.125 1.000 0.625 0.000    NA
cume_dist(desc(v))
# [1] 0.8888889 0.8888889 0.5555556 0.5555556 0.5555556 0.5555556
# [7] 1.0000000 0.6666667 0.1111111        NA
ntile(desc(v), n = 5)
# [1]  4  4  1  2  2  3  5  3  1 NA
ntile(row_number(desc(v)), n = 5)
# [1]  4  4  1  2  2  3  5  3  1 NA

data.frame(v = v,
           r1 = row_number(v),
           r2 = min_rank(v),
           r3 = dense_rank(v),
           r4 = percent_rank(v),
           r5 = cume_dist(v),
           r6 = ntile(v, n = 5)) %>%
  arrange(v)
#     v r1 r2 r3    r4        r5 r6
# 1   1  1  1  1 0.000 0.1111111  1
# 2   2  2  2  2 0.125 0.3333333  1
# 3   2  3  2  2 0.125 0.3333333  2
# 4   3  4  4  3 0.375 0.4444444  2
# 5   4  5  5  4 0.500 0.8888889  3
# 6   4  6  5  4 0.500 0.8888889  3
# 7   4  7  5  4 0.500 0.8888889  4
# 8   4  8  5  4 0.500 0.8888889  4
# 9   5  9  9  5 1.000 1.0000000  5
# 10 NA NA NA NA    NA        NA NA

data.frame(v = v,
           r1 = row_number(desc(v)),
           r2 = min_rank(desc(v)),
           r3 = dense_rank(desc(v)),
           r4 = percent_rank(desc(v)),
           r5 = cume_dist(desc(v)),
           r6 = ntile(desc(v), n = 5)) %>%
  arrange(v)
#     v r1 r2 r3    r4        r5 r6
# 1   1  9  9  5 1.000 1.0000000  5
# 2   2  7  7  4 0.750 0.8888889  4
# 3   2  8  7  4 0.750 0.8888889  4
# 4   3  6  6  3 0.625 0.6666667  3
# 5   4  2  2  2 0.125 0.5555556  1
# 6   4  3  2  2 0.125 0.5555556  2
# 7   4  4  2  2 0.125 0.5555556  2
# 8   4  5  2  2 0.125 0.5555556  3
# 9   5  1  1  1 0.000 0.1111111  1
# 10 NA NA NA NA    NA        NA NA

cumsum(), cumprod(), cummean(), cummax(), cummin()：累積計算

# cumsum(), cumprod(), cummean(), cummax(), cummin()
iris_tbl %>%
  select(id, Sepal.Length) %>%
  arrange(Sepal.Length) %>%
  mutate(cumsum = cumsum(Sepal.Length),
         cumprod = cumprod(Sepal.Length),
         cummean = cummean(Sepal.Length),
         cummax  = cummax(Sepal.Length),
         cummin  = cummin(Sepal.Length))
# # A tibble: 150 x 7
#       id Sepal.Length cumsum   cumprod cummean cummax cummin
#    <int>        <dbl>  <dbl>     <dbl>   <dbl>  <dbl>  <dbl>
#  1    14          4.3    4.3       4.3    4.3     4.3    4.3
#  2     9          4.4    8.7      18.9    4.3     4.4    4.3
#  3    39          4.4   13.1      83.2    4.33    4.4    4.3
#  4    43          4.4   17.5     366.     4.35    4.4    4.3
#  5    42          4.5   22      1648.     4.36    4.5    4.3
#  6     4          4.6   26.6    7582.     4.38    4.6    4.3
#  7     7          4.6   31.2   34878.     4.41    4.6    4.3
#  8    23          4.6   35.8  160440.     4.44    4.6    4.3
#  9    48          4.6   40.4  738024.     4.46    4.6    4.3
# 10     3          4.7   45.1 3468712.     4.47    4.7    4.3
# # ... with 140 more rows

ベクトルに使った例も示します。

v
# [1]  2  2  4  4  4  4  1  3  5 NA
cumsum(v)
# [1]  2  4  8 12 16 20 21 24 29 NA
cumprod(v)
# [1]     2     4    16    64   256  1024  1024  3072 15360    NA
cummean(v)
# [1] 2.000000 2.000000 2.000000 2.500000 2.800000 3.000000 3.142857
# [8] 2.875000 2.888889 3.100000
cummax(v)
# [1]  2  2  4  4  4  4  4  4  5 NA
cummin(v)
# [1]  2  2  2  2  2  2  1  1  1 NA

lead(), lag()：行方向のずらし（リード・ラグ）計算

# lead(), lag()
iris_tbl %>%
  select(id, Sepal.Length) %>%
  mutate(behind = lag(Sepal.Length),   # 1つ前
         ahead = lead(Sepal.Length))   # 1つ後ろ
# # A tibble: 150 x 4
#      id Sepal.Length behind ahead
#   <int>        <dbl>  <dbl> <dbl>
# 1     1          5.1   NA     4.9
# 2     2          4.9    5.1   4.7
# 3     3          4.7    4.9   4.6
# 4     4          4.6    4.7   5  
# 5     5          5      4.6   5.4

iris_tbl %>%
  select(id, Sepal.Length) %>%
  mutate(lag_0 = lag(Sepal.Length, n = 0),
         lag_1 = lag(Sepal.Length, n = 1),   # 1つ後ろ
         lag_2 = lag(Sepal.Length, n = 2),   # 2つ後ろ
         lag_3 = lag(Sepal.Length, n = 3))   # 3つ後ろ
iris_tbl %>%
  select(id, Sepal.Length) %>%
  mutate(lag_0 = Sepal.Length) %>%
  mutate(across(.cols = Sepal.Length,
                .fns = list(~lag(., n = 1), ~lag(., n = 2), ~lag(., n = 3)),
                .names = "lag_{fn}"))
iris_tbl %>%
  select(id, Sepal.Length) %>%
  mutate(across(.cols = Sepal.Length,
                .fns = list(~lag(., n = 0), ~lag(., n = 1),  ~lag(., n = 2), ~lag(., n = 3)),
                .names = "lag_{fn-1}"))
# # A tibble: 150 x 6
#      id Sepal.Length lag_0 lag_1 lag_2 lag_3
#   <int>        <dbl> <dbl> <dbl> <dbl> <dbl>
# 1     1          5.1   5.1  NA    NA    NA  
# 2     2          4.9   4.9   5.1  NA    NA  
# 3     3          4.7   4.7   4.9   5.1  NA  
# 4     4          4.6   4.6   4.7   4.9   5.1
# 5     5          5     5     4.6   4.7   4.9

注意）mutate(across())については、こちらを参照。

ベクトルに使った例も示します。

v
# [1]  2  2  4  4  4  4  1  3  5 NA
lead(v)
# [1]  2  4  4  4  4  1  3  5 NA NA
lead(v, 2)
# [1]  4  4  4  4  1  3  5 NA NA NA
lag(v)
# [1] NA  2  2  4  4  4  4  1  3  5
lag(v, 2)
# [1] NA NA  2  2  4  4  4  4  1  3

if_else(), case_when()：条件分岐

# if_else()
iris_tbl %>%
  select(id, Sepal.Length) %>%
  mutate(sl = if_else(Sepal.Length < 5.0, 1, 0))
# # A tibble: 150 x 3
#      id Sepal.Length    sl
#   <int>        <dbl> <dbl>
# 1     1          5.1     0
# 2     2          4.9     1
# 3     3          4.7     1
# 4     4          4.6     1
# 5     5          5       0

# case_when()
iris_tbl %>%
  select(id, Sepal.Length) %>%
  mutate(sl = case_when(is.na(Sepal.Length) ~ as.character(NA), # NA_character_
                        Sepal.Length < 4.0  ~ "<3.9",
                        Sepal.Length < 5.0  ~ "4.0-4.9",
                        Sepal.Length < 6.0  ~ "5.0-5.9",
                        TRUE                ~ "6.0-"))
# # A tibble: 150 x 3
#      id Sepal.Length sl     
#   <int>        <dbl> <chr>  
# 1     1          5.1 5.0-5.9
# 2     2          4.9 4.0-4.9
# 3     3          4.7 4.0-4.9
# 4     4          4.6 4.0-4.9
# 5     5          5   5.0-5.9

ベクトルに使った例も示します。

v
# [1]  2  2  4  4  4  4  1  3  5 NA
if_else(v %% 2 == 0, 0, 1)
# [1]  0  0  0  0  0  0  1  1  1 NA

case_when(v <= 2 ~ "1-2",
          v <= 4 ~ "3-4",
          TRUE   ~ "5-")
# [1] "1-2" "1-2" "3-4" "3-4" "3-4" "3-4" "1-2" "3-4" "5-"  "5-" 

case_when(is.na(v) ~ NA_character_,
          v <= 2 ~ "1-2",
          v <= 4 ~ "3-4",
          TRUE   ~ "5-")
case_when(is.na(v) ~ as.character(NA),
          v <= 2 ~ "1-2",
          v <= 4 ~ "3-4",
          TRUE   ~ "5-")
# [1] "1-2" "1-2" "3-4" "3-4" "3-4" "3-4" "1-2" "3-4" "5-"  NA

coalesce()：NAの置き換え

# coalesce()
iris_tbl %>%
  select(id, Sepal.Length) %>%
  mutate(across(.cols = Sepal.Length,
                .fns = list(~lag(., n = 0), ~lag(., n = 1),  ~lag(., n = 2), ~lag(., n = 3)),
                .names = "lag_{fn-1}")) %>%
  mutate(x_1 = coalesce(lag_1, 0),
         x_2 = coalesce(lag_2, 0),
         x_3 = coalesce(lag_3, 0))
# # A tibble: 150 x 9
#      id Sepal.Length lag_0 lag_1 lag_2 lag_3   x_1   x_2   x_3
#   <int>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1     1          5.1   5.1  NA    NA    NA     0     0     0  
# 2     2          4.9   4.9   5.1  NA    NA     5.1   0     0  
# 3     3          4.7   4.7   4.9   5.1  NA     4.9   5.1   0  
# 4     4          4.6   4.6   4.7   4.9   5.1   4.7   4.9   5.1
# 5     5          5     5     4.6   4.7   4.9   4.6   4.7   4.9

iris_tbl %>%
  select(id, Sepal.Length) %>%
  mutate(across(.cols = Sepal.Length,
                .fns = list(~lag(., n = 0), ~lag(., n = 1),  ~lag(., n = 2), ~lag(., n = 3)),
                .names = "lag_{fn-1}")) %>%
  mutate(x_1 = coalesce(lag_1, lag_0),
         x_2 = coalesce(lag_2, lag_1, lag_0),
         x_3 = coalesce(lag_3, lag_2, lag_1, lag_0))
# # A tibble: 150 x 9
#      id Sepal.Length lag_0 lag_1 lag_2 lag_3   x_1   x_2   x_3
#   <int>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1     1          5.1   5.1  NA    NA    NA     5.1   5.1   5.1
# 2     2          4.9   4.9   5.1  NA    NA     5.1   5.1   5.1
# 3     3          4.7   4.7   4.9   5.1  NA     4.9   5.1   5.1
# 4     4          4.6   4.6   4.7   4.9   5.1   4.7   4.9   5.1
# 5     5          5     5     4.6   4.7   4.9   4.6   4.7   4.9

ベクトルに使った例も示します。

v
# [1]  2  2  4  4  4  4  1  3  5 NA
coalesce(v, 0)
# [1] 2 2 4 4 4 4 1 3 5 0

lag_0 <- lag(v, n = 0)
lag_0
# [1]  2  2  4  4  4  4  1  3  5 NA
lag_1 <- lag(v, n = 1)
lag_1
# [1] NA  2  2  4  4  4  4  1  3  5
lag_2 <- lag(v, n = 2)
lag_2
# [1] NA NA  2  2  4  4  4  4  1  3
lag_3 <- lag(v, n = 3)
lag_3
# [1] NA NA NA  2  2  4  4  4  4  1
coalesce(lag_1, lag_0)
# [1] 2 2 2 4 4 4 4 1 3 5
coalesce(lag_2, lag_1, lag_0)
# [1] 2 2 2 2 4 4 4 4 1 3
coalesce(lag_3, lag_2, lag_1, lag_0)
# [1] 2 2 2 2 2 4 4 4 4 1

replace_na()：NAの置き換え

# replace_na()
iris_tbl %>%
  select(id, Sepal.Length) %>%
  mutate(across(.cols = Sepal.Length,
                .fns = list(~lag(., n = 0), ~lag(., n = 1),  ~lag(., n = 2), ~lag(., n = 3)),
                .names = "lag_{fn-1}")) %>%
  mutate(x_1 = replace_na(lag_1, 0),
         x_2 = replace_na(lag_2, 0),
         x_3 = replace_na(lag_3, 0))
iris_tbl %>%
  select(id, Sepal.Length) %>%
  mutate(across(.cols = Sepal.Length,
                .fns = list(~lag(., n = 0), ~lag(., n = 1),  ~lag(., n = 2), ~lag(., n = 3)),
                .names = "lag_{fn-1}")) %>%
  mutate(across(lag_1:lag_3, ~replace_na(., 0)))
iris_tbl %>%
  select(id, Sepal.Length) %>%
  mutate(across(.cols = Sepal.Length,
                .fns = list(~lag(., n = 0), ~lag(., n = 1),  ~lag(., n = 2), ~lag(., n = 3)),
                .names = "lag_{fn-1}")) %>%
  replace_na(list(lag_1 = 0, lag_2 = 0, lag_3 = 0))
# # A tibble: 150 x 9
#      id Sepal.Length lag_0 lag_1 lag_2 lag_3   x_1   x_2   x_3
#   <int>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1     1          5.1   5.1  NA    NA    NA     0     0     0  
# 2     2          4.9   4.9   5.1  NA    NA     5.1   0     0  
# 3     3          4.7   4.7   4.9   5.1  NA     4.9   5.1   0  
# 4     4          4.6   4.6   4.7   4.9   5.1   4.7   4.9   5.1
# 5     5          5     5     4.6   4.7   4.9   4.6   4.7   4.9

ベクトルに使った例も示します。

v
# [1]  2  2  4  4  4  4  1  3  5 NA
replace_na(v, 0)
# [1] 2 2 4 4 4 4 1 3 5 0
coalesce(v, 0)
# [1] 2 2 4 4 4 4 1 3 5 0
if_else(is.na(v), 0, v)
# [1] 2 2 4 4 4 4 1 3 5 0
if_else(!is.na(v), v, 0)
# [1] 2 2 4 4 4 4 1 3 5 0

na_if()：NAへの置き換え

# na_if()
iris_tbl %>%
  select(id, Sepal.Length) %>%
  mutate(sl = na_if(Sepal.Length, 4.9))
iris_tbl %>%
  select(id, Sepal.Length) %>%
  mutate(sl = if_else(Sepal.Length == 4.9, as.double(NA), Sepal.Length))
iris_tbl %>%
  select(id, Sepal.Length) %>%
  mutate(sl = if_else(Sepal.Length == 4.9, NA_real_, Sepal.Length))
# # A tibble: 150 x 3
#       id Sepal.Length    sl
#    <int>        <dbl> <dbl>
#  1     1          5.1   5.1
#  2     2          4.9  NA  
#  3     3          4.7   4.7
#  4     4          4.6   4.6
#  5     5          5     5  
#  6     6          5.4   5.4
#  7     7          4.6   4.6
#  8     8          5     5  
#  9     9          4.4   4.4
# 10    10          4.9  NA

ベクトルに使った例も示します。

v
# [1]  2  2  4  4  4  4  1  3  5 NA
na_if(v, 4)
# [1]  2  2 NA NA NA NA  1  3  5 NA

recode()：コードの振り直し

# recode()
iris_tbl %>%
  mutate(SP = recode(Species, setosa = toupper("setosa")))
# # A tibble: 150 x 7
#      id Sepal.Length Sepal.Width Petal.Length Petal.Width Species SP    
#   <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>   <fct> 
# 1     1          5.1         3.5          1.4         0.2 setosa  SETOSA
# 2     2          4.9         3            1.4         0.2 setosa  SETOSA

iris_tbl %>%
  mutate(SP2 = recode(Species,
                      setosa = toupper("setosa"),
                      versicolor = toupper("versicolor"),
                      virginica = toupper("virginica"))) %>%
  group_by(Species) %>% slice_head(n = 3)
# A tibble: 9 x 7
# Groups:   Species [3]
#      id Sepal.Length Sepal.Width Petal.Length Petal.Width Species    SP2       
#   <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>      <fct>     
# 1     1          5.1         3.5          1.4         0.2 setosa     SETOSA    
# 2     2          4.9         3            1.4         0.2 setosa     SETOSA    
# 3     3          4.7         3.2          1.3         0.2 setosa     SETOSA    
# 4    51          7           3.2          4.7         1.4 versicolor VERSICOLOR
# 5    52          6.4         3.2          4.5         1.5 versicolor VERSICOLOR
# 6    53          6.9         3.1          4.9         1.5 versicolor VERSICOLOR
# 7   101          6.3         3.3          6           2.5 virginica  VIRGINICA 
# 8   102          5.8         2.7          5.1         1.9 virginica  VIRGINICA 
# 9   103          7.1         3            5.9         2.1 virginica  VIRGINICA 

sp <- levels(iris_tbl$Species) # 変更前のlevels
sp
# [1] "setosa"     "versicolor" "virginica" 
sp_2 <- toupper(sp)            # 変更後のlevels
named_sp <- setNames(sp_2, sp) # 名前付きベクトル
iris_tbl %>%
  mutate(Species_2 = recode(Species, !!!named_sp)) %>%
  group_by(Species) %>% slice_head(n = 2)
# # A tibble: 6 x 7
# # Groups:   Species [3]
#      id Sepal.Length Sepal.Width Petal.Length Petal.Width Species    Species_2 
#   <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>      <fct>     
# 1     1          5.1         3.5          1.4         0.2 setosa     SETOSA    
# 2     2          4.9         3            1.4         0.2 setosa     SETOSA    
# 3    51          7           3.2          4.7         1.4 versicolor VERSICOLOR
# 4    52          6.4         3.2          4.5         1.5 versicolor VERSICOLOR
# 5   101          6.3         3.3          6           2.5 virginica  VIRGINICA 
# 6   102          5.8         2.7          5.1         1.9 virginica  VIRGINICA

ベクトルに使った例も示します。

v <- c(2, 2, 4, 4, 4, 4, 1, 3, 5, NA)
v <- LETTERS[v]
v
# [1] "B" "B" "D" "D" "D" "D" "A" "C" "E" NA 
recode(v, A = "AAA", B = "BBB", C = "CCC")
# [1] "BBB" "BBB" "D"   "D"   "D"   "D"   "AAA" "CCC" "E"   NA 

vf <- factor(v, levels = LETTERS[1:5])
vf
# [1] B    B    D    D    D    D    A    C    E    <NA>
# Levels: A B C D E
recode(vf, A = "AAA", B = "BBB", C = "CCC")
# [1] BBB  BBB  D    D    D    D    AAA  CCC  E    <NA>
# Levels: AAA BBB CCC D E

LETTERS[1:3]             # 変更前のlevels
# [1] "A" "B" "C"
library(stringr)
str_dup(LETTERS[1:3], 3) # 変更後のlevels
# [1] "AAA" "BBB" "CCC"
named_levels <- setNames(str_dup(LETTERS[1:3], 3), LETTERS[1:3]) # 名前付きベクトル
recode(vf, !!!named_levels)
# [1] BBB  BBB  D    D    D    D    AAA  CCC  E    <NA>
# Levels: AAA BBB CCC D E

指定した列のNAを置き換えるもの

まず、NAを含むデータを準備しておきます。

# NAを含むデータを準備
iris_tbl_na <- iris_tbl %>%
  slice_head(n = 10) %>%
  mutate(Sepal.Length = na_if(Sepal.Length, 4.9),
         Sepal.Width = na_if(Sepal.Width, 3.4)) %>% print()
# # A tibble: 10 x 6
#       id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#    <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#  1     1          5.1         3.5          1.4         0.2 setosa 
#  2     2         NA           3            1.4         0.2 setosa 
#  3     3          4.7         3.2          1.3         0.2 setosa 
#  4     4          4.6         3.1          1.5         0.2 setosa 
#  5     5          5           3.6          1.4         0.2 setosa 
#  6     6          5.4         3.9          1.7         0.4 setosa 
#  7     7          4.6        NA            1.4         0.3 setosa 
#  8     8          5          NA            1.5         0.2 setosa 
#  9     9          4.4         2.9          1.4         0.2 setosa 
# 10    10         NA           3.1          1.5         0.1 setosa

replace_na()：NAを指定した値で置き換え

指定した列のNAを指定した値で置き換えます。指定する列と置き換える値はリストで与えます。

# replace_na()
iris_tbl_na %>%
  replace_na(list(Sepal.Length = 0.0, Sepal.Width = 99.9))
# # A tibble: 10 x 6
#       id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#    <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#  1     1          5.1         3.5          1.4         0.2 setosa 
#  2     2          0           3            1.4         0.2 setosa 
#  3     3          4.7         3.2          1.3         0.2 setosa 
#  4     4          4.6         3.1          1.5         0.2 setosa 
#  5     5          5           3.6          1.4         0.2 setosa 
#  6     6          5.4         3.9          1.7         0.4 setosa 
#  7     7          4.6        99.9          1.4         0.3 setosa 
#  8     8          5          99.9          1.5         0.2 setosa 
#  9     9          4.4         2.9          1.4         0.2 setosa 
# 10    10          0           3.1          1.5         0.1 setosa 

list(Sepal.Length = 0.0, Sepal.Width = 99.9)
# $Sepal.Length
# [1] 0
# 
# $Sepal.Width
# [1] 99.9
# 
class(list(Sepal.Length = 0.0, Sepal.Width = 99.9))
# [1] "list"

なお、上で出てきたreplace_na()はベクトルに対する関数でしたが、こちらのreplace_na()はデータフレームに対する関数になっています。

fill()：NAを前後の値で置き換え

指定した列のNAを同じ列の前後（上下）のNAでない値で置き換えます。引数.directionで前後のどちらの方向で埋めるかを指定できます。指定できる方向には、"down" (the default), "up", "downup" (i.e. first down and then up) or "updown" (first up and then down) の４通りあります。

# fill()
iris_tbl_na %>%
  fill(Sepal.Length, Sepal.Width)
# # A tibble: 10 x 6
#       id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#    <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#  1     1          5.1         3.5          1.4         0.2 setosa 
#  2     2          5.1         3            1.4         0.2 setosa 
#  3     3          4.7         3.2          1.3         0.2 setosa 
#  4     4          4.6         3.1          1.5         0.2 setosa 
#  5     5          5           3.6          1.4         0.2 setosa 
#  6     6          5.4         3.9          1.7         0.4 setosa 
#  7     7          4.6         3.9          1.4         0.3 setosa 
#  8     8          5           3.9          1.5         0.2 setosa 
#  9     9          4.4         2.9          1.4         0.2 setosa 
# 10    10          4.4         3.1          1.5         0.1 setosa 

iris_tbl_na %>%
  fill(Sepal.Length, Sepal.Width, .direction = "up")
# # A tibble: 10 x 6
#       id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#    <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#  1     1          5.1         3.5          1.4         0.2 setosa 
#  2     2          4.7         3            1.4         0.2 setosa 
#  3     3          4.7         3.2          1.3         0.2 setosa 
#  4     4          4.6         3.1          1.5         0.2 setosa 
#  5     5          5           3.6          1.4         0.2 setosa 
#  6     6          5.4         3.9          1.7         0.4 setosa 
#  7     7          4.6         2.9          1.4         0.3 setosa 
#  8     8          5           2.9          1.5         0.2 setosa 
#  9     9          4.4         2.9          1.4         0.2 setosa 
# 10    10         NA           3.1          1.5         0.1 setosa 

iris_tbl_na %>%
  fill(Sepal.Length, Sepal.Width, .direction = "updown")
# # A tibble: 10 x 6
#       id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#    <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#  1     1          5.1         3.5          1.4         0.2 setosa 
#  2     2          4.7         3            1.4         0.2 setosa 
#  3     3          4.7         3.2          1.3         0.2 setosa 
#  4     4          4.6         3.1          1.5         0.2 setosa 
#  5     5          5           3.6          1.4         0.2 setosa 
#  6     6          5.4         3.9          1.7         0.4 setosa 
#  7     7          4.6         2.9          1.4         0.3 setosa 
#  8     8          5           2.9          1.5         0.2 setosa 
#  9     9          4.4         2.9          1.4         0.2 setosa 
# 10    10          4.4         3.1          1.5         0.1 setosa

参考

tidyverse
- dplyr
- tidyr
R for Data Science
- 5 Data transformation
- 12 Tidy data
Heavy Watal - R stats
- dplyr — 高速data.frame処理
- tidyr — シンプルなデータ変形ツール
Rのデータベースの操作についての自前のまとめ

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up