LoginSignup
1
4

More than 1 year has passed since last update.

Rのデータフレームの計算に用いる関数についてのメモ

Last updated at Posted at 2021-03-13

はじめに

Rのデータフレームの計算に用いられる関数について、サンプルデータを用いて具体的に練習してみました。

目次

Rのデータフレームの計算に用いる関数

Rのデータフレームの計算に用いる関数について、サンプルデータを用いて具体的に練習します。
なお、以下については別記事でまとめています。
- 行の操作について:こちら
- 列の操作について:こちら
- pivot操作について:こちら
- 集計の操作について:こちら

参考:
https://r4ds.had.co.nz/transform.html
https://github.com/tidyverse/dplyr
https://heavywatal.github.io/rstats/dplyr.html

練習に用いるデータ

Rのサンプルデータirisanscombeを用います。
行番号を表示した方が分かりやすいため、tibble::rowid_to_column()で行番号を列名にしています。
こちらの記事と同じデータを用います。)

R
library(dplyr)
library(tibble)
library(tidyr)

iris

# 行番号を列名に
iris_tbl <- iris %>% as_tibble() %>% rowid_to_column("id")
iris_tbl
# # A tibble: 150 x 6
#      id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#   <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>  
# 1     1          5.1         3.5          1.4         0.2 setosa 
# 2     2          4.9         3            1.4         0.2 setosa 

anscombe

anscombe_tbl <- anscombe %>% as_tibble() %>% rowid_to_column("id")
anscombe_tbl
# # A tibble: 11 x 9
#      id    x1    x2    x3    x4    y1    y2    y3    y4
#   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1     1    10    10    10     8  8.04  9.14  7.46  6.58
# 2     2     8     8     8     8  6.95  8.14  6.77  5.76

主にmutate()やfilter()を補助するもの

row_number(), min_rank(), dense_rank(), percent_rank(), cume_dist(), ntile():順位計算

順位(昇順の順位)を計算する関数です。グループ化しておくと、グループごとの順位になります。

引数なしでrow_number()とすると、行の通し番号になります。

R
# row_number()
iris_tbl %>%
  mutate(row = row_number())
# A tibble: 150 x 7
#       id Sepal.Length Sepal.Width Petal.Length Petal.Width Species   row
#    <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>   <int>
#  1     1          5.1         3.5          1.4         0.2 setosa      1
#  2     2          4.9         3            1.4         0.2 setosa      2
#  3     3          4.7         3.2          1.3         0.2 setosa      3
#  4     4          4.6         3.1          1.5         0.2 setosa      4
#  5     5          5           3.6          1.4         0.2 setosa      5
#  6     6          5.4         3.9          1.7         0.4 setosa      6
#  7     7          4.6         3.4          1.4         0.3 setosa      7
#  8     8          5           3.4          1.5         0.2 setosa      8
#  9     9          4.4         2.9          1.4         0.2 setosa      9
# 10    10          4.9         3.1          1.5         0.1 setosa     10
# # ... with 140 more rows

iris_tbl %>%
  group_by(Species) %>%
  mutate(row = row_number()) %>%
  slice_head(n = 5)
# # A tibble: 15 x 7
# # Groups:   Species [3]
#       id Sepal.Length Sepal.Width Petal.Length Petal.Width Species      row
#    <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>      <int>
#  1     1          5.1         3.5          1.4         0.2 setosa         1
#  2     2          4.9         3            1.4         0.2 setosa         2
#  3     3          4.7         3.2          1.3         0.2 setosa         3
#  4     4          4.6         3.1          1.5         0.2 setosa         4
#  5     5          5           3.6          1.4         0.2 setosa         5
#  6    51          7           3.2          4.7         1.4 versicolor     1
#  7    52          6.4         3.2          4.5         1.5 versicolor     2
#  8    53          6.9         3.1          4.9         1.5 versicolor     3
#  9    54          5.5         2.3          4           1.3 versicolor     4
# 10    55          6.5         2.8          4.6         1.5 versicolor     5
# 11   101          6.3         3.3          6           2.5 virginica      1
# 12   102          5.8         2.7          5.1         1.9 virginica      2
# 13   103          7.1         3            5.9         2.1 virginica      3
# 14   104          6.3         2.9          5.6         1.8 virginica      4
# 15   105          6.5         3            5.8         2.2 virginica      5

引数を指定すると、row_number()は、引数の昇順に並び換えたときの行番号(同順位でも別番号)になります。
min_rank()dense_rank()は順位(同順位は同じ番号)で、dense_rank()は番号が飛ばない順位になります。percent_rank()は、min_rank()を0から1の範囲に変換したもの、cume_dist()は、累積密度です。ntile()は、引数の順にn個のクラスに分けたときのクラスの番号です。

desc()関数を使うと、降順の順位が計算できます。

R Documentation の記載から引用:

  • row_number(): equivalent to rank(ties.method = "first")
  • min_rank(): equivalent to rank(ties.method = "min")
  • dense_rank(): like min_rank(), but with no gaps between ranks
  • percent_rank(): a number between 0 and 1 computed by rescaling min_rank to [0, 1]
  • cume_dist(): a cumulative distribution function. Proportion of all values less than or equal to the current rank.
  • ntile(): a rough rank, which breaks the input vector into n buckets. The size of the buckets may differ by up to one, larger buckets have lower rank.
R
# row_number(), min_rank(), dense_rank(), percent_rank(), cume_dist(), ntile()
iris_tbl %>%
  select(id, Sepal.Length, Species) %>%
  group_by(Species) %>%
  mutate(row_number   = row_number(Sepal.Length),
         min_rank     = min_rank(Sepal.Length),
         dense_rank   = dense_rank(Sepal.Length),
         percent_rank = percent_rank(Sepal.Length),
         cume_dist    = cume_dist(Sepal.Length),
         ntile5       = ntile(Sepal.Length, n = 5))
# # A tibble: 150 x 9
# Groups:   Species [3]
#       id Sepal.Length Species row_number min_rank dense_rank percent_rank cume_dist ntile5
#    <int>        <dbl> <fct>        <int>    <int>      <int>        <dbl>     <dbl>  <int>
#  1     1          5.1 setosa          29       29          9       0.571       0.72      3
#  2     2          4.9 setosa          17       17          7       0.327       0.4       2
#  3     3          4.7 setosa          10       10          5       0.184       0.22      1
#  4     4          4.6 setosa           6        6          4       0.102       0.18      1
#  5     5          5   setosa          21       21          8       0.408       0.56      3
#  6     6          5.4 setosa          41       41         12       0.816       0.9       5
#  7     7          4.6 setosa           7        6          4       0.102       0.18      1
#  8     8          5   setosa          22       21          8       0.408       0.56      3
#  9     9          4.4 setosa           2        2          2       0.0204      0.08      1
# 10    10          4.9 setosa          18       17          7       0.327       0.4       2
# # ... with 140 more rows

iris_tbl %>%
  select(id, Sepal.Length, Species) %>%
  group_by(Species) %>%
  mutate(row_number   = row_number(Sepal.Length),
         min_rank     = min_rank(Sepal.Length),
         dense_rank   = dense_rank(Sepal.Length),
         percent_rank = percent_rank(Sepal.Length),
         cume_dist    = cume_dist(Sepal.Length),
         ntile5       = ntile(Sepal.Length, n = 5)) %>%
  arrange(Sepal.Length)
# # A tibble: 150 x 9
# Groups:   Species [3]
#       id Sepal.Length Species row_number min_rank dense_rank percent_rank cume_dist ntile5
#    <int>        <dbl> <fct>        <int>    <int>      <int>        <dbl>     <dbl>  <int>
#  1    14          4.3 setosa           1        1          1       0           0.02      1
#  2     9          4.4 setosa           2        2          2       0.0204      0.08      1
#  3    39          4.4 setosa           3        2          2       0.0204      0.08      1
#  4    43          4.4 setosa           4        2          2       0.0204      0.08      1
#  5    42          4.5 setosa           5        5          3       0.0816      0.1       1
#  6     4          4.6 setosa           6        6          4       0.102       0.18      1
#  7     7          4.6 setosa           7        6          4       0.102       0.18      1
#  8    23          4.6 setosa           8        6          4       0.102       0.18      1
#  9    48          4.6 setosa           9        6          4       0.102       0.18      1
# 10     3          4.7 setosa          10       10          5       0.184       0.22      1
# # ... with 140 more rows

# desc()
iris_tbl %>%
  select(id, Sepal.Length, Species) %>%
  group_by(Species) %>%
  mutate(row_number   = row_number(desc(Sepal.Length)),
         min_rank     = min_rank(desc(Sepal.Length)),
         dense_rank   = dense_rank(desc(Sepal.Length)),
         percent_rank = percent_rank(desc(Sepal.Length)),
         cume_dist    = cume_dist(desc(Sepal.Length)),
         ntile5       = ntile(desc(Sepal.Length), n = 5)) %>%
  arrange(Sepal.Length)
# # A tibble: 150 x 9
# # Groups:   Species [3]
#       id Sepal.Length Species row_number min_rank dense_rank percent_rank cume_dist ntile5
#    <int>        <dbl> <fct>        <int>    <int>      <int>        <dbl>     <dbl>  <int>
#  1    14          4.3 setosa          50       50         15        1          1         5
#  2     9          4.4 setosa          47       47         14        0.939      0.98      5
#  3    39          4.4 setosa          48       47         14        0.939      0.98      5
#  4    43          4.4 setosa          49       47         14        0.939      0.98      5
#  5    42          4.5 setosa          46       46         13        0.918      0.92      5
#  6     4          4.6 setosa          42       42         12        0.837      0.9       5
#  7     7          4.6 setosa          43       42         12        0.837      0.9       5
#  8    23          4.6 setosa          44       42         12        0.837      0.9       5
#  9    48          4.6 setosa          45       42         12        0.837      0.9       5
# 10     3          4.7 setosa          40       40         11        0.796      0.82      4
# # ... with 140 more rows

なお、これらの関数は、ベクトルに対して使える関数ですので、データフレームの列の計算だけでなく、filter()の条件指定などでも使えます。

R
iris_tbl %>%
  group_by(Species) %>%
  filter(between(row_number(), 1, 3))
# これと同じ
iris_tbl %>%
  group_by(Species) %>%
  slice_head(n = 3)
# # A tibble: 9 x 6
# Groups:   Species [3]
#      id Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
#   <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>     
# 1     1          5.1         3.5          1.4         0.2 setosa    
# 2     2          4.9         3            1.4         0.2 setosa    
# 3     3          4.7         3.2          1.3         0.2 setosa    
# 4    51          7           3.2          4.7         1.4 versicolor
# 5    52          6.4         3.2          4.5         1.5 versicolor
# 6    53          6.9         3.1          4.9         1.5 versicolor
# 7   101          6.3         3.3          6           2.5 virginica 
# 8   102          5.8         2.7          5.1         1.9 virginica 
# 9   103          7.1         3            5.9         2.1 virginica 

iris_tbl %>%
  arrange(Sepal.Length) %>%
  group_by(Species) %>%
  filter(row_number() == 1)
# これと同じ
iris_tbl %>%
  arrange(Sepal.Length) %>%
  group_by(Species) %>%
  slice_head(n = 1)
iris_tbl %>%
  arrange(Sepal.Length) %>%
  distinct(Species, .keep_all = TRUE)
# A tibble: 3 x 6
#      id Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
#   <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>     
# 1    14          4.3         3            1.1         0.1 setosa    
# 2    58          4.9         2.4          3.3         1   versicolor
# 3   107          4.9         2.5          4.5         1.7 virginica 

ベクトルに使った例も示します。

R
v <- c(2, 2, 4, 4, 4, 4, 1, 3, 5, NA)
v
# [1]  2  2  4  4  4  4  1  3  5 NA

row_number(v)
# [1]  2  3  5  6  7  8  1  4  9 NA
min_rank(v)
# [1]  2  2  5  5  5  5  1  4  9 NA
dense_rank(v)
# [1]  2  2  4  4  4  4  1  3  5 NA
percent_rank(v)
# [1] 0.125 0.125 0.500 0.500 0.500 0.500 0.000 0.375 1.000    NA
cume_dist(v)
# [1] 0.3333333 0.3333333 0.8888889 0.8888889 0.8888889 0.8888889
# [7] 0.1111111 0.4444444 1.0000000        NA
ntile(v, n = 5)
# [1]  1  2  3  3  4  4  1  2  5 NA
ntile(row_number(v), n = 5)
# [1]  1  2  3  3  4  4  1  2  5 NA

row_number(desc(v))
# [1]  7  8  2  3  4  5  9  6  1 NA
min_rank(desc(v))
# [1]  7  7  2  2  2  2  9  6  1 NA
dense_rank(desc(v))
# [1]  4  4  2  2  2  2  5  3  1 NA
percent_rank(desc(v))
# [1] 0.750 0.750 0.125 0.125 0.125 0.125 1.000 0.625 0.000    NA
cume_dist(desc(v))
# [1] 0.8888889 0.8888889 0.5555556 0.5555556 0.5555556 0.5555556
# [7] 1.0000000 0.6666667 0.1111111        NA
ntile(desc(v), n = 5)
# [1]  4  4  1  2  2  3  5  3  1 NA
ntile(row_number(desc(v)), n = 5)
# [1]  4  4  1  2  2  3  5  3  1 NA

data.frame(v = v,
           r1 = row_number(v),
           r2 = min_rank(v),
           r3 = dense_rank(v),
           r4 = percent_rank(v),
           r5 = cume_dist(v),
           r6 = ntile(v, n = 5)) %>%
  arrange(v)
#     v r1 r2 r3    r4        r5 r6
# 1   1  1  1  1 0.000 0.1111111  1
# 2   2  2  2  2 0.125 0.3333333  1
# 3   2  3  2  2 0.125 0.3333333  2
# 4   3  4  4  3 0.375 0.4444444  2
# 5   4  5  5  4 0.500 0.8888889  3
# 6   4  6  5  4 0.500 0.8888889  3
# 7   4  7  5  4 0.500 0.8888889  4
# 8   4  8  5  4 0.500 0.8888889  4
# 9   5  9  9  5 1.000 1.0000000  5
# 10 NA NA NA NA    NA        NA NA

data.frame(v = v,
           r1 = row_number(desc(v)),
           r2 = min_rank(desc(v)),
           r3 = dense_rank(desc(v)),
           r4 = percent_rank(desc(v)),
           r5 = cume_dist(desc(v)),
           r6 = ntile(desc(v), n = 5)) %>%
  arrange(v)
#     v r1 r2 r3    r4        r5 r6
# 1   1  9  9  5 1.000 1.0000000  5
# 2   2  7  7  4 0.750 0.8888889  4
# 3   2  8  7  4 0.750 0.8888889  4
# 4   3  6  6  3 0.625 0.6666667  3
# 5   4  2  2  2 0.125 0.5555556  1
# 6   4  3  2  2 0.125 0.5555556  2
# 7   4  4  2  2 0.125 0.5555556  2
# 8   4  5  2  2 0.125 0.5555556  3
# 9   5  1  1  1 0.000 0.1111111  1
# 10 NA NA NA NA    NA        NA NA

cumsum(), cumprod(), cummean(), cummax(), cummin():累積計算

R
# cumsum(), cumprod(), cummean(), cummax(), cummin()
iris_tbl %>%
  select(id, Sepal.Length) %>%
  arrange(Sepal.Length) %>%
  mutate(cumsum = cumsum(Sepal.Length),
         cumprod = cumprod(Sepal.Length),
         cummean = cummean(Sepal.Length),
         cummax  = cummax(Sepal.Length),
         cummin  = cummin(Sepal.Length))
# # A tibble: 150 x 7
#       id Sepal.Length cumsum   cumprod cummean cummax cummin
#    <int>        <dbl>  <dbl>     <dbl>   <dbl>  <dbl>  <dbl>
#  1    14          4.3    4.3       4.3    4.3     4.3    4.3
#  2     9          4.4    8.7      18.9    4.3     4.4    4.3
#  3    39          4.4   13.1      83.2    4.33    4.4    4.3
#  4    43          4.4   17.5     366.     4.35    4.4    4.3
#  5    42          4.5   22      1648.     4.36    4.5    4.3
#  6     4          4.6   26.6    7582.     4.38    4.6    4.3
#  7     7          4.6   31.2   34878.     4.41    4.6    4.3
#  8    23          4.6   35.8  160440.     4.44    4.6    4.3
#  9    48          4.6   40.4  738024.     4.46    4.6    4.3
# 10     3          4.7   45.1 3468712.     4.47    4.7    4.3
# # ... with 140 more rows

ベクトルに使った例も示します。

R
v
# [1]  2  2  4  4  4  4  1  3  5 NA
cumsum(v)
# [1]  2  4  8 12 16 20 21 24 29 NA
cumprod(v)
# [1]     2     4    16    64   256  1024  1024  3072 15360    NA
cummean(v)
# [1] 2.000000 2.000000 2.000000 2.500000 2.800000 3.000000 3.142857
# [8] 2.875000 2.888889 3.100000
cummax(v)
# [1]  2  2  4  4  4  4  4  4  5 NA
cummin(v)
# [1]  2  2  2  2  2  2  1  1  1 NA

lead(), lag():行方向のずらし(リード・ラグ)計算

R
# lead(), lag()
iris_tbl %>%
  select(id, Sepal.Length) %>%
  mutate(behind = lag(Sepal.Length),   # 1つ前
         ahead = lead(Sepal.Length))   # 1つ後ろ
# # A tibble: 150 x 4
#      id Sepal.Length behind ahead
#   <int>        <dbl>  <dbl> <dbl>
# 1     1          5.1   NA     4.9
# 2     2          4.9    5.1   4.7
# 3     3          4.7    4.9   4.6
# 4     4          4.6    4.7   5  
# 5     5          5      4.6   5.4

iris_tbl %>%
  select(id, Sepal.Length) %>%
  mutate(lag_0 = lag(Sepal.Length, n = 0),
         lag_1 = lag(Sepal.Length, n = 1),   # 1つ後ろ
         lag_2 = lag(Sepal.Length, n = 2),   # 2つ後ろ
         lag_3 = lag(Sepal.Length, n = 3))   # 3つ後ろ
iris_tbl %>%
  select(id, Sepal.Length) %>%
  mutate(lag_0 = Sepal.Length) %>%
  mutate(across(.cols = Sepal.Length,
                .fns = list(~lag(., n = 1), ~lag(., n = 2), ~lag(., n = 3)),
                .names = "lag_{fn}"))
iris_tbl %>%
  select(id, Sepal.Length) %>%
  mutate(across(.cols = Sepal.Length,
                .fns = list(~lag(., n = 0), ~lag(., n = 1),  ~lag(., n = 2), ~lag(., n = 3)),
                .names = "lag_{fn-1}"))
# # A tibble: 150 x 6
#      id Sepal.Length lag_0 lag_1 lag_2 lag_3
#   <int>        <dbl> <dbl> <dbl> <dbl> <dbl>
# 1     1          5.1   5.1  NA    NA    NA  
# 2     2          4.9   4.9   5.1  NA    NA  
# 3     3          4.7   4.7   4.9   5.1  NA  
# 4     4          4.6   4.6   4.7   4.9   5.1
# 5     5          5     5     4.6   4.7   4.9

注意)mutate(across())については、こちらを参照。

ベクトルに使った例も示します。

R
v
# [1]  2  2  4  4  4  4  1  3  5 NA
lead(v)
# [1]  2  4  4  4  4  1  3  5 NA NA
lead(v, 2)
# [1]  4  4  4  4  1  3  5 NA NA NA
lag(v)
# [1] NA  2  2  4  4  4  4  1  3  5
lag(v, 2)
# [1] NA NA  2  2  4  4  4  4  1  3

if_else(), case_when():条件分岐

R
# if_else()
iris_tbl %>%
  select(id, Sepal.Length) %>%
  mutate(sl = if_else(Sepal.Length < 5.0, 1, 0))
# # A tibble: 150 x 3
#      id Sepal.Length    sl
#   <int>        <dbl> <dbl>
# 1     1          5.1     0
# 2     2          4.9     1
# 3     3          4.7     1
# 4     4          4.6     1
# 5     5          5       0

# case_when()
iris_tbl %>%
  select(id, Sepal.Length) %>%
  mutate(sl = case_when(is.na(Sepal.Length) ~ as.character(NA), # NA_character_
                        Sepal.Length < 4.0  ~ "<3.9",
                        Sepal.Length < 5.0  ~ "4.0-4.9",
                        Sepal.Length < 6.0  ~ "5.0-5.9",
                        TRUE                ~ "6.0-"))
# # A tibble: 150 x 3
#      id Sepal.Length sl     
#   <int>        <dbl> <chr>  
# 1     1          5.1 5.0-5.9
# 2     2          4.9 4.0-4.9
# 3     3          4.7 4.0-4.9
# 4     4          4.6 4.0-4.9
# 5     5          5   5.0-5.9

ベクトルに使った例も示します。

R
v
# [1]  2  2  4  4  4  4  1  3  5 NA
if_else(v %% 2 == 0, 0, 1)
# [1]  0  0  0  0  0  0  1  1  1 NA

case_when(v <= 2 ~ "1-2",
          v <= 4 ~ "3-4",
          TRUE   ~ "5-")
# [1] "1-2" "1-2" "3-4" "3-4" "3-4" "3-4" "1-2" "3-4" "5-"  "5-" 

case_when(is.na(v) ~ NA_character_,
          v <= 2 ~ "1-2",
          v <= 4 ~ "3-4",
          TRUE   ~ "5-")
case_when(is.na(v) ~ as.character(NA),
          v <= 2 ~ "1-2",
          v <= 4 ~ "3-4",
          TRUE   ~ "5-")
# [1] "1-2" "1-2" "3-4" "3-4" "3-4" "3-4" "1-2" "3-4" "5-"  NA 

coalesce():NAの置き換え

R
# coalesce()
iris_tbl %>%
  select(id, Sepal.Length) %>%
  mutate(across(.cols = Sepal.Length,
                .fns = list(~lag(., n = 0), ~lag(., n = 1),  ~lag(., n = 2), ~lag(., n = 3)),
                .names = "lag_{fn-1}")) %>%
  mutate(x_1 = coalesce(lag_1, 0),
         x_2 = coalesce(lag_2, 0),
         x_3 = coalesce(lag_3, 0))
# # A tibble: 150 x 9
#      id Sepal.Length lag_0 lag_1 lag_2 lag_3   x_1   x_2   x_3
#   <int>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1     1          5.1   5.1  NA    NA    NA     0     0     0  
# 2     2          4.9   4.9   5.1  NA    NA     5.1   0     0  
# 3     3          4.7   4.7   4.9   5.1  NA     4.9   5.1   0  
# 4     4          4.6   4.6   4.7   4.9   5.1   4.7   4.9   5.1
# 5     5          5     5     4.6   4.7   4.9   4.6   4.7   4.9

iris_tbl %>%
  select(id, Sepal.Length) %>%
  mutate(across(.cols = Sepal.Length,
                .fns = list(~lag(., n = 0), ~lag(., n = 1),  ~lag(., n = 2), ~lag(., n = 3)),
                .names = "lag_{fn-1}")) %>%
  mutate(x_1 = coalesce(lag_1, lag_0),
         x_2 = coalesce(lag_2, lag_1, lag_0),
         x_3 = coalesce(lag_3, lag_2, lag_1, lag_0))
# # A tibble: 150 x 9
#      id Sepal.Length lag_0 lag_1 lag_2 lag_3   x_1   x_2   x_3
#   <int>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1     1          5.1   5.1  NA    NA    NA     5.1   5.1   5.1
# 2     2          4.9   4.9   5.1  NA    NA     5.1   5.1   5.1
# 3     3          4.7   4.7   4.9   5.1  NA     4.9   5.1   5.1
# 4     4          4.6   4.6   4.7   4.9   5.1   4.7   4.9   5.1
# 5     5          5     5     4.6   4.7   4.9   4.6   4.7   4.9

ベクトルに使った例も示します。

R
v
# [1]  2  2  4  4  4  4  1  3  5 NA
coalesce(v, 0)
# [1] 2 2 4 4 4 4 1 3 5 0

lag_0 <- lag(v, n = 0)
lag_0
# [1]  2  2  4  4  4  4  1  3  5 NA
lag_1 <- lag(v, n = 1)
lag_1
# [1] NA  2  2  4  4  4  4  1  3  5
lag_2 <- lag(v, n = 2)
lag_2
# [1] NA NA  2  2  4  4  4  4  1  3
lag_3 <- lag(v, n = 3)
lag_3
# [1] NA NA NA  2  2  4  4  4  4  1
coalesce(lag_1, lag_0)
# [1] 2 2 2 4 4 4 4 1 3 5
coalesce(lag_2, lag_1, lag_0)
# [1] 2 2 2 2 4 4 4 4 1 3
coalesce(lag_3, lag_2, lag_1, lag_0)
# [1] 2 2 2 2 2 4 4 4 4 1

replace_na():NAの置き換え

R
# replace_na()
iris_tbl %>%
  select(id, Sepal.Length) %>%
  mutate(across(.cols = Sepal.Length,
                .fns = list(~lag(., n = 0), ~lag(., n = 1),  ~lag(., n = 2), ~lag(., n = 3)),
                .names = "lag_{fn-1}")) %>%
  mutate(x_1 = replace_na(lag_1, 0),
         x_2 = replace_na(lag_2, 0),
         x_3 = replace_na(lag_3, 0))
iris_tbl %>%
  select(id, Sepal.Length) %>%
  mutate(across(.cols = Sepal.Length,
                .fns = list(~lag(., n = 0), ~lag(., n = 1),  ~lag(., n = 2), ~lag(., n = 3)),
                .names = "lag_{fn-1}")) %>%
  mutate(across(lag_1:lag_3, ~replace_na(., 0)))
iris_tbl %>%
  select(id, Sepal.Length) %>%
  mutate(across(.cols = Sepal.Length,
                .fns = list(~lag(., n = 0), ~lag(., n = 1),  ~lag(., n = 2), ~lag(., n = 3)),
                .names = "lag_{fn-1}")) %>%
  replace_na(list(lag_1 = 0, lag_2 = 0, lag_3 = 0))
# # A tibble: 150 x 9
#      id Sepal.Length lag_0 lag_1 lag_2 lag_3   x_1   x_2   x_3
#   <int>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1     1          5.1   5.1  NA    NA    NA     0     0     0  
# 2     2          4.9   4.9   5.1  NA    NA     5.1   0     0  
# 3     3          4.7   4.7   4.9   5.1  NA     4.9   5.1   0  
# 4     4          4.6   4.6   4.7   4.9   5.1   4.7   4.9   5.1
# 5     5          5     5     4.6   4.7   4.9   4.6   4.7   4.9

ベクトルに使った例も示します。

R
v
# [1]  2  2  4  4  4  4  1  3  5 NA
replace_na(v, 0)
# [1] 2 2 4 4 4 4 1 3 5 0
coalesce(v, 0)
# [1] 2 2 4 4 4 4 1 3 5 0
if_else(is.na(v), 0, v)
# [1] 2 2 4 4 4 4 1 3 5 0
if_else(!is.na(v), v, 0)
# [1] 2 2 4 4 4 4 1 3 5 0

na_if():NAへの置き換え

R
# na_if()
iris_tbl %>%
  select(id, Sepal.Length) %>%
  mutate(sl = na_if(Sepal.Length, 4.9))
iris_tbl %>%
  select(id, Sepal.Length) %>%
  mutate(sl = if_else(Sepal.Length == 4.9, as.double(NA), Sepal.Length))
iris_tbl %>%
  select(id, Sepal.Length) %>%
  mutate(sl = if_else(Sepal.Length == 4.9, NA_real_, Sepal.Length))
# # A tibble: 150 x 3
#       id Sepal.Length    sl
#    <int>        <dbl> <dbl>
#  1     1          5.1   5.1
#  2     2          4.9  NA  
#  3     3          4.7   4.7
#  4     4          4.6   4.6
#  5     5          5     5  
#  6     6          5.4   5.4
#  7     7          4.6   4.6
#  8     8          5     5  
#  9     9          4.4   4.4
# 10    10          4.9  NA  

ベクトルに使った例も示します。

R
v
# [1]  2  2  4  4  4  4  1  3  5 NA
na_if(v, 4)
# [1]  2  2 NA NA NA NA  1  3  5 NA

recode():コードの振り直し

R
# recode()
iris_tbl %>%
  mutate(SP = recode(Species, setosa = toupper("setosa")))
# # A tibble: 150 x 7
#      id Sepal.Length Sepal.Width Petal.Length Petal.Width Species SP    
#   <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>   <fct> 
# 1     1          5.1         3.5          1.4         0.2 setosa  SETOSA
# 2     2          4.9         3            1.4         0.2 setosa  SETOSA

iris_tbl %>%
  mutate(SP2 = recode(Species,
                      setosa = toupper("setosa"),
                      versicolor = toupper("versicolor"),
                      virginica = toupper("virginica"))) %>%
  group_by(Species) %>% slice_head(n = 3)
# A tibble: 9 x 7
# Groups:   Species [3]
#      id Sepal.Length Sepal.Width Petal.Length Petal.Width Species    SP2       
#   <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>      <fct>     
# 1     1          5.1         3.5          1.4         0.2 setosa     SETOSA    
# 2     2          4.9         3            1.4         0.2 setosa     SETOSA    
# 3     3          4.7         3.2          1.3         0.2 setosa     SETOSA    
# 4    51          7           3.2          4.7         1.4 versicolor VERSICOLOR
# 5    52          6.4         3.2          4.5         1.5 versicolor VERSICOLOR
# 6    53          6.9         3.1          4.9         1.5 versicolor VERSICOLOR
# 7   101          6.3         3.3          6           2.5 virginica  VIRGINICA 
# 8   102          5.8         2.7          5.1         1.9 virginica  VIRGINICA 
# 9   103          7.1         3            5.9         2.1 virginica  VIRGINICA 

sp <- levels(iris_tbl$Species) # 変更前のlevels
sp
# [1] "setosa"     "versicolor" "virginica" 
sp_2 <- toupper(sp)            # 変更後のlevels
named_sp <- setNames(sp_2, sp) # 名前付きベクトル
iris_tbl %>%
  mutate(Species_2 = recode(Species, !!!named_sp)) %>%
  group_by(Species) %>% slice_head(n = 2)
# # A tibble: 6 x 7
# # Groups:   Species [3]
#      id Sepal.Length Sepal.Width Petal.Length Petal.Width Species    Species_2 
#   <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>      <fct>     
# 1     1          5.1         3.5          1.4         0.2 setosa     SETOSA    
# 2     2          4.9         3            1.4         0.2 setosa     SETOSA    
# 3    51          7           3.2          4.7         1.4 versicolor VERSICOLOR
# 4    52          6.4         3.2          4.5         1.5 versicolor VERSICOLOR
# 5   101          6.3         3.3          6           2.5 virginica  VIRGINICA 
# 6   102          5.8         2.7          5.1         1.9 virginica  VIRGINICA 

ベクトルに使った例も示します。

R
v <- c(2, 2, 4, 4, 4, 4, 1, 3, 5, NA)
v <- LETTERS[v]
v
# [1] "B" "B" "D" "D" "D" "D" "A" "C" "E" NA 
recode(v, A = "AAA", B = "BBB", C = "CCC")
# [1] "BBB" "BBB" "D"   "D"   "D"   "D"   "AAA" "CCC" "E"   NA 

vf <- factor(v, levels = LETTERS[1:5])
vf
# [1] B    B    D    D    D    D    A    C    E    <NA>
# Levels: A B C D E
recode(vf, A = "AAA", B = "BBB", C = "CCC")
# [1] BBB  BBB  D    D    D    D    AAA  CCC  E    <NA>
# Levels: AAA BBB CCC D E

LETTERS[1:3]             # 変更前のlevels
# [1] "A" "B" "C"
library(stringr)
str_dup(LETTERS[1:3], 3) # 変更後のlevels
# [1] "AAA" "BBB" "CCC"
named_levels <- setNames(str_dup(LETTERS[1:3], 3), LETTERS[1:3]) # 名前付きベクトル
recode(vf, !!!named_levels)
# [1] BBB  BBB  D    D    D    D    AAA  CCC  E    <NA>
# Levels: AAA BBB CCC D E

指定した列のNAを置き換えるもの

まず、NAを含むデータを準備しておきます。

R
# NAを含むデータを準備
iris_tbl_na <- iris_tbl %>%
  slice_head(n = 10) %>%
  mutate(Sepal.Length = na_if(Sepal.Length, 4.9),
         Sepal.Width = na_if(Sepal.Width, 3.4)) %>% print()
# # A tibble: 10 x 6
#       id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#    <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#  1     1          5.1         3.5          1.4         0.2 setosa 
#  2     2         NA           3            1.4         0.2 setosa 
#  3     3          4.7         3.2          1.3         0.2 setosa 
#  4     4          4.6         3.1          1.5         0.2 setosa 
#  5     5          5           3.6          1.4         0.2 setosa 
#  6     6          5.4         3.9          1.7         0.4 setosa 
#  7     7          4.6        NA            1.4         0.3 setosa 
#  8     8          5          NA            1.5         0.2 setosa 
#  9     9          4.4         2.9          1.4         0.2 setosa 
# 10    10         NA           3.1          1.5         0.1 setosa 

replace_na():NAを指定した値で置き換え

指定した列のNAを指定した値で置き換えます。指定する列と置き換える値はリストで与えます。

R
# replace_na()
iris_tbl_na %>%
  replace_na(list(Sepal.Length = 0.0, Sepal.Width = 99.9))
# # A tibble: 10 x 6
#       id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#    <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#  1     1          5.1         3.5          1.4         0.2 setosa 
#  2     2          0           3            1.4         0.2 setosa 
#  3     3          4.7         3.2          1.3         0.2 setosa 
#  4     4          4.6         3.1          1.5         0.2 setosa 
#  5     5          5           3.6          1.4         0.2 setosa 
#  6     6          5.4         3.9          1.7         0.4 setosa 
#  7     7          4.6        99.9          1.4         0.3 setosa 
#  8     8          5          99.9          1.5         0.2 setosa 
#  9     9          4.4         2.9          1.4         0.2 setosa 
# 10    10          0           3.1          1.5         0.1 setosa 

list(Sepal.Length = 0.0, Sepal.Width = 99.9)
# $Sepal.Length
# [1] 0
# 
# $Sepal.Width
# [1] 99.9
# 
class(list(Sepal.Length = 0.0, Sepal.Width = 99.9))
# [1] "list"

なお、上で出てきたreplace_na()はベクトルに対する関数でしたが、こちらのreplace_na()はデータフレームに対する関数になっています。

fill():NAを前後の値で置き換え

指定した列のNAを同じ列の前後(上下)のNAでない値で置き換えます。引数.directionで前後のどちらの方向で埋めるかを指定できます。指定できる方向には、"down" (the default), "up", "downup" (i.e. first down and then up) or "updown" (first up and then down) の4通りあります。

R
# fill()
iris_tbl_na %>%
  fill(Sepal.Length, Sepal.Width)
# # A tibble: 10 x 6
#       id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#    <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#  1     1          5.1         3.5          1.4         0.2 setosa 
#  2     2          5.1         3            1.4         0.2 setosa 
#  3     3          4.7         3.2          1.3         0.2 setosa 
#  4     4          4.6         3.1          1.5         0.2 setosa 
#  5     5          5           3.6          1.4         0.2 setosa 
#  6     6          5.4         3.9          1.7         0.4 setosa 
#  7     7          4.6         3.9          1.4         0.3 setosa 
#  8     8          5           3.9          1.5         0.2 setosa 
#  9     9          4.4         2.9          1.4         0.2 setosa 
# 10    10          4.4         3.1          1.5         0.1 setosa 

iris_tbl_na %>%
  fill(Sepal.Length, Sepal.Width, .direction = "up")
# # A tibble: 10 x 6
#       id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#    <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#  1     1          5.1         3.5          1.4         0.2 setosa 
#  2     2          4.7         3            1.4         0.2 setosa 
#  3     3          4.7         3.2          1.3         0.2 setosa 
#  4     4          4.6         3.1          1.5         0.2 setosa 
#  5     5          5           3.6          1.4         0.2 setosa 
#  6     6          5.4         3.9          1.7         0.4 setosa 
#  7     7          4.6         2.9          1.4         0.3 setosa 
#  8     8          5           2.9          1.5         0.2 setosa 
#  9     9          4.4         2.9          1.4         0.2 setosa 
# 10    10         NA           3.1          1.5         0.1 setosa 

iris_tbl_na %>%
  fill(Sepal.Length, Sepal.Width, .direction = "updown")
# # A tibble: 10 x 6
#       id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#    <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#  1     1          5.1         3.5          1.4         0.2 setosa 
#  2     2          4.7         3            1.4         0.2 setosa 
#  3     3          4.7         3.2          1.3         0.2 setosa 
#  4     4          4.6         3.1          1.5         0.2 setosa 
#  5     5          5           3.6          1.4         0.2 setosa 
#  6     6          5.4         3.9          1.7         0.4 setosa 
#  7     7          4.6         2.9          1.4         0.3 setosa 
#  8     8          5           2.9          1.5         0.2 setosa 
#  9     9          4.4         2.9          1.4         0.2 setosa 
# 10    10          4.4         3.1          1.5         0.1 setosa 

参考

1
4
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
4