More than 5 years have passed since last update.

NA値をカラムの平均値で補完する

Last updated at 2019-09-09Posted at 2019-08-29

データがNA値を含む場合、各カラムの平均値を仮の値として代入して処理を行う場合がある。
tidyverseを使った方法を以下に示す。

NA値を含むデータを作成

下記のコードは、カラム名にLengthを含む列に対し、ランダムに30個のNA値を代入する。

iris.na <- iris %>% 
  dplyr::mutate_at(dplyr::vars(tidyselect::matches("Length")),
                   list( ~ dplyr:::case_when(dplyr::row_number() %in% sample(1:nrow(iris),30) ~ as.numeric(NA),
                                     TRUE ~ .)))

> head(iris.na)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2           NA         3.0          1.4         0.2  setosa
3          4.7         3.2           NA         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5           NA         3.6          1.4         0.2  setosa
6           NA         3.9          1.7         0.4  setosa

上で作成したテーブルのNA値を各カラムの平均値に置換する。

iris.imputed <- iris.na %>% 
  dplyr::mutate_at(dplyr::vars(tidyselect::matches("Length")),
                   list( ~ tidyr::replace_na(., mean(., na.rm = TRUE))))

> head(iris.imputed)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1     5.100000         3.5       1.4000         0.2  setosa
2     5.864167         3.0       1.4000         0.2  setosa
3     4.700000         3.2       3.6975         0.2  setosa
4     4.600000         3.1       1.5000         0.2  setosa
5     5.864167         3.6       1.4000         0.2  setosa
6     5.864167         3.9       1.7000         0.4  setosa

平均値ではなく、中央値で補完する場合は、medianを使う。

iris.imputed <- iris.na %>% 
  dplyr::mutate_at(dplyr::vars(tidyselect::matches("Length")),
                   list( ~ tidyr::replace_na(., median(., na.rm = TRUE))))

> head(iris.imputed)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5         1.40         0.2  setosa
2          5.8         3.0         1.40         0.2  setosa
3          4.7         3.2         4.25         0.2  setosa
4          4.6         3.1         1.50         0.2  setosa
5          5.8         3.6         1.40         0.2  setosa
6          5.8         3.9         1.70         0.4  setosa

すべてのSpeciesの平均値ではなく、各Speciesの平均値を入れるには、group_by{dplyr}関数を使う。

iris.imputed <- iris.na %>% 
  dplyr::group_by(Species) %>% 
  dplyr::mutate_at(dplyr::vars(tidyselect::matches("Length")),
                   list( ~ tidyr::replace_na(., mean(., na.rm = TRUE)))) %>% 
  dplyr::ungroup()

結果はtibbleとなる。

iris.imputed
# A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1         5.1          3.5         1.4          0.2 setosa 
 2         5.01         3           1.4          0.2 setosa 
 3         4.7          3.2         1.46         0.2 setosa 
 4         4.6          3.1         1.5          0.2 setosa 
 5         5.01         3.6         1.4          0.2 setosa 
 6         5.01         3.9         1.7          0.4 setosa 
 7         4.6          3.4         1.4          0.3 setosa 
 8         5            3.4         1.5          0.2 setosa 
 9         5.01         2.9         1.4          0.2 setosa 
10         4.9          3.1         1.46         0.1 setosa 
# … with 140 more rows

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up