More than 5 years have passed since last update.

グループ毎に最も多く現れるデータを抽出する

Last updated at 2017-03-08Posted at 2017-03-08

次のような例で悩んだのでメモしておきます．

前提

日付（date），ID（id），値（value）のあるデータフレーム（df）を想定します．同じ日付データが複数あり，更にそれに対して付与されるID，値も異なるとします．

require(tidyverse)
require(lubridate)

set.seed(1)

iter <- 5　# 適当な値

date <- ymd(rep(20170101:20170110, each = iter)) # 日付
id <- sample(1:2, iter * 10, replace = TRUE)     # ID
value <- rnorm(iter * 10)                        # 値

df <- tibble(date, id, value) #データフレーム

df
# > # A tibble: 50 × 3
# >          date    id       value
# >        <date> <int>       <dbl>
# > 1  2017-01-01     1 -0.05612874
# > 2  2017-01-01     1 -0.15579551
# > 3  2017-01-01     2 -1.47075238
# > 4  2017-01-01     2 -0.47815006
# > 5  2017-01-01     1  0.41794156
# > 6  2017-01-02     2  1.35867955
# > 7  2017-01-02     2 -0.10278773
# > 8  2017-01-02     2  0.38767161
# > 9  2017-01-02     2 -0.05380504
# > 10 2017-01-02     1 -1.37705956
# > # ... with 40 more rows

やりたいこと

日付毎に最も多く現れるIDのデータだけを残したいとします．
非常に強引ですが，次のように書くことで抽出できます．

df %>% 
  group_by(date, id) %>% # 日付，IDでグループ化
  count() %>%  # グループ毎の個数をカウント，列"n"ができる
  top_n(1, n) %>%  # n列について，1番大きな値だけを抽出する
  ungroup() %>% 
  right_join(., df, by = c("date", "id")) %>% #日付，IDで結合（右側のdfを残す）
  na.omit() %>% 
  select(-n)　# 余分なn列を消す
# > # A tibble: 34 × 3
# >          date    id       value
# >        <date> <int>       <dbl>
# > 1  2017-01-01     1 -0.05612874
# > 2  2017-01-01     1 -0.15579551
# > 3  2017-01-01     1  0.41794156
# > 4  2017-01-02     2  1.35867955
# > 5  2017-01-02     2 -0.10278773
# > 6  2017-01-02     2  0.38767161
# > 7  2017-01-02     2 -0.05380504
# > 8  2017-01-03     1 -0.41499456
# > 9  2017-01-03     1 -0.39428995
# > 10 2017-01-03     1  1.10002537
# > # ... with 24 more rows

もっとよいやり方があると思うのですが，わかりませんでした．
お分かりになる方はご教示いただければ幸いです．

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up