More than 3 years have passed since last update.

{dplyr} 各グループで最大値を持つ行や平均値以上の値を持つ行を抽出

Last updated at 2022-03-24Posted at 2022-03-24

自分が結構引っかかってたので簡単なメモ。

A. グループ化したい列を`group_by`して`filter`すればOK

グループ化されたtibbleに対しては、filterは通常とは少し違う挙動をします。詳しくはfilterのレファレンスに書いてあります。以下部分的な引用です：

In the ungrouped version, filter() compares the value of mass in each row to the global average (taken over the whole data set), keeping only the rows with mass greater than this global average. In contrast, the grouped version calculates the average mass separately for each gender group, and keeps rows with mass greater than the relevant within-gender average.

https://dplyr.tidyverse.org/reference/filter.html#grouped-tibbles より

つまり、グループ化されたtibbleに対して== max(<列名>)や>= mean(<列名>)等によるフィルタリングをすると、全体における最大値や全体の平均値ではなく、グループ内の最大値やグループ内の平均値との比較でフィルタリングが行われます。これを知っていれば行抽出が楽々できます。

例

各グループ１で最大値を持つ行を抽出

iris %>% 
  group_by(Species) %>% 
  filter(Sepal.Length == max(Sepal.Length))

## # A tibble: 3 x 5
## # Groups:   Species [3]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
## 1          5.8         4            1.2         0.2 setosa    
## 2          7           3.2          4.7         1.4 versicolor
## 3          7.9         3.8          6.4         2   virginica

各グループで平均値以上の値を持つ行を抽出

iris %>% 
  group_by(Species) %>% 
  filter(Sepal.Length >= mean(Sepal.Length))

## # A tibble: 68 x 5
## # Groups:   Species [3]
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          5.4         3.9          1.7         0.4 setosa 
##  3          5.4         3.7          1.5         0.2 setosa 
##  4          5.8         4            1.2         0.2 setosa 
##  5          5.7         4.4          1.5         0.4 setosa 
##  6          5.4         3.9          1.3         0.4 setosa 
##  7          5.1         3.5          1.4         0.3 setosa 
##  8          5.7         3.8          1.7         0.3 setosa 
##  9          5.1         3.8          1.5         0.3 setosa 
## 10          5.4         3.4          1.7         0.2 setosa 
## # ... with 58 more rows

注意：必要に合わせて適宜`ungroup`するのを忘れずに

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

{dplyr} 各グループで最大値を持つ行や平均値以上の値を持つ行を抽出

A. グループ化したい列をgroup_byしてfilterすればOK

例

各グループ１で最大値を持つ行を抽出

各グループで平均値以上の値を持つ行を抽出

注意：必要に合わせて適宜ungroupするのを忘れずに

A. グループ化したい列を`group_by`して`filter`すればOK

注意：必要に合わせて適宜`ungroup`するのを忘れずに