dplyr (0.4.3 as of September, 2015) and tidyr (0.4.1 as of February, 2015) are two R packages and all that you require for data munching in R. To make the most of the capabilities of these packages, they are often used together by data scientists.
( dplyr is the next version of the famous plyr package in R )
This is a short overview of the basic capabilities provided by these packages.
The pipe operator %>%
which was first introduced in the package magrittr is now supported in both of these packages and is very addictive. We will discuss the usefulness of pipe operator ( %>%
) first.
The R pipe: %>%
Using the pipe frequently in your code can improve the neatness of your code hugely. It helps avoid unnecessary use of temporary variables in your code.
The basic meaning of %>%
is :
R pipe code | Equivalent code |
---|---|
x %>% f |
f(x) |
x %>% f(y) |
f(x,y) |
x %>% f(.,y) |
f(x,y) |
x %>% f(y,.) |
f(y,x) |
x %>% f(y, z = .) |
f(y,z=x) |
x %>% f %>% g |
g(f(x)) |
z <- x %>% f |
z <- f(x) |
Note that the variable x
on the left side of %>%
is applied as the first argument in the function on the right side. This default behaviour can be changed using .
which is called a placeholder.
However, one important thing to remember is, when the .
appears in nested expressions, the first-argument-rule is still applied. But this behaviour can be suppressed using the curly braces{ }
. That is,
R pipe code | Equivalent code |
---|---|
data %>% f(x = ncol(.)) |
f(data, x = ncol(data)) |
data %>% { f(x = ncol(.)) } |
f(x = ncol(data)) |
That is all what is necessary to know about the pipe.
dplyr
This package provides intuitive "verbs" useful for working with data frames in R (example: for subsetting, summarizing , rearranging data frames).
The basic verbs/functions of dplyr package are:
dplyr verb | Meaning |
---|---|
select() |
keep only the columns you mention |
rename() |
rename the columns you mention |
mutate() |
add new columns and keep existing ones |
filter() |
returns subset of rows with matching conditions |
arrange() |
re-order the rows |
We look at a few examples to understand how these verbs and pipes can be used together. We will be using sensor.csv
data located in the repository here. So download this file into any folder first.
This data contains columns named: timestamp, id, tempature, humidity and precipitation for different sensors which are represented by the column named id.
We load the data first.
require(dplyr)
sensor <- read.csv('data/input/sensor.csv')
head(sensor)
#> timestamp id tempature humidity precipitation
#>1 2014-01-01 00:00:00 33114029 -0.5 80.6 0
#>2 2014-01-01 01:00:00 33114029 -0.7 81.8 0
#>3 2014-01-01 02:00:00 33114029 0.1 77.0 0
#>4 2014-01-01 03:00:00 33114029 0.5 74.0 0
#>5 2014-01-01 04:00:00 33114029 -0.6 77.3 0
#>6 2014-01-01 05:00:00 33114029 -2.0 82.5 0
Now we show the use of each of dplyr verbs one by one.
# SELECT, RENAME VERB
sensor <- sensor %>% select(one_of('id', 'tempature', 'humidity', 'precipitation')) %>% rename(temperature=tempature)
head(sensor)
#> id temperature humidity precipitation
#>1 33114029 -0.5 80.6 0
#>2 33114029 -0.7 81.8 0
#>3 33114029 0.1 77.0 0
#>4 33114029 0.5 74.0 0
#>5 33114029 -0.6 77.3 0
#>6 33114029 -2.0 82.5 0
# MUTATE VERB
sensor <- sensor %>% mutate(newcol = temperature*humidity)
head(sensor)
#> id temperature humidity precipitation newcol
#>1 33114029 -0.5 80.6 0 -40.30
#>2 33114029 -0.7 81.8 0 -57.26
#>3 33114029 0.1 77.0 0 7.70
#>4 33114029 0.5 74.0 0 37.00
#>5 33114029 -0.6 77.3 0 -46.38
#>6 33114029 -2.0 82.5 0 -165.00
# FILTER VERB
sensor <- sensor %>% filter(newcol > 2.1 )
head(sensor)
#> id temperature humidity precipitation newcol
#>1 33114029 0.1 77.0 0 7.70
#>2 33114029 0.5 74.0 0 37.00
#>3 33114029 2.8 69.5 0 194.60
#>4 33114029 5.6 55.6 0 311.36
#>5 33114029 8.9 47.8 0 425.42
#>6 33114029 11.0 44.5 0 489.50
# ARRANGE VERB
sensor <- sensor %>% arrange(newcol)
head(sensor)
#> id temperature humidity precipitation newcol
#>1 33114029 0.1 32.5 0 3.25
#>2 33114029 0.1 34.1 0 3.41
#>3 33114029 0.1 34.8 0 3.48
#>4 33130167 0.1 35.3 0 3.53
#>5 2030130001 0.1 35.3 0 3.53
#>6 33149672 0.1 36.6 0 3.66
In one line using the pipe, the above code would be:
sensor <- read.csv('data/input/sensor.csv') %>% select(one_of('id', 'tempature', 'humidity', 'precipitation'))
%>% rename(temperature=tempature) %>% mutate(newcol = temperature*humidity) %>% filter(newcol > 2.1 ) %>% arrange(newcol)
Pretty clean !
tidyr
This package provides verbs to clean/tidy the data.
The main verbs here are:
tidyr verb | Meaning |
---|---|
gather() |
makes wide data longer |
spread() |
makes long data wider |
There are two additional verbs which are sometimes useful:
tidyr verb | Meaning |
---|---|
unite() |
paste together multiple columns into one |
separate() |
separate one column into several |
Here are some examples on the same sensor.csv
dataset:
# LOAD DATA
sensor <- read.csv('data/input/sensor.csv') %>% rename(temperature=tempature)
head(sensor)
#> timestamp id temperature humidity precipitation
#>1 2014-01-01 00:00:00 33114029 -0.5 80.6 0
#>2 2014-01-01 01:00:00 33114029 -0.7 81.8 0
#>3 2014-01-01 02:00:00 33114029 0.1 77.0 0
#>4 2014-01-01 03:00:00 33114029 0.5 74.0 0
#>5 2014-01-01 04:00:00 33114029 -0.6 77.3 0
#>6 2014-01-01 05:00:00 33114029 -2.0 82.5 0
# GATHER VERB
# .. the columns temperature, humidity and precipitation have to be gathered into a
# .. single column named 'key' and the values represented by the column 'key'.
sensor <- sensor %>% gather(key, value, temperature:precipitation)
head(sensor)
#> timestamp id key value
#>1 2014-01-01 00:00:00 33114029 temperature -0.5
#>2 2014-01-01 01:00:00 33114029 temperature -0.7
#>3 2014-01-01 02:00:00 33114029 temperature 0.1
#>4 2014-01-01 03:00:00 33114029 temperature 0.5
#>5 2014-01-01 04:00:00 33114029 temperature -0.6
#>6 2014-01-01 05:00:00 33114029 temperature -2.0
# UNITE VERB
# .. we want to unite columns 'id' and 'key' into a single column 'key'
sensor <- sensor %>% unite(key, id, key)
head(sensor)
#> timestamp key value
#>1 2014-01-01 00:00:00 33114029_temperature -0.5
#>2 2014-01-01 01:00:00 33114029_temperature -0.7
#>3 2014-01-01 02:00:00 33114029_temperature 0.1
#>4 2014-01-01 03:00:00 33114029_temperature 0.5
#>5 2014-01-01 04:00:00 33114029_temperature -0.6
#>6 2014-01-01 05:00:00 33114029_temperature -2.0
# SPREAD VERB
# .. we want to spread the column 'key' into multiple columns each taking value mentioned in the column 'value'
sensor <- sensor %>% spread(key, value)
head(sensor)
#> timestamp 2030120003_humidity 2030120003_precipitation 2030120003_temperature 2030130001_humidity
#>1 2014-01-01 00:00:00 79.4 0 4.2 74.6
#>2 2014-01-01 01:00:00 74.8 0 3.5 76.0
#>3 2014-01-01 02:00:00 82.7 0 2.6 75.7
#>4 2014-01-01 03:00:00 83.2 0 2.7 76.6
#>5 2014-01-01 04:00:00 83.5 0 2.7 76.7
#>6 2014-01-01 05:00:00 82.5 0 3.5 74.3
Combining above code into single line:
sensor <- read.csv('data/input/sensor.csv') %>% rename(temperature=tempature)
%>% gather(key, value, temperature:precipitation) %>% unite(key, id, key) %>% spread(key, value)
Lastly, Here is a cheatsheet for quick reference to use these packages in tandem.