More than 5 years have passed since last update.

Data Munching in R: dplyr + tidyr + pipe

Posted at 2016-03-18

dplyr (0.4.3 as of September, 2015) and tidyr (0.4.1 as of February, 2015) are two R packages and all that you require for data munching in R. To make the most of the capabilities of these packages, they are often used together by data scientists.

( dplyr is the next version of the famous plyr package in R )

This is a short overview of the basic capabilities provided by these packages.

The pipe operator %>% which was first introduced in the package magrittr is now supported in both of these packages and is very addictive. We will discuss the usefulness of pipe operator ( %>% ) first.

The R pipe: `%>%`

Using the pipe frequently in your code can improve the neatness of your code hugely. It helps avoid unnecessary use of temporary variables in your code.

The basic meaning of %>% is :

R pipe code	Equivalent code
`x %>% f`	`f(x)`
`x %>% f(y)`	`f(x,y)`
`x %>% f(.,y)`	`f(x,y)`
`x %>% f(y,.)`	`f(y,x)`
`x %>% f(y, z = .)`	`f(y,z=x)`
`x %>% f %>% g`	`g(f(x))`
`z <- x %>% f`	`z <- f(x)`

Note that the variable x on the left side of %>% is applied as the first argument in the function on the right side. This default behaviour can be changed using . which is called a placeholder.

However, one important thing to remember is, when the . appears in nested expressions, the first-argument-rule is still applied. But this behaviour can be suppressed using the curly braces{ }. That is,

R pipe code	Equivalent code
`data %>% f(x = ncol(.))`	`f(data, x = ncol(data))`
`data %>% { f(x = ncol(.)) }`	`f(x = ncol(data))`

That is all what is necessary to know about the pipe.

dplyr

This package provides intuitive "verbs" useful for working with data frames in R (example: for subsetting, summarizing , rearranging data frames).

The basic verbs/functions of dplyr package are:

dplyr verb	Meaning
`select()`	keep only the columns you mention
`rename()`	rename the columns you mention
`mutate()`	add new columns and keep existing ones
`filter()`	returns subset of rows with matching conditions
`arrange()`	re-order the rows

We look at a few examples to understand how these verbs and pipes can be used together. We will be using sensor.csv data located in the repository here. So download this file into any folder first.

This data contains columns named: timestamp, id, tempature, humidity and precipitation for different sensors which are represented by the column named id.

We load the data first.

require(dplyr)
sensor <- read.csv('data/input/sensor.csv')
head(sensor)
# >            timestamp       id tempature humidity precipitation
# >1 2014-01-01 00:00:00 33114029      -0.5     80.6             0
# >2 2014-01-01 01:00:00 33114029      -0.7     81.8             0
# >3 2014-01-01 02:00:00 33114029       0.1     77.0             0
# >4 2014-01-01 03:00:00 33114029       0.5     74.0             0
# >5 2014-01-01 04:00:00 33114029      -0.6     77.3             0
# >6 2014-01-01 05:00:00 33114029      -2.0     82.5             0

Now we show the use of each of dplyr verbs one by one.


# SELECT, RENAME VERB

sensor <- sensor %>% select(one_of('id', 'tempature', 'humidity', 'precipitation')) %>%  rename(temperature=tempature)
head(sensor)
# >        id temperature humidity precipitation
# >1 33114029        -0.5     80.6             0
# >2 33114029        -0.7     81.8             0
# >3 33114029         0.1     77.0             0
# >4 33114029         0.5     74.0             0
# >5 33114029        -0.6     77.3             0
# >6 33114029        -2.0     82.5             0

# MUTATE VERB

sensor <- sensor %>% mutate(newcol = temperature*humidity)
head(sensor)
# >        id temperature humidity precipitation  newcol
# >1 33114029        -0.5     80.6             0  -40.30
# >2 33114029        -0.7     81.8             0  -57.26
# >3 33114029         0.1     77.0             0    7.70
# >4 33114029         0.5     74.0             0   37.00
# >5 33114029        -0.6     77.3             0  -46.38
# >6 33114029        -2.0     82.5             0 -165.00

# FILTER VERB

sensor <- sensor %>% filter(newcol > 2.1 )
head(sensor)
# >        id temperature humidity precipitation newcol
# >1 33114029         0.1     77.0             0   7.70
# >2 33114029         0.5     74.0             0  37.00
# >3 33114029         2.8     69.5             0 194.60
# >4 33114029         5.6     55.6             0 311.36
# >5 33114029         8.9     47.8             0 425.42
# >6 33114029        11.0     44.5             0 489.50

# ARRANGE VERB

sensor <- sensor %>% arrange(newcol)
head(sensor)
# >          id temperature humidity precipitation newcol
# >1   33114029         0.1     32.5             0   3.25
# >2   33114029         0.1     34.1             0   3.41
# >3   33114029         0.1     34.8             0   3.48
# >4   33130167         0.1     35.3             0   3.53
# >5 2030130001         0.1     35.3             0   3.53
# >6   33149672         0.1     36.6             0   3.66

In one line using the pipe, the above code would be:

sensor <- read.csv('data/input/sensor.csv') %>% select(one_of('id', 'tempature', 'humidity', 'precipitation')) 
%>%  rename(temperature=tempature) %>% mutate(newcol = temperature*humidity) %>% filter(newcol > 2.1 ) %>% arrange(newcol)

Pretty clean !

tidyr

This package provides verbs to clean/tidy the data.

The main verbs here are:

tidyr verb	Meaning
`gather()`	makes wide data longer
`spread()`	makes long data wider

There are two additional verbs which are sometimes useful:

tidyr verb	Meaning
`unite()`	paste together multiple columns into one
`separate()`	separate one column into several

Here are some examples on the same sensor.csv dataset:


# LOAD DATA

sensor <- read.csv('data/input/sensor.csv') %>% rename(temperature=tempature)
head(sensor)
# >            timestamp       id temperature humidity precipitation
# >1 2014-01-01 00:00:00 33114029        -0.5     80.6             0
# >2 2014-01-01 01:00:00 33114029        -0.7     81.8             0
# >3 2014-01-01 02:00:00 33114029         0.1     77.0             0
# >4 2014-01-01 03:00:00 33114029         0.5     74.0             0
# >5 2014-01-01 04:00:00 33114029        -0.6     77.3             0
# >6 2014-01-01 05:00:00 33114029        -2.0     82.5             0

# GATHER VERB
# .. the columns temperature, humidity and precipitation have to be gathered into a 
# .. single column named 'key' and the values represented by the column 'key'.

sensor <- sensor %>% gather(key, value, temperature:precipitation) 
head(sensor)
# >            timestamp       id         key value
# >1 2014-01-01 00:00:00 33114029 temperature  -0.5
# >2 2014-01-01 01:00:00 33114029 temperature  -0.7
# >3 2014-01-01 02:00:00 33114029 temperature   0.1
# >4 2014-01-01 03:00:00 33114029 temperature   0.5
# >5 2014-01-01 04:00:00 33114029 temperature  -0.6
# >6 2014-01-01 05:00:00 33114029 temperature  -2.0

# UNITE VERB
# .. we want to unite columns 'id' and 'key' into a single column 'key'

sensor <- sensor %>% unite(key, id, key) 
head(sensor)
# >            timestamp                  key value
# >1 2014-01-01 00:00:00 33114029_temperature  -0.5
# >2 2014-01-01 01:00:00 33114029_temperature  -0.7
# >3 2014-01-01 02:00:00 33114029_temperature   0.1
# >4 2014-01-01 03:00:00 33114029_temperature   0.5
# >5 2014-01-01 04:00:00 33114029_temperature  -0.6
# >6 2014-01-01 05:00:00 33114029_temperature  -2.0

# SPREAD VERB
# .. we want to spread the column 'key' into multiple columns each taking value mentioned in the column 'value'

sensor <- sensor %>% spread(key, value)
head(sensor)
# >            timestamp 2030120003_humidity 2030120003_precipitation 2030120003_temperature 2030130001_humidity
# >1 2014-01-01 00:00:00                79.4                        0                    4.2                74.6
# >2 2014-01-01 01:00:00                74.8                        0                    3.5                76.0
# >3 2014-01-01 02:00:00                82.7                        0                    2.6                75.7
# >4 2014-01-01 03:00:00                83.2                        0                    2.7                76.6
# >5 2014-01-01 04:00:00                83.5                        0                    2.7                76.7
# >6 2014-01-01 05:00:00                82.5                        0                    3.5                74.3

Combining above code into single line:

sensor <- read.csv('data/input/sensor.csv') %>% rename(temperature=tempature) 
%>% gather(key, value, temperature:precipitation) %>% unite(key, id, key) %>% spread(key, value)

Lastly, Here is a cheatsheet for quick reference to use these packages in tandem.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

Data Munching in R: dplyr + tidyr + pipe

The R pipe: %>%

dplyr

tidyr

The R pipe: `%>%`