More than 5 years have passed since last update.

dplyrとRedshiftの連携

Posted at 2016-12-08

dplyrとデータベースのつなぎ方は下記リンク先に乗っていたが、
RedShiftはなかったので調べた。
dplyrとデータベース

正確な情報は下記公式リンクから
Connecting R with Amazon Redshift

1.もろもろの情報を入れてRedShiftと繋ぐ

# now run analyses with the dplyr package on Amazon Redshift
install.packages("dplyr")
library(dplyr)
library(RPostgreSQL)
# myRedshift <- src_postgres("<DBNAME>",
#	host = "<ENDPOINT>,
#	port = <PORT>,
#	user = "<USER<",
#	password = "<PW>")
myRedshift <- src_postgres('demo',
host = 'redshiftdemo.ckffhmu2rolb.eu-west-1.redshift.amazonaws.com',
port = 5439,
user = "markus", 
password = "XXX")

2.tbl関数でテーブルオブジェクト作成

# create table reference
flights <- tbl(myRedshift, "flights")

# simple and default R commands analyzing data frames
dim(flights)
colnames(flights)
head(flights)

# the summarize command reduces grouped data to a single row.
summarize(flights, avgdelay=mean(arrdelay))
summarize(flights, avgdelay=max(arrdelay))

テーブルオブジェクト作成時点ではデータはメモリーにロードされていない。
オブジェクトに対し何らかの処理を実行した時点でSQLが走る。

3.あとはdplyrでお好きに！

flights %>%
filter(depdelay-arrdelay>60) %>%
select(tailnum, depdelay, arrdelay, dest)

便利！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up