More than 5 years have passed since last update.

統計用R言語データ解析用チートシート

Last updated at 2017-05-05Posted at 2015-05-18

Rでデータ解析する際に必要な知識周り

help

help(base)  # packageのヘルプも
help(array)  # functionのヘルプも
help(class(1))  # instanceのヘルプはこうする(1はインスタンスの例)
?array  # ?でhelp関数と同等のことができる
??arr  # fuzzy matchでヘルプ検索

Package まわり

install.packages('ggplot2')  # パッケージのインストール
install.packages()  # パッケージを一覧から選択してインストール
library(ggplot2)  # パッケージの読み込み。installされていないと使えない

DataFrame

R-console

> head(diamonds)
  carat       cut color clarity depth table price    x    y    z
1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
> names(diamonds)
 [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
 [8] "x"       "y"       "z"  
> ncol(diamonds)
[1] 10
> nrow(diamonds)
[1] 53940
> dim(diamonds)
[1] 53940    10
> options(width = 90)  # (横幅を指定)
> summary(diamonds)
     carat               cut        color        clarity          depth      
 Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
 1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
 Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
 Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
 3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
 Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
                                    J: 2808   (Other): 2531                  
     table           price             x                y                z         
 Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
 1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720   1st Qu.: 2.910  
 Median :57.00   Median : 2401   Median : 5.700   Median : 5.710   Median : 3.530  
 Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735   Mean   : 3.539  
 3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540   3rd Qu.: 4.040  
 Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900   Max.   :31.800  
> qplot(carat, price, color = clarity, data = diamonds)
> qplot(diamonds$carat, diamonds$price, color = diamonds$clarity)
> ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_point()

qplot / ggplot の結果

CSVから読み込む場合

input.csv

X,Y
1,1
3,4
5,7

R-console

> read.csv('input.csv')
  X Y
1 1 1
2 3 4
3 5 7

関数はこんな感じの定義 (?read.csvより引用)

read.csv(file, header = TRUE, sep = ",", quote = "\"",
         dec = ".", fill = TRUE, comment.char = "", ...)

PostgreSQLから読み込む場合

MySQLはRMySQL, SQLiteはRSQLiteを使う

require('RPostgreSQL') # 初回はinstall.packages('RPostgreSQL')が必要
con <- dbConnect(PostgreSQL(), host="localhost", port=5432, dbname="dev", user="postgres", password="")
data <- dbGetQuery(con, "SELECT * FROM users LIMIT 1000")
## 量が多い時: dbGetQuery => dbSendQuery,dbFetch,dbClearResult
rs <- dbSendQuery(SELECT * FROM users)
while(!dbHasCompleted(rs)) {
  chunk <- dbFetch(rs, 1000)
  # ここでchunkに何らかの処理
}
dbClearResult(rs)
## しっかりDisconnect
dbDisconnect(con)

その他使える関数

tables <- dbListTables(con)
fields <- dbListFields(con, tables[1])
data <- dbReadTable(con, tables[1])

TODO

画像の出力
ggplotのfacetsや半透明化, jitter処理での可視化周り
新規データポイントのカテゴリ予測 (簡単な機械学習)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up