More than 5 years have passed since last update.

Rで遊ぶ　～ワインの等級をrandomForestで予測～

Posted at 2016-01-13

Rで適当に遊んでみたよ！

カリフォルニア大学が運営しているCenter for Machine Learning and Intelligent Systemsが、データ解析に使えるデータセットを無料で公開しています。

今回は、ワインのデータセットを使い、ワインの等級をrandom forestで予測してみたいと思います。

UCI Machine Learning Repository: Wine Data Set

random forestとはざっくり言いますとLeo Breiman により提唱された機械学習のアルゴリズムです。集団学習により汎化性能を向上させているところが特徴です。今回はワインの中に含まれる化学物質の量などを説明変数として、ワインの等級を予測していきます。

httrパッケージのGETメソッドでrepositoryのデータをダウンロードします。

library("httr")
geturl <- GET("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data")
dat <- read.csv(textConnection(content(geturl)), header=F)

列名を直しておきます。

head(dat)
names(dat) <- c("class", paste0("V", 1:13))
dat <- transform(dat, class = as.factor(class))

データの概要も見ておきましょう。

> str(dat)
'data.frame':	178 obs. of  14 variables:
 $ class: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
 $ V1   : num  14.2 13.2 13.2 14.4 13.2 ...
 $ V2   : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
 $ V3   : num  2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
 $ V4   : num  15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
 $ V5   : int  127 100 101 113 118 112 96 121 97 98 ...
 $ V6   : num  2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
 $ V7   : num  3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
 $ V8   : num  0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
 $ V9   : num  2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
 $ V10  : num  5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
 $ V11  : num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
 $ V12  : num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
 $ V13  : int  1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...

応答変数がclass、説明変数が他の変数となります。V1 ～ V13までは以下の項目です。

項目名	和訳
V1) Alcohol	アルコール
V2) Malic acid	リンゴ酸
V3) Ash	灰
V4) Alcalinity of ash	灰のアルカリ性
V5) Magnesium	マグネシウム
V6) Total phenols	フェノール類全量
V7) Flavanoids	フラバノイド
V8) Nonflavanoid phenols	非フラバノイドフェノール類
V9) Proanthocyanins	プロアントシアニン
V10)Color intensity	色彩強度
V11)Hue	色調
V12)OD280/OD315 of diluted wines	蒸留ワインのOD280/OD315
V13)Proline	プロリン

データを訓練用と評価用のデータに分割します。sample関数を使う方法もありますが、caretパッケージのcreateDataPartition関数は分割したデータの中で応答変数のクラスの比率が偏らないようにしてくれるのでオススメです。

library(caret)
index <- createDataPartition(dat$class, p=.8, list=F)
train <- dat[index, ]
test  <- dat[-index,]

tuneRFでmtryを最適化します。今回は6が最適と言う結果がでましたので、それでrandom forestを実施します。

tuneRF(train[,-1], train[,1], doBest=T) # 6
rf <- randomForest(class~., data=train, mtry=6)

混合行列を作り、モデルの評価をしてみます。

> table(predict(rf, test), test$class)
   
     1  2  3
  1 11  0  0
  2  0 14  0
  3  0  0  9

お見事。全て的中です。

各説明変数の重要度も見てみましょう。

> importance(rf)
    MeanDecreaseGini
V1        11.7547555
V2         1.3451287
V3         0.7519802
V4         1.1294739
V5         1.7996680
V6         2.4725046
V7        19.0491105
V8         0.3845735
V9         0.7663628
V10       11.6714396
V11        7.4439506
V12       13.1139074
V13       22.4820614

アルコール、フラバノイド、色彩強度などがワインの等級に効いているようです。へえ。

・・・以上、お遊びでした。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

Rで遊ぶ ～ワインの等級をrandomForestで予測～

Rで遊ぶ　～ワインの等級をrandomForestで予測～