More than 5 years have passed since last update.

Lesson13: PCA 主成分分析まとめ　Intro to Machine learning@Udacity

Last updated at 2019-01-22Posted at 2019-01-20

Lesson13の概要

Principal Component Analysis
特徴量圧縮をして、データを見える化するときによく使う手法。

PCA処理とは

・入力データの座標系をもとにして、新しい座標系を作る。(データに合わせて平行移動と、回転をさせる。新しくできた軸は主成分軸という)
・座標の中心は入力データの中心
・X軸の方向は、ばらつきが最も多い方向になる。
・Y軸はX軸に直行した方向になる。
・これらの軸がどのくらい重要かを算出する＝軸の固有値ベクトルが得られる。

こちらを参考にさせていただきました。
Logics of Blue:主成分分析の考え方
なるほどー

Measurable vs Latent Features

機械学習でPCAを使う場合の直感的に理解する。
例えば、下記のような特徴量から家の値段を予測する回帰をするとする。
＜入力特徴量:measurable＞
・Square footage(フィートの面積)
・No of rooms
・School ranking
・neighborhood safety

上記4つの中から、本当に欲しい情報は下記。
<潜在的特徴量(仮説):Latent>
・家の大きさ
・近隣のquality
この情報さえあれば回帰ができるはず(仮説)。

4つの入力特徴量から、上記2つの情報へ圧縮するためには何がベストか？
✖select percentile→上位X%の特徴量を返す。(今回は50%なら2つ返す)
〇select kbest→k個返す

たくさん特徴量から、パターン認識のための情報だけを抽出したい。
PCAをつかって合成特徴量(composite feature)＝principle componentを作る。
今回は次元削減(dimensionality reduction)を行う。
PCAは教師なし学習のアルゴリズム。
なぜ「ばらつき」が最大のX軸へデータを射影するのか？
→いちばん情報量を持っているから＝情報量の欠落が最も少ない。
情報量の欠落量は、各データ点から、新X軸へ射影した距離。

特徴変換のための一般的アルゴリズムとしてのPCA

~PCA as a general algorithm for feature transformation~

今回はどの入力特徴量から、どの潜在的特徴量が抽出できるのか、人が決めていたが、入力が爆発的に増える顔認識等は、人が決ることは不可能。PCAを使えば、自動的に潜在的特徴量を抽出してくれる。

When to use PCA

・latent featureがあるかどうか知りたいとき。
・次元削減=高次元を見える化したいとき,ノイズを取り除くとき、前処理

Mini-project

Eigenfaces Code＠githubを使ってPCA as a general algorithm for feature transformationを学ぶ。

sklearn:Faces recognition example using eigenfaces and SVMs

Eigenface処理の流れ

Labeled Faces in the Wild a.k.a lfw(人の顔画像にラベルがついたデータ)ダウンロード
特徴量は1850個
↓
PCA(Eigenfaces)処理:PCA後の固有ベクトルのことをEigenfacesと言っている
デフォルトでは150個のEigenfeacesをSVMに食わせている。
1850→150への次元削減
↓
SVMの学習
↓
test set によるSVMの評価

Quiz 34

問題：上記手順でできたPCAの上位2つのばらつきの値は何か？
答え：1st:0.19346525, 2nd:0.15116856
やり方：
eigenfaces.pyを実行後


print pca.explained_variance_ratio_
[0.19346525 0.15116856 0.07083683 0.05951786 0.05157496 0.02887157
 0.02514487 0.02176468 0.02019385 0.01902124 0.01682215 0.01580599
 0.01223364 0.01087938 0.01064451 0.00979653 0.00892399 0.00854845
 0.00835711 0.00722635 0.00696569 0.00653856 0.00639558 0.00561316
 0.00531106 0.00520151 0.00507464 0.00484208 0.00443587 0.00417828
 0.00393703 0.00381725 0.00356056 0.00351196 0.0033455  0.00329926
 0.00314616 0.00296208 0.0029012  0.00284713 0.00279992 0.00267541
 0.00259882 0.0025839  0.00240904 0.00238963 0.00235378 0.00222558
 0.00217471 0.00216533 0.00208975 0.00205374 0.00200394 0.00197354
 0.00193785 0.00188711 0.00180117 0.00178833 0.00174779 0.0017299
 0.00165592 0.00162903 0.00157334 0.00153308 0.00149843 0.00147103
 0.00143757 0.00141774 0.00139575 0.00137988 0.00133799 0.00133058
 0.00128563 0.00125378 0.00124053 0.00121694 0.00120686 0.00117905
 0.00114871 0.00113216 0.00112037 0.00111245 0.00108908 0.00106488
 0.00105176 0.00103919 0.0010203  0.00101365 0.00099407 0.0009576
 0.00093842 0.00091282 0.00090397 0.00088409 0.00086667 0.00085584
 0.00083781 0.00083217 0.00082224 0.00079643 0.00077813 0.00077172
 0.00074961 0.00074222 0.00073455 0.00072277 0.00071809 0.00069917
 0.00069356 0.00068119 0.00065529 0.0006479  0.00063939 0.00062291
 0.00061409 0.00060696 0.00059474 0.00059068 0.0005887  0.00057458
 0.0005585  0.0005531  0.00054934 0.00054092 0.00052247 0.00051193
 0.00050467 0.00049339 0.00048404 0.00047758 0.00047176 0.00046963
 0.00045394 0.00044619 0.00043922 0.00042981 0.00042644 0.00041565
 0.00040899 0.00040353 0.00039503 0.00039135 0.00037254 0.00036907
 0.00035501 0.00034934 0.00033642 0.00032915 0.00032744 0.00032202]

ちなみに数は150個ある。
n_components = 150
と指定されているので。
参考：sklearn.decomposition.RandomizedPCA
疑問：上記で表示したばらつきの合計値が「1」になるsklearnの上記ドキュメンテーションには書いてあるけど、なんでだろ。標準化されているのか？

Quiz35

今回のような多クラス分類は、2クラス分類のように直感的に、いくつの主成分(Principal Components:PC)が有効かを判別することは難しい。いくつの主成分がよいか判断するために、一般的な方法としてよく使われるのは、F1 score。
LessonではF1 scoreを使うが、ここでは、よい分類機はF1 scoreが高い時なのか、低い時なのか、見つけてもらう。具体的には、SVMに食わせるn_componentsの数を変えて、それに伴いF1 scoreがどのように変化するのか見ていく。

◆変更前


n_components = 150

print "Extracting the top %d eigenfaces from %d faces" % (n_components, X_train.shape[0])
t0 = time()
pca = RandomizedPCA(n_components=n_components, whiten=True).fit(X_train)
print "done in %0.3fs" % (time() - t0)

eigenfaces = pca.components_.reshape((n_components, h, w))

print "Projecting the input data on the eigenfaces orthonormal basis"
t0 = time()
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
print "done in %0.3fs" % (time() - t0)

###############################################################################
# Train a SVM classification model

print "Fitting the classifier to the training set"
t0 = time()
param_grid = {
         'C': [1e3, 5e3, 1e4, 5e4, 1e5],
          'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1],
          }
# for sklearn version 0.16 or prior, the class_weight parameter value is 'auto'
clf = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), param_grid)
clf = clf.fit(X_train_pca, y_train)
print "done in %0.3fs" % (time() - t0)
print "Best estimator found by grid search:"
print clf.best_estimator_


###############################################################################
    # Quantitative evaluation of the model quality on the test set

print "Predicting the people names on the testing set"
t0 = time()
y_pred = clf.predict(X_test_pca)
print "done in %0.3fs" % (time() - t0)

print classification_report(y_test, y_pred, target_names=target_names)
print confusion_matrix(y_test, y_pred, labels=range(n_classes))

◆変更後

# Quiz36コード
# n_componentsをリストに
# n_componentsをbbとした
# for文でn_componentsを回す。全部ぶっこんだ。
# 現状いくつのn_componentsなのかわかるように    print "n_components is: %d" % bb 
# を加えた
n_components = [10, 15, 25, 50, 100, 250]

for bb in n_components:
    print "n_components is: %d" % bb 
    print "Extracting the top %d eigenfaces from %d faces" % (bb, X_train.shape[0])
    t0 = time()
    pca = RandomizedPCA(n_components=bb, whiten=True).fit(X_train)
    print "done in %0.3fs" % (time() - t0)
    
    eigenfaces = pca.components_.reshape((bb, h, w))
    
    print "Projecting the input data on the eigenfaces orthonormal basis"
    t0 = time()
    X_train_pca = pca.transform(X_train)
    X_test_pca = pca.transform(X_test)
    print "done in %0.3fs" % (time() - t0)
    
    ###############################################################################
    # Train a SVM classification model
    
    print "Fitting the classifier to the training set"
    t0 = time()
    param_grid = {
             'C': [1e3, 5e3, 1e4, 5e4, 1e5],
              'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1],
              }
    # for sklearn version 0.16 or prior, the class_weight parameter value is 'auto'
    clf = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), param_grid)
    clf = clf.fit(X_train_pca, y_train)
    print "done in %0.3fs" % (time() - t0)
    print "Best estimator found by grid search:"
    print clf.best_estimator_
    
    
    ###############################################################################
        # Quantitative evaluation of the model quality on the test set
    
    print "Predicting the people names on the testing set"
    t0 = time()
    y_pred = clf.predict(X_test_pca)
    print "done in %0.3fs" % (time() - t0)
    
    print classification_report(y_test, y_pred, target_names=target_names)
    print confusion_matrix(y_test, y_pred, labels=range(n_classes))

◆結果の整理

図1

こんなんなりました。
n_componentsが100の時が一番よさそうで。

Quiz 36

問題:f1スコアが上がればよい分類器といえるか？
答え：いえる。下記参照

F1 score(f値)について

朱鷺の杜Wikiによると
F値　= 2RecallPrecision/(Recall+Precision)

@conjugate_box さんのF値に関するメモを参考にさせてもらうと

Recall = TP/(TP+FN)
Precision = TP/(TP+FN)

TPとTNが大きい方が嬉しい

FPやFNが小さい方が嬉しい
名前からも分かるように，前者は真のラベルと予測レベルが一致しているためTrueがついていて，後者は2つのラベルが一致していないためFalseがついている．なのでTが頭に付いてる数値が増えると嬉しい．

F値の性質
F値の利点は，おそらくだが，PrecisionとRecallという２つの指標を１つにまとめる所が大きい．>(違うという意見もあるかもだけど，その議論は本記事の趣旨ではない)
例えば新しくモデルを提案としても，指標が2つある場合，どちらかは勝ってるけどどちらかは負けてる状況が生じうる．そこでF値は両者のスコアを用い，全体としての良さらしさを反映した何かのスカラを定義することで，モデルの優劣を議論しやすくできる．

なるほど。そうゆうことだったのね。
式変形結果

F値 =  \frac{TP}{TP+\frac{1}{2}(FP+FN)}

解釈はFP+FNは両方とも間違って予測した分なので、これがゼロならF値は1になる。
間違って予測した分があれば、それを半分にして、分母に加える。こうすることで、分類器の間違った分が多いと、F値はどんどんゼロに近づいていくわけだ。大変わかりやすかったです。ありがとうございます。

Quiz37

問題：n_componentsの数が増えると過学習している兆候があるか？
答え：ある。図1でn_componentsが250のとき、f値が下がっている。training datasetについては、おそらく精度は上がっているんだろうけど。

最後のビデオ

f値が一番高くなる、分類器に入力する主成分(PCs)の数は自分で探すしかない。
-上位10％で決め打ち
-分類器を小さな数の主成分から学習させて、徐々に入力する主成分を大きくし、F値がドロップしたところでやめる。今回の上記Quizはこれをやった。
-Feature selection:Lesson12をPCAの前に入れて、そこで出力されたものは全部PCAに入力。

感想

特徴量→(特徴選択)→☆主成分分析で主成分を抽出→分類器学習
☆主成分分析では自分でf値を見ながら、どのくらいの主成分を分類学習器に入力するか決める。
ボリュームはそこまで大きくないLessonで、やりやすかった。

このレッスンで調べた英単語

orthogonal・・・adj
1〔数学〕直交の《a直角の, 垂直の b内積の定義された関数空間に属する2つの関数で , 互いの内積が0となる》
2〔結晶〕直交の《結晶軸が直角で交わる》
3〔統計〕直交の a〈変数が〉統計的に独立した b(実験計画法において)〈配置が〉直交している
coordinate system・・・n, 座標系
orthogonal・・・adj, 直交性の、直交の
truncate・・・v, 頭を切る、切り縮める
eigenvalue decomposition・・・n, 固有値分解
latent・・・adj, 隠れている,見えない,(…に)潜んで,潜伏性の
quantitative・・・adj, 量の、量に関する、量的な

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

Lesson13: PCA 主成分分析まとめ Intro to Machine learning@Udacity