More than 3 years have passed since last update.

K-means法の練習

Posted at 2021-02-04

#はじめに
距離学習の勉強を始め、Scikit-learnを利用し、K-means法をかじってみました。

#データ探索(EDA)
Irisのデータセットをロードします。

from sklearn.datasets import load_iris
iris_data = load_iris()

Iris_dataの詳細です。

print(iris_data.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    中略..

データは、Iris_Data.dataに格納されています。Shapeは(n_samples, n_features)です。

iris_data.data.shape
(150, 4)

Target_names(正解、ラベルに相当)やFeature_names(特徴量)は、下記の命令で確認可能です。

print(iris_data.target_names)
['setosa' 'versicolor' 'virginica']

print(iris_data.feature_names)
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

データとラベルを出力してみましょう。
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']と　['setosa' 'versicolor' 'virginica']　→　[0,1,2]として出力されます。

for data,target in zip(iris_data.data, iris_data.target):
    print(data,target)

[5.1 3.5 1.4 0.2] 0
[4.9 3.  1.4 0.2] 0
[4.7 3.2 1.3 0.2] 0
[4.6 3.1 1.5 0.2] 0
[5.  3.6 1.4 0.2] 0
中略...

sepal length (cm)と sepal width (cm)の分布図を描いてみましょう。

f, ax = plt.subplots(figsize=(7.5, 7.5))
#'sepal length (cm)', 'sepal width (cm)'
ax.scatter(iris_data.data[:,1],iris_data.data[:,2] )
ax.set_title('Iris')

#KMeans法

クラスタ数は3にします。


n_cluster = 3
model = KMeans(n_clusters=n_cluster)
iris_data.data.shape #shape -> (n_samples, n_features)

(150, 4)

学習を実施します。Kerasのfit()がなぜ学習を意味するのか気になったんですが、scikit-learnから持ってきたんでしょうね。

model.fit(X=iris_data.data) 

(結果)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

labels_で分類結果が見られます。一緒に正解も出力してみます。
性能はいまいちですね。

print(model.labels_)
print(iris_data.target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 1 2 1 1 1 1
 1 1 2 2 1 1 1 1 2 1 2 1 2 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 2 1
 1 2]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

各クラスターの重心を求めます。Columns : Features, Rows : Clustersです。

print(model.cluster_centers_)

[[5.006      3.428      1.462      0.246     ]
 [6.85       3.07368421 5.74210526 2.07105263]
 [5.9016129  2.7483871  4.39354839 1.43387097]]

分類した結果とクラスターの重心をプロットします。
X軸sepal length (cm)、Y軸 sepal width (cm)です。

f, ax = plt.subplots(figsize=(7,7))
rgb = np.array(['r','g','b'])
ax.scatter(iris_data.data[:,0],iris_data.data[0:,1], color=rgb[iris_data.target])
ax.scatter(model.cluster_centers_[:,0],model.cluster_centers_[:,1],marker='*', s=250, color='black',label='Clusters')

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up