More than 1 year has passed since last update.

Databricksにおけるクラスタリング

Posted at 2023-05-11

こちらが非常に参考になりました。

MLlibのKmeansの説明はこちら。

k-meansは、データポイントを事前に定義した数のクラスターにクラスタリングを行うクラスタリングアルゴリズムで最もよく利用されるものの一つです。MLlibの実装にはkmeans||と呼ばれるk-means++の並列化バージョンが含まれています。

KMeansはEstimatorとして実装されており、ベースモデルとしてKMeansModelを生成します。

上の記事に沿って実行してみますが、一部新機能など活用しています。

ライブラリのインポート

Python

from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
from pyspark.sql import SparkSession

データのダウンロード

ワークスペースファイルがサポートされたので、DBFSを使わずにファイルをダウンロードできます。便利です。

Python

import urllib 

urllib.request.urlretrieve("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", "/Workspace/Users/takaaki.yayoi@databricks.com/20230511_spark_clustering/winequality-red.csv")

データの読み込み

ワークスペースファイルから直接Sparkデータフレームにロードすることもできます。

Python

sp_path = 'file:/Workspace/Users/takaaki.yayoi@databricks.com/20230511_spark_clustering/winequality-red.csv'
sdf = spark.read.csv(sp_path, header=True, sep=';', inferSchema=True)

Python

display(sdf)

ワインの化学特性のデータです。

ベクトルの生成

MLlibのKMeansの入力featuresColがVectorなので、VectorAssemblerでデータフレームをベクトルに変換します。

Python

from pyspark.ml.feature import VectorAssembler

train = VectorAssembler(inputCols=sdf.columns, outputCol="features").transform(sdf)

シルエットスコアの比較

シルエットスコアが高いほど、クラスターに含まれるオブジェクトが一致していることを示します。適切なクラスター数を特定する助けとなります。

Python

for k in range(2,8):
  kmeans = KMeans().setK(k).setSeed(1)
  model = kmeans.fit(train)
  predictions = model.transform(train)
  evaluator = ClusteringEvaluator()
  silhouette = evaluator.evaluate(predictions)
  print("With K={}".format(k))
  print("Silhouette with squared euclidean distance = " + str(silhouette))
  print('--'*30)
  print("High value indicates that the object is well matched to its own cluster")

With K=2
Silhouette with squared euclidean distance = 0.772194945850551
------------------------------------------------------------
High value indicates that the object is well matched to its own cluster
With K=3
Silhouette with squared euclidean distance = 0.6895576732268774
------------------------------------------------------------
High value indicates that the object is well matched to its own cluster
With K=4
Silhouette with squared euclidean distance = 0.656312517141184
------------------------------------------------------------
High value indicates that the object is well matched to its own cluster
With K=5
Silhouette with squared euclidean distance = 0.6207361460993716
------------------------------------------------------------
High value indicates that the object is well matched to its own cluster
With K=6
Silhouette with squared euclidean distance = 0.613234543966971
------------------------------------------------------------
High value indicates that the object is well matched to its own cluster
With K=7
Silhouette with squared euclidean distance = 0.5619709932825219
------------------------------------------------------------
High value indicates that the object is well matched to its own cluster

モデルの作成(k=2)

Python

# k-meansモデルのトレーニング
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(train)

なお、フィッテイングするたびにMLflowにモデルがトラッキングされます。これも便利。

Python

# 予測の実行
predictions = model.transform(train)

Python

# クラスター中心の表示
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

Cluster Centers: 
[ 8.42425785  0.51933418  0.26653944  2.39427481  0.08544614 12.37192536
 30.34435963  0.99667684  3.31552163  0.65653096 10.5402177   5.72434266]
[ 8.02595238  0.55164286  0.28342857  2.94452381  0.0931381  25.70833333
 91.72857143  0.99694274  3.2987381   0.66269048 10.09388889  5.38809524]

可視化

Python

countByCluster = predictions.groupBy("prediction").count()
display(countByCluster)

クラスタリングされていますね。

Databricksクイックスタートガイド

Databricks無料トライアル

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up