DatabricksでSparkNLPとMLLibを使って分散トピックモデリングをやってみる(日本語編)

Posted at 2022-05-17

こちらでトライした分散LDAの日本語対応版です。

ノートブックはこちらです。

クラスターの設定とライブラリの設定は前回と同じです。

データのロード

ダウンロード - 株式会社ロンウイットから、livedoor ニュースコーパス ldcc-20140209.tar.gzをダウンロードします。

以下のようなPythonスクリプトを使って記事のタイトルを抽出します。

extract_titles.py

import glob

files = glob.glob("./topic-news/topic-news-*.txt")
print("publish_date,headline_text")
for file in files:
    #print(file)
    f = open(file, 'r')
    datalist = f.readlines()
    print(datalist[1].strip(), ",",  datalist[2].strip())

ターミナルでスクリプトを実行してCSVファイルを作成します。

python extract_titles.py > topic-news.csv

以下の手順でCSVファイルをDatabricksにアップロードします。

サイドメニューのデータをクリックし、アップロードボタンをクリックします。
アップロードするパスを選択して、CSVファイルをドラッグ&ドロップします。

　以下の例では、下のパスにアップロードしています。

dbfs:/FileStore/shared_uploads/takaaki.yayoi@databricks.com/news/topic_news.csv

ファイルをロードして中身を確認します。

Python

file_location = "dbfs:/FileStore/shared_uploads/takaaki.yayoi@databricks.com/news/topic_news.csv"
file_type = "csv"

# CSVのオプション
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

# レコード数の確認
df.count()

Python

display(df)

Spark NLPを用いた前処理パイプライン

日本語に対応している以下のアノテーターを使用します。

WordSegmenterModel: Japanese Word Segmentation- Spark NLP Model
StopWordsCleaner: Stopwords Remover for Japanese language (153 entries)- Spark NLP Model

Python

# Spark NLPはドキュメントに変換する入力データフレームあるいはカラムが必要です
document_assembler = DocumentAssembler() \
    .setInputCol("headline_text") \
    .setOutputCol("document") \
    .setCleanupMode("shrink")

# 文をトークンに分割(array)
tokenizer = Tokenizer() \
  .setInputCols(["document"]) \
  .setOutputCol("token")

# 日本語の文をトークンに分割(array)
word_segmenter = WordSegmenterModel.pretrained('wordseg_gsd_ud', 'ja')\
        .setInputCols(["document"])\
        .setOutputCol("token")    

# 不要な文字やゴミを除外
normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized")

# 日本語ストップワードの除外
stop_words_remover = StopWordsCleaner.pretrained("stopwords_iso", "ja") \
    .setInputCols(["normalized"]) \
    .setOutputCol("cleanTokens")

# Finisherは最も重要なアノテーターです。Spark NLPはデータフレームの各行をドキュメントに変換する際に自身の構造を追加します。Finisherは期待される構造、すなわち、トークンの配列に戻す助けをしてくれます。 
finisher = Finisher() \
    .setInputCols(["cleanTokens"]) \
    .setOutputCols(["tokens"]) \
    .setOutputAsArray(True) \
    .setCleanAnnotations(False)

# それぞれのフェーズが順番に実行されるようにパイプラインを構築します。このパイプラインはモデルのテストにも使うことができます。
nlp_pipeline = Pipeline(
    stages=[document_assembler, 
            word_segmenter,
            normalizer,
            stop_words_remover,
            finisher])

# パイプラインのトレーニング
nlp_model = nlp_pipeline.fit(df)

# データフレームを変換するためにパイプラインを適用します。
processed_df  = nlp_model.transform(df)

# NLPパイプラインは我々にとって不要な中間カラムを作成します。なので、必要なカラムのみを選択します。
tokens_df = processed_df.select('publish_date','tokens').limit(10000)

display(tokens_df)

特徴量エンジニアリング

こちらのロジックも前回から変更はありません。

Python

from pyspark.ml.feature import CountVectorizer

cv = CountVectorizer(inputCol="tokens", outputCol="features", vocabSize=500, minDF=3.0)

# モデルのトレーニング
cv_model = cv.fit(tokens_df)

# データを変換します。出力カラムが特徴量となります。
vectorized_tokens = cv_model.transform(tokens_df)

display(vectorized_tokens)

以降のロジックも前回から変更ありませんが、日本語記事からトピックを抽出することができています。

しかし、1文字のキーワードが抽出されていたりするので、NLPパイプラインを含めたチューニングを行う余地はまだあります。

Databricks 無料トライアル

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up