More than 1 year has passed since last update.

🔰自然言語処理で似たような単語から1つだけ選定する方法

Last updated at 2024-07-02Posted at 2024-03-20

はじめに

自分はニュースキュレーション機能のあるGoogleアプリを好んで作っています。
ニュースキュレーションアプリでは、新聞各社などのメディアから情報を仕入れて、AI等でユーザへの価値提供が最大化できる記事を選んでいるそうです（ネット情報によると）
AIプログラミングの勉強ついでに表題の機能のコードを書いてみました。

アプリのインストールはこちら👇

Android

実装の流れ

余分な情報を含め、以下の流れで実装しました
選定した技術には⭐️を付けています

1.前処理

⭐️クリーニング(HTMLタグ、特殊文字の除去）
⭐️トークン化
⭐️レンマ化
⭐️ストップワードの除去

2.ベクトル化

⭐️TF-IDF
・Word2Vec
・Doc2Vec
・BERT

3.類似性の計算

⭐️ユークリッド距離
・コサイン類似度
・マンハッタン距離

4.クラスタリング

⭐️K-means
・DBSCAN
・階層的クラスタリング

5.記事の選定

クラスタ中心の最も近い記事を選定

ライブラリ

$ pip list
Package            Version
------------------ ---------
Flask              2.3.3
ginza              5.1.2
nltk               3.8.1
numpy              1.25.2
pip                23.2.1
pytest             7.4.2
scikit-learn       1.3.0

機械学習のライブラリをscikit-learnを使用しています
日本語の自然言語モデルにGiNZAを使用しています

ライブラリをインストールする時は仮想環境を使用するとローカル環境を汚さずに使えます

コード

Flaskを使用しています
※Docstringを記載していますが、誤字脱字、編集希望がありましたら編集リクエストをください

1.前処理

preprocessing.py

import re
import nltk   # 自然言語処理のライブラリ
import spacy
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup

"""
テキストの前処理
。テキストのクリーニング（HTMLタグの除去、特殊文字の除去etc）
・トークン化
・ストップワードの除去
・ステミングやレンマ化
"""

# リソースのダウンロード
nltk.download('wordnet')

# ストップワードの設定
nlp: spacy.language = spacy.load('ja_ginza')  # テキスト分析で重要ではない日本語のリスト
stop_words = nlp.Defaults.stop_words

# レンマタイザの初期化
lematizer = WordNetLemmatizer()  # 単語を辞書の言葉に変換する機能

def remove_html_tags(text):
    """
    HTMLタグの除去

    Args:
        text (str): Webページの要素

    Returns:
        str: タグ無しの要素
    """
    bs = BeautifulSoup(text, 'html.parser')
    return bs.get_text()

def remove_special_charcters(text):
    """
    特殊文字の除去

    Args:
        text (str): Webページの要素

    Returns:
        str: 特殊文字を除去した要素
    """
    return re.sub(r'[^a-zA-Z\s]', '', text)

def prepprocess(text):
    """
    テキストの前処理
    ・HTMLタグを除去
    ・特殊文字を除去
    ・テキストをトークン化
    ・トークンをレンマ化（単語を辞書系に変換）
    ・"ストップワード"と"アルファベットではない"トークンを除去

    Args:
        text (str): Webページの要素

    Returns:
        str: 前処理したテキスト
    """

    text = remove_html_tags(text)
    text = remove_special_charcters(text)
    tokens = nltk.word_tokenize(text)
    tokens = [
        lematizer.lemmatize(token) for token in tokens
        if token.lower() not in stop_words and token.isalpha()
    ]
    return ' '.join(tokens)

2.ベクトル化

vectorization.py

from sklearn.feature_extraction.text import TfidfVectorizer  # ベクトル化モジュール
from app.services.preprocessing import prepprocess

def vectorize_texts(articles):
    """
    テキストデータを数値ベクトルに変換

    Args:
        articles (List[str]): Webページの要素

    Returns:
        scipy:csr_matrix: テキストデータのTF-IDFベクトル
        sklearn:TfidfVectorizer: テキストデータをベクトル化するためのインスタンス
    """
    # テキストの前処理
    texts = [prepprocess(article) for article in articles]
    vectorizer = TfidfVectorizer()
    # TF-IDFベクトルに変換
    X = vectorizer.fit_transform(texts)
    return X, vectorizer

4.クラスタリング

clustering.py

from sklearn.cluster import KMeans  # クラスタリング

def cluster_texts(X, n_clusters=10):
    """
    テキストからクラスタ予測する

    Args:
        X (_type_): TF-IDFベクトル化されたテキストデータの配列
        n_clusters (int, optional): クラスタの数。デフォルトは10

    Returns:
        tuple:
            _type_: 各テキストのクラスタ番号の配列
            KMenas: 学習されたモデル
    """
    # クラスタ作成
    kmeans = KMeans(n_clusters=n_clusters)
    # テキストデータから学習し、どのクラスタに属するか予測する
    clusters = kmeans.fit_predict(X)  # 各テキストがどのクラスタに属するか配列で返す
    return clusters, kmeans

3.類似性の計算 & 5.記事の選定

representative.py

import numpy as np

def select_representative_articles(X, articles, clusters, kmeans):
    """
    各クラスタから記事の内容を代表する記事を取得

    Args:
        X (_type_): TF-IDFベクトル化されたテキストデータの配列
        articles (_type_): クラスタリングされた記事
        clusters (_type_): クラスタ
        kmeans (_type_): 学習されたモデル

    Returns:
        list : 各クラスタの代表の記事のリスト
    """
    # 学習モデルからクラスタ数を取得
    n_clusters = kmeans.n_clusters
    representative_articles = []

    for i in range(n_clusters):
        # 記事のインデックスを取得
        cluster_indices = np.where(clusters == 1)[0]
        # 対象の記事のクラスタのセントロイドを取得
        cluster_center = kmeans.cluster_centers_[i]
        # 記事のベクトルとクラスタの中心とのユークリッド距離を計算
        distances = np.linalg.norm(
            X[cluster_indices].toarray() - cluster_center,
            axis=1
        )
        # 最もクラスタの中心に近い記事のインデックスを取得
        representative_index = cluster_indices[np.argmin(distances)]
        # 選択された代表記事をリストに追加
        representative_articles.append(articles[representative_index])

    # 各クラスタの代表記事のリストを返す
    return representative_articles

ユニットコード

テストフレームワークにpytestを使用しています

test_ut.py

import pytest
from app.services import preprocessing, vectorization, clustering, representative
from tests import test_data

# preprocessing.pyのテスト
def test_preprocessing():
    text = test_data.test_articles[0]
    processed = preprocessing.prepprocess(text)

    assert "<p>" not in processed
    assert "WBC" in processed

# vectorization.pyのテスト
def test_vectorization():
    X, vectorizer = vectorization.vectorize_texts(test_data.test_articles)

    assert X.shape[0] == len(test_data.test_articles)
    assert "wbc" in vectorizer.get_feature_names_out()

# clustering.pyのテスト
def test_clustering():
    X, _ = vectorization.vectorize_texts(test_data.test_articles)
    clusters, kmeans = clustering.cluster_texts(X)

    assert len(clusters) == len(test_data.test_articles)
    assert kmeans.n_clusters == 10

# representative.pyのテスト
def test_representative():
    X, _ = vectorization.vectorize_texts(test_data.test_articles)
    clusters, kmeans = clustering.cluster_texts(X)
    reps = representative.select_representative_articles(
        X,
        test_data.test_articles,
        clusters,
        kmeans
    )

    assert len(reps) == kmeans.n_clusters

テストデータ

test_data.py

test_articles = [
    "<p>日本がWBCで優勝しました</p>",
    "<p>侍ジャパンがWBCで優勝！</p>",
    "<p>WBCの優勝国は日本</p>",
    "<p>優勝した日本の野球は強い</p>",
    "<p>日本、WBCの頂点に！</p>",
    "<p>WBCでの日本の活躍が注目される</p>",
    "<p>侍ジャパンの優勝、世界が注目</p>",
    "<p>WBC優勝、日本の偉業</p>",
    "<p>WBCで見せた日本の力</p>",
    "<p>日本のチームがWBCで輝く</p>",
]

おわりに

今回は自然言語処理について記事を書きました。
AIについてコーディングをしていなかったので、よく分からないままスタートし、
最初のテキスト前処理の段階でつまづきました。
nltkという自然言語処理のライブラリをインストールし、テキストの内にストップワードが無いか調べる処理の実装で、日本語の言語データセットを探していたところ、英語やスペイン語などは揃っていますが、日本語のものは開発が遅れており、spacyを使うことで対応しました。
よく日本でAPIの公開が遅くなるということが言われますが、AIの分野でも遅い部分があるのかと実感しました。
ちなみに、今回作成したGiNZAはリクルートなどが開発に携わったもので、国内での利用がトップクラスのようです。
本来作ろうとしていたものはBeatuifulSoup4でWebページのHTMLを取得し、そこから前処理のクリーニングをし、ベクトル化などをした後に代表記事を選定するのを想定していました。
ソースコードのコメントが一部このようなことになっているのはそういうことです。
今回は初心者が書いたのでIT業界特有の玄人からの苦言が来るかもしれませんが、Xのようにブロック機能があればブロックしたいです笑。気持ちよく向上できたらいいのに

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up