More than 5 years have passed since last update.

機械学習関連情報をトピックモデルで分類する

Posted at 2016-11-09

機械学習関連情報をトピックモデルで分類する

実際に調べた時点から数か月ほど時間が経っていますので、現状と若干ずれがあるかもしれません。
また、満足できる結果になっていないことをあらかじめお断りしておきましょう。

Qiita にも Python にも慣れていないのでいろいろおかしな記述があるかもしれませんが、そういうところはコメントいただければありがたいです。

今回の記事で説明する処理は下記のような流れになっています。

　❶サイトのクロウル
　　クロウルした文書(article)を bookmarks.crawled ディレクトリ配下に置きます。
　　　↓
　❷articleのPythonオブジェクト化
　　文書(article)単位で Pythonオブジェクト化します。
　　　↓
　❸コーパスのPythonオブジェクト化
　　文書群全体をコーパスとしてPythonオブジェクト化します。
　　　↓
　❹トピックモデルによる分類
　　このコーパスを使ってトピックモデルによる分類を試みます。

thesaurus の部分は経緯が前後しますが、それ以外はできるだけ順を追って説明していきます。

❶サイトのクロウル

機械学習関連情報の収集と分類(構想)の❷では FESS で収集した結果を直接に入力とするシナリオでしたが、今回はショートカット・ディレクトリとプレインテキストの変換の crawl.rb でダウンロードしたコンテンツを bookmarks.crawled ディレクトリ配下に置いて入力としています。

FESS で収集した結果を直接に入力とすると、

・古い文書が Expire してしまう
・手動による記事分類時に除外した重複記事や重要度の低い記事が復活してしまう

ためです。

❷articleのPythonオブジェクト化

bookmarks.crawled ディレクトリ配下のHTMLファイルを読み込んで、Python の Articleクラスのオブジェクトに格納します。

  Article クラス
    属性
      path      HTMLファイルのパス
      contents  HTMLファイルを読み込んでHTMLタグなどを除去したもの
      tokens    contents にある名詞のリスト(list of string)

★HTMLファイルの本文抽出

PythonでブログのHTMLから本文抽出 2015のライブラリ調査結果が参考になります。

本格的に実装するなら Webstemmerを利用すべきでしょうが、あらかじめブログサイトごとにテンプレートを生成しておく必要があり、煩雑になるため今回は使用しませんでした。

実装した Article クラスでは extractcontentの正規表現を参考にしています。

★トークンの切り出し

(1) janome

Pure Python の日本語形態素解析ライブラリ janome を用いてみました。
辞書は MeCab とほぼ同じ構成で、英字の単語は全角で定義されているため、前処理に半角全角変換ライブラリ mojimojiを使用しました。

article_janome.py

import codecs
import re
import mojimoji
from janome.tokenizer import Tokenizer

class Article:

    encodings = [
        "utf-8",
        "cp932",
        "euc-jp",
        "iso-2022-jp",
        "latin_1"
    ]

    tokenizer = Tokenizer("user_dic.csv", udic_type="simpledic", udic_enc="utf8")

    def __init__(self,path):
        print(path)
        self.path = path
        self.contents = self.preprocess(self.get_contents(path))
        self.tokens = [token.surface for token in self.tokenizer.tokenize(self.contents) if re.match("カスタム名詞|名詞,(固有|一般|サ変)", token.part_of_speech)]

    def get_contents(self,path):
        exceptions = []
        for encoding in self.encodings:
            try:
                all = codecs.open(path, 'r', encoding).read()
                parts = re.split("(?i)<(body|frame)[^>]*>", all, 1)
                if len(parts) == 3:
                    head, void, body = parts
                else:
                    print('Cannot split ' + path)
                    body = all
                return re.sub("<[^>]+?>", "", re.sub(r"(?is)<(script|style|select|noscript)[^>]*>.*?</\1\s*>","", body))
            except UnicodeDecodeError:
                continue
        print('Cannot detect encoding of ' + path)
        print(exceptions)
        return None

    def get_title(self,path):
        return re.split('\/', path)[-1]

    def preprocess(self, text):
        text = re.sub("&[^;]+;",  " ", text)
        text = mojimoji.han_to_zen(text, digit=False)
        text = re.sub('(\s|　|＃)+', " ", text)
        return text

(2) 辞書の拡張

デフォルトのIPA辞書では「人工知能」は「人工」「知能」のように２語に分解されてしまいます。そこで、user_dic.csv に１語としたい用語を登録してjanome から利用するようにしてみました。

その後、

　mecab-ipadic-NEologd : Neologism dictionary for MeCab
　Ubuntu14.04でmecabの辞書にWikipediaとはてな単語を追加
　形態素解析のために Wikipedia とはてなキーワードからユーザー辞書を生成し利用する

なども見つけましたが、thesaurus.csv を使用する方針に転換した後だったためまだ試していません。

(3) thesaurus

後述の通り日本語形態素解析ライブラリによるトークン切り出しでは、perplexityが許容範囲内に入らず、トピックの抽出がうまくいきませんでした。
そこで、あらかじめ手作業で人工知能関連で頻出する350語ほどを thesaurus.csv

thesaurus.csv(例)

自然言語処理,NLP,Natural Language Processing,natural language processing
質問応答
音声認識
AlphaGo,アルファ碁
…

に登録し、ヒットした単語のみトークンとして切り出す処理を、

thesaurus.py

import re
import mojimoji

class Thesaurus:

    def __init__(self,path):
        map = dict()
        with open(path, 'r') as thesaurus:
            for line in thesaurus.readlines():
                words = [mojimoji.han_to_zen(word, digit=False) for word in re.split(',', line.strip())]
                for word in words:
                    if word in map:
                        print('Word duplicated: ' + word)
                        raise
                    map[word] = words[0]
        self.words = map
        self.re    = re.compile("|".join(sorted(map.keys(), key=lambda x: -len(x))))

    def tokenize(self,sentence):
        for token in re.finditer(self.re, sentence):
            yield(Token(self.words[token.group()]))

class Token:

    def __init__(self, surface):
        self.surface = surface
        self.part_of_speech = "カスタム名詞"

に記述し、日本語形態素解析ライブラリを置き換えました¹。

article.py

import codecs
import re
import mojimoji
from thesaurus import Thesaurus

class Article:

    encodings = [
        "utf-8",
        "cp932",
        "euc-jp",
        "iso-2022-jp",
        "latin_1"
    ]

    tokenizer = Thesaurus('thesaurus.csv')

    def __init__(self,path):
        print(path)
        self.path = path
        self.contents = self.preprocess(self.get_contents(path))
        self.tokens = [token.surface for token in self.tokenizer.tokenize(self.contents) if re.match("カスタム名詞|名詞,(固有|一般|サ変)", token.part_of_speech)]

    def get_contents(self,path):
        exceptions = []
        for encoding in self.encodings:
            try:
                all = codecs.open(path, 'r', encoding).read()
                parts = re.split("(?i)<(body|frame)[^>]*>", all, 1)
                if len(parts) == 3:
                    head, void, body = parts
                else:
                    print('Cannot split ' + path)
                    body = all
                return re.sub("<[^>]+?>", "", re.sub(r"(?is)<(script|style|select|noscript)[^>]*>.*?</\1\s*>","", body))
            except UnicodeDecodeError:
                continue
        print('Cannot detect encoding of ' + path)
        print(exceptions)
        return None

    def get_title(self,path):
        return re.split('\/', path)[-1]

    def preprocess(self, text):
        text = re.sub("&[^;]+;",  " ", text)
        text = mojimoji.han_to_zen(text, digit=False)
        return text

❸コーパスのPythonオブジェクト化

トピックモデルでは文章をBOW(Bag of Words, list of (単語ID,出現数))で扱います。このため下記のクラスを定義しました。

★Corpus クラス

    属性
      articles   (HTMLファイルのパス:Articleオブジェクト)からなる OrderedDictionary
      keys       HTMLファイルのパスのlist(list of string)
      size       Articleオブジェクト数
      texts      コーパスを構成するトークン(list of (list of string))
      corpus     texts を list of BOW に変換したもの
    クラスメソッド save/load を持ち、ファイルにオブジェクトを保存できるようにしています。

★Corpora クラス

    属性
      training   training 用の Corpus オブジェクト
      test       test 用の Corpus オブジェクト
      dictionary training, test 共通に使用する gensim.corpora.Dictionaryオブジェクト
                 (単語のID(integer)と表現(string)の対応関係を保持)

corpus.py

import pickle
from collections import defaultdict
from gensim import corpora

class Corpora:

    def __init__(self, training, test, dictionary):
        self.training   = training
        self.test       = test
        self.dictionary = dictionary

    def save(self, title):
        self.training.save(title+'_training')
        self.test.save(title+'_test')
        self.dictionary.save(title+".dict")

    @classmethod
    def load(cls, title):
        training   = Corpus.load(title+'_training')
        test       = Corpus.load(title+'_test')
        dictionary = corpora.Dictionary.load(title+".dict")
        return cls(training, test, dictionary)

    @classmethod
    def generate(cls, training, test):
        training_corpus = Corpus.generate(training)
        test_corpus     = Corpus.generate(test)
        all_texts       = training_corpus.texts + test_corpus.texts
        frequency       = defaultdict(int)
        for text in all_texts:
            for token in text:
                frequency[token] += 1
        all_texts  = [[token for token in text if frequency[token] > 1] for text in all_texts]
        dictionary = corpora.Dictionary(all_texts)
        training_corpus.mm(dictionary)
        test_corpus.mm(dictionary)
        return cls(training_corpus, test_corpus, dictionary)

class Corpus:

    def __init__(self, articles):
        self.articles  = articles
        self.keys      = list(articles.keys())
        self.size      = len(articles.keys())

    def article(self, index):
        return self.articles[self.keys[index]]

    def mm(self, dictionary):
        values_set = set(dictionary.values())
        self.texts  = [[token for token in text if token in values_set] for text in self.texts]
      # print(self.texts[0])
        self.corpus = [dictionary.doc2bow(text) for text in self.texts]

    def save(self, title):
        with open(title+".pickle", 'wb') as f:
            pickle.dump(self.articles, f)
        corpora.MmCorpus.serialize(title+".mm", self.corpus)

    @classmethod
    def load(cls, title):
        with open(title+".pickle", 'rb') as f:
            articles = pickle.load(f)
        corpus = cls(articles)
        corpus.corpus = corpora.MmCorpus(title+".mm")
        return corpus

    @classmethod
    def generate(cls, articles):
        corpus = cls(articles)
        corpus.texts = [articles[key].tokens for key in articles.keys()]
        return corpus

ここまではオンプレミスなツールに何を用いるかによらずに共通に必要となる技術です。

❹トピックモデルによる分類

以上の道具立てを準備して、

　トピックモデルを利用したアプリケーションの作成 … (*1)

を参考にトピックモデルによる分類を行いました。

test_view_LDA.py

import pprint
import logging
import glob
import numpy as np
import matplotlib.pylab as plt
from collections import OrderedDict
from gensim import corpora, models, similarities
from pprint import pprint  # pretty-printer
from corpus import Corpus, Corpora
from article import Article

# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

topic_range = range(10, 11)
training_percent = 90
test_percent = 10
path_pattern = '/home/samba/suchowan/links/bookmarks.crawled/**/*.html'

def corpus_pair(path, training_range, test_range):
    all_paths         = glob.glob(path, recursive=True)
    training_paths    = [v for i, v in enumerate(all_paths) if ((i * 2017) % 100) in training_range]
    test_paths        = [v for i, v in enumerate(all_paths) if ((i * 2017) % 100) in test_range    ]
    training_articles = OrderedDict([(path,Article(path)) for path in training_paths])
    test_articles     = OrderedDict([(path,Article(path)) for path in test_paths])
    return  Corpora.generate(training_articles, test_articles)

def calc_perplexity(m, c):
    return np.exp(-m.log_perplexity(c))

def search_model(pair):
    most = [1.0e15, None]
    print("dataset: training/test = {0}/{1}".format(pair.training.size, pair.test.size))
    
    for t in topic_range:
        m  = models.LdaModel(corpus=pair.training.corpus, id2word=pair.dictionary, num_topics=t, iterations=500, passes=10)
        p1 = calc_perplexity(m, pair.training.corpus)
        p2 = calc_perplexity(m, pair.test.corpus)
        print("{0}: perplexity is {1}/{2}".format(t, p1, p2))
        if p2 < most[0]:
            most[0] = p2
            most[1] = m
    
    return most[0], most[1]

pair = corpus_pair(path_pattern, range(0, training_percent+1), range(training_percent, training_percent+test_percent+1))
pair.save('article_contents')
perplexity, model = search_model(pair)
print("Best model: topics={0}, perplexity={1}".format(model.num_topics, perplexity))

def show_document_topics(c, m, r):

    # make document/topics matrix
    t_documents = OrderedDict()
    for s in r:
      # ts = m.__getitem__(c[s], -1)
        ts = m[c[s]]
        max_topic = max(ts, key=lambda x: x[1])
        if max_topic[0] not in t_documents:
            t_documents[max_topic[0]] = []
        t_documents[max_topic[0]] += [(s, max_topic[1])]
    
    return t_documents
    
topic_documents = show_document_topics(pair.test.corpus, model, range(0,pair.test.size))

for topic in topic_documents.keys():
    print("Topic #{0}".format(topic))
    for article in topic_documents[topic]:
       print(article[0], pair.test.article(article[0]).path)

pprint(model.show_topics())

使用したライブラリは gensim で、tfidf, lsi, ldaを使ったツイッターユーザーの類似度計算や Pythonで自然言語処理をしてみる_トピックモデルも参考にしました。

★training

  入力 : training コーパス - list of (list of (単語ID,出現数)) および トピック数
    list of (単語ID,出現数) - 個々の article での単語の出現数 (出現順は考慮しない)

  出力 : LDAモデル - gensim.models.ldamodel
    list of ((list of (単語ID,出現数)) からトピック適合確率を計算する計算式)

★test

  入力 : test コーパス - list of (list of (単語ID,出現数))
    list of (単語ID,出現数) - 個々の article での単語の出現数 (出現順は考慮しない)


  出力 : list of (list of 適合確率)

★実行例

janome で単語を抽出して parts_of_speech のみによる絞込みをしただけでコーパスを作ってトピックモデルによる分類を試みましたが perplexity が天文学的な値になって全く意味を成しませんでした。

必須の前処理をいろいろ省略していることもありますが、根本的な理由は明白です。

　単語の種類数　>>　文書数

これです。

トピックモデルは単語の種類数＋αだけ調整可能な変数を持っています。「単語の種類数>>文書数」という条件で無理に収束させると必然的に過学習になるのです。

　単語の種類数　<<　文書数

となるように単語を絞り込まねばなりません。

以下は、手作業で人工知能関連で頻出する350語ほどを thesaurus.csv に登録し、それらのみを用いてコーパスを作った場合の結果です。

トピック数は training の入力ですが、perplexity が最小になるようなトピック数を探すことで、判断を自動化できます。動作例ではあらかじめトピック数が10でperplexity が最小になることを確認しておきました。

(*1)によれば、

パープレキシティの逆数が文書中の単語の出現を予測できる度合いを示しており、よって最高は1で、
モデルの精度が悪くなるほど大きな値になります(2桁ならよし、3桁前半でまあまあ、それ以後は悪い、
という感じで、1桁の場合は逆にモデルやパープレキシティの算出方法に誤りがないか見直した方が
よいです)。

実行例では1920 article(90%)をtraining、210 article(10%)を test に用い²、test corpus の perplexity が 68.4 となっています。

トピック適合確率を計算する計算式の list は、下記の通り

[(0,
  '0.268*画像 + 0.124*Ｄｅｌｌ + 0.049*ＣＮＮ + 0.043*深層学習 + 0.038*ニューラルネットワーク + '
  '0.026*機械学習 + 0.025*Ｃｈａｉｎｅｒ + 0.024*ＧＰＵ + 0.023*記事 + 0.022*画像認識'),
 (1,
  '0.135*機械学習 + 0.121*Ｐｙｔｈｏｎ + 0.102*記事 + 0.055*Ｃｈａｉｎｅｒ + 0.052*Ｄｅｌｌ + '
  '0.037*深層学習 + 0.033*ｎｕｍｐｙ + 0.023*フレームワーク + 0.019*ニューラルネットワーク + 0.019*Ｓｐａｒｋ'),
 (2,
  '0.111*記事 + 0.097*予測 + 0.090*ランキング + 0.071*大学 + 0.055*検索 + 0.033*人工知能 + '
  '0.032*Ｙａｈｏｏ + 0.032*Ｄｅｌｌ + 0.029*データベース + 0.026*特許'),
 (3,
  '0.121*Ｒｕｂｙ + 0.100*ゲーム + 0.090*ＡｌｐｈａＧｏ + 0.085*囲碁 + 0.077*記事 + 0.076*人工知能 + '
  '0.053*Ｇｏｏｇｌｅ + 0.052*Ｍｉｃｒｏｓｏｆｔ + 0.047*Ｔａｙ + 0.034*Ｔｗｉｔｔｅｒ'),
 (4,
  '0.113*ＴｅｎｓｏｒＦｌｏｗ + 0.103*ＬＳＴＭ + 0.070*Ｄｅｌｌ + 0.068*ＣＮＮ + 0.063*ｌｉｎｅ + '
  '0.058*Ｔｈｅａｎｏ + 0.043*ＳＰＡＲＱＬ + 0.038*Ｋｅｒａｓ + 0.037*Ｐｙｔｈｏｎ + 0.035*ＭＮＩＳＴ'),
 (5,
  '0.130*クラウド + 0.096*セキュリティ + 0.079*ＡＷＳ + 0.079*Ａｍａｚｏｎ + 0.075*記事 + 0.057*ＩｏＴ '
  '+ 0.042*ビッグデータ + 0.031*書籍 + 0.023*攻撃 + 0.022*ＩＢＭ'),
 (6,
  '0.177*Ｇｏｏｇｌｅ + 0.137*ＡＰＩ + 0.100*検索 + 0.071*記事 + 0.055*Ｆａｃｅｂｏｏｋ + '
  '0.031*Ｗａｔｓｏｎ + 0.030*ＩＢＭ + 0.026*Ｂｌｕｅｍｉｘ + 0.026*機械学習 + 0.025*Ｔｗｉｔｔｅｒ'),
 (7,
  '0.351*人工知能 + 0.093*ロボット + 0.064*深層学習 + 0.049*記事 + 0.032*大学 + 0.029*機械学習 + '
  '0.020*東京大学 + 0.019*Ｆａｃｅｂｏｏｋ + 0.019*映画 + 0.019*Ｇｏｏｇｌｅ'),
 (8,
  '0.188*ｂｏｔ + 0.180*Ｍｉｃｒｏｓｏｆｔ + 0.057*Ａｚｕｒｅ + 0.056*Ｅｌａｓｔｉｃｓｅａｒｃｈ + '
  '0.042*ｗｏｒｄ2ｖｅｃ + 0.038*機械学習 + 0.033*ｌｉｎｅ + 0.030*検索 + 0.027*Ｋｉｂａｎａ + '
  '0.022*自然言語処理'),
 (9,
  '0.102*記事 + 0.094*Ｔｗｉｔｔｅｒ + 0.079*ロボット + 0.060*ＩｏＴ + 0.058*ソニー + 0.041*強化学習 '
  '+ 0.038*ＴｅｎｓｏｒＦｌｏｗ + 0.029*Ｊａｖａ + 0.028*Ｄｅｅｐ\u3000Ｑ−Ｎｅｔｗｏｒｋ + 0.027*ランキング')]

perplexity が 68.4 というのはそれほど悪くないようですが、この計算式を見る限り人間の目で見てトピックの意味を読み取るのはかなり難しいように感じます。

遡って、(*1)の例ではもともとの article から抽出するトークンとして、

　15席以上の大型サロン/駐車場あり/夜19時以降も受付OK/年中無休/最寄り駅から徒歩3分以内にある/
　ヘアセット/ネイル/朝10時前でも受付OK/ドリンクサービスあり/カード支払いOK/女性スタッフが多い/
　個室あり/禁煙/半個室あり

の例のような説明を '/' で分割したものを用いています。
これは自然言語から抽出したトークンというより、むしろ、直接的な特徴量というべきものです。
この有利なトークンで 2トピックでの perplexity が 17.1 ですから、今回の実行例も、それほど不手際があったためとも思えません。
逆に言えば、今回の実行例のような規模と内容のデータセットで、目をみはるような教師なし分類というのはトピックモデルでは難しいのかもしれません。

改善が可能であるとしたら、下記のような点が考えられます。

・Webstemmerなどを使って真の本文を抽出する。
・thesaurus.csv のチューンナップ

ただ、後者については同義語関係を手作業でメンテナンスするのでは何のための自動化なのかわかりません。
また、新規企業が機械学習関連に参入してきたら、それを判断して thesaurus.csv に追加せねばなりません。

最近発表された JUMAN++ は、

　新形態素解析器JUMAN++を触ってみたけど思ったより高精度でMeCabから乗り換えようかと思った話

を読む限り問題の解決に有効かもしれませんが、今後の課題です。

同じ API になるようにイテレータなどを実装しています。 ↩
この調査をした当時は2000記事くらいでしたが、現在は5000記事くらいに増えました。 ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up