More than 5 years have passed since last update.

[python] NLTKとfastTextの学習済みモデルから単語（英文）のベクトルを得る．

Last updated at 2018-11-08Posted at 2018-11-08

概要

やりたいこと

とある理由でサンプル数が極めて少ない短文の分類を行うことになりました.
fastTextを用いて分散表現を取得し，階層的クラスタリングを行ってデンドログラムを作ろう！と計画．
しかし，意外とやってみたら色々とハマりました(特にデンドログラムが見づらくて絶望...)
今回は，fastTextの学習済みモデルを用いて分散表現を取得するところまでを記事にしました．

流れ

本記事では主に以下のことを行います．
0. 環境準備(インストール)

fastTextの簡単な概要
データセットの準備
NLTKを用いたデータセットのクレンジング
fastTextを用いた分散表現の取得
cos類似度ベースで指定した文章の類似文章を取得

0. 環境準備

私の環境はpython3.5.3です．

まずは，fasttextのインストールから．公式に従います．

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ pip install .

(注)：GCCのバージョンにうるさく，macの場合はCommand Line Toolsとは別にGCCを入れる必要があるかもしれません．私の持ってるmac(2台)の内，1台はすんなりとインストールできましたが，もう1台は結局インストールできず諦めました．

その他必要なライブラリをインストールします．

$ pip install numpy scipy jupyter pandas nltk sklearn

1. fastTextとは

fastTextはWord2Vecを作った天才Mikolovが提案したWord2Vecの進化系で，学習が早いことと，subwordsを組み込んでいることが特徴です．
簡単にどんな感じのノリか知りたければ，この辺りの記事を読んでみるとよいかと．

今回は使わないのですが，どなたか詳しい方いらっしゃったらこの事前学習モデルに追加学習する方法について，何やっているのか教えてくれると嬉しいです．

2. データセットの準備

とりあえず，ABC Australia News Corpusのデータセットを用います．

from IPython.core.display import display
import pandas as pd
import numpy as np
np.random.seed(123)

# ランダムに500個の記事の題名を取得
data = pd.read_csv('abcnews-date-text.csv')
rand_index = np.random.randint(0, data.shape[0], 500)
data = data.iloc[rand_index, 1]
display(data.head(10))

結果がこいつです．

773630            opinion split on bridge bike lane proposal
277869                anglican church fears sexuality schism
28030                      regulation key to telstra sale mp
1066306    volkswagen accused of attempting to hoodwink c...
194278                   vaughan book exposes childish smith
990803     brief of evidence still being prepared for fat...
1094779    game of thrones star jason momoa rape joke rev...
1041977             bling worth $3m stolen from rapper drake
448625     more surveys add more gloom to already dark ou...
118857            beach plaques urge surfers to toe the line
Name: headline_text, dtype: object

3. NLTKを用いたデータセットのクレンジング

このまま使ってもいいのですが，簡単にクレンジングします．

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('tagsets')

形態素解析

NLTKで雑に形態素解析してみます．結構精度が悪いので，参考程度に．

sent = ' '.join(list(data))
words = nltk.word_tokenize(sent)
tags = nltk.pos_tag(words)
tags = sorted(list(set([tag for word, tag in tags])))
for tag in tags:
    print(nltk.help.upenn_tagset(tag))

見てみると，このデータセットには以下のタグ(品詞)が出ています．

tag	意味	排除	変換
$	ドル	する	-
CC	接続詞	する	-
CD	数詞	する	-
DT	限定詞	する	-
FW	外来語	しない	n
IN	前置詞	する	-
JJ	形容詞 or 数詞 or 順序?	しない	a
JJR	形容詞・比較級	しない	a
JJS	形容詞・最大級	しない	a
MD	法助動詞	する	-
NN	名詞	しない	n
NNP	固有名詞	しない	n
NNS	名詞・複数形	しない	n
POS	マーカー(ex: ', 's)	する	-
PRP	代名詞	する	-
PRP$	代名詞・所有格	する	-
RB	副詞	しない	r
RBR	副詞・比較級	しない	r
RP	不明(ex: about, across, along, at, away)	する	-
TO	前置詞or不定詞としてのto	する	-
VB	動詞・原型	しない	v
VBD	動詞・過去形	しない	v
VBG	動詞・現在分詞・動名詞	しない	v
VBN	動詞・過去分詞	しない	v
VBP	動詞・現在時制・三人称単数でない	しない	v
VBZ	動詞・現在時制・三人称単数である	しない	v
WP	関係代名詞, 関係副詞	する	-
WRB	wh-副詞	する	-

クレンジング

上の変換表に従って，クレンジングを行います．
排除する品詞の単語はコーパスから落とし，排除しない単語はlemmatizationを行います．

def cleansing(x, drop_tag, tag_pos, lemmatizer):
    """
    いらない品詞を除外し，レンマ化して返す．apply関数内で使用．

    Args:
        x (Series): apply関数で呼び出されるSeries
        drop_tag (list): いらない品詞リスト(nltk)
        tag_pos (dict): key -> tag, value -> pos. レンマ化の精度向上に使用．
        lemmatizer (nltk.stem.WordNetLemmatizer): lemmatizer

    Returns:
        (str): output sentence
    """
    words = [word for word in x['headline_text'].split(' ') if word != '']  # 空文字入るとエラーになる
    tags = nltk.pos_tag(words)  # 品詞を取得
    words = [(word, tag_pos[tag]) for word, tag in tags if tag not in drop_tag]  # いらない品詞を除外
    words = [lemmatizer.lemmatize(word, pos=pos) for word, pos in words]
    sentence = ' '.join(words)  # 連結
    return sentence


def preprocess(data):
    """
    前処理の関数．
    
    Args:
        data (DataFrame): input dataset
    
    Retruns:
        (DataFrame): output dataset
    """
    # まずは，いらない品詞を落とし，レンマ化する．
    # その後，階層クラスタリングのときに使う用のcsvファイルとモデル学習用のtxtファイルを出力する．    
    lemmatizer = nltk.stem.WordNetLemmatizer()
    # いらない品詞
    drop_tag = ['$', 'CC', 'CD', 'DT', 'IN', 'MD', 'POS', 'PRP', 'PRP$', 'RP', 'TO' , 'WP', 'WRB']
    # 品詞とpos(lemma用)の変換辞書
    tag_pos = {'FW': 'n', 'JJ': 'a', 'JJR': 'a', 'JJS': 'a', 'NN': 'n', 'NNP': 'n', 'NNS': 'n', 
               'RB': 'r', 'RBR': 'r', 'VB': 'v', 'VBD': 'v', 'VBG': 'v', 'VBN': 'v', 'VBP': 'v', 'VBZ': 'v'}

    data = data.assign(preprocessed=data.apply(func=cleansing, axis=1, args=(drop_tag, tag_pos, lemmatizer,)))
    
    print('after drop and lemmatization')
    display(data.head())
    data.to_csv('data.tsv', sep='\t', index=False)
    data['preprocessed'].to_csv('text.txt', index=False)
    return data

data = preprocess(pd.DataFrame(data))

結果，以下のようなデータを得ます．

	headline_text	preprocessed
773630	opinion split on bridge bike lane proposal	opinion split bridge bike lane proposal
277869	anglican church fears sexuality schism	anglican church fear sexuality schism
28030	regulation key to telstra sale mp	regulation key telstra sale mp
1066306	volkswagen accused of attempting to hoodwink c...	volkswagen accuse attempt hoodwink consumer
194278	vaughan book exposes childish smith	vaughan book expose childish smith

4. fastTextを用いた分散表現の取得

いよいよ，学習済みモデルから分散表現を得ます．

まずは，学習済みモデルを引っ張ってきます．
こちらから, "English: bin+text"をDLしましょう．15GBくらいあります．

(注) DLしたbinファイルを'fasttext_premodel_en.bin'に名前を変更しました．

学習済みモデルと用意したデータを用いて分散表現を得ましょう．

from os import path
import re
import fastText as ft

def get_word_vector(data_name='text.txt', model_name='fasttext_premodel_en.bin'):
    """
    fasttextベースで分散表現を取得する関数. これも見てわかると思うので引数は省略.

    Returns:
        (list of list): 単語リストのリスト. [['word_0_0', 'word_0_1'], ['word_1_0', 'word_1_1', 'word_1_2'], ...]みたいな
        (array): 分散表現 次元=(文章数×分散表現の次元数)
    """
    sentences = []
    with open(data_name, mode='r') as f:
        for line in f.readlines():
            line = re.sub('\n', '', line)
            sentences.append(line.split(' '))

    # modelが12GBくらいメモリを食うので終わったら開放する.
    vec_name =  'sentences_vec.npy'
    if not path.exists(vec_name):
        model = ft.load_model(model_name)

        dim = model.get_dimension()
        sentences_vec = np.zeros((dim,))

        for words in sentences:
            vec = np.zeros((dim,))
            for word in words:
                if model.get_word_id(word) == -1:
                    print('this word does not exists in corpus: %s at %s' % (word, words))
                vec = np.vstack((vec, model.get_word_vector(word)))
            vec = vec[1:, :].mean(axis=0)
            sentences_vec = np.vstack((sentences_vec, vec))
        sentences_vec = sentences_vec[1:, :]
        del model

        np.save(vec_name, sentences_vec)
    else:
        sentences_vec = np.load(vec_name)
    return sentences, sentences_vec

sentences, vec = get_word_vector()

無事, 分散表現を取得できました. vecの次元は(500, 300)です.

5. cos類似度ベースで指定した文章の類似文章を取得

ある文章について，最近傍の文章を探してみましょう．
文章のベクトルは単語のベクトルの単純な平均であることに注意してください．

from sklearn.metrics.pairwise import cosine_similarity

def get_similar_sentence(name, data, sentences, vec, num=5):
    """
    指定したcolumn_nameにcos類似度ベースで最も近いcolumn_nameを出力する．

    Args:
        name (str): target column_name
        data (DataFrame): preprocess()で出力されたデータフレーム
        sentences (list of list): 単語リストのリスト．[['word_0_0', 'word_0_1'], ['word_1_0', 'word_1_1', 'word_1_2'], ...]みたいな
        vec (np.array): 分散表現 次元=(文章数×分散表現の次元数)
        num (int): 最近傍いくつを出力するか
    """
    arr = np.array(data)
    assert (np.where(arr == name)[0].shape == (1,))
    assert (len(sentences) == vec.shape[0] == data.shape[0])
    index = np.where(arr == name)[0][0]
    assert (sentences[index] == data.iloc[index, -1].split(' '))
    similarity = cosine_similarity(vec[index, :][np.newaxis, :], vec)[0]
    sorted_index = similarity.argsort()[::-1]
    print('--------------------------------')
    print('sentence: %s' % name)
    print('word list: %s' % (sentences[index]))
    for i in range(num):
        index = sorted_index[i + 1]
        print('--------------------------------')
        print('cosine similarity: %1.3f' % (similarity[index]))
        print('sentence: %s' % (data.iloc[index, 0]))
        print('word list: %s' % (sentences[index]))
    print('--------------------------------')
    return sorted_index[1:num + 1], data.iloc[sorted_index[1:num + 1]]

get_similar_sentence(data.iloc[0, 1], data, sentences, vec, 5)

以下のような結果が得られます．

--------------------------------
sentence: opinion split bridge bike lane proposal
word list: ['opinion', 'split', 'bridge', 'bike', 'lane', 'proposal']
--------------------------------
cosine similarity: 0.748
sentence: full council yet to approve pedestrian bridge plan
word list: ['full', 'council', 'yet', 'approve', 'pedestrian', 'bridge', 'plan']
--------------------------------
cosine similarity: 0.659
sentence: councils facebook page lures road fix supporters
word list: ['council', 'facebook', 'page', 'lure', 'road', 'fix', 'supporter']
--------------------------------
cosine similarity: 0.640
sentence: man given suspended sentence after fatal bike
word list: ['man', 'give', 'suspend', 'sentence', 'fatal', 'bike']
--------------------------------
cosine similarity: 0.614
sentence: tunnelling work starts on tugun bypass
word list: ['tunnel', 'work', 'start', 'tugun', 'bypass']
--------------------------------
cosine similarity: 0.612
sentence: farmers hit out at planned highway upgrade
word list: ['farmer', 'hit', 'planned', 'highway', 'upgrade']
--------------------------------

文章が類似しているかはともかくとして，無事分散表現を得られました．

最後に

Qiitaにおける記念すべき初記事を書かせていただきました.
codeの不備等がありましたら, コメントいただけると幸いです.
最後に, 全ソースコードを貼っておきます.

import pandas as pd
import numpy as np
import nltk
from os import path
import re
import fastText as ft
from sklearn.metrics.pairwise import cosine_similarity


def cleansing(x, drop_tag, tag_pos, lemmatizer):
    """
    いらない品詞を除外し，レンマ化して返す．apply関数内で使用，

    Args:
        x (Series): apply関数で呼び出されるSeries
        drop_tag (list): いらない品詞リスト(nltk)
        tag_pos (dict): key -> tag, value -> pos. レンマ化の精度向上に使用．
        lemmatizer (nltk.stem.WordNetLemmatizer): lemmatizer

    Returns:
        (str): output sentence
    """
    words = [word for word in x['headline_text'].split(' ') if word != '']  # 空文字入るとエラーになる
    tags = nltk.pos_tag(words)  # 品詞を取得
    words = [(word, tag_pos[tag]) for word, tag in tags if tag not in drop_tag]  # いらない品詞を除外
    words = [lemmatizer.lemmatize(word, pos=pos) for word, pos in words]
    sentence = ' '.join(words)  # 連結
    return sentence


def preprocess(data):
    """
    前処理の関数．
    
    Args:
        data (DataFrame): input dataset
    
    Retruns:
        (DataFrame): output dataset
    """
    # まずは，いらない品詞を落とし，レンマ化する．
    # その後，階層クラスタリングのときに使う用のcsvファイルとモデル学習用のtxtファイルを出力する．
    lemmatizer = nltk.stem.WordNetLemmatizer()
    # いらない品詞
    drop_tag = ['$', 'CC', 'CD', 'DT', 'IN', 'MD', 'POS', 'PRP', 'PRP$', 'RP', 'TO' , 'WP', 'WRB']
    # 品詞とpos(lemma用)の変換辞書
    tag_pos = {'FW': 'n', 'JJ': 'a', 'JJR': 'a', 'JJS': 'a', 'NN': 'n', 'NNP': 'n', 'NNS': 'n', 'RB': 'r', 'RBR': 'r', 'VB': 'v',
               'VBD': 'v', 'VBG': 'v', 'VBN': 'v', 'VBP': 'v', 'VBZ': 'v'}

    data = data.assign(preprocessed=data.apply(func=cleansing, axis=1, args=(drop_tag, tag_pos, lemmatizer,)))
    
    print('after drop and lemmatization')
    print(data.head())
    data.to_csv('data.tsv', sep='\t', index=False)
    data['preprocessed'].to_csv('text.txt', index=False)
    return data


def get_word_vector(data_name='text.txt', model_name='fasttext_premodel_en.bin'):
    """
    fasttextベースで分散表現を取得する関数．これも見てわかると思うので引数は省略．

    Returns:
        (list of list): 単語リストのリスト．[['word_0_0', 'word_0_1'], ['word_1_0', 'word_1_1', 'word_1_2'], ...]みたいな
        (array): 分散表現 次元=(文章数×分散表現の次元数)
    """
    sentences = []
    with open(data_name, mode='r') as f:
        for line in f.readlines():
            line = re.sub('\n', '', line)
            sentences.append(line.split(' '))

    # modelが12GBくらいメモリを食うので終わったら開放する．
    vec_name =  'sentences_vec.npy'
    if not path.exists(vec_name):
        model = ft.load_model(model_name)

        dim = model.get_dimension()
        sentences_vec = np.zeros((dim,))

        for words in sentences:
            vec = np.zeros((dim,))
            for word in words:
                if model.get_word_id(word) == -1:
                    print('this word does not exists in corpus: %s at %s' % (word, words))
                vec = np.vstack((vec, model.get_word_vector(word)))
            vec = vec[1:, :].mean(axis=0)
            sentences_vec = np.vstack((sentences_vec, vec))
        sentences_vec = sentences_vec[1:, :]
        del model

        np.save(vec_name, sentences_vec)
    else:
        sentences_vec = np.load(vec_name)
    return sentences, sentences_vec


def get_similar_sentence(name, data, sentences, vec, num=5):
    """
    指定したcolumn_nameにcos類似度ベースで最も近いcolumn_nameを出力する．

    Args:
        name (str): target column_name
        data (DataFrame): preprocess()で出力されたデータフレーム
        sentences (list of list): 単語リストのリスト．[['word_0_0', 'word_0_1'], ['word_1_0', 'word_1_1', 'word_1_2'], ...]みたいな
        vec (np.array): 分散表現 次元=(文章数×分散表現の次元数)
        num (int): 最近傍いくつを出力するか
    """
    arr = np.array(data)
    assert (np.where(arr == name)[0].shape == (1,))
    assert (len(sentences) == vec.shape[0] == data.shape[0])
    index = np.where(arr == name)[0][0]
    assert (sentences[index] == data.iloc[index, -1].split(' '))
    similarity = cosine_similarity(vec[index, :][np.newaxis, :], vec)[0]
    sorted_index = similarity.argsort()[::-1]
    print('--------------------------------')
    print('sentence: %s' % name)
    print('word list: %s' % (sentences[index]))
    for i in range(num):
        index = sorted_index[i + 1]
        print('--------------------------------')
        print('cosine similarity: %1.3f' % (similarity[index]))
        print('sentence: %s' % (data.iloc[index, 0]))
        print('word list: %s' % (sentences[index]))
    print('--------------------------------')
    return sorted_index[1:num + 1], data.iloc[sorted_index[1:num + 1]]


if __name__ == '__main__':
    np.random.seed(123)
    # ランダムに50個の記事の題名を取得
    data = pd.read_csv('abcnews-date-text.csv')
    rand_index = np.random.randint(0, data.shape[0], 500)
    data = data.iloc[rand_index, 1]
    print('raw data')
    print(data.head())
    
    # sent = ' '.join(list(data))
    # words = nltk.word_tokenize(sent)
    # tags = nltk.pos_tag(words)
    # tags = sorted(list(set([tag for word, tag in tags])))
    # for i in tags:
        # print(nltk.help.upenn_tagset(i))
    data = preprocess(pd.DataFrame(data))
    
    sentences, vec = get_word_vector()
    print(vec.shape)
    get_similar_sentence(data.iloc[0, 1], data, sentences, vec, 5)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up