More than 5 years have passed since last update.

[Python]たった4行で英文間の類似度を出すモジュールを作る

Last updated at 2019-08-27Posted at 2019-08-27

はじめに

コサイン類似度を使用します
コサイン類似度についての詳しい説明は以前書いたQiitaを見てください↓
pythonで「君とはベクトルが合わない」を数値で出そう笑

今回は2文章間の類似度を4行の簡単なpythonで出してみようと思います！

2文章間の類似度

例えばこんな２つの文があったとします

text1 = ”I like apple and lemon”
text2 = ”I like apple and banana”

この２つの文章を表でまとめると

	I	like	apple	and	lemon	banana
text1	1	1	1	1	1	0
text2	1	1	1	1	0	1

２つのテキストをベクトルで表すと
α = (1,1,1,1,1,0)
β = (1,1,1,1,0,1)
α・β = 4
|α| = √5
|β| = √5

これらからコサイン類似度をもとめる。

cosθ = (α・β)/|α||β|
であるから

cosθ = 4 / (√5 ✕ √5) = 0.8

Pythonで書いてみる

環境

ubuntu 18.04
python3

分かち書きに必要なモジュールのインストール　polyglot

$ sudo apt-get update
$ sudo apt-get install libicu-dev
$ sudo apt-get install python3-pip
$ sudo -H pip3 install --upgrade pip
$ sudo -H pip3 install numpy
$ sudo -H pip3 install polyglot
$ sudo -H pip3 install pyicu
$ sudo -H pip3 install pycld2
$ sudo -H pip3 install morfessor

引用：英文の自然言語処理におススメ！お手軽なPolyglotを使ってみた。

　　主な使い方はここを見てください↑

以下がコサイン類似度を求める４行のモジュールです

cos_sentence.py

import math
from polyglot.text import Text
def calc_cos(text1,text2):
    return len([i for i in Text(text1).words if i in Text(text2).words])/(math.sqrt(len(Text(text1).words)*len(Text(text2).words)))

はい、無理やり４行にしたので見やすくしますｗ

cos_sentence.py


import math
from polyglot.text import Text

def calc_cos(text1,text2):

    # テキストを分かち書きしてlistに格納
    list1 = Text(text1).words
    list2 = Text(text2).words

    # コサイン類似度の分母
    denominator = math.sqrt(len(list1)*len(list2))

    # コサイン類似度の分子
    numerator = len([i for i in list1 if i in list2])

    return numerator/denominator

このモジュールを呼んで上の2文の類似度を出す

test.py


import cos_sentence
print(cos_sentence.calc_cos("I like apple and lemon","I like apple and banana"))

結果↓
0.8

あとがき(text1,2が短文でないとき)

この分かち書きだと　"."とか"?"とか"!" も含んじゃうので、数文ある記事ならreplace("!","")とかするか、正確にやりたいなら形態素解析してリストに格納する必要があります

形態素解析もpolyglotでできます、"."とか"?"とか"!" の品詞名は'PUNCT'です

text内に複数文ある場合はこんな感じでしょうか

cos_sentence.py


import math
from polyglot.text import Text


def calc_cos(text1,text2):

    # テキストを分かち書きしてlistに格納
    list1 = text_to_list(text1)
    list2 = text_to_list(text2)

    # コサイン類似度の分母
    denominator = math.sqrt(len(list1)*len(list2))

    # コサイン類似度の分子
    numerator = 0
    for word in list1:
        if word in list2:
            numerator += 1
            # 既出の単語を消去
            list2.remove(word)
    return numerator/denominator

def text_to_list(text): #textからlistに単語を格納する関数
    list = []
    for token in Text(text).pos_tags:
        if u'PUNCT' != token[1]:
            list.append(token[0])
    return list

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up