LoginSignup

This article is a Private article. Only a writer and users who know the URL can access it.
Please change open range to public in publish setting if you want to share this article with other users.

More than 5 years have passed since last update.

自然言語処理と深層学習の勉強会(初学者向け)

Last updated at Posted at 2017-07-14
1 / 20

scikit-learnの導入

scikit-learn (0.18.2)

!pip list

pip install scikit-learn

pip install --upgrade scikit-learn


データセットのimport

from sklearn.datasets import fetch_20newsgroups

fetch (取ってきてlocalにキャッシュ)

data = fetch_20newsgroups()

データ数確認

len(data)

BoW


頻度ベクトライザ

from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
vec.fit(data.data)

vec.transform(["This is a pen. i have a pen"])

1行130107列


類似度


コサイン類似度

cos\theta = \frac{A \cdot B}{||A|| ||B||}

類似度計算


類似度を測る

document1 = """
Ah, look at all the lonely people
Eleanor Rigby picks up the rice
"""

document2 = """
I hear Jerusalem bells a ringing
Roman Cavalry choirs are singing
"""


ベクトル表現

d1 = vec.transform([document1])
d2 = vec.transform([document2])

d1vec = d1.toarray()[0]
d2vec = d2.toarray()[0]

コサイン類似度

cos\theta = \frac{A \cdot B}{||A|| ||B||}
d1vec.dot(d2vec)  / (np.linalg.norm(d1vec) * np.linalg.norm(d2vec))

N-gramモデル

参考
http://recognize-speech.com/language-model/n-gram-model/comparison

N=1:ユニグラム
N=2:バイグラム
N=3:トリグラム


実装

vec = CountVectorizer(ngram_range=(1,2))

N=1:ユニグラム
N=2:バイグラム


重たい処理なのでデータセットを小さくします


vec.fit(["This is a sentence"])

(this, is, a, sentence, this is, is a, a sentence)

vn = vec.transform(["this is python"])


おまけ色々

自然言語処理を独習したい人のために
http://cl.sd.tmu.ac.jp/prospective/prerequisite

言語処理100本ノック
http://www.cl.ecei.tohoku.ac.jp/nlp100/

Pythonではじめる機械学習
https://github.com/amueller/introduction_to_ml_with_python/blob/master/07-working-with-text-data.ipynb

日本語形態素解析システム JUMAN++
http://nlp.ist.i.kyoto-u.ac.jp/?JUMAN%2B%2B

0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up