This post is Private. Only a writer or those who know its URL can access this post.

Improve article
Show article in Markdown
Report article
Help us understand the problem. What is going on with this article?

自然言語処理と深層学習の勉強会(初学者向け)

More than 3 years have passed since last update.

自然言語処理と深層学習の勉強会(初学者向け)

by weeyble
1 / 20

scikit-learnの導入

http://scikit-learn.org/stable/

scikit-learn (0.18.2)

!pip list

pip install scikit-learn

pip install --upgrade scikit-learn


データセットのimport

from sklearn.datasets import fetch_20newsgroups

fetch (取ってきてlocalにキャッシュ)

data = fetch_20newsgroups()

データ数確認

len(data)

BoW


頻度ベクトライザ

from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
vec.fit(data.data)

vec.transform(["This is a pen. i have a pen"])

1行130107列


類似度


コサイン類似度

cos\theta = \frac{A \cdot B}{||A|| ||B||}

類似度計算


類似度を測る

document1 = """
Ah, look at all the lonely people
Eleanor Rigby picks up the rice
"""

document2 = """
I hear Jerusalem bells a ringing
Roman Cavalry choirs are singing
"""


ベクトル表現

d1 = vec.transform([document1])
d2 = vec.transform([document2])

d1vec = d1.toarray()[0]
d2vec = d2.toarray()[0]

コサイン類似度

cos\theta = \frac{A \cdot B}{||A|| ||B||}
d1vec.dot(d2vec)  / (np.linalg.norm(d1vec) * np.linalg.norm(d2vec))

N-gramモデル

参考
http://recognize-speech.com/language-model/n-gram-model/comparison

N=1:ユニグラム
N=2:バイグラム
N=3:トリグラム


実装

vec = CountVectorizer(ngram_range=(1,2))

N=1:ユニグラム
N=2:バイグラム


重たい処理なのでデータセットを小さくします


vec.fit(["This is a sentence"])

(this, is, a, sentence, this is, is a, a sentence)

vn = vec.transform(["this is python"])


おまけ色々

自然言語処理を独習したい人のために
http://cl.sd.tmu.ac.jp/prospective/prerequisite

言語処理100本ノック
http://www.cl.ecei.tohoku.ac.jp/nlp100/

Pythonではじめる機械学習
https://github.com/amueller/introduction_to_ml_with_python/blob/master/07-working-with-text-data.ipynb

日本語形態素解析システム JUMAN++
http://nlp.ist.i.kyoto-u.ac.jp/?JUMAN%2B%2B

weeyble
Why not register and get more from Qiita?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away