This post is Private. Only a writer or those who know its URL can access this post.

Improve article
Show article in Markdown
Report article
Help us understand the problem. What is going on with this article?

# 自然言語処理と深層学習の勉強会(初学者向け)

More than 3 years have passed since last update.

by weeyble
1 / 20

# scikit-learnの導入

http://scikit-learn.org/stable/

scikit-learn (0.18.2)
 !pip list 

# データセットのimport

from sklearn.datasets import fetch_20newsgroups


# fetch (取ってきてlocalにキャッシュ)

data = fetch_20newsgroups()


# データ数確認

len(data)


# 頻度ベクトライザ

from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
vec.fit(data.data)


vec.transform(["This is a pen. i have a pen"])

# コサイン類似度

cos\theta = \frac{A \cdot B}{||A|| ||B||}


# 類似度を測る

document1 = """
Ah, look at all the lonely people
Eleanor Rigby picks up the rice
"""

document2 = """
I hear Jerusalem bells a ringing
Roman Cavalry choirs are singing
"""

# ベクトル表現

d1 = vec.transform([document1])
d2 = vec.transform([document2])

d1vec = d1.toarray()[0]
d2vec = d2.toarray()[0]


# コサイン類似度

cos\theta = \frac{A \cdot B}{||A|| ||B||}

d1vec.dot(d2vec)  / (np.linalg.norm(d1vec) * np.linalg.norm(d2vec))


N=1：ユニグラム
N=2：バイグラム
N=3：トリグラム

# 実装

vec = CountVectorizer(ngram_range=(1,2))


N=1：ユニグラム
N=2：バイグラム

# 重たい処理なのでデータセットを小さくします

vec.fit(["This is a sentence"])


(this, is, a, sentence, this is, is a, a sentence)

vn = vec.transform(["this is python"])


# おまけ色々

http://cl.sd.tmu.ac.jp/prospective/prerequisite

http://www.cl.ecei.tohoku.ac.jp/nlp100/

http://nlp.ist.i.kyoto-u.ac.jp/?JUMAN%2B%2B

Why not register and get more from Qiita?
1. We will deliver articles that match you
By following users and tags, you can catch up information on technical fields that you are interested in as a whole
2. you can read useful information later efficiently
By "stocking" the articles you like, you can search right away