0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

使用 wor2vec 加強搜尋體驗

Last updated at Posted at 2021-02-24

藝人在發行專輯會有多種名稱,我們可以稱為別名,那麼是否能搜尋藝人的專輯資料時,
同時找出別名發行的專輯和其它相關的專輯。

#agenda

  1. Download ja-wikepeda 語料
  2. Preprocess ja-wikiepeda 語料
  3. Use gensim generate word2vec model
  4. Final

Download ja-wikepeda 語料

wget https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2

Preprocess ja-wikiepeda 語料

WikiCorpus


from gensim.corpora import WikiCorpus

corpus = WikiCorpus('./jawiki-latest-pages-articles.xml.bz2', lemmatize=False, dictionary={})
with open('./ja-wiki-raw') as output:
    for text in corpus.get_texts():
        output.write(' '.join(text) + '\n')

Mecab + mecab-ipadic-neologd

mecab -d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd  -Owakati jawiki-raw -o ja-wiki-token -b 1000000000

Use gensim generate word2vec model

from gensim.models import word2vec
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

source = "./ja-wiki-token"
vector_size = 200
min_count = 10
window_size = 5
workers = 3

sentences = word2vec.LineSentence(source)
model = word2vec.Word2Vec(sentences, size=vector_size, min_count=min_count, window=window_size, workers=workers)
model.wv.save_word2vec_format('./ja-wiki-model.vec.pt', binary=True)

Final

  1. 計算澤野弘之nzk 的相關度
  2. 列出澤野弘之 相關詞
from gensim.models import KeyedVectors

word_vectors = KeyedVectors.load_word2vec_format('./ja-wiki-model.vec.pt', binary=True)

similarity = word_vectors.similarity('澤野弘之', 'nzk')

print(similarity)

data = word_vectors.most_similar('澤野弘之')

for word in data:
    print(word)
0.7995355
('nzk', 0.7995355129241943)
('梶浦由記', 0.79180908203125)
('tielle', 0.7494333982467651)
('gemie', 0.7381956577301025)
('小林未郁', 0.7224546074867249)
('yamanaiame', 0.7218720316886902)
('和田貴史', 0.711627721786499)
('caldito', 0.7096607685089111)
('blackschleger', 0.7088098526000977)
('cyua', 0.7018069624900818)

most_similar 的字串放入 elasticseach MultiMatch query 並且根據相關性 boost score。

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?