藝人在發行專輯會有多種名稱,我們可以稱為別名,那麼是否能搜尋藝人的專輯資料時,
同時找出別名發行的專輯和其它相關的專輯。
#agenda
- Download ja-wikepeda 語料
- Preprocess ja-wikiepeda 語料
- Use gensim generate word2vec model
- Final
Download ja-wikepeda 語料
wget https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2
Preprocess ja-wikiepeda 語料
WikiCorpus
from gensim.corpora import WikiCorpus
corpus = WikiCorpus('./jawiki-latest-pages-articles.xml.bz2', lemmatize=False, dictionary={})
with open('./ja-wiki-raw') as output:
for text in corpus.get_texts():
output.write(' '.join(text) + '\n')
Mecab + mecab-ipadic-neologd
mecab -d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd -Owakati jawiki-raw -o ja-wiki-token -b 1000000000
Use gensim generate word2vec model
from gensim.models import word2vec
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
source = "./ja-wiki-token"
vector_size = 200
min_count = 10
window_size = 5
workers = 3
sentences = word2vec.LineSentence(source)
model = word2vec.Word2Vec(sentences, size=vector_size, min_count=min_count, window=window_size, workers=workers)
model.wv.save_word2vec_format('./ja-wiki-model.vec.pt', binary=True)
Final
- 計算
澤野弘之
跟nzk
的相關度 - 列出
澤野弘之
相關詞
from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format('./ja-wiki-model.vec.pt', binary=True)
similarity = word_vectors.similarity('澤野弘之', 'nzk')
print(similarity)
data = word_vectors.most_similar('澤野弘之')
for word in data:
print(word)
0.7995355
('nzk', 0.7995355129241943)
('梶浦由記', 0.79180908203125)
('tielle', 0.7494333982467651)
('gemie', 0.7381956577301025)
('小林未郁', 0.7224546074867249)
('yamanaiame', 0.7218720316886902)
('和田貴史', 0.711627721786499)
('caldito', 0.7096607685089111)
('blackschleger', 0.7088098526000977)
('cyua', 0.7018069624900818)
將 most_similar
的字串放入 elasticseach MultiMatch query 並且根據相關性 boost score。