More than 1 year has passed since last update.

分布仮説に基づく単語のベクトル化（カウントベースの方法）

Last updated at 2023-11-07Posted at 2023-11-07

分布仮説（Distributional Hypothesis）

単語自体に意味はなく、その単語の周囲の単語（コンテキスト）によってその単語の意味が形成される、という考え方。
分布仮説に基づいた文章のベクトル化について自分の理解をアウトプット。

単語の周囲にどんな単語があるか記録

以下のような文章を例に使用（Beatlesの名曲"Hello, Goodbye"の歌詞から引用)

text = 'You say goodbye and I say hello.'

文章を単語に分割

text = text.lower()
text = text.replace('.', ' .')
words = text.split(" ")
print(words)

['you', 'say', 'goodbye', 'and', 'i', 'say', 'hello', '.']

各単語にユニークなIDを付与。単語：IDの辞書とID：単語の辞書をそれぞれ作成

word_to_id = {}
id_to_word = {}

for word in words:
  if word not in word_to_id:
    new_id = len(word_to_id)
    word_to_id[word] = new_id
    id_to_word[new_id] = word

print(word_to_id)

{'you': 0, 'say': 1, 'goodbye': 2, 'and': 3, 'i': 4, 'hello': 5, '.': 6}

この単語・ID対応表をもとに元の文章をコーパス化する

corpus = [word_to_id[w] for w in words]
print(corpus)

[0 1 2 3 4 1 5 6]

「各単語の両隣にどの単語があるか」を示す行列を作る。IDが2の単語goodbyeの両脇にはID1の単語sayとID3の単語andがあるので、作る行列は[0, 1, 0, 1, 0, 0, 0]となる。
これを全単語について作る

def create_co_matrix(corpus, vocab_size, window_size=1):
  corpus_size = len(corpus)
  co_matrix = np.zeros((vocab_size, vocab_size), dtype=np.int32)

  for idx, word_id in enumerate(corpus):
    for i in range(1, window_size + 1):
      left_idx = idx - i
      right_idx = idx + i

      if left_idx >= 0:
        left_word_id = corpus[left_idx]
        co_matrix[word_id, left_word_id] += 1

      if right_idx < corpus_size:
        right_word_id = corpus[right_idx]
        co_matrix[word_id, right_word_id] += 1

  return co_matrix

C = create_co_matrix(corpus, 7)
print(create_co_matrix(corpus, 7))

[[0 1 0 0 0 0 0]
 [1 0 1 0 1 1 0]
 [0 1 0 1 0 0 0]
 [0 0 1 0 1 0 0]
 [0 1 0 1 0 0 0]
 [0 1 0 0 0 0 1]
 [0 0 0 0 0 1 0]]

分布仮説に基づく単語のベクトル化ができた。これにより「ベクトル成分の類似度が高い単語は意味が似ている」と捉えることができる。しかしこのままでは以下の問題がある

登場頻度が高い単語で偽の関連性を示してしまう
コーパスが増えると行列がデカくなって計算コストがかかる

大いに参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up