More than 3 years have passed since last update.

自然言語処理_前処理

自然言語処理

Posted at 2022-04-16

大まかな流れ

①クリーニング処理
②形態素解析
③正規化
④ストップワード除去
⑤単語のベクトル化

クリーニング処理

・不要な文字を除去。bsで引っ張ったデータに残るコードとか。
　.string　で文章だけ残せる

形態素解析・・・文章の細分化

ライブラリー・・・MeCab

・いくつかある

-Ochasen：単語分割して品詞や活用なども表示-
-Owakati：半角スペースでの分かち書きを表示
-Oyomi：読みを表示

正規化・・・表記ゆれの解消

ライブラリ・・・NEologdnライブラリ

full_letter = neologdn.normalize ("ｸﾙﾏ")
print(full_letter) # クルマ

数字

import re
t = "12月のクリスマスに100万円のダイヤモンドをプレゼントする"
print( **re.sub** ("[0-9]+","0", t))

ストップワード除去・・・頻出単語の除去（は、が、を、に、、、など）

①辞書

ライブラリ・・・Slothlib（日本のストップワードを集めたもの）
　　　　　　　　　　　　　　　　　　　NLTK（英語のストップワード）
ダウンロード方法↓↓↓

import os
import urllib.request

def download(path):
    url = 'http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/SlothLib/NLP/Filter/StopWord/word/Japanese.txt'
    if os.path.exists(path):
        print('File already exists.')
    else:
        print('Downloading...')
        urllib.request.urlretrieve(url, path)
        print("Finish!")

download("stopwords")

②出現頻度
Counterメソッド

from collections import Counter

most_common() ///　頻度の高い単語から順に表示

単語のベクトル表現・・・単語の類似度を測る

word2vec

from gensim.models import word2vec

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up