LoginSignup
3
3

More than 3 years have passed since last update.

■【Google Colaboratory】 Preprocessing of Natural Language Processing & 形態素解析(janome)

Last updated at Posted at 2020-05-20

1. Read Data by "with open" method

青空文庫の 芥川龍之介の"鼻" を読み込んでみる
ファイルの文字コードは、shift_jis
image.png

# Pythonでのtext fileの読み書き(入出力)
with open('/hana.txt', mode='r', encoding='shift_jis') as f: 
  nose_hana = f.read()

print(nose_hana)

image.png

2. Preprocessing of "HANA"

#データの前処理
import re
import pickle

nose = re.sub('《[^》]+》', '', nose_hana)    # ルビを削除する
nose = re.sub('[|―  「」\n]', '', nose)      # |― と全角スペース、「」と改行の削除
nose = re.sub('[ ]', '', nose)                #半角スペース削除
nose = re.sub('[\u3000]', '', nose)           #\u3000削除

sentense_end = '。'

nose_list = nose.split(sentense_end)
nose_list.pop()
nose_list = [x+sentense_end for x in nose_list]

print(nose_list)

image.png

3. WAKATI "分かち書き"

from janome import tokenizer

s = Tokenizer()

t = nose_list

for _ in nose_list:
  print(s.tokenize(_, wakati=True))

image.png

4. Analysis of results of "WAKATI"

# collectionsで、出現頻度をカウントすることができる
import collections

s = Tokenizer() # インスタンス化
words = []
for _ in nose_list:
  words += s.tokenize(_, wakati=True)

c = collections.Counter(words)
print(c)

Reference

  1. 形態素解析ツール(janome)のインストール
3
3
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
3
3