More than 5 years have passed since last update.

言語処理100本ノック　第4章: 形態素解析　30. 形態素解析結果の読み込み

Posted at 2020-04-29

実行環境
・MacOS
・AnacondaからのJupyter Notebook
・Python 3.7

30. 形態素解析結果の読み込み

形態素解析結果（neko.txt.mecab）を読み込むプログラムを実装せよ．ただし，各形態素は表層形（surface），基本形（base），品詞（pos），品詞細分類1（pos1）をキーとするマッピング型に格納し，1文を形態素（マッピング型）のリストとして表現せよ．第4章の残りの問題では，ここで作ったプログラムを活用せよ．

前回、以下の前処理したデータを使って続きの処理をする。

前回の処理

# ファイル読み込み
path = '/Users/shin/Documents/NLP/neko.txt'
with open(path) as f:
    l = f.readlines()
    print(l)

# Pythonで文字列のリスト（配列）の条件を満たす要素を抽出、置換
table = str.maketrans({
    '\n': '',
    '\u3000': '',
    ' ':'　'
})
l_replace = [s.translate(table) for s in l]

# 空のlist削除
l_replace = [s for s in l_replace if s != '']

ここからが今回の処理

# 言語処理100本ノック2020:　第四章(形態素解析)
# https://qiita.com/TsumaR/items/40ebf328deeb66e3ba86

from pyknp import Juman
jumanpp = Juman()
result_list = []
for s in l_replace:
    morpheme_list = []
    result = jumanpp.analysis(s)
    for mrph in result.mrph_list(): # 各形態素にアクセス
        component_dict = {
            "surface":mrph.midasi,
            "base":mrph.genkei,
            "pos":mrph.hinsi,
            "pos1":mrph.bunrui
        }
        morpheme_list.append(component_dict)
    result_list.append(morpheme_list)

結果

[[{'surface': '一', 'base': '一', 'pos': '名詞', 'pos1': '数詞'}],
 [{'surface': '吾輩', 'base': '吾輩', 'pos': '名詞', 'pos1': '普通名詞'},
  {'surface': 'は', 'base': 'は', 'pos': '助詞', 'pos1': '副助詞'},
  {'surface': '猫', 'base': '猫', 'pos': '名詞', 'pos1': '普通名詞'},
  {'surface': 'である', 'base': 'だ', 'pos': '判定詞', 'pos1': '*'},
  {'surface': '。', 'base': '。', 'pos': '特殊', 'pos1': '句点'}],
~

ひとまず良さそう。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

言語処理100本ノック 第4章: 形態素解析 30. 形態素解析結果の読み込み

30. 形態素解析結果の読み込み

言語処理100本ノック　第4章: 形態素解析　30. 形態素解析結果の読み込み