More than 5 years have passed since last update.

BoWを使った文書分類を試す(GiNZA)

GiNZA

Last updated at 2020-04-17Posted at 2019-10-27

BoW(Bag of Words)で文書分類を行うまでの一連の操作を試します。

予備知識

実践

参考：はじめての自然言語処理第4回 spaCy/GiNZA を用いた自然言語処理
この内、「5.3 bow による分類」を試します。

スクレイピング→ラベリング→前処理→文書分類
の順になります。

スクレイピング

訓練のためにtsvというタブ区切りのテキストファイルを用意する必要があるようです。
私は下記の感じの処理で用意していますが、ソースは割愛します。

scrapeWiki.py

from bs4 import BeautifulSoup
import csv
import os.path
import requests
import time

def read_csv(path):
    with open(path, 'r') as read_file:
        reader = csv.reader(read_file)
        lines = list(reader)
        return lines

def write_csv(path, data): 
    with open(path, 'w') as write_file:
        writer = csv.writer(write_file, delimiter='\t', quoting=csv.QUOTE_ALL)
        writer.writerows(data)

if __name__ == '__main__':
    en_lines = [['text']] # set header
    ja_lines = [['text']] # set header

    if os.path.exists(EN_LINE_PATH):
        en_lines = read_csv(EN_LINE_PATH)
    
    if os.path.exists(JA_LINE_PATH):
        ja_lines = read_csv(JA_LINE_PATH)
    
    crawl_titles = get_titles()

    for title in crawl_titles:
        print(title + " Ready")
        time.sleep(10)
        quotes = scrape(title)
        
        # dic to list
        en_lines += extract_lang(quotes, suffix_en)
        ja_lines += extract_lang(quotes, suffix_ja)

        write_csv(EN_LINE_PATH, en_lines)
        write_csv(JA_LINE_PATH, ja_lines)

        interval = 290
        print(title + " Done. Next request'll be sent after " + str(interval) + " s.")
        time.sleep(interval)

ラベリング

doccanoを使うと良さそうです。
私は知らずに下記のコードを参考にしましたが・・・

参考：Creating a Custom Classifier for Text Cleaning

labelManually.py

import numpy as np
import pandas as pd

def manually_label(input_file):
    if input_file.endswith(".csv"):
        df = pd.read_csv(input_file, delimiter='\t')
    elif input_file.endswith(".pickle"):
        df = pd.read_pickle(input_file)
    else:
        raise ValueError("Impropper argument")

    # Add label column 
    if 'label' not in df:
        df.insert(loc=0, column='label', value=np.nan)

    for index, row in df.iterrows():
        print("index: " + str(index))

        if pd.isnull(row.text):
            # Nothing to label
            print("pass: " + str(row.text))
            df.drop(index, inplace=True)
            df.to_csv(OUTPUT_FILE, sep='\t', index=False)
            continue

        if row.label.isspace():
            # Label not found
            print("raw: " + row.text)
            
            label = input()
            df.loc[index, 'label'] = label
            df.to_csv(OUTPUT_FILE, sep='\t', index=False)
        else:
            # Label found
            print(str(row.label) + ": " + row.text)
            
    print('No more labels to classify!')

if __name__ == '__main__':
    manually_label(OUTPUT_FILE)

前処理と訓練データの分割

前処理する理由について

自然言語処理における前処理の種類とその威力

前処理の処理例

データの分割

How to split data into 3 sets (train, validation and test)?

(前述の処理でラベルが数値になっているので文字に置き換えます)

Replacing column values in a pandas DataFrame

normalizeAndSplit.py

import numpy as np
import pandas as pd
import neologdn
import re
import sys

LABEL_NAME = {
    1	:"category name1",
    2	:"category name2"
}

# 参考：https://stackoverflow.com/questions/38250710/how-to-split-data-into-3-sets-train-validation-and-test
def train_validate_test_split(df, train_percent=.6, validate_percent=.2, seed=None):
    np.random.seed(seed)
    perm = np.random.permutation(df.index)
    m = len(df.index)
    train_end = int(train_percent * m)
    validate_end = int(validate_percent * m) + train_end
    train = df.iloc[perm[:train_end]]
    validate = df.iloc[perm[train_end:validate_end]]
    test = df.iloc[perm[validate_end:]]
    return train, validate, test

def main():
    df = pd.read_csv(INPUT_FILE, delimiter='\t')
    df['label'] = df['label'].map(LABEL_NAME) #参考：https://stackoverflow.com/questions/23307301/replacing-column-values-in-a-pandas-dataframe

    for index, row in df.iterrows():
        normlized = neologdn.normalize(row.text, repeat=1)
        normlized = re.sub(r'(\d)([,.])(\d+)', r'\1\3', normlized)
        normlized = re.sub(r'[!-/:-@[-`{-~]', r' ', normlized)

    train, validate, test = train_validate_test_split(df)
    train.to_csv("train.tsv", sep='\t', index=False)
    validate.to_csv("dev.tsv", sep='\t', index=False)
    test.to_csv("test.tsv", sep='\t', index=False)

if __name__ == "__main__":
    main()

文書分類

Google Colaboratoryで動かす場合には
pkg_resourcesを下記の処理でリロードする必要があるようですが、
私の環境ではなぜか解決しないのでランタイムを再起動させています。

import pkg_resources, imp
imp.reload(pkg_resources)

下記の記事を参考に訓練を始めます。
参考元のソースと変わらないですが、私のノートも載せておきます。

参考：はじめての自然言語処理第4回 spaCy/GiNZA を用いた自然言語処理

躓いた点1

訓練の終わったモデルをto_diskでファイルに出力すると、
エラーメッセージ「Maximum recursion level reached」が返ってきます。
原因はlabelの値が 'numpy.int64'だったからなのですが、エラーメッセージがわかり辛いです。

躓いた点2

モデルの読み込み方がわからなかった。

import spacy
# After from_disk('/content/drive/My Drive')
nlp = spacy.load('/content/drive/My Drive')
doc = nlp("ありがとうございます")
doc.cats
max(doc.cats, key=doc.cats.get)

おまけ

マルチラベル分類（Multi-label classification）を試したいなら、fasttextのチュートリアルが良さそう。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up