More than 5 years have passed since last update.

言語処理100本ノック-74(scikit-learn使用):予測

Posted at 2019-12-24

言語処理100本ノック 2015の74本目「予測」の記録です。
前回の学習(訓練)したモデルを使用して極性(ネガポジ)を予測(推論)し予測確率も出します。
今までは基本的に「素人の言語処理100本ノック」とほぼ同じ内容にしていたのでブロクに投稿していなかったのですが、「第8章: 機械学習」については、真剣に時間をかけて取り組んでいてある程度変えているので投稿します。scikit-learnをメインに使用します。

参考リンク

リンク	備考
074.予測.ipynb	回答プログラムのGitHubリンク
素人の言語処理100本ノック:74	言語処理100本ノックで常にお世話になっています
言語処理100本ノックでPython入門 #74 - 機械学習、scikit-learnでロジスティック回帰の予測	scikit-learn使ったノック結果

環境

種類	バージョン	内容
OS	Ubuntu18.04.01 LTS	仮想で動かしています
pyenv	1.2.15	複数Python環境を使うことがあるのでpyenv使っています
Python	3.6.9	pyenv上でpython3.6.9を使っています 3.7や3.8系を使っていないことに深い理由はありませんパッケージはvenvを使って管理しています

上記環境で、以下のPython追加パッケージを使っています。通常のpipでインストールするだけです。

種類	バージョン
numpy	1.17.4
pandas	0.25.3
scikit-learn	0.21.3

課題

第8章: 機械学習

本章では，Bo Pang氏とLillian Lee氏が公開しているMovie Review Dataのsentence polarity dataset v1.0を用い，文を肯定的（ポジティブ）もしくは否定的（ネガティブ）に分類するタスク（極性分析）に取り組む．

74. 予測

73で学習したロジスティック回帰モデルを用い，与えられた文の極性ラベル（正例なら"+1"，負例なら"-1"）と，その予測確率を計算するプログラムを実装せよ．

回答

回答プログラム 074.予測.ipynb

基本的に前回の「回答プログラム(分析編) 073_2.学習(訓練).ipynb」に予測部分を付加した程度です。

import csv

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

# 単語ベクトル化をGridSearchCVで使うのためのクラス
class myVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, method='tfidf', min_df=0.0005, max_df=0.10):
        self.method = method
        self.min_df = min_df
        self.max_df = max_df

    def fit(self, x, y=None):
        if self.method == 'tfidf':
            self.vectorizer = TfidfVectorizer(min_df=self.min_df, max_df=self.max_df)
        else:
            self.vectorizer = CountVectorizer(min_df=self.min_df, max_df=self.max_df)
        self.vectorizer.fit(x)
        return self

    def transform(self, x, y=None):
        return self.vectorizer.transform(x)
		
# GridSearchCV用パラメータ
PARAMETERS = [
    {
        'vectorizer__method':['tfidf', 'count'], 
        'vectorizer__min_df': [0.0003, 0.0004], 
        'vectorizer__max_df': [0.07, 0.10], 
        'classifier__C': [1, 3],    #10も試したが遅いだけでSCORE低い
        'classifier__solver': ['newton-cg', 'liblinear']},
    ]

# ファイル読込
def read_csv_column(col):
    with open('./sentiment_stem.txt') as file:
        reader = csv.reader(file, delimiter='\t')
        header = next(reader)
        return [row[col] for row in reader]    
		
x_all = read_csv_column(1)
y_all = read_csv_column(0)
x_train, x_test, y_train, y_test = train_test_split(x_all, y_all)

def train(x_train, y_train, file):
    pipline = Pipeline([('vectorizer', myVectorizer()), ('classifier', LogisticRegression())])
    
    # clf は classificationの略
    clf = GridSearchCV(
            pipline, # 
            PARAMETERS,           # 最適化したいパラメータセット 
            cv = 5)               # 交差検定の回数
    
    clf.fit(x_train, y_train)
    pd.DataFrame.from_dict(clf.cv_results_).to_csv(file)

    print('Grid Search Best parameters:', clf.best_params_)
    print('Grid Search Best validation score:', clf.best_score_)
    print('Grid Search Best training score:', clf.best_estimator_.score(x_train, y_train))    

    return clf.best_estimator_
	
def validate(estimator, x_test, y_test):
    
    for i, (x, y) in enumerate(zip(x_test, y_test)):
        y_pred = estimator.predict_proba([x])
        if y == np.argmax(y_pred).astype( str ):
            if y == '1':
                result = 'TP:正解がPositiveで予測もPositive'
            else:
                result = 'TN:正解がNegativeで予測もNegative'
        else:
            if y == '1':
                result = 'FN:正解がPositiveで予測はNegative'
            else:
                result = 'FP:正解がNegativeで予測はPositive'
        print(result, y_pred, x)
        if i == 29:
            break

estimator = train(x_train, y_train, 'gs_result.csv')
validate(estimator, x_test, y_test)

回答解説

データ分割

関数train_test_splitを使って訓練データとテストデータに分割しています。訓練したデータでは精度がよくなるのが当然なので、予測用に訓練に使わないデータを分けておきます。
私は過去記事「Coursera機械学習入門コース(6週目 - 様々なアドバイス)」で学習しました。

x_train, x_test, y_train, y_test = train_test_split(x_all, y_all)

予測

関数predict_probaを使って予測をしています。
似た関数predictがあるのですが、その場合は確率は返ってこずに結果(0 or 1)だけが返ってきます。

def validate(estimator, x_test, y_test):
    
    for i, (x, y) in enumerate(zip(x_test, y_test)):
        y_pred = estimator.predict_proba([x])
        if y == np.argmax(y_pred).astype( str ):
            if y == '1':
                result = 'TP:正解がPositiveで予測もPositive'
            else:
                result = 'TN:正解がNegativeで予測もNegative'
        else:
            if y == '1':
                result = 'FN:正解がPositiveで予測はNegative'
            else:
                result = 'FP:正解がNegativeで予測はPositive'
        print(result, y_pred, x)
        if i == 29:
            break

30行出力した結果はこんな感じ。
TP, TN, FP, FNについては、記事「【入門者向け】機械学習の分類問題評価指標解説(正解率・適合率・再現率など)」を参照ください。

TN:正解がNegativeで予測もNegative [[0.7839262 0.2160738]] restrain freak show mercenari obviou cerebr dull pretenti engag isl defi easi categor
FN:正解がPositiveで予測はNegative [[0.6469949 0.3530051]] chronicl man quest presid man singl handedli turn plane full hard bitten cynic journalist essenti campaign end extend public depart
TN:正解がNegativeで予測もNegative [[0.87843253 0.12156747]] insuffer movi mean make think existenti suffer instead put sleep
TN:正解がNegativeで予測もNegative [[0.90800564 0.09199436]] minut condens episod tv seri pitfal expect
TP:正解がPositiveで予測もPositive [[0.12240474 0.87759526]] absorb unsettl psycholog drama
TP:正解がPositiveで予測もPositive [[0.42977787 0.57022213]] rodriguez chop smart aleck film school brat imagin big kid
FN:正解がPositiveで予測はNegative [[0.59805784 0.40194216]] gangster movi capac surpris
TP:正解がPositiveで予測もPositive [[0.29473058 0.70526942]] confront stanc todd solondz take aim polit correct suburban famili
TP:正解がPositiveで予測もPositive [[0.21660554 0.78339446]] except act quietli affect cop drama
TP:正解がPositiveで予測もPositive [[0.47919199 0.52080801]] steer unexpectedli adam streak warm blood empathi dispar manhattan denizen especi hole
TN:正解がNegativeで予測もNegative [[0.67294895 0.32705105]] standard gun martial art clich littl new add
TN:正解がNegativeで予測もNegative [[0.66582407 0.33417593]] sweet gentl jesu screenwrit cut past everi bad action movi line histori
TP:正解がPositiveで予測もPositive [[0.41463847 0.58536153]] malcolm mcdowel cool paul bettani cool paul bettani play malcolm mcdowel cool
TP:正解がPositiveで予測もPositive [[0.33183064 0.66816936]] center humor constant ensembl give buoyant deliveri
TN:正解がNegativeで予測もNegative [[0.63371373 0.36628627]] let subtitl fool movi prove holli wood longer monopoli mindless action
TP:正解がPositiveで予測もPositive [[0.25740295 0.74259705]] taiwanes auteur tsai ming liang good news fall sweet melancholi spell uniqu director previou film
FN:正解がPositiveで予測はNegative [[0.57810652 0.42189348]] turntabl outsel electr guitar
FN:正解がPositiveで予測はNegative [[0.52506635 0.47493365]] movi stay afloat thank hallucinatori product design
TN:正解がNegativeで予測もNegative [[0.57268778 0.42731222]] non-mysteri mysteri
TP:正解がPositiveで予測もPositive [[0.07663805 0.92336195]] beauti piec count heart import humor
TN:正解がNegativeで予測もNegative [[0.86860199 0.13139801]] toothless dog alreadi cabl lose bite big screen
FP:正解がNegativeで予測はPositive [[0.4918716 0.5081284]] sandra bullock hugh grant make great team predict romant comedi get pink slip
TN:正解がNegativeで予測もNegative [[0.61861307 0.38138693]] movi comedi work better ambit say subject willing
FP:正解がNegativeで予測はPositive [[0.47041114 0.52958886]] like lead actor lot manag squeez laugh materi tread water best forgett effort
TP:正解がPositiveで予測もPositive [[0.26767592 0.73232408]] writer director juan carlo fresnadillo make featur debut fulli form remark assur
FP:正解がNegativeで予測はPositive [[0.40931838 0.59068162]] grand fart come director begin resembl crazi french grandfath
FP:正解がNegativeで予測はPositive [[0.43081731 0.56918269]] perform sustain intellig stanford anoth subtl humour bebe neuwirth older woman seduc oscar film founder lack empathi social milieu rich new york intelligentsia
TP:正解がPositiveで予測もPositive [[0.29555115 0.70444885]] perform uniformli good
TP:正解がPositiveで予測もPositive [[0.34561148 0.65438852]] droll well act charact drive comedi unexpect deposit feel
TP:正解がPositiveで予測もPositive [[0.31537766 0.68462234]] great participatori spectat sport

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up