More than 1 year has passed since last update.

【Java初心者】Wekaによる機械学習 06-自然言語のCSVをStringToWordVectorにかけたarffにする

Posted at 2024-02-25

自分で収集したテキストを使って機械学習するために

スクレイピングしたり、エクセルから抽出した自然言語データを使って機械学習する手順としては

形態素解析した「テキスト」と、「カテゴリー」(例えばspam,ham)の2つのフィールドを持つCSVを作る
wekaを使ってarffに変換。その際「テキスト」はsetStringAttributesを使ってstring指定する
StringToWordVectorフィルターを使って単語ベクトル化する
上記の処理をすると「カテゴリー」のフィールドが最初(first)に来てしまうのでReorderを使って最も後ろ(last)にする

サンプルとして使うテキストデータ

手作業で作ったので適当です。本当に学習させるならばデータの数を増やしたり、助詞などを削除するなどの工夫が必要です。

テキスト,カテゴリー
"部下 を 成長 さ せ 、 結果 を 出す リーダー に なる 為 の  ミーティングを 学ぶ 講座 の ご 案内 ",ham
"本 メール と 行き 違い に お 申込み を いただい て いる 場合 は 失礼 を お詫び 申し上げ ます",ham
"キャッシュレス決済 の 裏 に 潜む 脅威",ham
"自分 を 高め て くれる 、 なり たい 自分 へ と 導い て くれる",ham
"セキュリティ 運用 の 最適 解 も 伝授",ham
"お 支払い 方法 の 情報 を 更新 し て ください",spam
"異常 は 検出 さ れ まし た",spam
"ご 請求 金額 確定 の ご 案内 平素 は カード を ご 利用 いただき 、 誠に ありがとう ござい ます",spam
"残念 ながら 、 アカウント を 更新 でき ませ ん でし た",spam
"あなた の アカウント の 状態 が 異常 で ある こと を 発見 し まし た",spam

手順に沿って処理します(Javaで書いてみました)

クラスパスに weka.jar(自分はmini-wekaを使います)が必要です。

一気に複数のフィルターを適用する方法もあると思うけど、一つずつやった方が手順を追いやすい。

package sample;

import java.io.File;
import java.nio.file.Files;
import java.nio.file.Path;
import weka.core.Instances;
import weka.core.converters.CSVLoader;
import weka.filters.Filter;
import weka.filters.unsupervised.attribute.StringToWordVector;
import weka.filters.unsupervised.attribute.Reorder;

public class string2wordvecTest {

    public static void main(String[] args) throws Exception {
        // CSVデータ
        String inputFile = "sample.csv";
        // ARFFに普通に変換する
        CSVLoader loader = new CSVLoader();
        loader.setFile(new File(inputFile));
        // 最初のフィールドはstringとする(こうしないとStringToWordVectorしてくれない)
        loader.setStringAttributes("1");
        Instances data = loader.getDataSet();
        // 分類するためのクラスは最後のフィールドである指定する
        data.setClassIndex(data.numAttributes() - 1);
        // 単語ベクトル化する
        StringToWordVector filter = new StringToWordVector();
        filter.setInputFormat(data);
        filter.setTFTransform(true);  // TF使用
        filter.setIDFTransform(true); // IDF使用
        // 以下の行で単語ベクトルになる
        Instances filteredData = Filter.useFilter(data, filter);
        
        // 最初のフィールドを一番うしろにまわす(その方が素直に処理できる)
        Reorder reoder = new Reorder();
        String[] option = new String[2];
        option[0] = "-R";
        option[1] = "2-last,first"; // 2番からラストがあり、その後に最初のフィードにするという意味
        reoder.setOptions(option);
        reoder.setInputFormat(filteredData);
        Instances reoderFilter = Filter.useFilter(filteredData, reoder);
        Files.write(Path.of("data.arff"), reoderFilter.toString().getBytes());
    }
}

実行結果

data.arffの内容は以下の通り

@relation 'sample-weka.filters.unsupervised.attribute.StringToWordVector-R1-W1000-prune-rate-1.0-T-I-N0-stemmerweka.core.stemmers.NullStemmer-stopwor\
ds-handlerweka.core.stopwords.Null-M1-tokenizerweka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"-weka.filters.unsupervise\
d.attribute.Reorder-R2-last,first'

@attribute 、 numeric
@attribute いただい numeric
@attribute いる numeric
@attribute お numeric
@attribute お詫び numeric
@attribute くれる numeric
@attribute ご numeric
@attribute さ numeric
@attribute せ numeric
@attribute たい numeric
@attribute て numeric
@attribute と numeric
@attribute なり numeric
@attribute なる numeric
@attribute に numeric
@attribute の numeric
@attribute は numeric
@attribute へ numeric
@attribute ます numeric
@attribute も numeric
@attribute を numeric
@attribute キャッシュレス決済 numeric
@attribute セキュリティ numeric
(中略)
@attribute 請求 numeric
@attribute 金額 numeric
@attribute カテゴリー {ham,spam}

@data
{0 0.635124,6 1.115577,7 1.115577,8 1.59603,13 1.59603,14 0.83453,15 0.354077,20 0.247228,23 1.59603,25 1.59603,27 1.59603,30 1.59603,32 1.59603,35 1\
.115577,37 1.59603,40 1.59603,46 1.59603,49 1.59603}
{1 1.59603,2 1.59603,3 1.115577,4 1.59603,10 0.83453,11 1.115577,14 0.83453,16 0.83453,18 1.115577,20 0.247228,24 1.59603,28 1.59603,29 1.59603,34 1.\
59603,38 1.59603,39 1.59603,43 1.59603,48 1.59603}
{14 0.83453,15 0.354077,21 1.59603,36 1.59603,41 1.59603,44 1.59603}
{0 0.635124,5 1.59603,9 1.59603,10 0.83453,11 1.115577,12 1.59603,17 1.59603,20 0.247228,31 1.59603,42 1.59603,50 1.59603}
{15 0.354077,19 1.59603,22 1.59603,26 1.59603,33 1.59603,45 1.59603,47 1.59603}
{3 1.115577,10 0.83453,15 0.354077,20 0.247228,56 1.59603,59 1.115577,73 1.59603,74 1.59603,75 1.59603,76 1.115577,86 spam}
{7 1.115577,16 0.83453,60 0.83453,65 1.115577,67 1.59603,77 1.59603,80 1.115577,86 spam}
{0 0.635124,6 1.115577,15 0.354077,16 0.83453,18 1.115577,20 0.247228,35 1.115577,52 1.59603,54 1.59603,58 1.59603,70 1.59603,71 1.59603,72 1.59603,8\
2 1.59603,83 1.59603,84 1.59603,85 1.59603,86 spam}
{0 0.635124,20 0.247228,60 0.83453,62 1.59603,63 1.59603,64 1.59603,66 1.59603,68 1.59603,69 1.115577,76 1.115577,78 1.59603,86 spam}
{15 0.354077,20 0.247228,51 1.59603,53 1.59603,55 1.59603,57 1.59603,59 1.115577,60 0.83453,61 1.59603,65 1.115577,69 1.115577,79 1.59603,80 1.115577\
,81 1.59603,86 spam}

あとは、分類モデルを作って機械学習する

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up