More than 5 years have passed since last update.

[前処理編] ロイター通信のデータセットを用いて、ニュースをトピックに分類するモデル(MLP)をkerasで作る（TensorFlow 2系）

Last updated at 2020-01-04Posted at 2020-01-03

概要

kerasを使ったテキスト分類を試し、記事にまとめます。
データセットはtensorflowに内蔵されたロイター通信のデータセットです（英語のテキストデータ）。

Keras MLPの文章カテゴリー分類を理解するというブログ記事を参考に、一度取り組んだことがあります。
今回はドキュメントを引きつつ手を動かしており、理解を深める目的でこの記事をアウトプットします。
構築したモデルは、非常にシンプルなMLPです。

分量が長くなったので2つに分けます：

本記事で扱うこと
- データセットについて
- 前処理について
次の記事で扱うこと
- モデルの学習について
- モデルの性能評価について

動作環境

$ sw_vers
ProductName:	Mac OS X
ProductVersion:	10.14.6
BuildVersion:	18G103
$ python -V  # venvモジュールによる仮想環境を利用
Python 3.7.3
$ pip list  # 主要なものを抜粋
ipython              7.11.0
matplotlib           3.1.2
numpy                1.18.0
pip                  19.3.1
scikit-learn         0.22.1
scipy                1.4.1
tensorflow           2.0.0

データセット

読み込み

tensorflow.keras.datasets.reuters.load_data(ドキュメント)で読み込むことができます。
test_split引数のデフォルト値が0.2のため、学習用8割、テスト用2割に分かれて読み込まれます。
※初回実行時は、データがダウンロードされます。

In [1]: from tensorflow.keras.datasets import reuters

In [2]: (x_train, y_train), (x_test, y_test) = reuters.load_data()

In [3]: len(y_train), len(y_test)
Out[3]: (8982, 2246)  # 合計 11228 件

ラベルを見る

ラベルはニュースのトピックを表すそうです。
試しにラベルを1つ見てみましょう。

In [4]: y_train[1000]
Out[4]: 19

数値で表されています（※それぞれがどんなトピックなのかまでは調べきれていません）。

学習用とテスト用のデータ全体で何種類のラベルがあるか確認します。
numpy.ndarrayのy_trainとy_testをリストに変換して、collections.Counter(ドキュメント)に渡します。

In [5]: from collections import Counter
In [8]: counter = Counter(list(y_train) + list(y_test))

In [9]: len(counter)
Out[9]: 46

全部で46のトピックがありました。

トピックごとに何件あるか確認します。

In [10]: for i in range(46):
    ...:     print(f'{i}: {counter[i]},')
    ...:
0: 67,
1: 537,
2: 94,
3: 3972,
4: 2423,
5: 22,
6: 62,
7: 19,
8: 177,
9: 126,
10: 154,
11: 473,
12: 62,
13: 209,
14: 28,
15: 29,
16: 543,
17: 51,
18: 86,
19: 682,
20: 339,
21: 127,
22: 22,
23: 53,
24: 81,
25: 123,
26: 32,
27: 19,
28: 58,
29: 23,
30: 57,
31: 52,
32: 42,
33: 16,
34: 57,
35: 16,
36: 60,
37: 21,
38: 22,
39: 29,
40: 46,
41: 38,
42: 16,
43: 27,
44: 17,
45: 19,

3と4のトピックが図抜けて多く、約57%を占めます。
トピックに含まれる記事の数に偏りがありますが、今回は46クラスへの分類という問題設定で進めます。

ニュースのテキストを見る

ニュースも1つ見てみましょう。

In [12]: x_train[1000]
Out[12]:
[1,
 437,
 495,
 1237,
 55,
 9070,
 :
 12]

整数からなるリストが表示されました。
今回のデータセットの場合、単語が数値に変換されています。
元のテキストを確認してみます。

まず、単語と数値の対応表は、tensorflow.keras.datasets.reuters.get_word_index(ドキュメント)で取得できます。
※初回実行時は、データがダウンロードされます。

In [16]: word_index = reuters.get_word_index()

In [17]: len(word_index)
Out[17]: 30979

In [18]: word_index
Out[18]:
{'mdbl': 10996,
 'fawc': 16260,
 'degussa': 12089,
 'woods': 8803,
 'hanging': 13796,
 'localized': 20672,
 :
 'hebei': 9407,
 ...}

In [19]: for word, index in word_index.items():
    ...:     if index in [0, 1, 2]:
    ...:         print(word, index)
    ...:
the 1
of 2

In [20]: for word, index in word_index.items():
    ...:     if index in [30978, 30979, 30980]:
    ...:         print(word, index)
    ...:
jung 30978
northerly 30979

word_indexは単語に対する数値の辞書です。
この対応を逆にして数値に対する単語の辞書を用意すればよさそうです。
ここで、x_trainとx_testに使われた整数は、word_indexの整数とずれていることに対応する必要があります。

ずれる理由は load_dataの3つの引数にあります。

1. 開始を表す数値：start_char=1

The start of a sequence will be marked with this character. Set to 1 because 0 is usually the padding character.

x_trainとx_testの中で1は開始を表します。
0がpadding character（埋め草文字。余白を埋めるための文字）に使われるため、1がデフォルト値となっているそうです。

2. 対応しない語を表す数値：oov_char=2

words that were cut out because of the num_words or skip_top limit will be replaced with this character.

num_wordsやskip_top引数によって、使う単語の範囲を区切ることで、対応しない語がoov_charに置き換えられます。
今のx_trainやx_testを取得する際、これらの引数を指定していないため、oov_charは現時点では無関係です。
（load_dataのドキュメントを見ると、oovはout of vocabularyの略のようです）

3. 単語に対応する数値の最初の値：index_from=3

index actual words with this index and higher.

数値のうち、0, 1, 2が意味を持っているため、単語の対応がずれるわけです。
index_from引数によりx_trainやx_testはword_indexの0が3に該当するという対応¹で読み込まれています。

ニュースを単語の並びとして見るために、数値のズレを考慮して、数値: 単語という辞書を作ります。

In [22]: index_word_map = {
    ...:     index + 3: word for word, index in word_index.items()
    ...: }
    ...: index_word_map[0] = "[padding]"
    ...: index_word_map[1] = "[start]"
    ...: index_word_map[2] = "[oov]"
In [23]: len(index_word_map)
Out[23]: 30982

この辞書を使うことでx_train, x_testの整数の並びから文章を復元することができました。

In [24]: for index in x_train[1000]:
    ...:     print(index_word_map[index], end = " ")
    ...:
[start] german banking authorities are weighing rules for banks' off balance sheet activities in an attempt to cope with the growing volume of sophisticated capital market instruments banking sources said interest rate and currency swaps and ...

前処理

テキストの前処理

整数で表されたニュース記事の長さはまちまちです。

In [31]: for x in x_train[998:1003]:
    ...:     print(len(x))
    ...:
133
51
626
17
442

そこで長さが揃うように変換して前処理します。
今回は、各ニュース記事を、ニュース記事全体に登場する頻度の上位1000語が含まれるか否かで表します。

例えば、上位1000語の中に「currency」という単語があり、対応する整数は500とします。
各記事を0か1の並びで表すとき、currencyという語を含む記事は、インデックス500に1が来ます。
currencyという語を含まない記事は、インデックス500が0です。
これが他の語にも当てはまります。

この変換により、

ニュース記事の長さが1000に揃います
ニュース記事は1000個の0または1の並びで表されます

load_dataメソッドのnum_words引数²に1000を渡して、登場する頻度の上位1000語でx_train, x_testを表すように変換します。

In [32]: (x_train, y_train), (x_test, y_test) = reuters.load_data(num_words=1000)
    ...:

In [33]: len(y_train), len(y_test)
Out[33]: (8982, 2246)

num_wordsを登場頻度の上位1000語としたので、そこに含まれない語はoov_char(整数では2)としてx_train, x_testで表されます。

In [34]: for index in x_train[1000]:
    ...:     print(index_word_map[index], end = " ")
    ...:
[start] german banking [oov] are [oov] [oov] for [oov] off balance [oov] [oov] in an [oov] to [oov] with the growing volume of [oov] capital market [oov] banking sources said interest rate and currency [oov] and ...

この段階ではニュース記事の長さはまだ揃っていません。
上位1000語がニュース記事に含まれるか否かを表すために、tensorflow.keras.preprocessing.text.Tokenizer(ドキュメント)を使います。

Tokenizerの初期化でnum_words引数に1000を渡します。
Tokenizerを使った処理でnum_words-1の語が考慮されます。

num_words: the maximum number of words to keep, based
on word frequency. Only the most common num_words-1 words will
be kept.

In [37]: from tensorflow.keras.preprocessing.text import Tokenizer

In [42]: tokenizer = Tokenizer(1000)

load_dataで上位1000語を取り出しているので、x_train, x_testに含まれる整数の最大は999です³。

In [71]: max_index = 0

In [72]: for x in list(x_train)+list(x_test):
    ...:     now_max = max(x)
    ...:     if now_max > max_index:
    ...:         max_index = now_max
    ...:

In [73]: max_index
Out[73]: 999

sequences_to_matrixメソッド(ドキュメント)で、x_train, x_testをそれぞれ変換します。

In [36]: x_train.shape, x_test.shape
Out[36]: ((8982,), (2246,))

In [76]: x_train = tokenizer.sequences_to_matrix(x_train, "binary")

In [77]: x_test = tokenizer.sequences_to_matrix(x_test, "binary")

In [78]: x_train.shape, x_test.shape
Out[78]: ((8982, 1000), (2246, 1000))

全てのニュース記事が長さが1000で表されました。

sequences_to_matrixのドキュメントによると、

a sequence is a list of integer word indices

すなわち、sequenceとは「単語を表す整数のリスト（意訳）」なので、x_train, x_testはまさしくsequenceです。

第2引数の指定ですが、"binany"の場合は各語が存在するかしないかの0/1で表されます。
他に、"count", "tfidf", "freq"を指定できるそうです。

変換されたニュース記事を試しに1つ見てみると

In [58]: x_train[1000]
Out[58]:
array([0., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 0., 0., 1.,
       ...

と0/1で表現されています。
index_word_mapの中で0（埋め草文字にあたる）や、3（もともと0の語がないので、3というキーがない）はどのニュースにも登場しないので0です。
1([start])や2([oov])、4(the)、5(of)などはx_train[1000]に登場するので1となっています。

テキストの前処理は以上です。

ラベルの前処理

ラベル（ニュースのトピック）は0〜45のいずれかですが、これをone-hot表現に変換します。
（one-hot表現とする理由は、ラベル同士に大小関係を持たせないようにするため）

tensorflow.keras.utils.to_categorical(ドキュメント)を使います。

In [80]: y_train.shape, y_test.shape
Out[80]: ((8982,), (2246,))

In [81]: from tensorflow import keras

In [85]: number_of_classes = len(counter)

In [86]: y_train = keras.utils.to_categorical(y_train, number_of_classes)

In [88]: y_test = keras.utils.to_categorical(y_test, number_of_classes)

In [89]: y_train.shape, y_test.shape
Out[89]: ((8982, 46), (2246, 46))

ラベルをインデックスと見立てて、ラベルのインデックスだけ1、他は0という形式に変換されます。
y_train[1000]は19でしたが、y_train[1000][19]が1、他は0となるように変換されています。

In [90]: y_train[1000]
Out[90]:
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)

前処理は以上です。

本記事のまとめ

データセットについて
- ロイター通信のニュースのトピック分類データ（多クラス分類）
- ラベルは46クラスあり、含まれるニュースの件数に偏りがある
- ニュースは整数のリストで表されている。単語の並びへ復元して元のニュースを確認できる
前処理について
- テキストを固定長の0/1の並びに変換（登場頻度上位1000語のそれぞれが含まれるか否か）
- ラベルをone-hot表現に変換

本記事は[モデル構築編]に続きます。

word_indexは0というキーを持ちませんが、1は4、2は5、、、のように対応するということです ↩
「max number of words to include. Words are ranked by how often they occur (in the training set) and only the most frequent words are kept」（学習用データの中で頻出する上位何語を使って、x_train, x_testとして取り出すかの指定です） ↩
index_word_mapでキーが999までの単語ということです ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up