Fuel の TextFile が便利 #Python

Fuel

主に

Dataset
DataStream
IterationScheme

がありますが、 Dataset の例を示します。

fuel.datasets.base.Datasetを継承したクラスでどんなデータなのかを定義する。
例えば、２次元のグレイスケールの画像データを分類したトレーニングデータでバッチ処理をするような場合。

>>> dataset.axis_labels
OrderedDict([('features', ('batch', 'height', 'width')), ('targets', ('batch', 'index'))])

しかし、 Dataset は数値化されたものを扱うので、テキストデータはどうすればいいのか、となる。
そこで、 TextFile なるクラスが用意されている。

TextFile

TextFile — Fuel 0.1.1 documentation

例えば、以下のような vocab.pkl があったとする。

>>> import cPickle
>>> d = cPickle.load(open('vocab.pkl'))
>>> d
{'and': 12, 'cute': 13, 'forget': 14, 'it': 15, 'an': 16, 'break-': 17, 'are': 18, 'horrendous': 19, '&apos;re': 20, '<UNK>': 0, 'again': 5, 'what': 22, 'make': 23, ',': 6, '.': 3, 'start': 24, 'pronunciation': 25, 'asking': 26, 'roxanne': 27, 'you': 7, 'out': 21, 'public': 29, '?': 8, '&apos;d': 30, 'we': 9, 'okay': 31, 'that': 10, '<S>': 1, 'andrew': 32, 'if': 33, '&apos;s': 4, 'quad': 34, 'with': 11, 'barrett': 35, 'me': 36, 'on': 37, 'incredibly': 28, 'your': 38, 'name': 39, 'this': 40, 'well': 41, 'up': 42, 'thought': 43, 'i': 44, 'korrine': 45, 'so': 46, 'can': 47, 'quick': 48, 'the': 49, '</S>': 2, 'having': 50}

また、以下のような train_post.txt.tok があったとする。

train_post.txt.tok

Can we make this quick ? Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad . Again .
You &apos;re asking me out . That &apos;s so cute . What &apos;s your name again ?

するとこんな感じ。

>>> import cPickle
>>> d = cPickle.load(open('vocab.pkl'))
>>> def get_key(n):
...     for k, v in d.items():
...             if v == n:
...                     return k
...     return '<NULL>'
... 
>>> from fuel.datasets.text import TextFile
>>> text_data = TextFile(['train_post.txt.tok'], d)
>>> s = text_data.open()
>>> data = text_data.get_data(s)
>>> " ".join([get_key(n) for n in data[0]])
'<S> <UNK> we make this quick ? <UNK> <UNK> and <UNK> <UNK> are having an incredibly horrendous public break- up on the quad . <UNK> . </S>'
>>> 
>>> def lower(s):
...     return s.lower()
... 
>>> text_data = TextFile(['train_post.txt.tok'], d, preprocess=lower)
>>> s = text_data.open()
>>> data = text_data.get_data(s)
>>> " ".join([get_key(n) for n in data[0]])
'<S> can we make this quick ? roxanne korrine and andrew barrett are having an incredibly horrendous public break- up on the quad . again . </S>'
>>> 
>>> data = text_data.get_data(s)
>>> data[0]
[1, 7, 20, 26, 36, 21, 3, 10, 4, 46, 13, 3, 22, 4, 38, 39, 5, 8, 2]
>>> " ".join([get_key(n) for n in data[0]])
'<S> you &apos;re asking me out . that &apos;s so cute . what &apos;s your name again ? </S>'

辞書を作る時点で lowercase にしてしまったので、 preprocess を指定しています。

こんな感じで辞書を渡してあげることで、対応した数字に変換してくれるというものです。

まあそもそも Fuel を使う場合に限る話ですが。

参考

トークナイズと辞書の作成を事前にする必要があるので、以下の処理をしました。

Tokenize

mosesdecoder/tokenizer.perl at master · moses-smt/mosesdecoder

上のスクリプトをお借りした。

perl tokenizer.perl < train_post.txt -l en 2>/dev/null > train_post.txt.tok

train_post.txt

Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.
You're asking me out.  That's so cute. What's your name again?

train_post.txt.tok

Can we make this quick ? Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad . Again .
You &apos;re asking me out . That &apos;s so cute . What &apos;s your name again ?

' に関しては何らかの対処をしないと。。

Vocabrary

GroundHog/preprocess.py at master · lisa-groundhog/GroundHog

上のスクリプトをお借りした。

python preprocess.py -l train_post.txt.tok

実行すると vocab.pkl が生成される。

-d オプションで生成される辞書のパスを指定できる。デフォルトが vocab.pkl 。
-l オプションで小文字にできる。
-v オプションで語彙数を指定できる。指定しないと全単語。

しかし、複数のファイルから辞書を作りたい時

python preprocess.py -l train_post.txt.tok train_reply.txt.tok

のようにするとうまくいかない。
原因はここ。

vocab_count = counter.most_common()

を

vocab_count = combined_counter.most_common()

に修正。

語彙数を指定すれば、複数ファイルでも問題ない。
基本語彙数を指定しない時なんてないのかな。。。

また、 text_data = TextFile(['train_post.txt.tok'], d) の際にBOSの記号を指定できるが、デフォルトは以下。

synbol	meaning
<UNK>	Unknown term
<S>	Begin of sentence
</S>	End of sentence

辞書内に３つがないとエラーが出るので

vocab = {'<UNK>': 0, '<S>': 1, '</S>': 2}
for i, (word, count) in enumerate(vocab_count):
   vocab[word] = i + 3

のように修正。