More than 5 years have passed since last update.


Last updated at Posted at 2016-09-08





その圧倒的な表現能力の高さ や 高価なGPGPUが大量にないと話にならないお金持ちゲーであること や
実際にさわってみて, やっぱり凄かったので 投稿する次第です.

本稿では, 再配布可能な対訳コーパスKFTTを対象に日英翻訳をニューラルネット翻訳(NMT)で試してみます.
加えて, これまでのフレーズベースの統計的機械翻訳の典型的なツールでした


  • NMTに関する関連情報
  • NMTツール(seq2seq-attn)のインストール・実行手順
  • NTMの翻訳結果 加え NMTとPBMTの出力結果の比較


  • NMTがどのように動いているかの動作原理

本稿が, 自然言語処理の大きな分野の一つである機械翻訳に興味がある方々の助けになれば幸いです.

NMTの有用そうな情報 まとめ

ひとまず, NMTの手がかりとして有用そうなリンクを挙げておきます





  • Torchの依存パッケージのインストール
  • Torchのインストール
  • seq2seq-attnの依存パッケージのインストール
  • seq2seq-attnのクローンのインストール





git clone https://github.com/torch/distro.git ~/torch --recursive
cd ~/torch; bash install-deps;




  • yumで導入するソフト
sudo yum install -y epel-release # a lot of things live in EPEL
sudo yum install -y make cmake curl readline-devel ncurses-devel \
                    gcc-c++ gcc-gfortran git gnuplot unzip \
                    libjpeg-turbo-devel libpng-devel \
                    ImageMagick GraphicsMagick-devel fftw-devel \
                    sox-devel sox zeromq3-devel \
                    qt-devel qtwebkit-devel sox-plugins-freeworld
sudo yum install -y python-ipython
  • ソースからコンパイル
git clone https://github.com/xianyi/OpenBLAS.git
cd OpenBLAS
make install```

## Torch本体のインストール


git clone https://github.com/torch/distro.git ~/torch --recursive
cd ~/torch

あとアクティベーションスクリプトの読み込み. /lab/takahashi/tools/torch/install/bin/torch-activateのような1行が実行中のシェル設定ファイル(.zshrcなど)の末尾に追加されます.
そのため、インストールに続いてTorchを実行する場合はsource ~/.zshrcなどを実行しておかなければなりません.


torch-activateを実行した後で, 以下の作業を行います.

# seq2seq-attnのクローン
git clone https://github.com/harvardnlp/seq2seq-attn.git
# seq2seq-attnの依存パッケージの導入
luarocks install hdf5
luarocks install nn
luarocks install nngraph
luarocks install luautf8
# GPUを使うなら
luarocks install cutorch
luarocks install cunn

日英対訳データに対して 翻訳器を動かしてみる




この際, 下記の前処理が行われます.

  • 語彙サイズの削減
  • 語数の多い対訳文の削除
  • 語彙のID化 (辞書を生成する)
  • ミニバッチ化
  • 対訳データの並びのシャッフル


python preprocess.py \
    --srcfile kftt/data/tok/kyoto-train.cln.ja \
    --targetfile kftt/data/tok/kyoto-train.cln.en \
    --srcvalfile kftt/data/tok/kyoto-dev.ja \
    --targetvalfile kftt/data/tok/kyoto-dev.en \
    --outputfile kftt/data/kyoto \
    --batchsize 64 \
    --shuffle 1


th train.lua \
    -data_file kftt/data/kyoto-train.hdf5 \
    -val_data_file kftt/data/kyoto-val.hdf5 \
    -savefile kftt/data/kyoto-model \
    -gpuid 1


学習時間はGTX GeForce TitanXで半日ほど. (速度的には 20k tokens/s)
メモリの使用量は大体1GB程度です. 色々な実装と比べて かなり少ない


th evaluate.lua\
	-model kftt/data/kyoto-model_final.t7 \
	-src_file kftt/data/tok/kyoto-test.ja \
	-src_dict kftt/data/kyoto.src.dict \
	-targ_dict kftt/data/kyoto.targ.dict \
	-output_file kftt/data/tok/pred.txt \
	-gpuid 1



別の形態素解析辞書やコーパスの前処理を行ったものの結果は、日英でBLEU:21.8, RIBES:68.7でした.


|systems | BLEU | RIBES|
|PBMT(moses) | 19.3 | 66.4 |
|NMT(seq2seq-attn) | 21.8 | 68.7 |

BLEUが2.5ポイント向上します. すごい...



  • 日本語
  • 参照訳
  • PBMT(moses)の出力
  • NMTの出力

TRG: InfoboxBuddhist
REF: InfoboxBuddhist
PBMT: infobox buddhist
NMT: <unk>

TRG: 道元 は 、 鎌倉 時代 初期 の 禅僧 。
REF: dogen , the priests of the early kamakura period .
PBMT: dogen was a zen monk in the early kamakura period .
NMT: dogen was a zen priest in the early kamakura period .

TRG: 曹洞 宗 の 開祖 。
REF: the founder of the soto sect .
PBMT: the founder of soto zen
NMT: he was the founder of the soto sect .

TRG: 晩年 に 希玄 と いう 異称 も 用い た 。
REF: it was also used as another name for 希玄 in his later years .
PBMT: later in his life he also went by the name kigen .
NMT: it was also called <unk> in his later years .

TRG: 同 宗旨 で は 高祖 と 尊称 さ れる 。
REF: is referred to as koso in the religious doctrines of the same year .
PBMT: within the sect he is referred to by the honorary title koso .
NMT: in the same religious doctrine , he was honorifically called koso .

TRG: 一般 に は 道元 禅師 と 呼ば れる 。
REF: it is generally called dogen zenji .
PBMT: he is generally called dogen zenji .
NMT: he is generally called dogen zenji .

TRG: 日本 に 歯磨き 洗面 、 食事 の 際 の 作法 や 掃除 の 習慣 を 広め た と いわ れる 。
REF: in japan , it is said that the custom to clean wash 歯磨き when eating manners and spread .
PBMT: he is reputed to have been the one that spread the practices of tooth brushing , face washing , table manners and cleaning in japan .
NMT: it is said that he spread the manners and customs of meals in japan and the custom of cleaning .

TRG: 最初 に モウソウチク ( 孟宗 竹 ) を 持ち帰っ た と する 説 も ある 。
REF: there is also a theory that the first moso-chiku ( phyllostachys edulis species in the back ) .
PBMT: another story has it that he was the first one to bring moso-chiku ( moso bamboo ) to japan .
NMT: there is a theory that he was first brought back to japan .

TRG: 道元 の 出生 に は 不明 の 点 が 多い が 、 内 大臣 土御門 通親 ( 源 通親 あるいは 久我 通親 ) の 嫡流 に 生まれ た と する 点 で は 諸説 が 一致 し て いる 。
REF: there are various theories as to the birth of michichika koga of naidaijin ( minister ) minamoto no michichika ( michichika tsuchimikado , but it is often was born in the direct line of the intersection with dogen , or ) ) .
PBMT: though some points are unclear about dogen 's birth , all accounts agree that he was born in the line of udaijin ( minister of the right ) michichika tsuchimikado ( minamoto no michichika or michichika koga ) .
NMT: there are many theories about the birth of dogen , but there are various theories regarding the fact that he was born as the direct descendant of the minister of the center michichika tsuchimikado ( minamoto no michichika or michichika koga ) .

TRG: 定説 で は 京都 木幡 の 松 殿 山荘 で 通親 と 太政 大臣 松 殿 基房 ( 藤原 基房 ) の 娘 藤原 伊 子 の 子 と し て 生まれ た と さ れ て いる が 、 近年 の 研究 で は 定説 で は 養父 と さ れ て いる 堀川 通具 の 実子 と する 説 が 有力 に なり つつ ある 。
REF: he was born as the child of fujiwara no ishi , a daughter of fujiwara no motofusa matsudono and his foster father was an established theory is becoming influential in the study of the biological son of motofusa matsudono ( grand minister ) , but in recent years , michitomo horikawa , kyoto of kohata villa with michichika .
PBMT: although it is generally accepted that he was born in shoden sanso in kohata , kyoto , to michichika and fujiwara no ishi , the daughter of daijo-daijin ( grand minister of state ) motofusa matsudono ( fujiwara no motofusa ) , recent research suggests that he may have been the son of michitomo horikawa , who was presumed to be his adoptive father .
NMT: it is said that he was born as a son of the grand minister of state motofusa matsudono ( fujiwara no motofusa ) , who was the daughter of the grand minister of state , motofusa matsudono ( fujiwara no motofusa ) , and the daughter of the grand minister of state motofusa matsudono ( fujiwara no motofusa ) in the <unk> villa in kohata , kyoto .

TRG: また 、 通親 の 子 、 源 通 宗 また は 久我 通光 を 父親 と する 説 も ある 。
REF: in addition , there is a theory that the father and the son of michichika , minamoto no michimune or michiteru koga .
PBMT: another account says his father was the son of michichika , minamoto no michimune or michiteru koga .
NMT: there is also a theory that michichika 's son , minamoto no michimune or michiteru koga is his father .

TRG: 伝記 で ある 『 建撕 記 』 に よれ ば 、 3 歳 で 父 ( 通親 ) を 、 8 歳 で 母 を 失っ て 、 異母 兄 で ある 堀川 通具 の 養子 に なっ た 。
REF: according to the " kojiki , " 建撕 lost his mother at the age of eight , his older paternal half-brother , michitomo horikawa , became an adopted son of his father at the age of three ( michichika ) , is his biography .
PBMT: according to the biography " kenzeiki " , he lost his father ( michichika ) at 3 years of age , his mother at 8 , and was adopted by his half brother michitomo horikawa .
NMT: according to " <unk> , " a biography , his father ( michichika ) was born when he was three years old , and his mother was the adopted child of michitomo horikawa , who was a paternal half-brother .

TRG: また 、 一説 に よれ ば 両親 の 死後 に 母方 の 叔父 で ある 松 殿 師家 ( 元 摂政 内 大臣 ) から 松 殿 家 の 養 嗣子 に し たい と いう 話 が あっ た が 、 世 の 無常 を 感じ て い た 道元 が 断っ た と も 言わ れ て いる 。
REF: in addition , it is also said that it was the adopted heir of the matsudono family declined after the death of his parents , who felt the absence of absolutes regent moroie matsudono ( according to one theory , the story as maternal uncle to dogen ) from naidaijin ( minister ) .
PBMT: yet another account tells that his maternal uncle moroie matsudono ( former regent and interior minister ) wanted to adopt him as an heir after his parents died , but dogen , feeling the uncertainty of the world , declined .
NMT: according to one theory , he wanted to become an adopted heir of matsudono family ( former regent ) from his maternal uncle , moroie matsudono ( former regent , minister of the palace ) after the death of his parents , but it is said that dogen , who felt sorry for the impermanence of the world , refused .

TRG: その 他 、 御 手伝 と 称する 課役 や 江戸 時代 末期 に は 海岸 防備 を 命ぜ られる こと も あり 、 大名 は 常 に 経済 的 に も 苦しかっ た 。
REF: other than that , he was assigned to defend the coast , and in the end of the edo period , the daimyo in economic spread as the assistant of assignments .
PBMT: other than that , there was a system of assignments called otetsudai , and at the end of the edo period some were ordered to defend the coast , so the daimyo were always in a difficult position financially .
NMT: in addition , the daimyo was always in financial difficulty in the end of edo period , and in the end of edo period , daimyo always had financial difficulty .

PBMTの結果に比べて, NMTの結果はちゃんと読めるのが凄いです.
今回はCopyModelを利用していないので, 未知語<unk>はそのままですが.

NMTは それっぽく出力されていますが

  • he/she/it, 他に時制 などの間違い
  • 入力文に関して, それっぽいことを言っているが 全く関係ないこと を述べていたり
  • 長い文に関して 同じような英文を出力 して, 語数を稼いでいる


Neural versus Phrase-Based Machine Translation Quality: a Case Study


