More than 5 years have passed since last update.

自然言語処理で使えるデータのお手軽水増し術

Last updated at 2019-12-31Posted at 2019-12-31

はじめに

ディープラーニングでは大量の学習データが必要となりますが、実社会ではそんなにデータをそろえることができないという現実があります。
こうした状況を背景に、近年広まってきているのが、少ないデータで学習するテクニックです。
少ないデータで学習するための方法には次の3つが挙げられます。

品質の良いデータを使う
水増し
転移学習

本記事では、自然言語処理において用いられる、他言語への翻訳を活用したデータの「水増し」に着目し、そもそも「水増し」とはどのようなテクニックなのか、「水増し」を行う際にはどういったことに注意すれば良いかを整理しながら、実際に「水増し」の実装をしていきたいと思います。

自然言語処理におけるデータの水増し

「水増し」は、元の学習データを変換してデータ量を増やすテクニックであり、自然言語処理だけでなく、画像処理などでもよく用いられています。
余談ですが、「水増し」の原語は "Data Augmentation" であり、直訳すると「データ拡張」という意味になります。

分析環境と事前準備

本記事における実装では、Kaggle の Kernel を使用しています。
今回使用するKaggleの環境のスペック及び設定を、以下に挙げておきます。

Python 3.6.6
Anaconda conda 4.6.14
RAM 16GB
Disk 4.9GB
Language Python
GPU Off
Internet On

Kaggle の Kernel を使用する場合は Internet を On にすることを忘れないようにしましょう。
また、ローカル環境を使用する場合は、以下のコマンドをコマンドプロンプトに打ち込んで、各モジュールをインストールしておきましょう。

pip install -U joblib textblob
python -m textblob.download_corpora

モジュールの設定は以下のように行います。

augmentation.py

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# augmentation
from joblib import Parallel, delayed
from textblob import TextBlob
from textblob.translate import NotTranslated

# sleep
from time import sleep 

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

今回の使用するデータは、Jigsaw Unintended Bias in Toxicity ClassificationというKaggleのコンペティションのデータセットになります。
今回の翻訳処理においては、特に事前加工はせず、試行用として先頭の100レコードだけ抽出します。

augmentation.py

# importing the dataset
train=pd.read_csv("../input/train.csv")
train=train.head(100)

翻訳処理

各モジュールとデータの準備ができたら、翻訳処理を実施していきたいと思います。
試しに、以下のコードを実行して、x に設定した例文を、英語から日本語に翻訳してみたいと思います。

augmentation.py

# a random example
x = "Great question! It's one we're asked a lot. We've designed the system assuming that people *will* try to abuse it. So, in addition to the peer reviews, there are algorithms on the backend doing a lot of meta-analysis. I'm sure the system isn't 100% perfect yet, but we know from months of beta testing that it's a really solid start, and we'll keep working to improve it!"

# translate
analysis = TextBlob(x)
print(analysis)
print(analysis.translate(to='ja'))

すると、以下のような結果が返ってくるかと思います。

Great question! It's one we're asked a lot. We've designed the system assuming that people *will* try to abuse it. So, in addition to the peer reviews, there are algorithms on the backend doing a lot of meta-analysis. I'm sure the system isn't 100% perfect yet, but we know from months of beta testing that it's a really solid start, and we'll keep working to improve it!
素晴らしい質問です。それは私たちがたくさん求められているものです。私たちは、人々がそれを悪用しようとすると仮定してシステムを設計しました。そのため、ピアレビューに加えて、バックエンドに多くのメタ分析を行うアルゴリズムがあります。私はシステムがまだ100％完璧ではないと確信しています、しかし我々は何ヶ月ものベータテストからそれが本当に堅実なスタートであることを知っています、そして我々はそれを改善するために努力し続ける！

上手く翻訳できたようですね。
次に、実施した翻訳処理の汎用性を高めるために、翻訳処理の部分を関数化していきます。
今回の翻訳処理では、スペイン語・ドイツ語・フランス語の3つの言語でやってみたいと思います。
まず、翻訳に使用する言語や、並列処理において使用するコア数や進捗の出力頻度を、パラメータとして定義します。

augmentation.py

languages = ["es", "de", "fr"]
parallel = Parallel(n_jobs=-1, backend="threading", verbose=5)

次に、翻訳処理の関数を定義します。
実際に翻訳処理を関数化したものが以下のコードです。

augmentation.py

def translate_text(comment, language):
    if hasattr(comment, "decode"):
        comment = comment.decode("utf-8")
    text = TextBlob(comment)
    try:
        text = text.translate(to=language)
        sleep(0.4)
        text = text.translate(to="en")
        sleep(0.4)
    except NotTranslated:
        pass
    return str(text)

上記の処理のポイントとしては、timeモジュールのsleep関数を使用して、翻訳処理を一時的に止めています。
ここで一時停止を挟んでいるのは、挟まないと以下のようなエラーが発生するためです。

HTTPError: HTTP Error 429: Too Many Requests

上記で定義したパラメータと関数を使用して、実際に翻訳処理を実施していきます。
翻訳処理は以下のコードで実施することができます。

augmentation.py

comments_list = train["comment_text"].fillna("unknown").values

for language in languages:
    print('Translate comments using "{0}" language'.format(language))
    translated_data = parallel(delayed(translate_text)(comment, language) for comment in comments_list)
    train['comment_text'] = translated_data
    result_path = os.path.join("train_" + language + ".csv")
    train.to_csv(result_path, index=False)

上記の処理を実行すると、以下のようにログが出力されるはずです。

Translate comments using "es" language

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    2.5s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   13.4s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   20.8s finished

Translate comments using "de" language

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    2.5s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   13.2s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   20.7s finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.

Translate comments using "fr" language
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    2.5s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   13.6s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   21.4s finished

全てfinishedまで行けば、翻訳処理の結果がCSVファイルで出力されているはずです。

水増しの結果

以下の原文を例に、出力された翻訳結果を見てみましょう。

原文

Great question! It's one we're asked a lot. We've designed the system assuming that people will try to abuse it. So, in addition to the peer reviews, there are algorithms on the backend doing a lot of meta-analysis. I'm sure the system isn't 100% perfect yet, but we know from months of beta testing that it's a really solid start, and we'll keep working to improve it!

原文と比べてみると、以下の翻訳結果では微妙に文章が変わっていることが分かります。

英語->スペイン語->英語

Great question It is one that they ask us a lot. We have designed the system assuming that people will try to abuse it. So, in addition to peer reviews, there are algorithms in the backend that perform many meta-analyzes. I'm sure the system is not 100% perfect yet, but we know for months of beta testing that it's a really solid start, and we'll keep working to improve it!

英語->ドイツ語->英語

Good question! We are often asked about it. We designed the system on the assumption that people will try to abuse it. In addition to the peer reviews, there are backend algorithms that do a lot of meta-analysis. I'm sure the system is not 100% perfect yet, but we know from months of beta testing that it's a really solid start, and we'll continue to work on improving it!

英語->フランス語->英語

Good question! We are asked a lot. We designed the system on the assumption that people will * try * to abuse it. Thus, in addition to peer reviews, there are algorithms on the backend that do a lot of meta-analysis. I'm sure the system is not 100% perfect yet, but months of beta testing have taught us that it was a good start, and we will continue to improve it!

参考情報

本記事の参考になったサイトや関連したトピックを扱っているサイトを紹介します。

TextBlob: Simplified Text Processing：
翻訳処理などの際に使用する「TextBlob」について知りたい方はこちらを見ると良いでしょう。

JoblibのParallelの全引数を解説：
並列処理を行うためのライブラリ「Joblib」について、本記事で使用した引数以外の引数についても解説がされています。

水増しと転移学習 (Vol.7)：
画像処理における「水増し」のテクニックについて取り上げられています。

最後に

最終的に水増ししたデータを使用してモデルを組む時、元言語と翻訳で使用する言語間の類似性が、どれだけモデルの精度に影響しているについて、次回取り上げられたらと思います。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

自然言語処理で使えるデータのお手軽水増し術

はじめに

目次

自然言語処理におけるデータの水増し

分析環境と事前準備

翻訳処理

水増しの結果

参考情報

最後に