More than 3 years have passed since last update.

【NLP】Hugging Faceの🤗Transformersことはじめ

Posted at 2021-02-15

はじめに🤗

Pythonで自然言語処理を試すときに使える、🤗 Transformersというモジュールがあります。
僕はこの中のPEGASUSという文章要約タスク用の学習済みモデルを利用したことがあるのですが、他にはどんなことができるのかが気になって公式サイトを調べてみました。
PEGASUSを使ったときは文章要約ブラウザを開発しました。

今回の目標は「transformersを使うと何ができるのかを知ること」とします。最初はこのモジュールを見ても何ができるのかもよくわからないのでそこからはじめようと思います。
まずはQuick tourからいきます！

Quick tour

Transformersライブラリの特徴から見ていきましょう。このライブラリはテキストに対しての感情分析や、セリフの完成、翻訳のような文章生成などの自然言語処理タスクを実行する学習済みモデルをダウンロードしてきます。

最初は、推論のときにpipeline APIをどのように活用して学習済みモデルを使えるかを見ていきます。その後もう少し掘り下げていき、ライブラリがどのようにモデルへのアクセスを許可しているか、どのようにデータの前処理の手助けをしているかについて見ていきます。

pipelineのタスク

すでに与えられているタスクに対して、学習済みモデルを利用したい場合の最も簡単な方法はpipeline()関数を使うことです。🤗Transformersはすぐに使える次のタスクを提供してくれます。

感情分析: このテキストはポジティブな内容かネガティブな内容か？
英語の文章生成: なにかしらのセリフを与えると、モデルがそれに続く文章を生成してくれます。
固有表現抽出: 与えられた文章の中で、それぞれの語に対してそれぞれが何を表現しているかのラベルを付与します。。
質問応答: モデルにいくつかの文脈を与え質問を行うと、その文脈から答えを抽出してくれます。
文章の穴埋め: [MASK]で穴が開けられたような文章の穴を埋めてくれます。
要約: 長いテキストの要約を生成します。
翻訳: 言語間の翻訳を行います。
特徴抽出：テキストの表現テンソルを抽出します。

pipeline()が例えば感情分析においてどのようなはたらきをしているかを見てみましょう。

>>> from transformers import pipeline
>>> classifier = pipeline('sentiment-analysis')

はじめてこのコマンドが入力されたとき、学習済みモデルとtokenizerがダウンロードされ、キャッシュされます。tokenizerの仕事は、テキストをpredictionの生成をするモデルが受け取れるような形に前処理することです。pipeline()はそれらのpredictionをひとまとめにし、人間が読めるような形に後処理してくれます。

例：

>>> classifier('We are very happy to show you the 🤗 Transformers library.')
[{'label': 'POSITIVE', 'score': 0.9997795224189758}]

なんと心強い！pipeline()は文章のリストに対しても使うことができます。文章が前処理され、バッチとしてモデルへ与えられることとなります。そして最後は、このように↓辞書型で返されます。

>>> results = classifier(["We are very happy to show you the 🤗 Transformers library.",
...            "We hope you don't hate it."])
>>> for result in results:
...     print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309

２つ目の文章がNEGATIVEになっていますが、スコアはかなり中央に近いです。

デフォルトでは、このpipeline()でダウンロードされるモデルはdistillbert-base-uncased-finetuned-sst-2-englishと呼ばれています。そのモデルはDistillBERTアーキテクチャを使っており、SST-2と呼ばれるデータセットを用いて感情分析タスク用にファインチューニングされています。

違うモデルを使ってみましょう。例えば、フランス語のデータで訓練されたモデルがあるとします。model hubに登録されている学習済みモデルの中からnlptown/bert-base-multilingual-uncased-sentimentというものを選びます。

直接パスをpipeline()に指定して使用することができます。

>>> classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")

この分類器は今英語、フランス語だけでなくオランダ語やドイツ語にイタリア語、スペイン語まで扱うことができます！モデルの名前はローカル環境に保存してある学習済みモデルのパスに置き換えることもできます。
モデルのオブジェクトとそれに関連付いたtokenizerを渡すこともできます。

そのためには２つのクラスが必要となります。１つ目は、AutoTokenizerというものです。AutoTokenizerは自分が選んだモデルに関連付いたtokenizerをダウンロードして使用するために使われます。２つ目はAutoModelForSequenceClassificationと呼ばれるものです。（Tensorflowを利用していたら`TFAutoModelForSequenceClassification）これはモデル自体をダウンロードして使用するためのものです。

モデルとtokenizerをダウンロードして使うためには、from_pretrained()メソッドを使用する必要があります。

>>> from transformers import AutoTokenizer, AutoModelForSequenceClassification
>>> model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
>>> model = AutoModelForSequenceClassification.from_pretrained(model_name)
>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
>>> classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

自分の所持しているデータに似たもので事前学習されたものが見つけられない場合は、自分のデータでモデルをファインチューニングする必要があります。

裏側の処理

事前学習済みモデルの裏側の処理についての文章はこちら

タスクの一覧

ライブラリのユースケースの中でも高頻度のものについてです。モデルは色々な設定を追加することができ、とても広い用途で使うことができます。質問応答や系列データの分類、固有表現抽出のような一番シンプルなものをここで紹介します。

今回の例ではオートモデルを利用します。オートモデルというのは、与えられたチェックポイントに応じて自動的に適切なモデルアーキテクチャを選択し、そのモデルをインスタンスにするクラスのことを指します。AutoModelというクラスとして提供されています。

モデルをあるタスクに対して適切に動作させるために、タスクに対応したチェックポイントからモデルを読みこむ必要があります。このチェックポイントというのは通常、データの広大なコーパスと具体的なタスクについてのファインチューニングで事前学習が行われて作成されています。これが何を意味するのかと言うと、

すべてのモデルがすべてのタスクに対してファインチューニングされているわけではない。何か具体的なタスクについてモデルをファインチューニングしたいときは適宜対応させる必要がある。
ファインチューニング済みのモデルは特定のデータセットでファインチューニングされたものである。このデータセットは自分のユースケースに合っているかもしれないし、合っていないかもしれない。これも必要によってはスクリプトを自作したりして適宜対応する必要がある。

また、タスクに対して推論を行うためのいくつかのメカニズムがライブラリによって利用可能になっている。

Pipelines: 抽象化されていて非常に簡単に使用できる。２行のコードで済む場合がある。
直接のモデル使用: あまり抽象化されていない。しかしtokenizerに直接アクセスすることができたりして柔軟性に富み、パワフルである。

系列データの分類(Sequence Classification)

系列データの分類は、与えられた分類(クラス)の数によってデータを分類するタスクのことです。今回の系列データ分類の例として、GLUEのデータセットを使用します。GLUEの系列データ分類のモデルをファインチューニングしたい場合は、run_glue.pyのようなスクリプトを作成して対応する必要があります。

これは系列データ分類の例として、感情分析を行っているプログラムです。与えられた文章がポジティブかネガティブかを判別します。これはGLUEタスクのSST2というデータセットでファインチューニングされたモデルに対応しています。
スコアと一緒に**"POSITIVE"か"NEGATIVE"**の文字を返します。

>>> from transformers import pipeline

>>> nlp = pipeline("sentiment-analysis")

>>> result = nlp("I hate you")[0]
>>> print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
label: NEGATIVE, with score: 0.9991

>>> result = nlp("I love you")[0]
>>> print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
label: POSITIVE, with score: 0.9999

抽出型の質問応答

抽出型の質問応答は、与えられた質問文と文脈から解答を抽出するタスクのことです。質問応答データセットのサンプルはSQuADデータセットによるものです。SQuADデータセットは全体的に質問応答タスクに基づいたものになっています。SQuADタスクのモデルをファインチューニングしたい場合はrun_tf_squad.pyのようなスクリプトを用意する必要があります。

これはSQuADでファインチューニングされたモデルを利用して質問応答を行っているサンプルです。

>>> from transformers import pipeline

>>> nlp = pipeline("question-answering")

>>> context = r"""
... Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
... question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
... a model on a SQuAD task, you may leverage the examples/question-answering/run_squad.py script.
... """

>>> result = nlp(question="What is extractive question answering?", context=context)
>>> print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
Answer: 'the task of extracting an answer from a text given a question.', score: 0.6226, start: 34, end: 96

>>> result = nlp(question="What is a good example of a question answering dataset?", context=context)
>>> print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
Answer: 'SQuAD dataset,', score: 0.5053, start: 147, end: 161

言語モデリング

言語モデリングはモデルをコーパスに適したものにするタスクのことを指します。transformerがベースになっているすべての人気のあるモデルは既存の言語モデルを利用して作成されています。例えばBERTはmasked language modeling、GPT-2はcausal language modelingといった具合です。

Masked Language Modeling

Masked language modelingは文章の中の語をマスキングされたトークンに置き換えたものをモデルに与え、適切な穴埋めをさせるというモデリング手法です。モデルは穴の右側の語と左側の文脈を観察します。

文脈から穴埋めをする処理のサンプルは以下の通りです。

>>> from transformers import pipeline

>>> nlp = pipeline("fill-mask")
>>> from pprint import pprint
>>> pprint(nlp(f"HuggingFace is creating a {nlp.tokenizer.mask_token} that the community uses to solve NLP tasks."))
[{'score': 0.1792745739221573,
  'sequence': '<s>HuggingFace is creating a tool that the community uses to '
              'solve NLP tasks.</s>',
  'token': 3944,
  'token_str': 'Ġtool'},
 {'score': 0.11349421739578247,
  'sequence': '<s>HuggingFace is creating a framework that the community uses '
              'to solve NLP tasks.</s>',
  'token': 7208,
  'token_str': 'Ġframework'},
 {'score': 0.05243554711341858,
  'sequence': '<s>HuggingFace is creating a library that the community uses to '
              'solve NLP tasks.</s>',
  'token': 5560,
  'token_str': 'Ġlibrary'},
 {'score': 0.03493533283472061,
  'sequence': '<s>HuggingFace is creating a database that the community uses '
              'to solve NLP tasks.</s>',
  'token': 8503,
  'token_str': 'Ġdatabase'},
 {'score': 0.02860250137746334,
  'sequence': '<s>HuggingFace is creating a prototype that the community uses '
              'to solve NLP tasks.</s>',
  'token': 17715,
  'token_str': 'Ġprototype'}]

Causal Language Modeling

Causal language mmodelingは与えられた文章の続きの語を予測するタスクのことです。このタスクの場合は、モデルは穴の左側のみを観察します。基本的に、続きの文章はモデルの最後の隠れ状態から予測されます。

モデルとtokenizerを使い、入力の文章から続きの語を予想するtop_k_top_p_filtering()メソッドを利用したサンプルを以下に示します。

>>> from transformers import AutoModelWithLMHead, AutoTokenizer, top_k_top_p_filtering
>>> import torch
>>> from torch.nn import functional as F

>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
>>> model = AutoModelWithLMHead.from_pretrained("gpt2")

>>> sequence = f"Hugging Face is based in DUMBO, New York City, and "

>>> input_ids = tokenizer.encode(sequence, return_tensors="pt")

>>> # get logits of last hidden state
>>> next_token_logits = model(input_ids).logits[:, -1, :]

>>> # filter
>>> filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)

>>> # sample
>>> probs = F.softmax(filtered_next_token_logits, dim=-1)
>>> next_token = torch.multinomial(probs, num_samples=1)

>>> generated = torch.cat([input_ids, next_token], dim=-1)

>>> resulting_string = tokenizer.decode(generated.tolist()[0])

>>> print(resulting_string)
Hugging Face is based in DUMBO, New York City, and has

文章生成

文章生成の目的は、与えられた文脈から辻褄のあったテキストを続きとして作り出すことです。以下の例では、GPT-2がどのように使われてテキストを生成しているかを示しています。デフォルトでそれぞれで設定されているように、すべてのモデルはpipelinesで使用されたときTop-Kサンプリングに対応するようになっています。

>>> from transformers import pipeline

>>> text_generator = pipeline("text-generation")
>>> print(text_generator("As far as I am concerned, I will", max_length=50, do_sample=False))
[{'generated_text': 'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a "free market." I think that the idea of a free market is a bit of a stretch. I think that the idea'}]

文章生成は現在、GPT-2、OpenAi-GPT、CTRL、XLNet、Transfo-XLとPytorchのReformerとTensorflowのほとんどのモデルでも利用可能になっています。

固有表現抽出

固有表現抽出は、語を分類(クラス)に基づいて分類します。語を人として認識したり、組織として認識したり、場所として認識したりのようなラベル付けを行います。固有表現抽出のサンプルはCoNLL-2003というデータセットが使われています。ファインチューニングをしたい場合はrun_pl_ner.pyのようなスクリプトを用意しましょう。

固有表現抽出のサンプルを以下に示します。この固有表現抽出では、９種類の分類を行おうとしています。

O, Outside of a named entity
B-MIS, Beginning of a miscellaneous entity right after another miscellaneous entity
I-MIS, Miscellaneous entity
B-PER, Beginning of a person’s name right after another person’s name
I-PER, Person’s name
B-ORG, Beginning of an organisation right after another organisation
I-ORG, Organisation
B-LOC, Beginning of a location right after another location
I-LOC, Location

>>> from transformers import pipeline

>>> nlp = pipeline("ner")

>>> sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very"
...            "close to the Manhattan Bridge which is visible from the window."
>>> print(nlp(sequence))
[
    {'word': 'Hu', 'score': 0.9995632767677307, 'entity': 'I-ORG'},
    {'word': '##gging', 'score': 0.9915938973426819, 'entity': 'I-ORG'},
    {'word': 'Face', 'score': 0.9982671737670898, 'entity': 'I-ORG'},
    {'word': 'Inc', 'score': 0.9994403719902039, 'entity': 'I-ORG'},
    {'word': 'New', 'score': 0.9994346499443054, 'entity': 'I-LOC'},
    {'word': 'York', 'score': 0.9993270635604858, 'entity': 'I-LOC'},
    {'word': 'City', 'score': 0.9993864893913269, 'entity': 'I-LOC'},
    {'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
    {'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
    {'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},
    {'word': 'Manhattan', 'score': 0.9758241176605225, 'entity': 'I-LOC'},
    {'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
]

"Hugging Face"という言葉が組織として分類され、"New York City"や"DUMBO"、"Manhattan Bridge"という言葉がきちんと場所として認識されています。

要約

要約は書類や記事をより短いテキストにするタスクのことです。要約タスクのサンプルは長いニュース記事やニュースで構成されているCNN/DailyMailDatasetが用いられています。

>>> from transformers import pipeline

>>> summarizer = pipeline("summarization")

>>> ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
... A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
... Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
... In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
... Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
... 2010 marriage license application, according to court documents.
... Prosecutors said the marriages were part of an immigration scam.
... On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
... After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
... Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
... All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
... Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
... Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
... The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
... Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
... Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
... If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
... """
>>> print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))
[{'summary_text': 'Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men.'}]

要約のpipeline()はPreTrainedModel.generate()メソッドに依存しているので、pipeline()にmax_length引数とmin_length引数を以下に指定してオーバーライドを行います。

翻訳

翻訳タスクはある言語で書かれた文章を違う言語に翻訳することです。

翻訳タスクのサンプルのデータセットにはWMT English to Germanを用います。このデータセットは英語の文章の入力と、それに対応するドイツ語の文章が含まれています。

>>> from transformers import pipeline

>>> translator = pipeline("translation_en_to_de")
>>> print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))
[{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]

翻訳のpipeline()はPreTrainedModel.generate()メソッドに依存しているので、pipeline()にmax_length引数とmin_length引数を以下に指定してオーバーライドを行います。

さいごに🤗

これで、Transformersで何ができるのかがぼんやりイメージがついたかと思います。
個人的には、翻訳と要約と質問応答が気になっています。
まだ何ができるのかの種類や簡単な利用方法に触れただけなので、またモチベーションがあったらどんどん掘り下げていこうと思います！
ありがとうございました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up