Googleから出た最新のマルチタスクモデル T5(Text-to-Text Transfer Transformer) を動かしてみます。
( T5モデルについて )
Google AI Blog
__論文__はこちらです。総計67ページの大作です。
( マルチタスクモデル「T5モデル」の適用先タスクを宣言する方法 )
tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)
のt5_prepared_Textの部分に格納する文字列を切り替えることで、切り替えます。
なお、以下に登場する「変数preprocess_text」は、英文の文字列を格納したstr型の変数オブジェクトです。
- 抽象型要約を行う場合:"summarize: "+preprocess_text
- 英文から独文への機械翻訳を行う場合:"translate English to German: "+preprocess_text
- 英文から仏文への機械翻訳を行う場合:"translate English to French: "+preprocess_text
英文から独文への機械翻訳を行う場合
t5_prepared_Text = "translate English to German: "+preprocess_text
tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)
なお、原論文を読むと、この他に、以下の「適用先のタスク宣言」を行う指定文字列が用意されていることがわかります。
- ss2 sentence
- stsb sentence1
- cb hypothesis
- copa choice1
- multirc question
- wic pos
- wsc
- question
- translate English to Romanian
Google T5モデルを使って、英文の抽象型要約(Abstractive Summarization)を実行するコード__は、次のRamsriさんのWebページ__に掲載されています。
今回は、RamsriさんのPythonコードを実行してみましす。
なお、本スクリプトは、Huggingfaceに登録されているようです。
- [Huggingface Model: remi/bertabs-finetuned-xsum-extractive-abstractive-summarization] (https://huggingface.co/remi/bertabs-finetuned-xsum-extractive-abstractive-summarization)
( 実行環境 )
- 計算機 : MacBookPro (CPU)
- OS : macOS Catalina
- Python : ver.3.6.3
- transformers : ver.2.8.0
- torch : ver.1.4.0
Terminal
Desktop % mkdir t5_asbtract_summarization
Desktop % cd t5_asbtract_summarization
ocean@AfoGuardMacBook-Pro t5_asbtract_summarization % pyenv local TensorFlow
ocean@AfoGuardMacBook-Pro t5_asbtract_summarization % pyenv versions
system
3.6.0
3.6.0/envs/TensorFlow
3.6.1
3.6.3
3.6.3/envs/gpt2_ja
3.9.0
* TensorFlow (set by /Users/ocean/Desktop/t5_asbtract_summarization/.python-version)
gpt2_ja
ocean@AfoGuardMacBook-Pro t5_asbtract_summarization %
transformersとtorchをバージョン指定して入れる
Terminal
ocean@AfoGuardMacBook-Pro t5_asbtract_summarization % pip install transformers==2.8.0
pip3 install torch torchvision install transformers==2.8.0
Collecting torch
Cache entry deserialization failed, entry ignored
Downloading https://files.pythonhosted.org/packages/b6/01/fffb29c3892d80801bc6400e07c90b8fa6cd5f3db5ce9d7ca8068e14e0b2/torch-1.7.1-cp36-none-macosx_10_9_x86_64.whl (108.8MB)
100% |████████████████████████████████| 108.8MB 14kB/s
( ・・・省略・・・ )
Installing collected packages: typing-extensions, dataclasses, numpy, torch, pillow, torchvision, install, idna, certifi, urllib3, chardet, requests, tokenizers, filelock, sentencepiece, tqdm, regex, six, click, joblib, sacremoses, jmespath, python-dateutil, botocore, s3transfer, boto3, transformers
Running setup.py install for sacremoses ... done
Successfully installed boto3-1.16.35 botocore-1.19.35 certifi-2020.12.5 chardet-3.0.4 click-7.1.2 dataclasses-0.8 filelock-3.0.12 idna-2.10 install-1.3.4 jmespath-0.10.0 joblib-0.17.0 numpy-1.19.4 pillow-8.0.1 python-dateutil-2.8.1 regex-2020.11.13 requests-2.25.0 s3transfer-0.3.3 sacremoses-0.0.43 sentencepiece-0.1.94 six-1.15.0 tokenizers-0.5.2 torch-1.7.1 torchvision-0.8.2 tqdm-4.54.1 transformers-2.8.0 typing-extensions-3.7.4.3 urllib3-1.26.2
You are using pip version 9.0.1, however version 20.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
ocean@AfoGuardMacBook-Pro t5_asbtract_summarization %
Terminal
ocean@AfoGuardMacBook-Pro t5_asbtract_summarization % pip install torch==1.4.0
( ・・・省略・・・ )
ocean@AfoGuardMacBook-Pro t5_asbtract_summarization %
Python 3.6.3 の対話型インタプリタを立ち上げる
Terminal
ocean@AfoGuardMacBook-Pro t5_asbtract_summarization % python
Python 3.6.3 (default, Dec 10 2020, 22:43:16)
[GCC Apple LLVM 12.0.0 (clang-1200.0.32.27)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
タスク1:英文の抽象型要約
( 入力した文章 )
入力テキスト
The US has "passed the peak" on new coronavirus cases, President Donald Trump said and predicted that some states would reopen this month.The US has over 637,000 confirmed Covid-19 cases and over 30,826 deaths, the highest for any country in the world.At the daily White House coronavirus briefing on Wednesday, Trump said new guidelines to reopen the country would be announced on Thursday after he speaks to governors."We'll be the comeback kids, all of us," he said. "We want to get our country back."The Trump administration has previously fixed May 1 as a possible date to reopen the world's largest economy, but the president said some states may be able to return to normalcy earlier than that.
( T5モデルが出力した文章 )
出力テキスト
the us has over 637,000 confirmed Covid-19 cases and over 30,826 deaths. president Donald Trump predicts some states will reopen the country in april, he said. "we'll be the comeback kids, all of us," the president says.
( 実行したコード )
Python3.6.3
>>> import torch
>>> import json
>>> from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config
>>>
>>> model = T5ForConditionalGeneration.from_pretrained('t5-small')
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.20k/1.20k [00:00<00:00, 354kB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 242M/242M [00:24<00:00, 10.1MB/s]
>>>
>>> tokenizer = T5Tokenizer.from_pretrained('t5-small')
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 792k/792k [00:01<00:00, 730kB/s]
>>>
>>> device = torch.device('cpu')
>>>
>>> text = """The US has "passed the peak" on new coronavirus cases, President Donald Trump said and predicted that some states would reopen this month.
...
... The US has over 637,000 confirmed Covid-19 cases and over 30,826 deaths, the highest for any country in the world.
...
... At the daily White House coronavirus briefing on Wednesday, Trump said new guidelines to reopen the country would be announced on Thursday after he speaks to governors.
...
... "We'll be the comeback kids, all of us," he said. "We want to get our country back."
...
... The Trump administration has previously fixed May 1 as a possible date to reopen the world's largest economy, but the president said some states may be able to return to normalcy earlier than that."""
>>>
>>> preprocess_text = text.strip().replace("\n","")
>>> print(preprocess_text)
The US has "passed the peak" on new coronavirus cases, President Donald Trump said and predicted that some states would reopen this month.The US has over 637,000 confirmed Covid-19 cases and over 30,826 deaths, the highest for any country in the world.At the daily White House coronavirus briefing on Wednesday, Trump said new guidelines to reopen the country would be announced on Thursday after he speaks to governors."We'll be the comeback kids, all of us," he said. "We want to get our country back."The Trump administration has previously fixed May 1 as a possible date to reopen the world's largest economy, but the president said some states may be able to return to normalcy earlier than that.
>>>
>>> t5_prepared_Text = "summarize: "+preprocess_text
>>> print(("original text preprocessed: \n", preprocess_text))
('original text preprocessed: \n', 'The US has "passed the peak" on new coronavirus cases, President Donald Trump said and predicted that some states would reopen this month.The US has over 637,000 confirmed Covid-19 cases and over 30,826 deaths, the highest for any country in the world.At the daily White House coronavirus briefing on Wednesday, Trump said new guidelines to reopen the country would be announced on Thursday after he speaks to governors."We\'ll be the comeback kids, all of us," he said. "We want to get our country back."The Trump administration has previously fixed May 1 as a possible date to reopen the world\'s largest economy, but the president said some states may be able to return to normalcy earlier than that.')
>>>
>>> tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)
>>> type(tokenized_text))
<class 'torch.Tensor'>
>>>
>>> summary_ids = model.generate(tokenized_text, num_beams=4, no_repeat_ngram_size=2, min_length=30, max_length=100, early_stopping=True)
>>>
>>> print(type(summary_ids))
<class 'torch.Tensor'>
>>>
>>> output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
>>> print(type(output))
<class 'str'>
>>>
>>> print ("\n\nSummarized text: \n",output)
Summarized text:
the us has over 637,000 confirmed Covid-19 cases and over 30,826 deaths. president Donald Trump predicts some states will reopen the country in april, he said. "we'll be the comeback kids, all of us," the president says.
>>>
タスク2:英文から独文への機械翻訳
( 入力した文章 )
入力テキスト
The US has "passed the peak" on new coronavirus cases, President Donald Trump said and predicted that some states would reopen this month.The US has over 637,000 confirmed Covid-19 cases and over 30,826 deaths, the highest for any country in the world.At the daily White House coronavirus briefing on Wednesday, Trump said new guidelines to reopen the country would be announced on Thursday after he speaks to governors."We'll be the comeback kids, all of us," he said. "We want to get our country back."The Trump administration has previously fixed May 1 as a possible date to reopen the world's largest economy, but the president said some states may be able to return to normalcy earlier than that.
( T5モデルが出力した文章 )
出力テキスト
Die USA haben den Höchststand auf neuen Koronavirus-Fällen "passiert", sagte Präsident Donald Trump und prognostizierte, dass einige Staaten in diesem Monat wieder eröffnen würden.Die USA verfügen über mehr als 637.000 bestätigte Covid-19-Fälle und über 30.826 Todesfälle, die höchste für jedes Land der Welt.
( 実行したコード )
Python3.6.3
>>> t5_prepared_Text = "translate English to German: "+preprocess_text
>>> print ("original text preprocessed: \n", preprocess_text)
original text preprocessed:
The US has "passed the peak" on new coronavirus cases, President Donald Trump said and predicted that some states would reopen this month.The US has over 637,000 confirmed Covid-19 cases and over 30,826 deaths, the highest for any country in the world.At the daily White House coronavirus briefing on Wednesday, Trump said new guidelines to reopen the country would be announced on Thursday after he speaks to governors."We'll be the comeback kids, all of us," he said. "We want to get our country back."The Trump administration has previously fixed May 1 as a possible date to reopen the world's largest economy, but the president said some states may be able to return to normalcy earlier than tha
>>>
>>> tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)
>>> summary_ids = model.generate(tokenized_text,
... num_beams=4,
... no_repeat_ngram_size=2,
... min_length=30,
... max_length=100,
... early_stopping=True)
>>>
>>> output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
>>> print ("\n\nTranslated text: \n",output)
Translated text:
Die USA haben den Höchststand auf neuen Koronavirus-Fällen "passiert", sagte Präsident Donald Trump und prognostizierte, dass einige Staaten in diesem Monat wieder eröffnen würden.Die USA verfügen über mehr als 637.000 bestätigte Covid-19-Fälle und über 30.826 Todesfälle, die höchste für jedes Land der Welt.
>>>
タスク3:英文から仏文への機械翻訳
( 入力した文章 )
入力テキスト
The US has "passed the peak" on new coronavirus cases, President Donald Trump said and predicted that some states would reopen this month.The US has over 637,000 confirmed Covid-19 cases and over 30,826 deaths, the highest for any country in the world.At the daily White House coronavirus briefing on Wednesday, Trump said new guidelines to reopen the country would be announced on Thursday after he speaks to governors."We'll be the comeback kids, all of us," he said. "We want to get our country back."The Trump administration has previously fixed May 1 as a possible date to reopen the world's largest economy, but the president said some states may be able to return to normalcy earlier than that.
( T5モデルが出力した文章 )
出力テキスト
Les Etats-Unis ont «passé le sommet» sur les nouveaux cas de coronavirus, a déclaré le président Donald Trump et prévoyait que certains États rouvriraient ce mois-ci. Les ÉtatsUnis affichent plus de 637 000 cas confirmés de Covid-19 ainsi que 30 826 décès, le plus élevé pour tout pays du monde. Lors de l'exposé quotidien
( 実行したコード )
Python3.6.3
>>> t5_prepared_Text = "translate English to French: "+preprocess_text
>>> tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)
>>> summary_ids = model.generate(tokenized_text,
... num_beams=4,
... no_repeat_ngram_size=2,
... min_length=30,
... max_length=100,
... early_stopping=True)
>>>
>>> output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
>>> print ("\n\nTranslated tex: \n",output)
Translated text:
Les Etats-Unis ont «passé le sommet» sur les nouveaux cas de coronavirus, a déclaré le président Donald Trump et prévoyait que certains États rouvriraient ce mois-ci. Les ÉtatsUnis affichent plus de 637 000 cas confirmés de Covid-19 ainsi que 30 826 décès, le plus élevé pour tout pays du monde. Lors de l'exposé quotidien
>>>