__前回の記事__で、うまく動くことを確認したスクリプトを、__メソッド化__しました。
Googleから出た最新のマルチタスクモデル T5(Text-to-Text Transfer Transformer) を動かしてみます。
( T5モデルについて )
Google AI Blog
__論文__はこちらです。総計67ページの大作です。
( マルチタスクモデル「T5モデル」の適用先タスクを宣言する方法 )
tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)
のt5_prepared_Textの部分に格納する文字列を切り替えることで、切り替えます。
なお、以下に登場する「変数preprocess_text」は、英文の文字列を格納したstr型の変数オブジェクトです。
- 抽象型要約を行う場合:"summarize: "+preprocess_text
- 英文から独文への機械翻訳を行う場合:"translate English to German: "+preprocess_text
- 英文から仏文への機械翻訳を行う場合:"translate English to French: "+preprocess_text
英文から独文への機械翻訳を行う場合
t5_prepared_Text = "translate English to German: "+preprocess_text
tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)
なお、原論文を読むと、この他に、以下の「適用先のタスク宣言」を行う指定文字列が用意されていることがわかります。
- ss2 sentence
- stsb sentence1
- cb hypothesis
- copa choice1
- multirc question
- wic pos
- wsc
- question
- translate English to Romanian
原論文の該当箇所を挙げておきます。
( summarize )
( translate English to German )
( wsc )
( question )
( translate English to Romanian )
( multirc question )
( copa choice1 )
( cb hypothesis )
( ss2 sentence )
英文の抽象型要約を行うメソッド
Python3.9.0
def abstract_sum(input_text:str) -> str:
import torch, json
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config
model = T5ForConditionalGeneration.from_pretrained('t5-small')
tokenizer = T5Tokenizer.from_pretrained('t5-small')
device = torch.device('cpu')
text = input_text
preprocess_text = text.strip().replace("\n","")
t5_prepared_Text = "summarize: "+preprocess_text
print ("original text preprocessed: \n", preprocess_text)
tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)
# summmarize
summary_ids = model.generate(tokenized_text, num_beams=4, no_repeat_ngram_size=2, min_length=30, max_length=100, early_stopping=True)
output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print ("\n\nSummarized text: \n",output)
return output
翻訳(英文 -> 独文)を行うメソッド
Python3.9.0
def translate_eng_to_german(input_text:str) -> str:
import torch, json
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config
model = T5ForConditionalGeneration.from_pretrained('t5-small')
tokenizer = T5Tokenizer.from_pretrained('t5-small')
device = torch.device('cpu')
text = input_text
preprocess_text = text.strip().replace("\n","")
t5_prepared_Text = "translate English to German: "+preprocess_text
print ("original text preprocessed: \n", preprocess_text)
tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)
# summmarize
summary_ids = model.generate(tokenized_text, num_beams=4, no_repeat_ngram_size=2, min_length=30, max_length=100, early_stopping=True)
output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print ("\n\nTranslated German text: \n",output)
return output
翻訳(英文 -> 仏文)を行うメソッド
Python3.9.0
def translate_eng_to_french(input_text:str) -> str:
import torch, json
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config
model = T5ForConditionalGeneration.from_pretrained('t5-small')
tokenizer = T5Tokenizer.from_pretrained('t5-small')
device = torch.device('cpu')
text = input_text
preprocess_text = text.strip().replace("\n","")
t5_prepared_Text = "translate English to French: "+preprocess_text
print ("original text preprocessed: \n", preprocess_text)
tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)
# summmarize
summary_ids = model.generate(tokenized_text, num_beams=4, no_repeat_ngram_size=2, min_length=30, max_length=100, early_stopping=True)
output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print ("\n\nTranslated French text: \n",output)
return output
動作を検証する際に使うスクリプト
Python3.9.0
given_text = """Trade talks between the UK and European Union are continuing in Brussels with one day to go until a deadline imposed by the two sides.
The leaders of both parties have warned they are unlikely to reach a post-Brexit trade deal by Sunday.
On Friday, Boris Johnson chaired a "stock-take" on the UK's preparedness for a no-deal scenario.
Meanwhile, the Ministry of Defence (MoD) said four Royal Navy patrol boats are ready to protect UK fishing waters.
The Sunday deadline was set by Mr Johnson and European Commission President Ursula von der Leyen after the pair met in Brussels on Wednesday, after months of talks failed to achieve an agreement.
Mr Johnson said the EU needed to make a "big change" over the main sticking points on fishing rights and business competition rules, while Mrs von der Leyen said no deal was the most probable end to "difficult" talks.
The EU has rejected Mr Johnson's request to bypass the European Commission and speak directly to French President Emmanuel Macron and Germany's Angela Merkel about the unresolved issues."""
```
```
summarized_text = abstract_sum(given_text)
print(summarized_text)
translated_german_text = translate_eng_to_german(given_text)
print(translated_german_text)
translated_french_text = translate_eng_to_french(given_text)
print(translated_french_text)
```
### 実行結果
#### ( 環境構築 )
```bash:Terminal
Desktop % mkdir t5_abstract_summarization
Desktop % cd t5_abstract_summarization
ocean@AfoGuardMacBook-Pro t5_abstract_summarization% virtualenv t5_abst_summary
```
```bash:Terminal
ocean@AfoGuardMacBook-Pro t5_abstract_summarization% ls
t5_abst_summary
ocean@AfoGuardMacBook-Pro t5_abstract_summarization%
ocean@AfoGuardMacBook-Pro t5_abstract_summarization% ls t5_abst_summary
bin lib pyvenv.cfg
ocean@AfoGuardMacBook-Pro t5_abstract_summarization%
ocean@AfoGuardMacBook-Pro t5_abstract_summarization% source t5_abst_summary/bin/activate
(t5_abst_summary) ocean@AfoGuardMacBook-Pro t5_abstract_summarization%
(t5_abst_summary) ocean@AfoGuardMacBook-Pro t5_abstract_summarization% pip freeze
(t5_abst_summary) ocean@AfoGuardMacBook-Pro t5_abstract_summarization%
```
```bash:Terminal
(t5_abst_summary) ocean@AfoGuardMacBook-Pro t5_abstract_summarization% pip install transformers==2.8.0
(t5_abst_summary) ocean@AfoGuardMacBook-Pro t5_abstract_summarization%
(t5_abst_summary) ocean@AfoGuardMacBook-Pro t5_abstract_summarization% pip install torch==1.4.0
ERROR: Could not find a version that satisfies the requirement torch==1.4.0
ERROR: No matching distribution found for torch==1.4.0
(t5_abst_summary) ocean@AfoGuardMacBook-Pro t5_abstract_summarization%
(t5_abst_summary) ocean@AfoGuardMacBook-Pro t5_abstract_summarization% pip3 install torch==1.4.0
ERROR: Could not find a version that satisfies the requirement torch==1.4.0
ERROR: No matching distribution found for torch==1.4.0
(t5_abst_summary) ocean@AfoGuardMacBook-Pro t5_abstract_summarization% pip3 install torch
Successfully installed torch-1.7.1 typing-extensions-3.7.4.3
(t5_abst_summary) ocean@AfoGuardMacBook-Pro t5_abstract_summarization%
```
#### ( 実行結果 )
```Python:Python3.9.0
(t5_abst_summary) ocean@AfoGuardMacBook-Pro t5_abstract_summarization % python
Python 3.9.0 (default, Dec 3 2020, 16:09:02)
[Clang 12.0.0 (clang-1200.0.32.27)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> given_text = """Trade talks between the UK and European Union are continuing in Brussels with one day to go until a deadline imposed by the two sides.
...
... The leaders of both parties have warned they are unlikely to reach a post-Brexit trade deal by Sunday.
...
... On Friday, Boris Johnson chaired a "stock-take" on the UK's preparedness for a no-deal scenario.
...
... Meanwhile, the Ministry of Defence (MoD) said four Royal Navy patrol boats are ready to protect UK fishing waters.
...
... The Sunday deadline was set by Mr Johnson and European Commission President Ursula von der Leyen after the pair met in Brussels on Wednesday, after months of talks failed to achieve an agreement.
...
... Mr Johnson said the EU needed to make a "big change" over the main sticking points on fishing rights and business competition rules, while Mrs von der Leyen said no deal was the most probable end to "difficult" talks.
...
... The EU has rejected Mr Johnson's request to bypass the European Commission and speak directly to French President Emmanuel Macron and Germany's Angela Merkel about the unresolved issues."""
>>>
>>> def abstract_sum(input_text:str) -> str:
... import torch, json
... from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config
... model = T5ForConditionalGeneration.from_pretrained('t5-small')
... tokenizer = T5Tokenizer.from_pretrained('t5-small')
... device = torch.device('cpu')
... text = input_text
... preprocess_text = text.strip().replace("\n","")
... t5_prepared_Text = "summarize: "+preprocess_text
... print ("original text preprocessed: \n", preprocess_text)
... tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)
... # summmarize
... summary_ids = model.generate(tokenized_text, num_beams=4, no_repeat_ngram_size=2, min_length=30, max_length=100, early_stopping=True)
... output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
... print ("\n\nSummarized text: \n",output)
... return output
...
>>> def translate_eng_to_german(input_text:str) -> str:
... import torch, json
... from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config
... model = T5ForConditionalGeneration.from_pretrained('t5-small')
... tokenizer = T5Tokenizer.from_pretrained('t5-small')
... device = torch.device('cpu')
... text = input_text
... preprocess_text = text.strip().replace("\n","")
... t5_prepared_Text = "translate English to German: "+preprocess_text
... print ("original text preprocessed: \n", preprocess_text)
... tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)
... # summmarize
... summary_ids = model.generate(tokenized_text, num_beams=4, no_repeat_ngram_size=2, min_length=30, max_length=100, early_stopping=True)
... output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
... print ("\n\nTranslated German text: \n",output)
... return output
...
>>> def translate_eng_to_french(input_text:str) -> str:
... import torch, json
... from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config
... model = T5ForConditionalGeneration.from_pretrained('t5-small')
... tokenizer = T5Tokenizer.from_pretrained('t5-small')
... device = torch.device('cpu')
... text = input_text
... preprocess_text = text.strip().replace("\n","")
... t5_prepared_Text = "translate English to French: "+preprocess_text
... print ("original text preprocessed: \n", preprocess_text)
... tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)
... # summmarize
... summary_ids = model.generate(tokenized_text, num_beams=4, no_repeat_ngram_size=2, min_length=30, max_length=100, early_stopping=True)
... output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
... print ("\n\nTranslated French text: \n",output)
... return output
...
>>> #英文の抽象型要約
>>> summarized_text = abstract_sum(given_text)
original text preprocessed:
Trade talks between the UK and European Union are continuing in Brussels with one day to go until a deadline imposed by the two sides.The leaders of both parties have warned they are unlikely to reach a post-Brexit trade deal by Sunday.On Friday, Boris Johnson chaired a "stock-take" on the UK's preparedness for a no-deal scenario.Meanwhile, the Ministry of Defence (MoD) said four Royal Navy patrol boats are ready to protect UK fishing waters.The Sunday deadline was set by Mr Johnson and European Commission President Ursula von der Leyen after the pair met in Brussels on Wednesday, after months of talks failed to achieve an agreement.Mr Johnson said the EU needed to make a "big change" over the main sticking points on fishing rights and business competition rules, while Mrs von der Leyen said no deal was the most probable end to "difficult" talks.The EU has rejected Mr Johnson's request to bypass the European Commission and speak directly to French President Emmanuel Macron and Germany's Angela Merkel about the unresolved issues.
Summarized text:
the leaders of both parties have warned they are unlikely to reach a post-Brexit trade deal by Sunday. the two sides met in Brussels on friday after months of talks failed to achieve an agreement.
>>>
>>> print(summarized_text)
the leaders of both parties have warned they are unlikely to reach a post-Brexit trade deal by Sunday. the two sides met in Brussels on friday after months of talks failed to achieve an agreement.
>>>
>>>
>>> #英文 -> 独文
>>>
>>> translated_german_text = translate_eng_to_german(given_text)
original text preprocessed:
Trade talks between the UK and European Union are continuing in Brussels with one day to go until a deadline imposed by the two sides.The leaders of both parties have warned they are unlikely to reach a post-Brexit trade deal by Sunday.On Friday, Boris Johnson chaired a "stock-take" on the UK's preparedness for a no-deal scenario.Meanwhile, the Ministry of Defence (MoD) said four Royal Navy patrol boats are ready to protect UK fishing waters.The Sunday deadline was set by Mr Johnson and European Commission President Ursula von der Leyen after the pair met in Brussels on Wednesday, after months of talks failed to achieve an agreement.Mr Johnson said the EU needed to make a "big change" over the main sticking points on fishing rights and business competition rules, while Mrs von der Leyen said no deal was the most probable end to "difficult" talks.The EU has rejected Mr Johnson's request to bypass the European Commission and speak directly to French President Emmanuel Macron and Germany's Angela Merkel about the unresolved issues.
Translated German text:
Am Freitag hat Boris Johnson den Vorsitz des britischen Ministeriums für Verteidigung (MoD) inne, dass vier Patrouillenboote der Royal Navy bereit sind, britische Fischereigewässer zu schützen. Die Sonntagsfrist wurde von Herrn Johnson und der Präsidentin der EU Ursula von der Leyen festgelegt, nachdem die beiden Parteien am Mittwoch in Brüssel zusammentraf
>>>
>>>
>>> print(translated_german_text)
Am Freitag hat Boris Johnson den Vorsitz des britischen Ministeriums für Verteidigung (MoD) inne, dass vier Patrouillenboote der Royal Navy bereit sind, britische Fischereigewässer zu schützen. Die Sonntagsfrist wurde von Herrn Johnson und der Präsidentin der EU Ursula von der Leyen festgelegt, nachdem die beiden Parteien am Mittwoch in Brüssel zusammentraf
>>>
>>> #英文 -> 仏文
>>> translated_french_text = translate_eng_to_french(given_text)
original text preprocessed:
Trade talks between the UK and European Union are continuing in Brussels with one day to go until a deadline imposed by the two sides.The leaders of both parties have warned they are unlikely to reach a post-Brexit trade deal by Sunday.On Friday, Boris Johnson chaired a "stock-take" on the UK's preparedness for a no-deal scenario.Meanwhile, the Ministry of Defence (MoD) said four Royal Navy patrol boats are ready to protect UK fishing waters.The Sunday deadline was set by Mr Johnson and European Commission President Ursula von der Leyen after the pair met in Brussels on Wednesday, after months of talks failed to achieve an agreement.Mr Johnson said the EU needed to make a "big change" over the main sticking points on fishing rights and business competition rules, while Mrs von der Leyen said no deal was the most probable end to "difficult" talks.The EU has rejected Mr Johnson's request to bypass the European Commission and speak directly to French President Emmanuel Macron and Germany's Angela Merkel about the unresolved issues.
Translated French text:
Les pourparlers commerciaux entre le Royaume-Uni et l'Union européenne se poursuivent à Bruxelles avec un jour d'attente jusqu'à un délai imposé par les deux parties. Les dirigeants des deux partis ont mis en garde qu'ils ne seront probablement pas parvenus au règlement commercial post-Brexit le dimanche. Vendredi, Boris Johnson a présidé
>>>
>>> print(translated_french_text)
Les pourparlers commerciaux entre le Royaume-Uni et l'Union européenne se poursuivent à Bruxelles avec un jour d'attente jusqu'à un délai imposé par les deux parties. Les dirigeants des deux partis ont mis en garde qu'ils ne seront probablement pas parvenus au règlement commercial post-Brexit le dimanche. Vendredi, Boris Johnson a présidé
>>>
```
## 次にやりたいこと
T5に日本語を入力したい、
T5を使って、日本語の文章を抽象型要約するタスクは、次の記事で取り上げられています。
- [@yuko1658さん 「Multilingual T5で日本語の文章要約」](https://qiita.com/yuko1658/items/02a2321b20dd870d6afe)