2
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

Googleのマルチタスクモデル T5 を用いて、英語テキストの抽象型要約、独文翻訳、仏文翻訳を行う処理をメソッド化した

Last updated at Posted at 2020-12-12

__前回の記事__で、うまく動くことを確認したスクリプトを、__メソッド化__しました。

Googleから出た最新のマルチタスクモデル T5(Text-to-Text Transfer Transformer) を動かしてみます。

T5モデルについて )

Google AI Blog

__論文__はこちらです。総計67ページの大作です。

( マルチタスクモデル「T5モデル」の適用先タスクを宣言する方法 )

tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)

t5_prepared_Textの部分に格納する文字列を切り替えることで、切り替えます。

なお、以下に登場する「変数preprocess_text」は、英文の文字列を格納したstr型の変数オブジェクトです。

  1. 抽象型要約を行う場合:"summarize: "+preprocess_text
  2. 英文から独文への機械翻訳を行う場合:"translate English to German: "+preprocess_text
  3. 英文から仏文への機械翻訳を行う場合:"translate English to French: "+preprocess_text
英文から独文への機械翻訳を行う場合
t5_prepared_Text = "translate English to German: "+preprocess_text
tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)

なお、原論文を読むと、この他に、以下の「適用先のタスク宣言」を行う指定文字列が用意されていることがわかります。

  • ss2 sentence
  • stsb sentence1
  • cb hypothesis
  • copa choice1
  • multirc question
  • wic pos
  • wsc
  • question
  • translate English to Romanian

原論文の該当箇所を挙げておきます。

summarize

スクリーンショット 2020-12-13 19.02.35.png

translate English to German

スクリーンショット 2020-12-13 19.02.57.png

wsc

スクリーンショット 2020-12-13 19.03.41.png

question

スクリーンショット 2020-12-13 19.03.53.png

translate English to Romanian

スクリーンショット 2020-12-13 19.04.07.png

( multirc question )

スクリーンショット 2020-12-13 19.04.42.png

( copa choice1 )

スクリーンショット 2020-12-13 19.05.17.png

( cb hypothesis )

スクリーンショット 2020-12-13 19.05.37.png

( ss2 sentence )

スクリーンショット 2020-12-13 19.05.46.png


英文の抽象型要約を行うメソッド

Python3.9.0
def abstract_sum(input_text:str) -> str:
	import torch, json
	from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config
	model = T5ForConditionalGeneration.from_pretrained('t5-small')
	tokenizer = T5Tokenizer.from_pretrained('t5-small')
	device = torch.device('cpu')
	text = input_text
	preprocess_text = text.strip().replace("\n","")
	t5_prepared_Text = "summarize: "+preprocess_text
	print ("original text preprocessed: \n", preprocess_text)
	tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)
	# summmarize 
	summary_ids = model.generate(tokenized_text,  num_beams=4, no_repeat_ngram_size=2, min_length=30, max_length=100, early_stopping=True)
	output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
	print ("\n\nSummarized text: \n",output)
	return output

翻訳(英文 -> 独文)を行うメソッド

Python3.9.0
def translate_eng_to_german(input_text:str) -> str:
	import torch, json
	from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config
	model = T5ForConditionalGeneration.from_pretrained('t5-small')
	tokenizer = T5Tokenizer.from_pretrained('t5-small')
	device = torch.device('cpu')
	text = input_text
	preprocess_text = text.strip().replace("\n","")
	t5_prepared_Text = "translate English to German: "+preprocess_text
	print ("original text preprocessed: \n", preprocess_text)
	tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)
	# summmarize 
	summary_ids = model.generate(tokenized_text, num_beams=4, no_repeat_ngram_size=2, min_length=30, max_length=100, early_stopping=True)
	output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
	print ("\n\nTranslated German text: \n",output)
	return output

翻訳(英文 -> 仏文)を行うメソッド

Python3.9.0
def translate_eng_to_french(input_text:str) -> str:
	import torch, json
	from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config
	model = T5ForConditionalGeneration.from_pretrained('t5-small')
	tokenizer = T5Tokenizer.from_pretrained('t5-small')
	device = torch.device('cpu')
	text = input_text
	preprocess_text = text.strip().replace("\n","")
	t5_prepared_Text = "translate English to French: "+preprocess_text
	print ("original text preprocessed: \n", preprocess_text)
	tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)
	# summmarize 
	summary_ids = model.generate(tokenized_text, num_beams=4, no_repeat_ngram_size=2, min_length=30, max_length=100, early_stopping=True)
	output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
	print ("\n\nTranslated French text: \n",output)
	return output

動作を検証する際に使うスクリプト

Python3.9.0
given_text = """Trade talks between the UK and European Union are continuing in Brussels with one day to go until a deadline imposed by the two sides.

The leaders of both parties have warned they are unlikely to reach a post-Brexit trade deal by Sunday.

On Friday, Boris Johnson chaired a "stock-take" on the UK's preparedness for a no-deal scenario.

Meanwhile, the Ministry of Defence (MoD) said four Royal Navy patrol boats are ready to protect UK fishing waters.

The Sunday deadline was set by Mr Johnson and European Commission President Ursula von der Leyen after the pair met in Brussels on Wednesday, after months of talks failed to achieve an agreement.

Mr Johnson said the EU needed to make a "big change" over the main sticking points on fishing rights and business competition rules, while Mrs von der Leyen said no deal was the most probable end to "difficult" talks.

The EU has rejected Mr Johnson's request to bypass the European Commission and speak directly to French President Emmanuel Macron and Germany's Angela Merkel about the unresolved issues."""
```

```
summarized_text = abstract_sum(given_text)
print(summarized_text)

translated_german_text = translate_eng_to_german(given_text)
print(translated_german_text)

translated_french_text = translate_eng_to_french(given_text)
print(translated_french_text)
```


### 実行結果

#### ( 環境構築 )

```bash:Terminal
Desktop % mkdir t5_abstract_summarization
Desktop % cd t5_abstract_summarization 
ocean@AfoGuardMacBook-Pro t5_abstract_summarization% virtualenv t5_abst_summary
```

```bash:Terminal
ocean@AfoGuardMacBook-Pro t5_abstract_summarization% ls
t5_abst_summary
ocean@AfoGuardMacBook-Pro t5_abstract_summarization% 
ocean@AfoGuardMacBook-Pro t5_abstract_summarization% ls t5_abst_summary 
bin		lib		pyvenv.cfg
ocean@AfoGuardMacBook-Pro t5_abstract_summarization% 
ocean@AfoGuardMacBook-Pro t5_abstract_summarization% source t5_abst_summary/bin/activate
(t5_abst_summary) ocean@AfoGuardMacBook-Pro t5_abstract_summarization% 
(t5_abst_summary) ocean@AfoGuardMacBook-Pro t5_abstract_summarization% pip freeze
(t5_abst_summary) ocean@AfoGuardMacBook-Pro t5_abstract_summarization%
```

```bash:Terminal
(t5_abst_summary) ocean@AfoGuardMacBook-Pro t5_abstract_summarization% pip install transformers==2.8.0
(t5_abst_summary) ocean@AfoGuardMacBook-Pro t5_abstract_summarization% 
(t5_abst_summary) ocean@AfoGuardMacBook-Pro t5_abstract_summarization% pip install torch==1.4.0       
ERROR: Could not find a version that satisfies the requirement torch==1.4.0
ERROR: No matching distribution found for torch==1.4.0
(t5_abst_summary) ocean@AfoGuardMacBook-Pro t5_abstract_summarization% 
(t5_abst_summary) ocean@AfoGuardMacBook-Pro t5_abstract_summarization% pip3 install torch==1.4.0                                                                          
ERROR: Could not find a version that satisfies the requirement torch==1.4.0
ERROR: No matching distribution found for torch==1.4.0
(t5_abst_summary) ocean@AfoGuardMacBook-Pro t5_abstract_summarization% pip3 install torch       
Successfully installed torch-1.7.1 typing-extensions-3.7.4.3
(t5_abst_summary) ocean@AfoGuardMacBook-Pro t5_abstract_summarization%
```

#### ( 実行結果 )

```Python:Python3.9.0
(t5_abst_summary) ocean@AfoGuardMacBook-Pro t5_abstract_summarization % python
Python 3.9.0 (default, Dec  3 2020, 16:09:02) 
[Clang 12.0.0 (clang-1200.0.32.27)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> given_text = """Trade talks between the UK and European Union are continuing in Brussels with one day to go until a deadline imposed by the two sides.
... 
... The leaders of both parties have warned they are unlikely to reach a post-Brexit trade deal by Sunday.
... 
... On Friday, Boris Johnson chaired a "stock-take" on the UK's preparedness for a no-deal scenario.
... 
... Meanwhile, the Ministry of Defence (MoD) said four Royal Navy patrol boats are ready to protect UK fishing waters.
... 
... The Sunday deadline was set by Mr Johnson and European Commission President Ursula von der Leyen after the pair met in Brussels on Wednesday, after months of talks failed to achieve an agreement.
... 
... Mr Johnson said the EU needed to make a "big change" over the main sticking points on fishing rights and business competition rules, while Mrs von der Leyen said no deal was the most probable end to "difficult" talks.
... 
... The EU has rejected Mr Johnson's request to bypass the European Commission and speak directly to French President Emmanuel Macron and Germany's Angela Merkel about the unresolved issues."""
>>> 
>>> def abstract_sum(input_text:str) -> str:
...     import torch, json
...     from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config
...     model = T5ForConditionalGeneration.from_pretrained('t5-small')
...     tokenizer = T5Tokenizer.from_pretrained('t5-small')
...     device = torch.device('cpu')
...     text = input_text
...     preprocess_text = text.strip().replace("\n","")
...     t5_prepared_Text = "summarize: "+preprocess_text
...     print ("original text preprocessed: \n", preprocess_text)
...     tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)
...     # summmarize 
...     summary_ids = model.generate(tokenized_text,  num_beams=4, no_repeat_ngram_size=2, min_length=30, max_length=100, early_stopping=True)
...     output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
...     print ("\n\nSummarized text: \n",output)
...     return output
... 
>>> def translate_eng_to_german(input_text:str) -> str:
...     import torch, json
...     from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config
...     model = T5ForConditionalGeneration.from_pretrained('t5-small')
...     tokenizer = T5Tokenizer.from_pretrained('t5-small')
...     device = torch.device('cpu')
...     text = input_text
...     preprocess_text = text.strip().replace("\n","")
...     t5_prepared_Text = "translate English to German: "+preprocess_text
...     print ("original text preprocessed: \n", preprocess_text)
...     tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)
...     # summmarize 
...     summary_ids = model.generate(tokenized_text, num_beams=4, no_repeat_ngram_size=2, min_length=30, max_length=100, early_stopping=True)
...     output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
...     print ("\n\nTranslated German text: \n",output)
...     return output
... 
>>> def translate_eng_to_french(input_text:str) -> str:
...     import torch, json
...     from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config
...     model = T5ForConditionalGeneration.from_pretrained('t5-small')
...     tokenizer = T5Tokenizer.from_pretrained('t5-small')
...     device = torch.device('cpu')
...     text = input_text
...     preprocess_text = text.strip().replace("\n","")
...     t5_prepared_Text = "translate English to French: "+preprocess_text
...     print ("original text preprocessed: \n", preprocess_text)
...     tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)
...     # summmarize 
...     summary_ids = model.generate(tokenized_text, num_beams=4, no_repeat_ngram_size=2, min_length=30, max_length=100, early_stopping=True)
...     output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
...     print ("\n\nTranslated French text: \n",output)
...     return output
... 
>>> #英文の抽象型要約
>>> summarized_text = abstract_sum(given_text)
original text preprocessed: 
 Trade talks between the UK and European Union are continuing in Brussels with one day to go until a deadline imposed by the two sides.The leaders of both parties have warned they are unlikely to reach a post-Brexit trade deal by Sunday.On Friday, Boris Johnson chaired a "stock-take" on the UK's preparedness for a no-deal scenario.Meanwhile, the Ministry of Defence (MoD) said four Royal Navy patrol boats are ready to protect UK fishing waters.The Sunday deadline was set by Mr Johnson and European Commission President Ursula von der Leyen after the pair met in Brussels on Wednesday, after months of talks failed to achieve an agreement.Mr Johnson said the EU needed to make a "big change" over the main sticking points on fishing rights and business competition rules, while Mrs von der Leyen said no deal was the most probable end to "difficult" talks.The EU has rejected Mr Johnson's request to bypass the European Commission and speak directly to French President Emmanuel Macron and Germany's Angela Merkel about the unresolved issues.


Summarized text: 
 the leaders of both parties have warned they are unlikely to reach a post-Brexit trade deal by Sunday. the two sides met in Brussels on friday after months of talks failed to achieve an agreement.
>>> 
>>> print(summarized_text)
the leaders of both parties have warned they are unlikely to reach a post-Brexit trade deal by Sunday. the two sides met in Brussels on friday after months of talks failed to achieve an agreement.
>>>
>>>
>>> #英文 -> 独文
>>>
>>> translated_german_text = translate_eng_to_german(given_text)
original text preprocessed: 
 Trade talks between the UK and European Union are continuing in Brussels with one day to go until a deadline imposed by the two sides.The leaders of both parties have warned they are unlikely to reach a post-Brexit trade deal by Sunday.On Friday, Boris Johnson chaired a "stock-take" on the UK's preparedness for a no-deal scenario.Meanwhile, the Ministry of Defence (MoD) said four Royal Navy patrol boats are ready to protect UK fishing waters.The Sunday deadline was set by Mr Johnson and European Commission President Ursula von der Leyen after the pair met in Brussels on Wednesday, after months of talks failed to achieve an agreement.Mr Johnson said the EU needed to make a "big change" over the main sticking points on fishing rights and business competition rules, while Mrs von der Leyen said no deal was the most probable end to "difficult" talks.The EU has rejected Mr Johnson's request to bypass the European Commission and speak directly to French President Emmanuel Macron and Germany's Angela Merkel about the unresolved issues.


Translated German text: 
 Am Freitag hat Boris Johnson den Vorsitz des britischen Ministeriums für Verteidigung (MoD) inne, dass vier Patrouillenboote der Royal Navy bereit sind, britische Fischereigewässer zu schützen. Die Sonntagsfrist wurde von Herrn Johnson und der Präsidentin der EU Ursula von der Leyen festgelegt, nachdem die beiden Parteien am Mittwoch in Brüssel zusammentraf
>>> 
>>> 
>>> print(translated_german_text)
Am Freitag hat Boris Johnson den Vorsitz des britischen Ministeriums für Verteidigung (MoD) inne, dass vier Patrouillenboote der Royal Navy bereit sind, britische Fischereigewässer zu schützen. Die Sonntagsfrist wurde von Herrn Johnson und der Präsidentin der EU Ursula von der Leyen festgelegt, nachdem die beiden Parteien am Mittwoch in Brüssel zusammentraf
>>> 
>>> #英文 -> 仏文
>>> translated_french_text = translate_eng_to_french(given_text)
original text preprocessed: 
 Trade talks between the UK and European Union are continuing in Brussels with one day to go until a deadline imposed by the two sides.The leaders of both parties have warned they are unlikely to reach a post-Brexit trade deal by Sunday.On Friday, Boris Johnson chaired a "stock-take" on the UK's preparedness for a no-deal scenario.Meanwhile, the Ministry of Defence (MoD) said four Royal Navy patrol boats are ready to protect UK fishing waters.The Sunday deadline was set by Mr Johnson and European Commission President Ursula von der Leyen after the pair met in Brussels on Wednesday, after months of talks failed to achieve an agreement.Mr Johnson said the EU needed to make a "big change" over the main sticking points on fishing rights and business competition rules, while Mrs von der Leyen said no deal was the most probable end to "difficult" talks.The EU has rejected Mr Johnson's request to bypass the European Commission and speak directly to French President Emmanuel Macron and Germany's Angela Merkel about the unresolved issues.


Translated French text: 
 Les pourparlers commerciaux entre le Royaume-Uni et l'Union européenne se poursuivent à Bruxelles avec un jour d'attente jusqu'à un délai imposé par les deux parties. Les dirigeants des deux partis ont mis en garde qu'ils ne seront probablement pas parvenus au règlement commercial post-Brexit le dimanche. Vendredi, Boris Johnson a présidé
>>> 
>>> print(translated_french_text)
Les pourparlers commerciaux entre le Royaume-Uni et l'Union européenne se poursuivent à Bruxelles avec un jour d'attente jusqu'à un délai imposé par les deux parties. Les dirigeants des deux partis ont mis en garde qu'ils ne seront probablement pas parvenus au règlement commercial post-Brexit le dimanche. Vendredi, Boris Johnson a présidé
>>> 
```

## 次にやりたいこと

T5に日本語を入力したい、
T5を使って、日本語の文章を抽象型要約するタスクは、次の記事で取り上げられています。

- [@yuko1658さん 「Multilingual T5で日本語の文章要約」](https://qiita.com/yuko1658/items/02a2321b20dd870d6afe)
2
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
2
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?