0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

【備忘録】 日本語の文章を事前学習済Bertで特徴量にembeddingする (Colab環境2022/01/19)

Last updated at Posted at 2023-01-19

Colab環境

!pip install transformers
!pip install fugashi
!pip install ipadic
!pip install tqdm
import torch
from transformers import AutoModel, AutoTokenizer
import tqdm

bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese")
tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese")
device = "cuda" if torch.cuda.is_available() else "cpu"

bertjapanese.to(device)
features = []

for i, text in enumerate(tqdm(texts)):
    input_ids = tokenizer.encode(text, return_tensors="pt", max_length=512, truncation=True) # 事前学習bertモデルのinputの長さは最大で512
    outputs = bertjapanese(input_ids.to(device))
    last_hidden_states = outputs[0]
    feature = last_hidden_states[0, 0, :] # bertjapaneseは複数の文章を同時に処理できるが,一つずつ変換しているので0番目を指定,二つ目の0はclsトークン部分を表す
    path = "path_to_save_dir" + str(i) + ".pt" 
    torch.save(feature, path)

一度に複数の文章を処理しようとするとメモリが足りずにクラッシュする.

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?