More than 1 year has passed since last update.

【備忘録】日本語の文章を事前学習済Bertで特徴量にembeddingする (Colab環境2022/01/19)

Last updated at 2023-01-19Posted at 2023-01-19

Colab環境

!pip install transformers
!pip install fugashi
!pip install ipadic
!pip install tqdm

import torch
from transformers import AutoModel, AutoTokenizer
import tqdm

bertjapanese = AutoModel.from_pretrained("cl-tohoku/bert-base-japanese")
tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese")
device = "cuda" if torch.cuda.is_available() else "cpu"

bertjapanese.to(device)
features = []

for i, text in enumerate(tqdm(texts)):
    input_ids = tokenizer.encode(text, return_tensors="pt", max_length=512, truncation=True) # 事前学習bertモデルのinputの長さは最大で512
    outputs = bertjapanese(input_ids.to(device))
    last_hidden_states = outputs[0]
    feature = last_hidden_states[0, 0, :] # bertjapaneseは複数の文章を同時に処理できるが，一つずつ変換しているので0番目を指定，二つ目の0はclsトークン部分を表す
    path = "path_to_save_dir" + str(i) + ".pt" 
    torch.save(feature, path)

一度に複数の文章を処理しようとするとメモリが足りずにクラッシュする．

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

【備忘録】 日本語の文章を事前学習済Bertで特徴量にembeddingする (Colab環境2022/01/19)

【備忘録】日本語の文章を事前学習済Bertで特徴量にembeddingする (Colab環境2022/01/19)