英語の tokenize, lemmatize の速度比較

Last updated at Posted at 2023-03-10

1. はじめに

英語の lemmatize のライブラリは以下の記事にまとまっています。


2. 所要時間の測り方


from functools import wraps
from time import perf_counter
from statistics import mean

def timeit(func):
    def _wrapper(*args, **kwargs):
        es = []
        for _ in range(1000):
            t = perf_counter()
            result = func(*args, **kwargs)
            e = perf_counter() - t
        return result, mean(es)
    return _wrapper

3. 使用するライブラリ




3-1. nltk

import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

tag_dict_nltk = {
    'J': wordnet.ADJ,
    'N': wordnet.NOUN,
    'V': wordnet.VERB,
    'R': wordnet.ADV,
lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    return tag_dict_nltk.get(tag, wordnet.NOUN)

def lemmatize_by_wordnet(sent: str) -> list[str]:
    return [
        lemmatizer.lemmatize(word, get_wordnet_pos(word))
        for word in nltk.word_tokenize(sent)

3-2. Pattern

from pattern.en import lemma

def lemmatize_by_pattern(sent: str) -> list[str]:
    return [lemma(word) for word in sent.split()]

3-3. spaCy

import spacy

nlp_en = spacy.load('en_core_web_sm', enable=['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer'])

def lemmatize_by_spacy(sent: str) -> list[str]:
    return [token.lemma_ for token in nlp_en(sent)]

3-4. TextBlob

from textblob import TextBlob

tag_dict_tb = {'J': 'a', 'N': 'n', 'V': 'v', 'R': 'r'}

def lemmatize_by_textblob(sent: str) -> list[str]:
    return [
        word.lemmatize(tag_dict_tb.get(pos[0], 'n'))
        for word, pos in TextBlob(sent).tags

4. 検証結果

4-1. 文章①

Mr. Bean met with widespread acclaim and attracted large television audiences.

ライブラリ 出力 速度
nltk Mr. / Bean / met / with / widespread / acclaim / and / attract / large / television / audience / . 0.006062
Pattern mr. / bean / meet / with / widespread / acclaim / and / attract / large / television / audiences. 0.000013
spaCy Mr. / Bean / meet / with / widespread / acclaim / and / attract / large / television / audience / . 0.002754
TextBlob Mr. / Bean / meet / with / widespread / acclaim / and / attract / large / television / audience 0.001292

4-2. 文章②

I'm often asked what inspired me to create Dothraki, but I always find the question a little odd.

ライブラリ 出力 速度
nltk I / 'm / often / ask / what / inspire / me / to / create / Dothraki / , / but / I / always / find / the / question / a / little / odd / . 0.009829
Pattern i'm / often / ask / what / inspire / me / to / create / dothraki, / but / i / alway / find / the / question / a / little / odd. 0.000020
spaCy I / be / often / ask / what / inspire / I / to / create / Dothraki / , / but / I / always / find / the / question / a / little / odd / . 0.003094
TextBlob I / 'm / often / ask / what / inspire / me / to / create / Dothraki / but / I / always / find / the / question / a / little / odd 0.001477

4-3. 文章③

The best and worst foods for your teeth

ライブラリ 出力 速度
nltk The / best / and / bad / food / for / your / teeth 0.003903
Pattern the / best / and / worst / food / for / your / teeth 0.000010
spaCy the / good / and / bad / food / for / your / tooth 0.002266
TextBlob The / best / and / bad / food / for / your / teeth 0.000997

4-4. 文章④

Tom stood up and walked to the window.

ライブラリ 出力 速度
nltk Tom / stood / up / and / walk / to / the / window / . 0.004425
Pattern tom / stand / up / and / walk / to / the / window. 0.000009
spaCy Tom / stand / up / and / walk / to / the / window / . 0.002315
TextBlob Tom / stand / up / and / walk / to / the / window 0.001090

とにかく速度を求めるのであれば pattern が一番ですが、精度は低い。
精度を求めるのであれば spaCy が一番ですが、速度は遅い。

5. おまけに

精度は高いけど遅い spaCy ですが、下記のリンクによると nlp.pipe で実行することでバッチ化できて効率的だそうです。

検証で使用したパソコンは Windows で、検証内容は「コーパスからレンマの種類ごとの数を数える」です。

from collections import defaultdict

import spacy

nlp = spacy.load('en_core_web_sm', enable=['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer'])

# nlpをそのまま使用する場合
def nlp_with_for_loop(texts: list[str]) -> defaultdict[str, int]:
    counter = defaultdict(int)
    for text in texts:
        for token in nlp(text):
            counter[token.lemma_] += 1
    return counter

# nlp.pipeを使用する場合
def nlp_pipe(texts: list[str],
             n_process: int,
             batch_size: int
             ) -> defaultdict[str, int]:
    counter = defaultdict(int)
    for doc in nlp.pipe(texts):
        for token in doc:
            counter[token.lemma_] += 1
    return counter


nlp.pipe ではなく nlp を使用した場合も一応載せてあります。

nlp 256 512 1024 2048
nlp 19.26 - - - -
1 - 5.92 4.82 4.84 4.86
2 - 4.82 5.12 5.74 6.36
4 - 6.64 8.36 6.40 6.47
7 - 6.46 6.35 6.37 6.38



nlp 256 512 1024 2048
nlp 203.48 - - - -
1 - 66.14 57.71 63.00 64.30
2 - 62.29 63.91 65.93 79.05
4 - 73.53 61.34 47.85 47.57
7 - 52.79 60.56 61.29 61.02



