More than 1 year has passed since last update.

英語の tokenize, lemmatize の速度比較

Last updated at 2023-10-24Posted at 2023-03-10

1. はじめに

英語の lemmatize のライブラリは以下の記事にまとまっています。

ただこの記事では速度の測定は行っていないので、それを比較してみます。

2. 所要時間の測り方

処理を1000回行い、その平均所要時間を算出します。

from functools import wraps
from time import perf_counter
from statistics import mean

def timeit(func):
    @wraps(func)
    def _wrapper(*args, **kwargs):
        es = []
        for _ in range(1000):
            t = perf_counter()
            result = func(*args, **kwargs)
            e = perf_counter() - t
            es.append(e)
        return result, mean(es)
    return _wrapper

3. 使用するライブラリ

下記の4種類のライブラリを比較します。

nltk==3.8.1
Pattern==3.6
spacy==3.5.1
textblob==0.17.1

コードは最初に張った記事の実装を参考にしています。

3-1. nltk

import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

tag_dict_nltk = {
    'J': wordnet.ADJ,
    'N': wordnet.NOUN,
    'V': wordnet.VERB,
    'R': wordnet.ADV,
}
lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    return tag_dict_nltk.get(tag, wordnet.NOUN)

@timeit
def lemmatize_by_wordnet(sent: str) -> list[str]:
    return [
        lemmatizer.lemmatize(word, get_wordnet_pos(word))
        for word in nltk.word_tokenize(sent)
    ]

3-2. Pattern

from pattern.en import lemma

@timeit
def lemmatize_by_pattern(sent: str) -> list[str]:
    return [lemma(word) for word in sent.split()]

3-3. spaCy

import spacy

nlp_en = spacy.load('en_core_web_sm', enable=['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer'])

@timeit
def lemmatize_by_spacy(sent: str) -> list[str]:
    return [token.lemma_ for token in nlp_en(sent)]

3-4. TextBlob

from textblob import TextBlob

tag_dict_tb = {'J': 'a', 'N': 'n', 'V': 'v', 'R': 'r'}

@timeit
def lemmatize_by_textblob(sent: str) -> list[str]:
    return [
        word.lemmatize(tag_dict_tb.get(pos[0], 'n'))
        for word, pos in TextBlob(sent).tags
    ]

4. 検証結果

4-1. 文章①

Mr. Bean met with widespread acclaim and attracted large television audiences.

ライブラリ	出力	速度
nltk	Mr. / Bean / met / with / widespread / acclaim / and / attract / large / television / audience / .	0.006062
Pattern	mr. / bean / meet / with / widespread / acclaim / and / attract / large / television / audiences.	0.000013
spaCy	Mr. / Bean / meet / with / widespread / acclaim / and / attract / large / television / audience / .	0.002754
TextBlob	Mr. / Bean / meet / with / widespread / acclaim / and / attract / large / television / audience	0.001292

4-2. 文章②

I'm often asked what inspired me to create Dothraki, but I always find the question a little odd.

ライブラリ	出力	速度
nltk	I / 'm / often / ask / what / inspire / me / to / create / Dothraki / , / but / I / always / find / the / question / a / little / odd / .	0.009829
Pattern	i'm / often / ask / what / inspire / me / to / create / dothraki, / but / i / alway / find / the / question / a / little / odd.	0.000020
spaCy	I / be / often / ask / what / inspire / I / to / create / Dothraki / , / but / I / always / find / the / question / a / little / odd / .	0.003094
TextBlob	I / 'm / often / ask / what / inspire / me / to / create / Dothraki / but / I / always / find / the / question / a / little / odd	0.001477

4-3. 文章③

The best and worst foods for your teeth

ライブラリ	出力	速度
nltk	The / best / and / bad / food / for / your / teeth	0.003903
Pattern	the / best / and / worst / food / for / your / teeth	0.000010
spaCy	the / good / and / bad / food / for / your / tooth	0.002266
TextBlob	The / best / and / bad / food / for / your / teeth	0.000997

4-4. 文章④

Tom stood up and walked to the window.

ライブラリ	出力	速度
nltk	Tom / stood / up / and / walk / to / the / window / .	0.004425
Pattern	tom / stand / up / and / walk / to / the / window.	0.000009
spaCy	Tom / stand / up / and / walk / to / the / window / .	0.002315
TextBlob	Tom / stand / up / and / walk / to / the / window	0.001090

とにかく速度を求めるのであれば pattern が一番ですが、精度は低い。
精度を求めるのであれば spaCy が一番ですが、速度は遅い。
という結果でした。

5. おまけに

精度は高いけど遅い spaCy ですが、下記のリンクによると nlp.pipe で実行することでバッチ化できて効率的だそうです。

ではどのくらい違うのでしょう。データ数、プロセス数、バッチサイズを変えて試してみます。
検証で使用したパソコンは Windows で、検証内容は「コーパスからレンマの種類ごとの数を数える」です。

from collections import defaultdict

import spacy

nlp = spacy.load('en_core_web_sm', enable=['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer'])

# nlpをそのまま使用する場合
def nlp_with_for_loop(texts: list[str]) -> defaultdict[str, int]:
    counter = defaultdict(int)
    for text in texts:
        for token in nlp(text):
            counter[token.lemma_] += 1
    return counter

# nlp.pipeを使用する場合
def nlp_pipe(texts: list[str],
             n_process: int,
             batch_size: int
             ) -> defaultdict[str, int]:
    counter = defaultdict(int)
    for doc in nlp.pipe(texts):
        for token in doc:
            counter[token.lemma_] += 1
    return counter

データ数：10,000件

データ数が10,000件の場合の、プロセス数とバッチサイズを変えた処理速度をまとめます。
nlp.pipe ではなく nlp を使用した場合も一応載せてあります。

batch_size→ n_process↓	nlp	256	512	1024	2048
nlp	19.26	-	-	-	-
1	-	5.92	4.82	4.84	4.86
2	-	4.82	5.12	5.74	6.36
4	-	6.64	8.36	6.40	6.47
7	-	6.46	6.35	6.37	6.38

データ数：100,000件

上のデータ数が100,000件の場合です。

batch_size→ n_process↓	nlp	256	512	1024	2048
nlp	203.48	-	-	-	-
1	-	66.14	57.71	63.00	64.30
2	-	62.29	63.91	65.93	79.05
4	-	73.53	61.34	47.85	47.57
7	-	52.79	60.56	61.29	61.02

プロセス数やバッチサイズをいじらなくても3-4倍くらい早くなるので、使わない手はないですね。
あとは今回は試していませんがspaCyではGPUも使えるので、力があればもっと早くすることはできます。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up