LoginSignup
0
0

More than 3 years have passed since last update.

spaCyのpipeline周りに詳しくなる(なりたい)

Posted at

注意

以下はjupyter notebookで記述した、出力された内容をmarkdownでダウンロードして貼り付けたものです。

https://github.com/booink/spacy-trial1/tree/master
こちらの公開リポジトリに動作環境を反映してあります。

30分程度しか手を動かせていないのをお試しでmarkdown出力しただけのペラペラな内容なので、読み応えはありませんので悪しからず。


上から写経していく。

When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the default models consists of a tagger, a parser and an entity recognizer. Each pipeline component returns the processed Doc, which is then passed on to the next component.

要は nlp メソッドにテキストを渡すと、トークン化したテキストをDocクラスのオブジェクトに入れて返してくれると。
そのDocオブジェクトは pipeline という仕組みで、連鎖的に処理した結果をDocオブジェクトのバケツリレーをするってことかな。
pipeline には tagger、parser、entity recognizer(ner) があるよ。

なるほど。
docオブジェクトの型を見てみよう。

import spacy

nlp = spacy.load("en")
doc = nlp("This is a text")
type(doc)
---------------------------------------------------------------------------

OSError                                   Traceback (most recent call last)

<ipython-input-4-69cc80a89d2d> in <module>
      1 import spacy
      2 
----> 3 nlp = spacy.load("en")
      4 doc = nlp("This is a text")
      5 type(doc)


/usr/local/lib/python3.7/site-packages/spacy/__init__.py in load(name, **overrides)
     28     if depr_path not in (True, False, None):
     29         deprecation_warning(Warnings.W001.format(path=depr_path))
---> 30     return util.load_model(name, **overrides)
     31 
     32 


/usr/local/lib/python3.7/site-packages/spacy/util.py in load_model(name, **overrides)
    167     elif hasattr(name, "exists"):  # Path or Path-like to model data
    168         return load_model_from_path(name, **overrides)
--> 169     raise IOError(Errors.E050.format(name=name))
    170 
    171 


OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

enモデルが無いよ、って怒られました。

QuickStartの通りにやってみる

!python -m spacy download en_core_web_sm
Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 476 kB/s eta 0:00:01
[?25hRequirement already satisfied: spacy>=2.2.2 in /usr/local/lib/python3.7/site-packages (from en_core_web_sm==2.2.5) (2.2.4)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.0.2)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (2.0.3)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (4.44.1)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.18.2)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (2.23.0)
Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (7.4.0)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (3.0.2)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (0.6.0)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.1.3)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.0.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (46.0.0)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (0.4.1)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.7/site-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.0.2)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (2.9)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (1.25.8)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/site-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (2019.11.28)
Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in /usr/local/lib/python3.7/site-packages (from catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->en_core_web_sm==2.2.5) (1.6.0)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/site-packages (from importlib-metadata>=0.20; python_version < "3.8"->catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->en_core_web_sm==2.2.5) (3.1.0)
Building wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py) ... [?25ldone
[?25h  Created wheel for en-core-web-sm: filename=en_core_web_sm-2.2.5-py3-none-any.whl size=12011738 sha256=4e741a4ef6924b14806dc4789ff4156bf93b98c79d33f5959516f6a04c73f4bb
  Stored in directory: /tmp/pip-ephem-wheel-cache-yazrb305/wheels/51/19/da/a3885266a3c241aff0ad2eb674ae058fd34a4870fef1c0a5a0
Successfully built en-core-web-sm
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')

ダウンロードできた。
コードを実行してみる

import spacy
nlp = spacy.load("en_core_web_sm")
---------------------------------------------------------------------------

OSError                                   Traceback (most recent call last)

<ipython-input-6-14d257ed08ca> in <module>
      1 import spacy
----> 2 nlp = spacy.load("en_core_web_sm")


/usr/local/lib/python3.7/site-packages/spacy/__init__.py in load(name, **overrides)
     28     if depr_path not in (True, False, None):
     29         deprecation_warning(Warnings.W001.format(path=depr_path))
---> 30     return util.load_model(name, **overrides)
     31 
     32 


/usr/local/lib/python3.7/site-packages/spacy/util.py in load_model(name, **overrides)
    167     elif hasattr(name, "exists"):  # Path or Path-like to model data
    168         return load_model_from_path(name, **overrides)
--> 169     raise IOError(Errors.E050.format(name=name))
    170 
    171 


OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

むむ。
jupyter notebook上だとアカンのか?
一度Dockerfileに書いてビルドし直してみる。

ビルドし直してみた。
再度実行してみる。

import spacy
nlp = spacy.load("en_core_web_sm")

何もエラー出ない。成功か。
docの型を見てみよう。

doc = nlp("This is a text")
type(doc)
spacy.tokens.doc.Doc

spacy.tokens.doc.Doc なるほど。
pipeline は何が設定されているか。

for p in nlp.pipeline:
    print(p)
('tagger', <spacy.pipeline.pipes.Tagger object at 0x7fc3c78613d0>)
('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7fc39292ede0>)
('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x7fc3928c5360>)

ふむふむ。
tagger、parser、ner 確かに。

ちなみに、モデルのQuickStart見てたら、こんな書き方↓もできるみたい。

import en_core_web_sm # 文字列でloadするモデルを指定する方法の他に、モジュールとして読み込む方法があるようだ
nlp = en_core_web_sm.load() # 引数なしの load メソッドが nlp を返すのか
doc = nlp("This is a text")
print(doc)

for p in nlp.pipeline:
    print(p)
This is a text
('tagger', <spacy.pipeline.pipes.Tagger object at 0x7fc3903805d0>)
('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7fc3928bad70>)
('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x7fc3928ba9f0>)

nlp オブジェクトってなんだ

type(nlp)
spacy.lang.en.English

ふーん

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0