LoginSignup
3
2

More than 5 years have passed since last update.

系列ラベリング(品詞タグ付け)用のデータをサクッと作る

Last updated at Posted at 2013-01-03

NLTKのnltk_dataに付属しているBrown Corpusを使うと簡単。品詞タグ付け用のデータを作るには、tagged_sents()を呼び出すだけで良い。categoriesを指定しておくと、そのドメインのデータのみを扱うこともできる(news以外にもreviews、fiction、romance、mysteryなど色々ある)。

import nltk
from nltk.corpus import brown

corpus = brown.tagged_sents(categories='news')

def dataset(N=100):
    d = []
    for tagged_sent in corpus[:N]:
        untagged_sent = nltk.tag.untag(tagged_sent)
        pos_sequence = [pos for (word, pos) in tagged_sent]
        d.append((untagged_sent, pos_sequence))
    return d

if __name__ == "__main__":
    dataset = dataset()
3
2
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
3
2