More than 5 years have passed since last update.

Lucene の Analyzer で文字列を分割する

lucene

Posted at 2018-03-25

Lucene の org.apache.lucene.analysis パッケージの Analyzer を使って文字列を分割してみます。具体的なクラスは StandardAnalyzer、CJKAnalyzer、JapaneseAnalyzer、ICUCollationKeyAnalyzer の4つです。処理は Scala で記述します。

準備

Java と sbt を使えるようにして、4つのファイルを用意します。

ファイル名	説明
`build.sbt`	sbt の設定ファイル。Lucene のライブラリを指定します。
`src/main/scala/Main.scala`	Scala のメイン関数。テストデータを読み込んで Tokenizer を呼び出します。
`src/main/scala/Tokenizer.scala`	Analyzer のラッパー。文字列からトークンを生成して属性情報を出力します。
`src/main/resources/sampletext.txt`	サンプルの文字列。1行に1文を記入します。

sbt の設定ファイルです。Scala　のバージョンを指定し、Lucene のライブラリを使えるようにします。ここでは Scala のバージョンは 2.12.4、Lucene のバージョンは 7.2.1 を使います。

build.sbt

scalaVersion := "2.12.4"

name := "lucene-tokenize-sample"
organization := "com.example.scala"
version := "1.0"

val luceneVersion = "7.2.1"

// https://mvnrepository.com/artifact/org.apache.lucene/
libraryDependencies ++= Seq(
    "org.apache.lucene" % "lucene-core" % luceneVersion,
    "org.apache.lucene" % "lucene-analyzers-common" % luceneVersion,
    "org.apache.lucene" % "lucene-analyzers-kuromoji" % luceneVersion,
    "org.apache.lucene" % "lucene-analyzers-icu" % luceneVersion
)

Scala のメイン関数です。
Source.fromResource で src/main/resources/ 配下のファイルを読み込み、StandardAnalyzer、CJKAnalyzer、JapaneseAnalyzer、ICUCollationKeyAnalyzer のぞれぞれに文字列を解析させます。

src/main/scala/Main.scala

import scala.io.Source

import com.ibm.icu.text.Collator
import com.ibm.icu.util.ULocale
import org.apache.lucene.analysis.standard.StandardAnalyzer
import org.apache.lucene.analysis.cjk.CJKAnalyzer
import org.apache.lucene.analysis.ja.JapaneseAnalyzer
import org.apache.lucene.collation.ICUCollationKeyAnalyzer
import org.apache.lucene.util.Version

object Main extends App {
  println(s"Hello, Lucene ${Version.LATEST}!")

  val sampleFile = "sampletext.txt"
  val resource = Source.fromResource(sampleFile)
  val collator = Collator.getInstance(new ULocale("jp"))
  val tokenizers = Array(
    new Tokenizer(new StandardAnalyzer),
    new Tokenizer(new CJKAnalyzer),  // 2-gram for Japanese text
    new Tokenizer(new JapaneseAnalyzer),
    new Tokenizer(new ICUCollationKeyAnalyzer(collator)),
  )
  println(s"test ${tokenizers.length} tokenizers")
  resource.getLines.foreach { line =>
    tokenizers.foreach { t =>
      val tokens = t.tokenize(line)
      println(s"tokenized into ${tokens.length} tokens")
    }
  }
  resource.close
}

Analyzer のラッパーです。tokenStream で文字列をトークンに分割して、トークンごとの属性を取得して表示します。トークンの属性情報は tokenattributes パッケージで定義されています。詳細は Javadoc で確認してください。なお、JAR ファイルが別ですので、ドキュメントやソースコードを行き来する場合は慣れるまでに注意が必要かもしれません。それぞれを別のタブかウィンドウで開いておくと良いでしょう。

src/main/scala/Tokenizer.scala

import java.io.{IOException, StringReader}

import org.apache.lucene.analysis.Analyzer
import org.apache.lucene.analysis.tokenattributes.{CharTermAttribute, OffsetAttribute, PositionIncrementAttribute, PositionLengthAttribute, TypeAttribute}
import org.apache.lucene.analysis.ja.tokenattributes.{BaseFormAttribute, InflectionAttribute, PartOfSpeechAttribute, ReadingAttribute}

class Tokenizer(a: Analyzer) {
  private val analyzer: Analyzer = a

  def tokenize(s: String) : Array[String] = {
    println("------------------------------")
    println(s"text = $s")
    println(s"analyzer = ${analyzer.getClass}")
    var tokens = new Array[String](0)
    // Make analyzer to create token stream
    val stream  = analyzer.tokenStream(null, new StringReader(s))
    try {
      stream.reset
      var i = 1
      while (stream.incrementToken) {
        val termAtt = stream.getAttribute(classOf[CharTermAttribute])
        tokens :+= termAtt.toString
        val typeAtt = stream.getAttribute(classOf[TypeAttribute])
        val offsetAtt = stream.getAttribute(classOf[OffsetAttribute])
        val poslenAtt = stream.getAttribute(classOf[PositionLengthAttribute])
        val posincAtt = stream.getAttribute(classOf[PositionIncrementAttribute])
        println(s"    ${i}\ttoken=${termAtt.toString}\ttype=${typeAtt.`type`}\tstart=${offsetAtt.startOffset}\tend=${offsetAtt.endOffset}\tlength=${poslenAtt.getPositionLength}\tincrement=${posincAtt.getPositionIncrement}")
        // Kuromoji attributes
        val baseFormAtt = stream.getAttribute(classOf[BaseFormAttribute])
        val inflectionAtt = stream.getAttribute(classOf[InflectionAttribute])
        val partOfSpeechAtt = stream.getAttribute(classOf[PartOfSpeechAttribute])
        val readingAtt = stream.getAttribute(classOf[ReadingAttribute])
        if (baseFormAtt != null) {
          println(s"    \tpartOfSpeech=${partOfSpeechAtt.getPartOfSpeech}\treading=${readingAtt.getReading}\tpronounciation=${readingAtt.getPronunciation}\tbase=${baseFormAtt.getBaseForm}\tinflectionForm=${inflectionAtt.getInflectionForm}\tinflectionType=${inflectionAtt.getInflectionType}")
        }
        i += 1
      }
      stream.end
    } catch {
      case ex: IOException => {
        // not thrown b/c we're using a string reader...
        throw new RuntimeException(ex)
      }
    } finally {
      stream.close
    }
    return tokens
  }
}

サンプルの文字列です。英語と日本語を記載しておきます。

src/main/resources/sampletext.txt

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP) (https://github.com/niderhoff/nlp-datasets)
春は曙。やうやう白くなりゆく山際、すこしあかりて、紫だちたる雲の細くたなびきたる。

Analyzer と Tokenizer および Filter については Solr の公式ドキュメントに説明があります。Stack Overflow の以下の質問に対する回答も分かりやすいでしょう。

上掲の Scala のコードでは、Lucene の Tokenizer と Filter を組み合わせてある Analyzer を使っています。単なる文字列の分割だけでなく、ストップワードの除去なども含めて実行しておきたいためです。

実行

sbt run コマンドで実行します。

$ sbt run

結果をまとめると以下のようになります。なお、ICUCollationKeyAnalyzer は文字列を分割しないので、下記の表形式の結果には含めていません。

日本語の解析結果は Kuromoji の JapaneseAnalyzer　が分かりやすいですが、この Analyzer は英語の場合のストップワードが設定されていませんので、of や with などが残ります。解析する前に言語を指定して処理を振り分けると良いでしょう。また、「すこしあかりて」の後半の「あかりて」をうまく解釈できていませんので、対象ドメインに応じて辞書を追加・または変更するなどの対応が必要と言えます。

文字列1:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

ラテン語なので空白で分割するのみで、同じ結果になります。いずれも小文字に変換する LowerCaseFilter が含まれています。

No.	StandardAnalyzer	CJKAnalyzer	JapaneseAnalyzer
1	lorem	lorem	lorem
2	ipsum	ipsum	ipsum
3	dolor	dolor	dolor
4	sit	sit	sit
5	amet	amet	amet
6	consectetur	consectetur	consectetur
7	adipiscing	adipiscing	adipiscing
8	elit	elit	elit
9	sed	sed	sed
10	do	do	do
11	eiusmod	eiusmod	eiusmod
12	tempor	tempor	tempor
13	incididunt	incididunt	incididunt
14	ut	ut	ut
15	labore	labore	labore
16	et	et	et
17	dolore	dolore	dolore
18	magna	magna	magna
19	aliqua	aliqua	aliqua

文字列2:

Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP) (https://github.com/niderhoff/nlp-datasets)

StandardAnalyzer と　CJKAnalyzer は同じ結果になります。JapaneseAnalyzer は英語のストップワードのフィルタを含まないので of などが残ります。また、ピリオドの扱いの関係でドメイン名が区切られるかも異なります。

No.	StandardAnalyzer	CJKAnalyzer	JapaneseAnalyzer
1	alphabetical	alphabetical	alphabetical
2	list	list	list
3	free	free	of
4	public	public	free
5	domain	domain	public
6	datasets	datasets	domain
7	text	text	datasets
8	data	data	with
9	use	use	text
10	natural	natural	data
11	language	language	for
12	processing	processing	use
13	nlp	nlp	in
14	https	https	natural
15	github.com	github.com	language
16	niderhoff	niderhoff	processing
17	nlp	nlp	nlp
18	datasets	datasets	https
19			github
20			com
21			niderhoff
22			nlp
23			datasets

文字列3:

春は曙。やうやう白くなりゆく山際、すこしあかりて、紫だちたる雲の細くたなびきたる。

日本語の扱いは三通りで大きく異なります。CJKAnalyzer は N-Gram で N=2 となっています。

No.	StandardAnalyzer	CJKAnalyzer	JapaneseAnalyzer
1	春	春は	春
2	は	は曙	曙
3	曙	やう	やうやう
4	や	うや	白い
5	う	やう	ゆく
6	や	う白	山際
7	う	白く	すこし
8	白	くな	かりる
9	く	なり	紫
10	な	りゆ	ちる
11	り	ゆく	雲
12	ゆ	く山	細い
13	く	山際	たなびく
14	山	すこ
15	際	こし
16	す	しあ
17	こ	あか
18	し	かり
19	あ	りて
20	か	紫だ
21	り	だち
22	て	ちた
23	紫	たる
24	だ	る雲
25	ち	雲の
26	た	の細
27	る	細く
28	雲	くた
29	の	たな
30	細	なび
31	く	びき
32	た	きた
33	な	たる
34	び
35	き
36	た
37	る

sbt run コマンドの実行結果は以下のようになります。トークンの種類や単語の活用によってスキップした長さなどを確認できます。Kuromoji　の場合は読み仮名も抽出できます。

Hello, Lucene 7.2.1!
test 4 tokenizers
------------------------------
text = Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
analyzer = class org.apache.lucene.analysis.standard.StandardAnalyzer
    1	token=lorem	type=<ALPHANUM>	start=0	end=5	length=1	increment=1
    2	token=ipsum	type=<ALPHANUM>	start=6	end=11	length=1	increment=1
    3	token=dolor	type=<ALPHANUM>	start=12	end=17	length=1	increment=1
    4	token=sit	type=<ALPHANUM>	start=18	end=21	length=1	increment=1
    5	token=amet	type=<ALPHANUM>	start=22	end=26	length=1	increment=1
    6	token=consectetur	type=<ALPHANUM>	start=28	end=39	length=1	increment=1
    7	token=adipiscing	type=<ALPHANUM>	start=40	end=50	length=1	increment=1
    8	token=elit	type=<ALPHANUM>	start=51	end=55	length=1	increment=1
    9	token=sed	type=<ALPHANUM>	start=57	end=60	length=1	increment=1
    10	token=do	type=<ALPHANUM>	start=61	end=63	length=1	increment=1
    11	token=eiusmod	type=<ALPHANUM>	start=64	end=71	length=1	increment=1
    12	token=tempor	type=<ALPHANUM>	start=72	end=78	length=1	increment=1
    13	token=incididunt	type=<ALPHANUM>	start=79	end=89	length=1	increment=1
    14	token=ut	type=<ALPHANUM>	start=90	end=92	length=1	increment=1
    15	token=labore	type=<ALPHANUM>	start=93	end=99	length=1	increment=1
    16	token=et	type=<ALPHANUM>	start=100	end=102	length=1	increment=1
    17	token=dolore	type=<ALPHANUM>	start=103	end=109	length=1	increment=1
    18	token=magna	type=<ALPHANUM>	start=110	end=115	length=1	increment=1
    19	token=aliqua	type=<ALPHANUM>	start=116	end=122	length=1	increment=1
tokenized into 19 tokens
------------------------------
text = Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
analyzer = class org.apache.lucene.analysis.cjk.CJKAnalyzer
    1	token=lorem	type=<ALPHANUM>	start=0	end=5	length=1	increment=1
    2	token=ipsum	type=<ALPHANUM>	start=6	end=11	length=1	increment=1
    3	token=dolor	type=<ALPHANUM>	start=12	end=17	length=1	increment=1
    4	token=sit	type=<ALPHANUM>	start=18	end=21	length=1	increment=1
    5	token=amet	type=<ALPHANUM>	start=22	end=26	length=1	increment=1
    6	token=consectetur	type=<ALPHANUM>	start=28	end=39	length=1	increment=1
    7	token=adipiscing	type=<ALPHANUM>	start=40	end=50	length=1	increment=1
    8	token=elit	type=<ALPHANUM>	start=51	end=55	length=1	increment=1
    9	token=sed	type=<ALPHANUM>	start=57	end=60	length=1	increment=1
    10	token=do	type=<ALPHANUM>	start=61	end=63	length=1	increment=1
    11	token=eiusmod	type=<ALPHANUM>	start=64	end=71	length=1	increment=1
    12	token=tempor	type=<ALPHANUM>	start=72	end=78	length=1	increment=1
    13	token=incididunt	type=<ALPHANUM>	start=79	end=89	length=1	increment=1
    14	token=ut	type=<ALPHANUM>	start=90	end=92	length=1	increment=1
    15	token=labore	type=<ALPHANUM>	start=93	end=99	length=1	increment=1
    16	token=et	type=<ALPHANUM>	start=100	end=102	length=1	increment=1
    17	token=dolore	type=<ALPHANUM>	start=103	end=109	length=1	increment=1
    18	token=magna	type=<ALPHANUM>	start=110	end=115	length=1	increment=1
    19	token=aliqua	type=<ALPHANUM>	start=116	end=122	length=1	increment=1
tokenized into 19 tokens
------------------------------
text = Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
analyzer = class org.apache.lucene.analysis.ja.JapaneseAnalyzer
    1	token=lorem	type=word	start=0	end=5	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    2	token=ipsum	type=word	start=6	end=11	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    3	token=dolor	type=word	start=12	end=17	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    4	token=sit	type=word	start=18	end=21	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    5	token=amet	type=word	start=22	end=26	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    6	token=consectetur	type=word	start=28	end=39	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    7	token=adipiscing	type=word	start=40	end=50	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    8	token=elit	type=word	start=51	end=55	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    9	token=sed	type=word	start=57	end=60	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    10	token=do	type=word	start=61	end=63	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    11	token=eiusmod	type=word	start=64	end=71	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    12	token=tempor	type=word	start=72	end=78	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    13	token=incididunt	type=word	start=79	end=89	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    14	token=ut	type=word	start=90	end=92	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    15	token=labore	type=word	start=93	end=99	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    16	token=et	type=word	start=100	end=102	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    17	token=dolore	type=word	start=103	end=109	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    18	token=magna	type=word	start=110	end=115	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    19	token=aliqua	type=word	start=116	end=122	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
tokenized into 19 tokens
------------------------------
text = Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
analyzer = class org.apache.lucene.collation.ICUCollationKeyAnalyzer
    1	token=Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.	type=word	start=0	end=123	length=1	increment=1
tokenized into 1 tokens
------------------------------
text = Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP) (https://github.com/niderhoff/nlp-datasets)
analyzer = class org.apache.lucene.analysis.standard.StandardAnalyzer
    1	token=alphabetical	type=<ALPHANUM>	start=0	end=12	length=1	increment=1
    2	token=list	type=<ALPHANUM>	start=13	end=17	length=1	increment=1
    3	token=free	type=<ALPHANUM>	start=21	end=25	length=1	increment=2
    4	token=public	type=<ALPHANUM>	start=26	end=32	length=1	increment=1
    5	token=domain	type=<ALPHANUM>	start=33	end=39	length=1	increment=1
    6	token=datasets	type=<ALPHANUM>	start=40	end=48	length=1	increment=1
    7	token=text	type=<ALPHANUM>	start=54	end=58	length=1	increment=2
    8	token=data	type=<ALPHANUM>	start=59	end=63	length=1	increment=1
    9	token=use	type=<ALPHANUM>	start=68	end=71	length=1	increment=2
    10	token=natural	type=<ALPHANUM>	start=75	end=82	length=1	increment=2
    11	token=language	type=<ALPHANUM>	start=83	end=91	length=1	increment=1
    12	token=processing	type=<ALPHANUM>	start=92	end=102	length=1	increment=1
    13	token=nlp	type=<ALPHANUM>	start=104	end=107	length=1	increment=1
    14	token=https	type=<ALPHANUM>	start=110	end=115	length=1	increment=1
    15	token=github.com	type=<ALPHANUM>	start=118	end=128	length=1	increment=1
    16	token=niderhoff	type=<ALPHANUM>	start=129	end=138	length=1	increment=1
    17	token=nlp	type=<ALPHANUM>	start=139	end=142	length=1	increment=1
    18	token=datasets	type=<ALPHANUM>	start=143	end=151	length=1	increment=1
tokenized into 18 tokens
------------------------------
text = Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP) (https://github.com/niderhoff/nlp-datasets)
analyzer = class org.apache.lucene.analysis.cjk.CJKAnalyzer
    1	token=alphabetical	type=<ALPHANUM>	start=0	end=12	length=1	increment=1
    2	token=list	type=<ALPHANUM>	start=13	end=17	length=1	increment=1
    3	token=free	type=<ALPHANUM>	start=21	end=25	length=1	increment=2
    4	token=public	type=<ALPHANUM>	start=26	end=32	length=1	increment=1
    5	token=domain	type=<ALPHANUM>	start=33	end=39	length=1	increment=1
    6	token=datasets	type=<ALPHANUM>	start=40	end=48	length=1	increment=1
    7	token=text	type=<ALPHANUM>	start=54	end=58	length=1	increment=2
    8	token=data	type=<ALPHANUM>	start=59	end=63	length=1	increment=1
    9	token=use	type=<ALPHANUM>	start=68	end=71	length=1	increment=2
    10	token=natural	type=<ALPHANUM>	start=75	end=82	length=1	increment=2
    11	token=language	type=<ALPHANUM>	start=83	end=91	length=1	increment=1
    12	token=processing	type=<ALPHANUM>	start=92	end=102	length=1	increment=1
    13	token=nlp	type=<ALPHANUM>	start=104	end=107	length=1	increment=1
    14	token=https	type=<ALPHANUM>	start=110	end=115	length=1	increment=1
    15	token=github.com	type=<ALPHANUM>	start=118	end=128	length=1	increment=1
    16	token=niderhoff	type=<ALPHANUM>	start=129	end=138	length=1	increment=1
    17	token=nlp	type=<ALPHANUM>	start=139	end=142	length=1	increment=1
    18	token=datasets	type=<ALPHANUM>	start=143	end=151	length=1	increment=1
tokenized into 18 tokens
------------------------------
text = Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP) (https://github.com/niderhoff/nlp-datasets)
analyzer = class org.apache.lucene.analysis.ja.JapaneseAnalyzer
    1	token=alphabetical	type=word	start=0	end=12	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    2	token=list	type=word	start=13	end=17	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    3	token=of	type=word	start=18	end=20	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    4	token=free	type=word	start=21	end=25	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    5	token=public	type=word	start=26	end=32	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    6	token=domain	type=word	start=33	end=39	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    7	token=datasets	type=word	start=40	end=48	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    8	token=with	type=word	start=49	end=53	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    9	token=text	type=word	start=54	end=58	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    10	token=data	type=word	start=59	end=63	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    11	token=for	type=word	start=64	end=67	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    12	token=use	type=word	start=68	end=71	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    13	token=in	type=word	start=72	end=74	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    14	token=natural	type=word	start=75	end=82	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    15	token=language	type=word	start=83	end=91	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    16	token=processing	type=word	start=92	end=102	length=1	increment=1
    	partOfSpeech=名詞-固有名詞-組織	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    17	token=nlp	type=word	start=104	end=107	length=1	increment=1
    	partOfSpeech=名詞-一般	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    18	token=https	type=word	start=110	end=115	length=1	increment=1
    	partOfSpeech=名詞-一般	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    19	token=github	type=word	start=118	end=124	length=1	increment=1
    	partOfSpeech=名詞-一般	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    20	token=com	type=word	start=125	end=128	length=1	increment=1
    	partOfSpeech=名詞-一般	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    21	token=niderhoff	type=word	start=129	end=138	length=1	increment=1
    	partOfSpeech=名詞-一般	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    22	token=nlp	type=word	start=139	end=142	length=1	increment=1
    	partOfSpeech=名詞-一般	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
    23	token=datasets	type=word	start=143	end=151	length=1	increment=1
    	partOfSpeech=名詞-一般	reading=null	pronounciation=null	base=null	inflectionForm=null	inflectionType=null
tokenized into 23 tokens
------------------------------
text = Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP) (https://github.com/niderhoff/nlp-datasets)
analyzer = class org.apache.lucene.collation.ICUCollationKeyAnalyzer
    1	token=Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP) (https://github.com/niderhoff/nlp-datasets)	type=word	start=0	end=152	length=1	increment=1
tokenized into 1 tokens
------------------------------
text = 春は曙。やうやう白くなりゆく山際、すこしあかりて、紫だちたる雲の細くたなびきたる。
analyzer = class org.apache.lucene.analysis.standard.StandardAnalyzer
    1	token=春	type=<IDEOGRAPHIC>	start=0	end=1	length=1	increment=1
    2	token=は	type=<HIRAGANA>	start=1	end=2	length=1	increment=1
    3	token=曙	type=<IDEOGRAPHIC>	start=2	end=3	length=1	increment=1
    4	token=や	type=<HIRAGANA>	start=4	end=5	length=1	increment=1
    5	token=う	type=<HIRAGANA>	start=5	end=6	length=1	increment=1
    6	token=や	type=<HIRAGANA>	start=6	end=7	length=1	increment=1
    7	token=う	type=<HIRAGANA>	start=7	end=8	length=1	increment=1
    8	token=白	type=<IDEOGRAPHIC>	start=8	end=9	length=1	increment=1
    9	token=く	type=<HIRAGANA>	start=9	end=10	length=1	increment=1
    10	token=な	type=<HIRAGANA>	start=10	end=11	length=1	increment=1
    11	token=り	type=<HIRAGANA>	start=11	end=12	length=1	increment=1
    12	token=ゆ	type=<HIRAGANA>	start=12	end=13	length=1	increment=1
    13	token=く	type=<HIRAGANA>	start=13	end=14	length=1	increment=1
    14	token=山	type=<IDEOGRAPHIC>	start=14	end=15	length=1	increment=1
    15	token=際	type=<IDEOGRAPHIC>	start=15	end=16	length=1	increment=1
    16	token=す	type=<HIRAGANA>	start=17	end=18	length=1	increment=1
    17	token=こ	type=<HIRAGANA>	start=18	end=19	length=1	increment=1
    18	token=し	type=<HIRAGANA>	start=19	end=20	length=1	increment=1
    19	token=あ	type=<HIRAGANA>	start=20	end=21	length=1	increment=1
    20	token=か	type=<HIRAGANA>	start=21	end=22	length=1	increment=1
    21	token=り	type=<HIRAGANA>	start=22	end=23	length=1	increment=1
    22	token=て	type=<HIRAGANA>	start=23	end=24	length=1	increment=1
    23	token=紫	type=<IDEOGRAPHIC>	start=25	end=26	length=1	increment=1
    24	token=だ	type=<HIRAGANA>	start=26	end=27	length=1	increment=1
    25	token=ち	type=<HIRAGANA>	start=27	end=28	length=1	increment=1
    26	token=た	type=<HIRAGANA>	start=28	end=29	length=1	increment=1
    27	token=る	type=<HIRAGANA>	start=29	end=30	length=1	increment=1
    28	token=雲	type=<IDEOGRAPHIC>	start=30	end=31	length=1	increment=1
    29	token=の	type=<HIRAGANA>	start=31	end=32	length=1	increment=1
    30	token=細	type=<IDEOGRAPHIC>	start=32	end=33	length=1	increment=1
    31	token=く	type=<HIRAGANA>	start=33	end=34	length=1	increment=1
    32	token=た	type=<HIRAGANA>	start=34	end=35	length=1	increment=1
    33	token=な	type=<HIRAGANA>	start=35	end=36	length=1	increment=1
    34	token=び	type=<HIRAGANA>	start=36	end=37	length=1	increment=1
    35	token=き	type=<HIRAGANA>	start=37	end=38	length=1	increment=1
    36	token=た	type=<HIRAGANA>	start=38	end=39	length=1	increment=1
    37	token=る	type=<HIRAGANA>	start=39	end=40	length=1	increment=1
tokenized into 37 tokens
------------------------------
text = 春は曙。やうやう白くなりゆく山際、すこしあかりて、紫だちたる雲の細くたなびきたる。
analyzer = class org.apache.lucene.analysis.cjk.CJKAnalyzer
    1	token=春は	type=<DOUBLE>	start=0	end=2	length=1	increment=1
    2	token=は曙	type=<DOUBLE>	start=1	end=3	length=1	increment=1
    3	token=やう	type=<DOUBLE>	start=4	end=6	length=1	increment=1
    4	token=うや	type=<DOUBLE>	start=5	end=7	length=1	increment=1
    5	token=やう	type=<DOUBLE>	start=6	end=8	length=1	increment=1
    6	token=う白	type=<DOUBLE>	start=7	end=9	length=1	increment=1
    7	token=白く	type=<DOUBLE>	start=8	end=10	length=1	increment=1
    8	token=くな	type=<DOUBLE>	start=9	end=11	length=1	increment=1
    9	token=なり	type=<DOUBLE>	start=10	end=12	length=1	increment=1
    10	token=りゆ	type=<DOUBLE>	start=11	end=13	length=1	increment=1
    11	token=ゆく	type=<DOUBLE>	start=12	end=14	length=1	increment=1
    12	token=く山	type=<DOUBLE>	start=13	end=15	length=1	increment=1
    13	token=山際	type=<DOUBLE>	start=14	end=16	length=1	increment=1
    14	token=すこ	type=<DOUBLE>	start=17	end=19	length=1	increment=1
    15	token=こし	type=<DOUBLE>	start=18	end=20	length=1	increment=1
    16	token=しあ	type=<DOUBLE>	start=19	end=21	length=1	increment=1
    17	token=あか	type=<DOUBLE>	start=20	end=22	length=1	increment=1
    18	token=かり	type=<DOUBLE>	start=21	end=23	length=1	increment=1
    19	token=りて	type=<DOUBLE>	start=22	end=24	length=1	increment=1
    20	token=紫だ	type=<DOUBLE>	start=25	end=27	length=1	increment=1
    21	token=だち	type=<DOUBLE>	start=26	end=28	length=1	increment=1
    22	token=ちた	type=<DOUBLE>	start=27	end=29	length=1	increment=1
    23	token=たる	type=<DOUBLE>	start=28	end=30	length=1	increment=1
    24	token=る雲	type=<DOUBLE>	start=29	end=31	length=1	increment=1
    25	token=雲の	type=<DOUBLE>	start=30	end=32	length=1	increment=1
    26	token=の細	type=<DOUBLE>	start=31	end=33	length=1	increment=1
    27	token=細く	type=<DOUBLE>	start=32	end=34	length=1	increment=1
    28	token=くた	type=<DOUBLE>	start=33	end=35	length=1	increment=1
    29	token=たな	type=<DOUBLE>	start=34	end=36	length=1	increment=1
    30	token=なび	type=<DOUBLE>	start=35	end=37	length=1	increment=1
    31	token=びき	type=<DOUBLE>	start=36	end=38	length=1	increment=1
    32	token=きた	type=<DOUBLE>	start=37	end=39	length=1	increment=1
    33	token=たる	type=<DOUBLE>	start=38	end=40	length=1	increment=1
tokenized into 33 tokens
------------------------------
text = 春は曙。やうやう白くなりゆく山際、すこしあかりて、紫だちたる雲の細くたなびきたる。
analyzer = class org.apache.lucene.analysis.ja.JapaneseAnalyzer
    1	token=春	type=word	start=0	end=1	length=1	increment=1
    	partOfSpeech=名詞-一般	reading=ハル	pronounciation=ハル	base=null	inflectionForm=null	inflectionType=null
    2	token=曙	type=word	start=2	end=3	length=1	increment=2
    	partOfSpeech=名詞-一般	reading=アケボノ	pronounciation=アケボノ	base=null	inflectionForm=null	inflectionType=null
    3	token=やうやう	type=word	start=4	end=8	length=1	increment=1
    	partOfSpeech=副詞-一般	reading=ヤウヤウ	pronounciation=ヨーヨー	base=null	inflectionForm=null	inflectionType=null
    4	token=白い	type=word	start=8	end=10	length=1	increment=1
    	partOfSpeech=形容詞-自立	reading=シロク	pronounciation=シロク	base=白い	inflectionForm=連用テ接続	inflectionType=形容詞・アウオ段
    5	token=ゆく	type=word	start=12	end=14	length=1	increment=2
    	partOfSpeech=動詞-非自立	reading=ユク	pronounciation=ユク	base=null	inflectionForm=基本形	inflectionType=五段・カ行促音便ユク
    6	token=山際	type=word	start=14	end=16	length=1	increment=1
    	partOfSpeech=名詞-一般	reading=ヤマギワ	pronounciation=ヤマギワ	base=null	inflectionForm=null	inflectionType=null
    7	token=すこし	type=word	start=17	end=20	length=1	increment=1
    	partOfSpeech=副詞-助詞類接続	reading=スコシ	pronounciation=スコシ	base=null	inflectionForm=null	inflectionType=null
    8	token=かりる	type=word	start=21	end=23	length=1	increment=2
    	partOfSpeech=動詞-自立	reading=カリ	pronounciation=カリ	base=かりる	inflectionForm=連用形	inflectionType=一段
    9	token=紫	type=word	start=25	end=26	length=1	increment=2
    	partOfSpeech=名詞-一般	reading=ムラサキ	pronounciation=ムラサキ	base=null	inflectionForm=null	inflectionType=null
    10	token=ちる	type=word	start=27	end=28	length=1	increment=2
    	partOfSpeech=動詞-自立	reading=チ	pronounciation=チ	base=ちる	inflectionForm=体言接続特殊２	inflectionType=五段・ラ行
    11	token=雲	type=word	start=30	end=31	length=1	increment=2
    	partOfSpeech=名詞-一般	reading=クモ	pronounciation=クモ	base=null	inflectionForm=null	inflectionType=null
    12	token=細い	type=word	start=32	end=34	length=1	increment=2
    	partOfSpeech=形容詞-自立	reading=ホソク	pronounciation=ホソク	base=細い	inflectionForm=連用テ接続	inflectionType=形容詞・アウオ段
    13	token=たなびく	type=word	start=34	end=38	length=1	increment=1
    	partOfSpeech=動詞-自立	reading=タナビキ	pronounciation=タナビキ	base=たなびく	inflectionForm=連用形	inflectionType=五段・カ行イ音便
tokenized into 13 tokens
------------------------------
text = 春は曙。やうやう白くなりゆく山際、すこしあかりて、紫だちたる雲の細くたなびきたる。
analyzer = class org.apache.lucene.collation.ICUCollationKeyAnalyzer
    1	token=春は曙。やうやう白くなりゆく山際、すこしあかりて、紫だちたる雲の細くたなびきたる。	type=word	start=0	end=41	length=1	increment=1
tokenized into 1 tokens

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up