LoginSignup
0
0

More than 5 years have passed since last update.

SchemaSimilarityFactoryの検証

Posted at

お疲れ様です

Solr4.0で追加されたと噂されるSchemaSimilarityFactoryについてどんなもんか検証してみようと思います

ちょっと古い話で恐縮です。。

調査用にschema.xmlは下記二種類用意しました

  • パターンA
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="aucfan-db" version="1.5">
  <types>
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" />
    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
      <similarity class="solr.DefaultSimilarityFactory" />
    </fieldType>
  </types>
  <fields>
    <field name="site_product_id" type="string" indexed="true" stored="true" required="true" />
    <field name="_version_" type="long" indexed="true" stored="true" />
    <field name="title" type="text_ja" indexed="true" stored="true" required="true" />
  </fields>
  <similarity class="solr.SchemaSimilarityFactory" />
</schema>
  • パターンB
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="aucfan-db" version="1.5">
 <types>
  <fieldType name="string" class="solr.StrField" sortMissingLast="true" />
  <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
  <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
  </fieldType>
 </types>
 <fields>
   <field name="site_product_id" type="string" indexed="true" stored="true" required="true" />
   <field name="_version_" type="long" indexed="true" stored="true" />
   <field name="title" type="text_ja" indexed="true" stored="true" required="true" />
 </fields>
 <similarity class="solr.DefaultSimilarityFactory" />
</schema>

その後、indexを登録し、適当な検索をdebugQuery付きで投げScoreを比較します

ここらへんはノウハウが溜まってると思うのでスルーします

Score の比較結果は以下

  • パターンA
13.652491 = (MATCH) weight(title:af in 48112) [], result of: 13.652491 = score(doc=48112,freq=2.0 = termFreq=2.0 ), product of: 5.073794 = queryWeight, product of: 5.073794 = idf(docFreq=8005, maxDocs=470589) 1.0 = queryNorm 2.6907854 = fieldWeight in 48112, product of: 1.4142135 = tf(freq=2.0), with freq of: 2.0 = termFreq=2.0 5.073794 = idf(docFreq=8005, maxDocs=470589) 0.375 = fieldNorm(doc=48112)
  • パターンB
2.6907854 = (MATCH) weight(title:af in 48112) [DefaultSimilarity], result of: 2.6907854 = fieldWeight in 48112, product of: 1.4142135 = tf(freq=2.0), with freq of: 2.0 = termFreq=2.0 5.073794 = idf(docFreq=8005, maxDocs=470589) 0.375 = fieldNorm(doc=48112)

なんと、恐ろしい程違います

どういう事!?

と、思ったらJavadocに書いてありました

> Users should be aware that in addition to supporting Similarity configurations specified on individual field types, this factory also differs in behavior from DefaultSimilarityFactory because of other differences in the implementations of PerFieldSimilarityWrapper and DefaultSimilarity - notably in methods such as Similarity.coord(int, int) and Similarity.queryNorm(float).

SchemaSimilarityFactory を使うときは、DefaultSimilarityFactoryは使えないって事です

まだ回避方法はわかってません。。。

ソースを見る限りうまくいかなさそうな気がしてます。。

取り急ぎ、DefaultSimilarityFactory を使いたい時はSchemaSimilarityFactoryを使わない方がいいでしょう

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0