お疲れ様です
Solr4.0で追加されたと噂されるSchemaSimilarityFactoryについてどんなもんか検証してみようと思います
ちょっと古い話で恐縮です。。
調査用にschema.xmlは下記二種類用意しました
- パターンA
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="aucfan-db" version="1.5">
<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
<similarity class="solr.DefaultSimilarityFactory" />
</fieldType>
</types>
<fields>
<field name="site_product_id" type="string" indexed="true" stored="true" required="true" />
<field name="_version_" type="long" indexed="true" stored="true" />
<field name="title" type="text_ja" indexed="true" stored="true" required="true" />
</fields>
<similarity class="solr.SchemaSimilarityFactory" />
</schema>
- パターンB
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="aucfan-db" version="1.5">
<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
</fieldType>
</types>
<fields>
<field name="site_product_id" type="string" indexed="true" stored="true" required="true" />
<field name="_version_" type="long" indexed="true" stored="true" />
<field name="title" type="text_ja" indexed="true" stored="true" required="true" />
</fields>
<similarity class="solr.DefaultSimilarityFactory" />
</schema>
その後、indexを登録し、適当な検索をdebugQuery付きで投げScoreを比較します
ここらへんはノウハウが溜まってると思うのでスルーします
Score の比較結果は以下
- パターンA
13.652491 = (MATCH) weight(title:af in 48112) [], result of: 13.652491 = score(doc=48112,freq=2.0 = termFreq=2.0 ), product of: 5.073794 = queryWeight, product of: 5.073794 = idf(docFreq=8005, maxDocs=470589) 1.0 = queryNorm 2.6907854 = fieldWeight in 48112, product of: 1.4142135 = tf(freq=2.0), with freq of: 2.0 = termFreq=2.0 5.073794 = idf(docFreq=8005, maxDocs=470589) 0.375 = fieldNorm(doc=48112)
- パターンB
2.6907854 = (MATCH) weight(title:af in 48112) [DefaultSimilarity], result of: 2.6907854 = fieldWeight in 48112, product of: 1.4142135 = tf(freq=2.0), with freq of: 2.0 = termFreq=2.0 5.073794 = idf(docFreq=8005, maxDocs=470589) 0.375 = fieldNorm(doc=48112)
なんと、恐ろしい程違います
どういう事!?
と、思ったらJavadocに書いてありました
> Users should be aware that in addition to supporting Similarity configurations specified on individual field types, this factory also differs in behavior from DefaultSimilarityFactory because of other differences in the implementations of PerFieldSimilarityWrapper and DefaultSimilarity - notably in methods such as Similarity.coord(int, int) and Similarity.queryNorm(float).
SchemaSimilarityFactory を使うときは、DefaultSimilarityFactoryは使えないって事です
まだ回避方法はわかってません。。。
ソースを見る限りうまくいかなさそうな気がしてます。。
取り急ぎ、DefaultSimilarityFactory を使いたい時はSchemaSimilarityFactoryを使わない方がいいでしょう