More than 5 years have passed since last update.

LuceneでGroupingSearch

Last updated at 2018-04-29Posted at 2018-04-29

LuceneでMySQLのdistinctみたいなことをしたかったのです。
GroupingSearchを使えば似たようなことができそうなのでやってみます。

準備

lucene-analyzers-common-7.2.1
lucene-core-7.2.1
lucene-grouping-7.2.1

をEclipseのAdd External JARsから追加しておきます。

データ作成

データをLuceneに入れておきます。

 public void feed() {
    Analyzer analyzer = new WhitespaceAnalyzer();
    IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
    iwc.setOpenMode(OpenMode.CREATE);
    Path path = Paths.get("src/main/resources/test");
    Directory directory = null;
    IndexWriter writer = null;
    try {
      directory = FSDirectory.open(path);
      writer = new IndexWriter(directory, iwc);

      for (int j = 0; j < 3; j++) {
        for (int i = 1; i < 4; i++) {
          // Documentクラスの作成
          Document doc = new Document();

          // textフィールドにtest textの値を入れる
          doc.add(new TextField("text", "test text", Field.Store.YES));

          // idフィールドにidの値を入れる　このフィールドでグルーピングする
          //グルーピングするフィールドにSortedDocValuesFieldも指定しないと上手くいかない
          doc.add(new StringField("id", "id_" + i, Field.Store.YES));
          doc.add(new SortedDocValuesField("id", new BytesRef("id_" + i)));

          //cntフィールドに値を入れる
          doc.add(new IntPoint("cnt", i + (j * 10)));
          doc.add(new StoredField("cnt", i + (j * 10)));
          doc.add(new NumericDocValuesField("cnt", i + (j * 10)));

          writer.addDocument(doc);
        }
      }
      writer.close();

    } catch (IOException e) {
      e.printStackTrace();
      System.exit(1);
    }
 }

こんなデータが入るイメージです。

text	id	cnt
test text	id_1	1
test text	id_1	11
test text	id_1	21
test text	id_2	2
test text	id_2	12
test text	id_2	22
test text	id_3	3
test text	id_3	13
test text	id_3	23

Groupingで検索をする

このデータに対してGroupingを使って検索します。

 public void groupSearch() {
    Path path = Paths.get("src/main/resources/test");
    Directory directory = null;

    //検索クエリとソート
    TermQuery tq = new TermQuery(new Term("text", "test"));
    BooleanQuery.Builder builder = new BooleanQuery.Builder();
    Query q = builder.add(tq, BooleanClause.Occur.MUST).build();
    Sort sort = new Sort(new SortField("cnt", SortField.Type.INT, true));

    try {
      directory = FSDirectory.open(path);
      DirectoryReader reader = DirectoryReader.open(directory);
      IndexSearcher searcher = new IndexSearcher(reader);
    
      //idフィールドでグルーピングサーチ
      GroupingSearch gs = new GroupingSearch("id");
      
      //グループ外のソート
      gs.setGroupSort(sort);
      //グループ内のソート
      gs.setSortWithinGroup(sort);
      
      //グループ内の取得件数
      gs.setGroupDocsLimit(1);
     
      //グループ数の取得をできるようにする
      gs.setAllGroups(true);
    
      int groupOffset = 0;
      int groupLimit = 50;
      TopGroups result = gs.search(searcher, q, groupOffset, groupLimit);
      GroupDocs[] groupDocs = result.groups;
      
      System.out.println("start search");
      System.out.println("TotalHitCount:" + result.totalHitCount);
      System.out.println("TotalGroupCount:" + result.totalGroupCount);
      for (int i = 0; i < groupDocs.length; i++) {
        ScoreDoc[] groupHits = groupDocs[i].scoreDocs;
        for (int j = 0; j < groupHits.length; j++) {
          Document hitDoc = searcher.doc(groupHits[j].doc);
          System.out.println("-------------");
          System.out.println("id:" + hitDoc.get("id"));
          System.out.println("cnt:" + hitDoc.get("cnt"));
        }
      }
      reader.close();
      directory.close();
    } catch (IOException e) {
      e.printStackTrace();
      System.exit(1);
    }
  }

結果は以下のようになって、idフィールドごとに重複を削除したような結果になりました。

結果

start search
TotalHitCount:9
TotalGroupCount:3
-------------
id:id_3
cnt:23
-------------
id:id_2
cnt:22
-------------
id:id_1
cnt:21

この部分を

 //グループ内の取得件数
 gs.setGroupDocsLimit(2);

こうすると、idフィールドごとに2件ずつとれました。

結果

start search
TotalHitCount:9
TotalGroupCount:3
-------------
id:id_3
cnt:23
-------------
id:id_3
cnt:13
-------------
id:id_2
cnt:22
-------------
id:id_2
cnt:12
-------------
id:id_1
cnt:21
-------------
id:id_1
cnt:11

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up