More than 3 years have passed since last update.

マイナビAdvent Calendar 2021

pythonでOpensearchの検索機能を扱う

Last updated at 2021-12-11Posted at 2021-12-11

はじめに

本記事はpythonでopensearchにデータを挿入、そして取得する一連の流れを簡単にまとめたものです
pythonを中心にopensearchを勉強しているので個人的に体系化したいと思ったのが動機です
pythonでopensearchを扱う上でのざっくりとした流れをつかんでいただければ幸いです

作業環境

Dockerを利用しOpenSearchとOpenSearch Dashboardsを使いました
その際にはkuromojiを扱えるようにしました

データを用意

今回挿入するデータは以下のものを架空の本をサンプルとして用意しました
各カラムは以下のような内容です。
id　 : 一意なid
bookName :本の名前
description :本の内容
genre :ジャンル
publicationDate :出版日

bookdata.csv

id,bookName,description,genre,publicationDate
a0001,pythonを学ぶ,python初心者向けの本です,プログラミング,2023-12-24
a0002,Go100選,GOしっかりと学びたい人向けです,プログラミング,2022-12-31
a0003,今年の冬はこれだ!,冬のイベント情報が詰まっています,旅行,2025-11-21
a0004,今年の秋はこれだ!,秋のイベント情報が詰まっています,旅行,2023-08-21
a0005,世界128の秘境,世界の128の秘境が画像とともに紹介されています,旅行,2027-02-21
a0006,日本の秘境６選,日本の秘境６つに焦点をあてて細く解説していきます,旅行,2022-04-03
a0007,三丁目の水野さん,三丁目の水野さんは八百屋さんにいくそうです,小説,2021-12-24
a0008,三丁目の水野さん2,三丁目の水野さんは京都に行くそうです,小説,2022-12-24
a0009,三丁目の水野さん3,三丁目の水野さんはイタリアにいくそうです,小説,2023-12-24
a0010,ライゼ旅物語,ベルエルン国で生まれたライゼの旅物語,小説,2025-03-01
a0011,水の美味しい飲み方,水の美味しい飲み方を紹介します,生活,2023-02-03
a0012,水の美味しい飲み方2,水の美味しい飲み方を紹介します,生活,2023-03-03

opensearchにデータを挿入する

全体的なコードは以下のようになりました

insert.py

# opensearchpyモジュールについて
import csv
from opensearchpy import OpenSearch
from setting import settings

# 接続先について
host = "localhost"
port = 9200
auth = ("admin", "admin")
client = OpenSearch(
    hosts=[{"host": host, "port": port}],
    http_compress=True,
    http_auth=auth,
    use_ssl=True,
    verify_certs=False,
)

# indexの作成とindexの中のデータに対しての設定
index_name = "bookinfo"
if not client.indices.exists(index=index_name):
    client.indices.create(index=index_name,body=settings)

# データの挿入
id = 1
with open("bookdata.csv","r",encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for data in reader:
        client.index(
            index = index_name,
            body = data,
            id = id
        )
        id += 1

opensearchpyモジュールについて

pythonでopensearchにアクセスできるようになります
インストールが必要です
下記が使用したopensearch-pyです
https://github.com/opensearch-project/opensearch-py

接続先について

上のコードではclientが接続先情報になります
接続先等の書き方は下記を参考にしました
色々な設定ができるみたいです
https://opensearch.org/docs/latest/clients/python/

indexの作成とindexの中のデータに対しての設定

insert.py(1)

index_name = "bookinfo"
if not client.indices.exists(index=index_name):
    client.indices.create(index=index_name,body=settings)

indexの名前はbookinfoとしました
このindexが存在しないときに作成するようにしました
bodyにはindexの中のデータの設定を書きます
下記のような設定をしmappingします

setting.py

settings = {
  "mappings": {
    "properties": {
      "id":{
        "type": "keyword"
      },   
      "bookName":{
        "type": "text",
        "analyzer": "kuromoji"
      },
      "description":{
        "type": "text",
        "analyzer": "kuromoji"
      },      
      "publicationDate":{
        "type": "date",
      } 
    }
  }
}

indexの中のデータに対しての設定です
typeはデータの型を設定します
text,int,float,date,object等あります
詳しくは下記にのっています
https://opensearch.org/docs/latest/search-plugins/sql/datatypes/
またanalayzerにkuromojiを設定することで日本語検索が可能になります

このmappingはインデックスを作成するタイミングで行われるため,mappingを更新したい場合は一回indexを削除します

client.indices.delete(index="bookinfo")

挿入されたデータを確認

Opensearch Dashboardsで確認します
できてそうです

opensearchに検索をかける

検索する際の例をいくつか考えました
全体的なコードは以下の通りです

search_book.py

from opensearchpy import OpenSearch
import json
host = "localhost"
port = 9200
auth = ("admin", "admin")
client = OpenSearch(
    hosts=[{"host": host, "port": port}],
    http_compress=True,
    http_auth=auth,
    use_ssl=True,
    verify_certs=False,
)


# ひとつのフィールドで検索
def search_book1(index_name: str,q: str):
    query ={
        "query":{
            "match":{"description":q}
        }
    }
    result = client.search(index=index_name, body=json.dumps(query))
    print(result)
    return result
search_book1("bookinfo","三丁目")

# 複数フィールドで検索
def search_book2(index_name: str,q: str, start: int, size: int):
    query = {
        "query": {
            "multi_match": {
                "fields": [
                    "bookName",
                    "description",
                    "genre"
                ],
                "query": q,
            }
        },
        "from": start,
        "size": size,
    }

    result = client.search(index=index_name, body=json.dumps(query))
    print(result)
    return result

search_book2("bookinfo","三丁目",1,5)



# idで完全一致で検索
def search_book_detail(index_name: str, id: str):
    query = {
    "query": {"term": {"id": id}},
    "_source": [
        "bookName",
        "description",
        "genre",
        "publicationDate"
    ],
    }
    result = client.search(index=index_name, body=json.dumps(query))
    print(result)

search_book_detail("bookinfo","a0001")


# 　日付を降順で取得
def get_new_book(index_name : str):
    query = {
        "query": {"match_all": {}},
        "sort": [{"publicationDate": {"order": "desc"}}],
        "size": 5,
    }
    result = client.search(index=index_name, body=json.dumps(query))
    print(result)
    return result

get_new_book("bookinfo")

以下の部分で検索してデータを取得しています

search_book.py(1)

client.search(index=index_name, body=json.dumps(query))

ひとつのフィールドで検索

一番シンプルな検索です
ひとつのフィールドに対して検索します
mappingでbookName,genre,descriptionをtypeをtextとしました
またanalyzerをkuromojiとしたので日本語検索ができるようになっています
なのでひらがな一文字で検索などしてもヒットしないようになっています

複数のフィールドで検索

複数のフィールドにわたって検索する方法です
fieldsに検索したいフィールドを入れます

また,fromやsizeを設定することで取得数などを調整できます

from : ヒットしたうち何件目から取得するか　(書いていない場合は1件目から取得します)
size : ヒットしたうち何件取得するか　(書いていない場合は10件取得します)

idで完全一致で検索

id等のユニークなものの検索はこれが適していると思います
mappingでkeywordとして設定することで可能です
完全一致なものを取得します
_sourceには取得したいフィールドを設定できます

日付を降順で取得

mappingでtypeをdateとしています
sortを使うことで取得するデータに対してソートできます
またmatch_allは全件取得したい際につかいます

取得できたデータ

上記のコードを実行すると例えばsearch_book1("bookinfo","三丁目")では
下記のような結果が得られます

{'took': 3, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 3, 'relation': 'eq'}, 'max_score': 2.39394, 'hits': [{'_index': 'bookinfo', '_type': '_doc', '_id': '8', '_score': 2.39394, '_source': {'id': 'a0008', 'bookName': '三丁目の水野さん2', 'description': '三丁目の水野さんは京都に行くそうです', 'genre': '小説', 'publicationDate': '2022-12-24'}}, {'_index': 'bookinfo', '_type': '_doc', '_id': '9', '_score': 2.39394, '_source': {'id': 'a0009', 'bookName': '三丁目の水野さん3', 'description': '三丁目の水野さんはイタリアにいくそうです', 'genre': '小説', 'publicationDate': '2023-12-24'}}, {'_index': 'bookinfo', '_type': '_doc', '_id': '7', '_score': 2.2460306, '_source': {'id': 'a0007', 'bookName': '三丁目の水野さん', 'description': '三丁目の水野さんは八百屋さんにいくそうです', 'genre': '小説', 'publicationDate': '2021-12-24'}}]}}

この取得できたデータに処理をくわえればいろいろ活用できそうです

おわり

opensearchに対してpythonで処理をする場合の一連の簡単な流れをまとめました
上記以外にも細かい検索だったりエイリアスの設定だったりいろいろありそうです
個人的にgithubのコードに書かれている説明等もわかりやすかったので見ていきたいと考えています
https://github.com/opensearch-project/opensearch-py

ここまで読んでいただきありがとうございました

参照

https://opensearch.org/docs/latest/clients/python/
https://github.com/opensearch-project/opensearch-py
https://opensearch.org/docs/latest

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up