DjangoのTrigramSimilarityで類似度検索 #Python

TrigramSimilarityとは

あまり詳しくはかけないですが..
PostgreSQLで提供されているpg_trgmというモジュールを使って、類似度をもって検索ができる機能になります。
pg_trgmにはこう書かれています。

pg_trgmモジュールは、類似文字列の高速検索をサポートするインデックス演算子クラスだけではなく、トリグラム一致に基くテキスト類似度の決定に関する関数と演算子も提供します。

Django Docs - Trigram similarity
に書かれているコードを参考に実際に使ってみます。

環境準備

まず適用にmodelを作成します。

models.py

class Book(models.Model):
    title = models.CharField(max_length=200)

次に、データを作成します。

用意したデータ

[
"青い空の彼方に",
"無限の夢",
"夜明けの詩",
"心の声を聞いて",
"未来への扉",
"時の流れに身を任せて",
"星降る夜に",
"静かな湖畔で",
"遠い日の思い出",
"秘密の庭",
"風のささやき",
"月明かりの下で",
"終わりなき旅",
"虹の彼方へ",
"永遠の約束",
"海辺の物語",
"夢見る頃を過ぎても",
"静寂の森で",
"希望の光",
"忘れられた時間",
"草原の風",
"季節の変わり目",
"夢追い人",
"深い森の中で",
"流れる雲",
"春の訪れ",
"夕暮れのメロディ",
"思い出のアルバム",
"光と影の間で",
"新しい世界へ",
"儚い夢",
"巡り会い",
"雪の降る街",
"夏の夜の夢",
"孤独な星",
"海の見える丘",
"終わらない歌",
"幻想の森",
"記憶のかけら",
"運命の輪",
"星空の下で",
"秘密の約束",
"月の光に照らされて",
"過去への旅",
"未知の世界へ",
"奇跡の瞬間",
"一瞬の輝き",
"永遠の夢",
"深い海の底で",
"心の旅路",
"未来への航海",
"風の詩",
"時の果てで",
"夜空の星たち",
"風の行方",
"春の息吹",
"夢幻の世界",
"森のささやき",
"遥かなる道",
"永遠の愛",
"青い鳥を探して",
"夜の静寂",
"月夜の物語",
"時の砂",
"夢の途中",
"遠い未来へ",
"希望の道",
"雪原の彼方に",
"星の輝き",
"風の記憶",
"夕焼けの街",
"静かな夜に",
"新しい出会い",
"夢の彼方",
"時の旅人",
"深い森の静けさ",
"遠い約束",
"未来の記憶",
"心の絆",
"月夜の静寂",
"青い空の下で",
"風のささやき",
"夜の散歩",
"星のささやき",
"忘れられた道",
"夢見る時間",
"未来への希望",
"深い眠り",
"星降る夜の奇跡",
"青い海の彼方に",
"静かな時間",
"風のささやき",
"夜明けの光",
"夢追い人の歌",
"未来への扉",
"希望の星",
"永遠の輝き",
"Beyond the Blue Sky",
"Infinite Dreams",
"Poem at Dawn",
"Listening to the Heart",
"Door to the Future",
"Going with the Flow of Time",
"On a Starry Night",
"By the Quiet Lakeside",
"Memories of Distant Days",
"The Secret Garden",
"Whisper of the Wind",
"Under the Moonlight",
"Endless Journey",
"Beyond the Rainbow",
"Eternal Promise",
"Tales by the Seaside",
"Even After Dreaming",
"In the Silent Forest",
"Light of Hope",
"Forgotten Time",
"Wind on the Prairie",
"Changing Seasons",
"Dream Chaser",
"In the Deep Forest",
"Flowing Clouds",
"Arrival of Spring",
"Melody at Dusk",
"Album of Memories",
"Between Light and Shadow",
"To a New World",
"Fleeting Dreams",
"Meeting Again",
"Snowy Town",
"Summer Night's Dream",
"Lonely Star",
"Hill with an Ocean View",
"Endless Song",
"Enchanted Forest",
"Fragments of Memory",
"Wheel of Fate",
"Under the Starry Sky",
"Secret Promise",
"Illuminated by Moonlight",
"Journey to the Past",
"To the Unknown World",
"Moment of Miracle",
"Moment of Brilliance",
"Eternal Dreams",
"At the Bottom of the Deep Sea",
"Journey of the Heart",
"Voyage to the Future",
"Poem of the Wind",
"At the Edge of Time",
"Stars in the Night Sky",
"Path of the Wind",
"Breath of Spring",
"World of Fantasy",
"Whispers of the Forest",
"Distant Road",
"Eternal Love",
"Searching for the Bluebird",
"Silence of the Night",
"Story of the Moonlit Night",
"Sand of Time",
"In the Midst of Dreams",
"To the Far Future",
"Path of Hope",
"Beyond the Snowfield",
"Radiance of the Stars",
"Memory of the Wind",
"Town at Sunset",
"In the Quiet Night",
"New Encounter",
"Beyond the Dream",
"Traveler of Time",
"Quietude of the Deep Forest",
"Distant Promise",
"Memory of the Future",
"Bonds of the Heart",
"Stillness of the Moonlit Night",
"Under the Blue Sky",
"Whispers of the Wind",
"Night Walk",
"Murmur of the Stars",
"Forgotten Path",
"Dreaming Time",
"Hope for the Future",
"Deep Sleep",
"Miracle on a Starry Night",
"Beyond the Blue Ocean",
"Quiet Time",
"Whispers of the Wind",
"Light at Dawn",
"Song of the Dream Chaser",
"Door to the Future",
"Star of Hope",
"Eternal Radiance"
]

しかし、そのまま使おうとするとエラーが出ます。

function similarity(character varying, unknown) does not exist

pgAdminのクエリツールで以下のコマンドを実行してエクステンションをインストールすると使えます。

CREATE EXTENSION pg_trgm;

TrigramSimilarityを使って検索

views.py

class IndexView(TemplateView):
    template_name = "index.html"

    def get_context_data(self, **kwargs: Any) -> dict[str, Any]:
        context = super().get_context_data(**kwargs)

        q = self.request.GET.get("q", "")

        search_results = (
            Book.objects.annotate(
                title_similarity=TrigramSimilarity("title", q),
            )
            .filter(title_similarity__gt=0.1)  # タイトルの類似性スコアの閾値
            .order_by("-title_similarity")
        )
        context.update(
            {
                "search_results": search_results[:10],
            }
        )

        return context

閾値は0.1に設定していますが、これは適当な値なので、実際に使う際は調整が必要です。
数字が大きいほど類似度が高いものになります。(検索結果が少なくなります)

実験

色々な値を入れて実験してみました。

最後に

実際に使ってみたところ、普通の検索とは違った切り口で検索ができるので面白いなと思いました。
英語の方はそれなりに良い結果が出ていると思いますが、日本語は不安定であると感じました。
閾値の設定によって大分変わる部分もあるかと思われます。