nlp4j-local-search: Local Full-Text and Vector Search from Python, Powered by Apache Lucene

Posted at 2026-06-14

Introduction

When working on NLP or RAG experiments in Python, I often want a simple way to search local data.

For example:

Search a small set of local documents
Search JSONL data
Try Japanese full-text search
Try English keyword search
Build a quick RAG prototype
Avoid setting up Elasticsearch, OpenSearch, Solr, or Docker
Try vector search locally

For this purpose, I created nlp4j-local-search.

It is a Python package that lets you use Apache Lucene-based local search from Python.

The package is now available on PyPI:

pip install nlp4j-local-search

What is nlp4j-local-search?

nlp4j-local-search is a lightweight local search library for Python.

Internally, it uses Apache Lucene.

The main features are:

Use Lucene from Python
No Elasticsearch required
No OpenSearch required
No Solr required
No Docker required
Local full-text search
Japanese keyword search
English keyword search
Vector search
Useful for NLP and RAG experiments

The goal is not to replace a production search engine cluster.

Instead, the goal is to make local search easy when you want to run quick experiments from Python.

Installation

You can install it from PyPI.

pip install nlp4j-local-search

Since this package uses Java-based components internally, a Java runtime environment is required.

Basic English Keyword Search

Here is a simple English keyword search example.

from nlp4j_local_search import SearchEngine


def test_text_search():
    print("=== Search test ===")

    with SearchEngine("en") as search:
        search.add("1", "I run every morning")
        search.add("2", "She runs every day")
        search.add("3", "He is running in the park")
        search.add("4", "This document is about search engines")
        search.commit()

        results = search.search("run", 10)
        print(f"number of results: {len(results)}")
        for i, result in enumerate(results):
            print(f"result[{i}].id: {result.id}")
            print(f"result[{i}].body: {result.body}")
            print(f"result[{i}].score: {result.score}")

        assert len(results) == 3, f"Expected 3 results, got {len(results)}"
        print("✓ OK\n")


if __name__ == "__main__":
    test_text_search()

Run the script:

python tests/test_search_en.py

Example output:

=== Search test ===
number of results: 3
result[0].id: 3
result[0].body: He is running in the park
result[0].score: 0.17657174170017242
result[1].id: 1
result[1].body: I run every morning
result[1].score: 0.15782077610492706
result[2].id: 2
result[2].body: She runs every day
result[2].score: 0.15782077610492706
✓ OK

==================================================
OK!
==================================================

In this example, the query is:

results = search.search("run", 10)

The indexed documents are:

search.add("1", "I run every morning")
search.add("2", "She runs every day")
search.add("3", "He is running in the park")
search.add("4", "This document is about search engines")

The query run matches:

I run every morning
She runs every day
He is running in the park

It does not match:

This document is about search engines

This is a good example of English keyword search because it shows how word forms such as run, runs, and running can be handled by the search analyzer.

Why English Search Example Uses run / runs / running

For Japanese search, a good example is often something like:

京都
東京都

With simple substring matching, searching for 京都 may also match 東京都.

Depending on the use case, that can be unwanted noise.

For English, however, words are usually separated by spaces. Therefore, substring-based examples are less interesting.

A more typical English search example is word form normalization.

For example:

run
runs
running

These words are different surface forms, but they are closely related.

So, for an English search example, the following dataset is easier to understand:

search.add("1", "I run every morning")
search.add("2", "She runs every day")
search.add("3", "He is running in the park")
search.add("4", "This document is about search engines")

Then search with:

results = search.search("run", 10)

This example demonstrates that nlp4j-local-search is not just doing a naive substring search.

It behaves more like a search engine.

Basic Japanese Keyword Search

Japanese search is also supported.

from nlp4j_local_search import SearchEngine

with SearchEngine("ja") as search:
    search.add("1", "東京都は日本の都道府県のひとつです")
    search.add("2", "京都は日本の都市です。")
    search.add("3", "京都市には任天堂の本社があります")
    search.add_json({
        "id": "4",
        "body": "京都府は広いです"
    })

    search.commit()

    results = search.search("京都", 10)

    for i, result in enumerate(results):
        print(f"result[{i}].id: {result.id}")
        print(f"result[{i}].body: {result.body}")
        print(f"result[{i}].score: {result.score}")

The first argument of SearchEngine specifies the language.

SearchEngine("ja")

In this example, "ja" means Japanese search.

Documents can be added in two ways.

The first way is to add an ID and body text directly.

search.add("1", "東京都は日本の都道府県のひとつです")

The second way is to add a JSON-like dictionary.

search.add_json({
    "id": "4",
    "body": "京都府は広いです"
})

After adding documents, call commit().

search.commit()

Then you can search.

results = search.search("京都", 10)

The first argument is the query string, and the second argument is the maximum number of results.

Vector Search

nlp4j-local-search also supports vector search.

To use vector search, specify vector_dimension when creating SearchEngine.

from nlp4j_local_search import SearchEngine

with SearchEngine("ja", vector_dimension=2) as search:
    search.add("1_East", [1.0, 0.0])
    search.add("2_North", [1.0, 1.0])
    search.add("3_West", [-1.0, 0.0])
    search.add("4_South", [-1.0, -1.0])

    search.commit()

    results = search.search([0.9, 0.1], 10)

    for i, result in enumerate(results):
        print(f"result[{i}].id: {result.id}")
        print(f"result[{i}].body: {result.body}")
        print(f"result[{i}].score: {result.score}")
        print("---")

In this example, we use two-dimensional vectors.

search.add("1_East", [1.0, 0.0])
search.add("2_North", [1.0, 1.0])
search.add("3_West", [-1.0, 0.0])
search.add("4_South", [-1.0, -1.0])

Then we search with the following query vector:

results = search.search([0.9, 0.1], 10)

The vector [0.9, 0.1] is close to [1.0, 0.0].

Therefore, the expected top result is:

assert results[0].id == "1_East"

This makes it possible to try simple nearest-neighbor search locally.

If you generate embeddings from your own model, you can store those vectors and search similar items.

Vector Dimension Validation

When using vector search, the vector dimension must match.

For example, if the search engine is initialized with two dimensions:

with SearchEngine("ja", vector_dimension=2) as search:

then this vector is valid:

search.add("test", [1.0, 2.0])

However, this vector is invalid:

search.add("test", [1.0, 2.0, 3.0])

The same rule applies to search queries.

search.search([1.0, 2.0, 3.0], 10)

This should raise an error because the query vector has three dimensions, while the search engine expects two dimensions.

This validation is important when working with embeddings.

Embedding models usually have a fixed vector dimension, such as 384, 768, or 1024.

When using vector search, the vector_dimension value must match the embedding model output size.

Separating Text Search and Vector Search

If you want normal text search, you can create the search engine without vector_dimension.

with SearchEngine("en") as search:
    search.add("test", "This is a test document")
    search.commit()
    results = search.search("test", 10)

If you want vector search, specify vector_dimension.

with SearchEngine("ja", vector_dimension=2) as search:
    search.add("1", [1.0, 0.0])
    search.commit()
    results = search.search([0.9, 0.1], 10)

If vector_dimension is not specified, adding or searching vectors is not allowed.

For example:

with SearchEngine("en") as search:
    search.add("test", [1.0, 2.0])

This helps avoid ambiguous usage.

The search engine should know whether it is being used for text search or vector search.

Use Cases

I think nlp4j-local-search is useful for small NLP and RAG experiments.

For example:

Search local Markdown files
Search text extracted from PDFs or PowerPoint files
Search JSONL datasets
Try Japanese keyword search
Try English keyword search
Try vector search with embeddings
Build a local RAG prototype
Test search ideas before introducing Elasticsearch or OpenSearch

For production-scale distributed search, Elasticsearch or OpenSearch is usually a better choice.

However, for experiments, installing and operating a search server can be too heavy.

In such cases, a local search library that can be used directly from Python is convenient.

Difference from Elasticsearch and OpenSearch

nlp4j-local-search is not intended to be a full replacement for Elasticsearch or OpenSearch.

Elasticsearch and OpenSearch are better when you need:

Distributed search
Large-scale indexing
Cluster management
Production monitoring
Access control
REST APIs
Operational tooling

On the other hand, nlp4j-local-search is useful when you want:

Local search
Simple Python API
No search server
No Docker
Quick NLP experiments
Japanese full-text search
English keyword search
Vector search in a local environment

In short, it is designed for local experiments rather than production search infrastructure.

Why I Built This

I often want to try search ideas quickly while working on NLP experiments.

Using a full search server is powerful, but it adds operational overhead.

For small experiments, I wanted something like this:

with SearchEngine("en") as search:
    search.add("1", "I run every morning")
    search.add("2", "She runs every day")
    search.add("3", "He is running in the park")
    search.commit()
    results = search.search("run", 10)

I also wanted to try vector search in a similarly simple way:

with SearchEngine("ja", vector_dimension=2) as search:
    search.add("1_East", [1.0, 0.0])
    search.commit()
    results = search.search([0.9, 0.1], 10)

That is the motivation behind nlp4j-local-search.

Summary

nlp4j-local-search makes it easy to use Apache Lucene-based local search from Python.

You can install it from PyPI:

pip install nlp4j-local-search

English text search example:

with SearchEngine("en") as search:
    search.add("1", "I run every morning")
    search.add("2", "She runs every day")
    search.add("3", "He is running in the park")
    search.commit()
    results = search.search("run", 10)

Japanese text search example:

with SearchEngine("ja") as search:
    search.add("1", "京都は日本の都市です。")
    search.commit()
    results = search.search("京都", 10)

Vector search example:

with SearchEngine("ja", vector_dimension=2) as search:
    search.add("1_East", [1.0, 0.0])
    search.commit()
    results = search.search([0.9, 0.1], 10)

The main benefits are:

No Elasticsearch
No OpenSearch
No Solr
No Docker
Local full-text search
Japanese keyword search
English keyword search
Vector search
Simple Python API

I hope this package will be useful for NLP and RAG experiments.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up