Introduction
When working on NLP or RAG experiments in Python, I often want a simple way to search local data.
For example:
- Search a small set of local documents
- Search JSONL data
- Try Japanese full-text search
- Try English keyword search
- Build a quick RAG prototype
- Avoid setting up Elasticsearch, OpenSearch, Solr, or Docker
- Try vector search locally
For this purpose, I created nlp4j-local-search.
It is a Python package that lets you use Apache Lucene-based local search from Python.
The package is now available on PyPI:
pip install nlp4j-local-search
What is nlp4j-local-search?
nlp4j-local-search is a lightweight local search library for Python.
Internally, it uses Apache Lucene.
The main features are:
- Use Lucene from Python
- No Elasticsearch required
- No OpenSearch required
- No Solr required
- No Docker required
- Local full-text search
- Japanese keyword search
- English keyword search
- Vector search
- Useful for NLP and RAG experiments
The goal is not to replace a production search engine cluster.
Instead, the goal is to make local search easy when you want to run quick experiments from Python.
Installation
You can install it from PyPI.
pip install nlp4j-local-search
Since this package uses Java-based components internally, a Java runtime environment is required.
Basic English Keyword Search
Here is a simple English keyword search example.
from nlp4j_local_search import SearchEngine
def test_text_search():
print("=== Search test ===")
with SearchEngine("en") as search:
search.add("1", "I run every morning")
search.add("2", "She runs every day")
search.add("3", "He is running in the park")
search.add("4", "This document is about search engines")
search.commit()
results = search.search("run", 10)
print(f"number of results: {len(results)}")
for i, result in enumerate(results):
print(f"result[{i}].id: {result.id}")
print(f"result[{i}].body: {result.body}")
print(f"result[{i}].score: {result.score}")
assert len(results) == 3, f"Expected 3 results, got {len(results)}"
print("✓ OK\n")
if __name__ == "__main__":
test_text_search()
Run the script:
python tests/test_search_en.py
Example output:
=== Search test ===
number of results: 3
result[0].id: 3
result[0].body: He is running in the park
result[0].score: 0.17657174170017242
result[1].id: 1
result[1].body: I run every morning
result[1].score: 0.15782077610492706
result[2].id: 2
result[2].body: She runs every day
result[2].score: 0.15782077610492706
✓ OK
==================================================
OK!
==================================================
In this example, the query is:
results = search.search("run", 10)
The indexed documents are:
search.add("1", "I run every morning")
search.add("2", "She runs every day")
search.add("3", "He is running in the park")
search.add("4", "This document is about search engines")
The query run matches:
I run every morningShe runs every dayHe is running in the park
It does not match:
This document is about search engines
This is a good example of English keyword search because it shows how word forms such as run, runs, and running can be handled by the search analyzer.
Why English Search Example Uses run / runs / running
For Japanese search, a good example is often something like:
京都
東京都
With simple substring matching, searching for 京都 may also match 東京都.
Depending on the use case, that can be unwanted noise.
For English, however, words are usually separated by spaces. Therefore, substring-based examples are less interesting.
A more typical English search example is word form normalization.
For example:
run
runs
running
These words are different surface forms, but they are closely related.
So, for an English search example, the following dataset is easier to understand:
search.add("1", "I run every morning")
search.add("2", "She runs every day")
search.add("3", "He is running in the park")
search.add("4", "This document is about search engines")
Then search with:
results = search.search("run", 10)
This example demonstrates that nlp4j-local-search is not just doing a naive substring search.
It behaves more like a search engine.
Basic Japanese Keyword Search
Japanese search is also supported.
from nlp4j_local_search import SearchEngine
with SearchEngine("ja") as search:
search.add("1", "東京都は日本の都道府県のひとつです")
search.add("2", "京都は日本の都市です。")
search.add("3", "京都市には任天堂の本社があります")
search.add_json({
"id": "4",
"body": "京都府は広いです"
})
search.commit()
results = search.search("京都", 10)
for i, result in enumerate(results):
print(f"result[{i}].id: {result.id}")
print(f"result[{i}].body: {result.body}")
print(f"result[{i}].score: {result.score}")
The first argument of SearchEngine specifies the language.
SearchEngine("ja")
In this example, "ja" means Japanese search.
Documents can be added in two ways.
The first way is to add an ID and body text directly.
search.add("1", "東京都は日本の都道府県のひとつです")
The second way is to add a JSON-like dictionary.
search.add_json({
"id": "4",
"body": "京都府は広いです"
})
After adding documents, call commit().
search.commit()
Then you can search.
results = search.search("京都", 10)
The first argument is the query string, and the second argument is the maximum number of results.
Vector Search
nlp4j-local-search also supports vector search.
To use vector search, specify vector_dimension when creating SearchEngine.
from nlp4j_local_search import SearchEngine
with SearchEngine("ja", vector_dimension=2) as search:
search.add("1_East", [1.0, 0.0])
search.add("2_North", [1.0, 1.0])
search.add("3_West", [-1.0, 0.0])
search.add("4_South", [-1.0, -1.0])
search.commit()
results = search.search([0.9, 0.1], 10)
for i, result in enumerate(results):
print(f"result[{i}].id: {result.id}")
print(f"result[{i}].body: {result.body}")
print(f"result[{i}].score: {result.score}")
print("---")
In this example, we use two-dimensional vectors.
search.add("1_East", [1.0, 0.0])
search.add("2_North", [1.0, 1.0])
search.add("3_West", [-1.0, 0.0])
search.add("4_South", [-1.0, -1.0])
Then we search with the following query vector:
results = search.search([0.9, 0.1], 10)
The vector [0.9, 0.1] is close to [1.0, 0.0].
Therefore, the expected top result is:
assert results[0].id == "1_East"
This makes it possible to try simple nearest-neighbor search locally.
If you generate embeddings from your own model, you can store those vectors and search similar items.
Vector Dimension Validation
When using vector search, the vector dimension must match.
For example, if the search engine is initialized with two dimensions:
with SearchEngine("ja", vector_dimension=2) as search:
then this vector is valid:
search.add("test", [1.0, 2.0])
However, this vector is invalid:
search.add("test", [1.0, 2.0, 3.0])
The same rule applies to search queries.
search.search([1.0, 2.0, 3.0], 10)
This should raise an error because the query vector has three dimensions, while the search engine expects two dimensions.
This validation is important when working with embeddings.
Embedding models usually have a fixed vector dimension, such as 384, 768, or 1024.
When using vector search, the vector_dimension value must match the embedding model output size.
Separating Text Search and Vector Search
If you want normal text search, you can create the search engine without vector_dimension.
with SearchEngine("en") as search:
search.add("test", "This is a test document")
search.commit()
results = search.search("test", 10)
If you want vector search, specify vector_dimension.
with SearchEngine("ja", vector_dimension=2) as search:
search.add("1", [1.0, 0.0])
search.commit()
results = search.search([0.9, 0.1], 10)
If vector_dimension is not specified, adding or searching vectors is not allowed.
For example:
with SearchEngine("en") as search:
search.add("test", [1.0, 2.0])
This helps avoid ambiguous usage.
The search engine should know whether it is being used for text search or vector search.
Use Cases
I think nlp4j-local-search is useful for small NLP and RAG experiments.
For example:
- Search local Markdown files
- Search text extracted from PDFs or PowerPoint files
- Search JSONL datasets
- Try Japanese keyword search
- Try English keyword search
- Try vector search with embeddings
- Build a local RAG prototype
- Test search ideas before introducing Elasticsearch or OpenSearch
For production-scale distributed search, Elasticsearch or OpenSearch is usually a better choice.
However, for experiments, installing and operating a search server can be too heavy.
In such cases, a local search library that can be used directly from Python is convenient.
Difference from Elasticsearch and OpenSearch
nlp4j-local-search is not intended to be a full replacement for Elasticsearch or OpenSearch.
Elasticsearch and OpenSearch are better when you need:
- Distributed search
- Large-scale indexing
- Cluster management
- Production monitoring
- Access control
- REST APIs
- Operational tooling
On the other hand, nlp4j-local-search is useful when you want:
- Local search
- Simple Python API
- No search server
- No Docker
- Quick NLP experiments
- Japanese full-text search
- English keyword search
- Vector search in a local environment
In short, it is designed for local experiments rather than production search infrastructure.
Why I Built This
I often want to try search ideas quickly while working on NLP experiments.
Using a full search server is powerful, but it adds operational overhead.
For small experiments, I wanted something like this:
with SearchEngine("en") as search:
search.add("1", "I run every morning")
search.add("2", "She runs every day")
search.add("3", "He is running in the park")
search.commit()
results = search.search("run", 10)
I also wanted to try vector search in a similarly simple way:
with SearchEngine("ja", vector_dimension=2) as search:
search.add("1_East", [1.0, 0.0])
search.commit()
results = search.search([0.9, 0.1], 10)
That is the motivation behind nlp4j-local-search.
Summary
nlp4j-local-search makes it easy to use Apache Lucene-based local search from Python.
You can install it from PyPI:
pip install nlp4j-local-search
English text search example:
with SearchEngine("en") as search:
search.add("1", "I run every morning")
search.add("2", "She runs every day")
search.add("3", "He is running in the park")
search.commit()
results = search.search("run", 10)
Japanese text search example:
with SearchEngine("ja") as search:
search.add("1", "京都は日本の都市です。")
search.commit()
results = search.search("京都", 10)
Vector search example:
with SearchEngine("ja", vector_dimension=2) as search:
search.add("1_East", [1.0, 0.0])
search.commit()
results = search.search([0.9, 0.1], 10)
The main benefits are:
- No Elasticsearch
- No OpenSearch
- No Solr
- No Docker
- Local full-text search
- Japanese keyword search
- English keyword search
- Vector search
- Simple Python API
I hope this package will be useful for NLP and RAG experiments.