Using Multilingual E5 Embeddings Locally with Python: nlp4j-llm-embedding-e5

Last updated at 2026-06-20Posted at 2026-06-20

This tool lets you generate multilingual E5 embeddings from JSONL files or use them via a lightweight HTTP API, without writing Python integration code every time.

Introduction

I released a Python package named nlp4j-llm-embedding-e5.

PyPI:

https://pypi.org/project/nlp4j-llm-embedding-e5/

This package provides simple command-line tools for using multilingual E5 embeddings locally.

It supports two main use cases:

nlp4j-embedding-local-e5
nlp4j-embedding-server-e5

The first command processes JSONL files and adds embedding vectors.

The second command starts a lightweight HTTP server that exposes embedding, semantic search, and cosine similarity APIs.

Internally, this package uses sentence-transformers and the multilingual E5 model:

intfloat/multilingual-e5-large

Installation

The package is available on PyPI, so it can be installed with pip.

pip install nlp4j-llm-embedding-e5

After installation, the following commands are available:

nlp4j-embedding-local-e5
nlp4j-embedding-server-e5

You can check the command options with:

nlp4j-embedding-local-e5 --help
nlp4j-embedding-server-e5 --help

Why I Created This Tool

Embedding models are useful for many NLP and LLM-related tasks, such as:

semantic search
RAG preprocessing
document retrieval
similarity calculation
clustering
multilingual search
local text analysis

However, using embedding models directly from Python code is not always convenient.

In many systems, the main application may be written in Java, Node.js, shell scripts, or another language. In such cases, a simple command-line tool or HTTP API can be more convenient than writing Python integration code every time.

This package provides both:

a local JSONL embedding command
a lightweight HTTP embedding server

This makes multilingual E5 embeddings easier to use from local scripts, batch jobs, and non-Python applications.

What Is E5?

E5 is a family of embedding models designed for text retrieval and semantic search.

This package uses:

intfloat/multilingual-e5-large

E5 models expect different prefixes depending on the role of the text.

For documents or passages:

passage: your document text

For search queries:

query: your search query

For example:

query: Japanese NLP
passage: GiNZA is a Japanese NLP library.

This package automatically adds these prefixes depending on the selected mode.

If the input text already starts with query: or passage:, the existing prefix is preserved.

Command 1: Local JSONL Embedding

The command nlp4j-embedding-local-e5 reads a JSONL file, embeds text from a specified JSON attribute, and writes a new JSONL file with embedding vectors added.

Input Example

Create a JSONL file named input.jsonl.

{"id":"1","text":"Kyoto is a city in Japan."}
{"id":"2","text":"Tokyo is the capital of Japan."}
{"id":"3","text":"GiNZA is a Japanese NLP library."}

Run Embedding

nlp4j-embedding-local-e5 --input input.jsonl --output output.jsonl

Short options are also available:

nlp4j-embedding-local-e5 -i input.jsonl -o output.jsonl

By default:

input text attribute: text
output vector attribute: vector
E5 text type: passage
batch size: 32
max length: 512

Output Example

The output JSONL will contain a new vector field.

{"id":"1","text":"Kyoto is a city in Japan.","vector":[0.0123,-0.0456,...]}
{"id":"2","text":"Tokyo is the capital of Japan.","vector":[0.0234,-0.0567,...]}
{"id":"3","text":"GiNZA is a Japanese NLP library.","vector":[0.0345,-0.0678,...]}

The actual vector is a high-dimensional floating-point array.

For intfloat/multilingual-e5-large, the embedding dimension is typically 1024.

Custom JSON Attributes

If your JSONL file uses a different attribute name, you can specify it.

For example:

{"doc_id":"a","body":"Natural language processing handles text data."}
{"doc_id":"b","body":"Vector search can retrieve semantically similar documents."}

Run:

nlp4j-embedding-local-e5 \
  --input input.jsonl \
  --output output.jsonl \
  --text-attr body \
  --vector-attr embedding

The output will contain the embedding field.

{"doc_id":"a","body":"Natural language processing handles text data.","embedding":[...]}
{"doc_id":"b","body":"Vector search can retrieve semantically similar documents.","embedding":[...]}

Specifying E5 Text Type

For document embeddings, use passage.

nlp4j-embedding-local-e5 \
  --input input.jsonl \
  --output output.jsonl \
  --text-type passage

For query embeddings, use query.

nlp4j-embedding-local-e5 \
  --input queries.jsonl \
  --output query_vectors.jsonl \
  --text-type query

To disable automatic E5 prefixing, use none.

nlp4j-embedding-local-e5 \
  --input input.jsonl \
  --output output.jsonl \
  --text-type none

Useful Local Options

Batch Size

nlp4j-embedding-local-e5 \
  --input input.jsonl \
  --output output.jsonl \
  --batch-size 64

A larger batch size may improve throughput if enough memory is available.

If memory is limited, reduce the batch size.

nlp4j-embedding-local-e5 \
  --input input.jsonl \
  --output output.jsonl \
  --batch-size 8

Max Length

nlp4j-embedding-local-e5 \
  --input input.jsonl \
  --output output.jsonl \
  --max-length 512

Token Count Check

nlp4j-embedding-local-e5 \
  --input input.jsonl \
  --output output.jsonl \
  --check-token-count

If the token count exceeds --max-length, a warning is printed.

Verbose Mode

nlp4j-embedding-local-e5 \
  --input input.jsonl \
  --output output.jsonl \
  --verbose

This prints batch-level processing time.

Command 2: HTTP Embedding Server

The command nlp4j-embedding-server-e5 starts a lightweight HTTP server.

nlp4j-embedding-server-e5

By default, the server listens on:

127.0.0.1:8888

You can specify the host and port.

nlp4j-embedding-server-e5 --host 127.0.0.1 --port 8888

The server provides the following endpoints:

/embeddings
/semantic_search
/cos_sim

Model Warmup

By default, the server loads and warms up the model at startup.

This may take some time, especially on the first run, because:

the model may need to be downloaded
sentence-transformers loads the model
PyTorch initializes CPU/GPU resources
a warmup embedding is executed

The server prints status messages during startup, such as model loading and warmup completion.

If you want to skip startup warmup, use:

nlp4j-embedding-server-e5 --no-warmup

In this mode, the server starts faster, but the first API request may take longer because the model is loaded on demand.

API: /embeddings

The /embeddings endpoint generates an embedding for a single text.

This endpoint is intended for document embeddings and uses the E5 passage: prefix internally.

GET

curl "http://127.0.0.1:8888/embeddings?text=This%20is%20a%20test."

POST

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{"text":"This is a test."}' \
  http://127.0.0.1:8888/embeddings

Response Example

{
  "message": "ok",
  "time": "2026-06-20T12:00:00",
  "text": "This is a test.",
  "embeddings": [0.0123, -0.0456, 0.0789]
}

Token Count Check

curl "http://127.0.0.1:8888/embeddings?text=This%20is%20a%20test.&checktokencount=true"

The response includes token count information.

{
  "message": "ok",
  "text": "This is a test.",
  "token_count": 10,
  "max_tokens": 512,
  "truncated": false,
  "embeddings": [...]
}

API: /semantic_search

The /semantic_search endpoint compares a query with one or more candidate texts.

The query is encoded with:

query: ...

The candidate texts are encoded with:

passage: ...

This is the recommended endpoint for retrieval-style search.

GET

GET supports one query and one candidate text.

curl "http://127.0.0.1:8888/semantic_search?text1=Japanese%20NLP&text2=GiNZA%20is%20a%20Japanese%20NLP%20library."

POST

POST supports multiple candidate texts.

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{"text":"Japanese NLP","texts":["GiNZA is a Japanese NLP library.","This document is about image processing.","Tokyo is the capital of Japan."]}' \
  http://127.0.0.1:8888/semantic_search

Response Example

{
  "message": "ok",
  "time": "2026-06-20T12:00:00",
  "text": "Japanese NLP",
  "r": [
    {
      "corpus_id": 0,
      "score": 0.8234
    },
    {
      "corpus_id": 1,
      "score": 0.3123
    }
  ]
}

The corpus_id indicates the index of the candidate text in the input list.

The score is the semantic similarity score.

API: /cos_sim

The /cos_sim endpoint calculates cosine similarity between two texts.

This endpoint currently uses no automatic E5 prefix by default. It is useful as a simple compatibility endpoint for comparing two raw texts.

For retrieval-style search, /semantic_search is recommended because it applies query: and passage: prefixes correctly.

GET

curl "http://127.0.0.1:8888/cos_sim?text1=This%20is%20a%20test.&text2=This%20is%20an%20exam."

POST

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{"text1":"This is a test.","text2":"This is an exam.","checktokencount":true}' \
  http://127.0.0.1:8888/cos_sim

Response Example

{
  "text1": "This is a test.",
  "text2": "This is an exam.",
  "cosine_similarity": 0.8123
}

Python API

The package can also be used directly from Python.

Single Text Embedding

from nlp4j_embedding import e5_model

vector, elapsed = e5_model.embed_text(
    "Kyoto is a city in Japan.",
    text_type="passage"
)

print(len(vector))
print(elapsed)

Batch Embedding

from nlp4j_embedding import e5_model

vectors, elapsed = e5_model.embed_texts(
    [
        "Kyoto is a city in Japan.",
        "Tokyo is the capital of Japan."
    ],
    text_type="passage"
)

print(len(vectors))

Semantic Search

from nlp4j_embedding import e5_model

results = e5_model.semantic_search(
    "Japanese city",
    [
        "Kyoto is a city in Japan.",
        "Python is a programming language."
    ]
)

print(results)

Cosine Similarity

from nlp4j_embedding import e5_model

score = e5_model.cos_sim(
    "This is a test.",
    "This is an exam."
)

print(score)

Testing the Local Command

Create a small JSONL file.

cat > input.jsonl << 'EOF'
{"id":"1","text":"Kyoto is a city in Japan."}
{"id":"2","text":"Tokyo is the capital of Japan."}
EOF

Run:

nlp4j-embedding-local-e5 \
  --input input.jsonl \
  --output output.jsonl \
  --check-token-count \
  --verbose

Check the result.

cat output.jsonl

You should see vectors added to each JSON object.

Testing the HTTP Server

Start the server.

nlp4j-embedding-server-e5 --port 8888

Then test the embedding endpoint.

curl "http://127.0.0.1:8888/embeddings?text=This%20is%20a%20test."

Test semantic search.

curl "http://127.0.0.1:8888/semantic_search?text1=Japanese%20NLP&text2=GiNZA%20is%20a%20Japanese%20NLP%20library."

Test cosine similarity.

curl "http://127.0.0.1:8888/cos_sim?text1=This%20is%20a%20test.&text2=This%20is%20an%20exam."

Why an HTTP API Is Useful

Although the embedding model runs in Python, many real-world applications are not written entirely in Python.

A lightweight HTTP server makes the model usable from:

Java applications
Node.js applications
shell scripts
web applications
enterprise batch systems
internal tools

For example, a Java or Node.js application can call:

http://127.0.0.1:8888/embeddings?text=...

without directly importing Python libraries.

This is useful when Python is used as the NLP model runtime and another language is used for application development.

Comparison with Vector Databases and Search Engines

This package is not intended to replace Elasticsearch, OpenSearch, Lucene, PostgreSQL with pgvector, or dedicated vector databases.

Instead, it is useful when you want:

a simple local embedding tool
JSONL preprocessing
a lightweight HTTP embedding server
quick semantic search experiments
integration with your own search engine
embedding generation before indexing into another system

For example, you can use this package to generate vectors first, and then store those vectors in your own search index.

Performance Notes

The first execution may take longer because the model must be downloaded and loaded.

The HTTP server warms up the model by default so that the first request can be faster after startup.

nlp4j-embedding-server-e5

To skip warmup:

nlp4j-embedding-server-e5 --no-warmup

For large JSONL files, adjust the batch size depending on available memory and GPU capacity.

nlp4j-embedding-local-e5 \
  --input input.jsonl \
  --output output.jsonl \
  --batch-size 64

If memory is limited, reduce the batch size.

nlp4j-embedding-local-e5 \
  --input input.jsonl \
  --output output.jsonl \
  --batch-size 8

Conclusion

nlp4j-llm-embedding-e5 is a small Python package for using multilingual E5 embeddings locally.

It provides:

pip installation from PyPI
a local JSONL embedding command
a lightweight HTTP embedding server
semantic search API
cosine similarity API
automatic E5 prefix handling
token count checking
direct Python API access

Install it with:

pip install nlp4j-llm-embedding-e5

This package is useful for local NLP experiments, RAG preprocessing, semantic search workflows, and integration with applications written in languages other than Python.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up