”ウルトラコンパクト”なOCRモデル：SmolDocling-256M-previewを動かしてみる

Last updated at 2025-04-05Posted at 2025-04-04

記載日2025/04/04

備忘録です。
超軽量モデルなのにスピードや精度が高くて、OCR分野で話題になっている（と勝手に思っている）SmolDoclingを動かしてみました。
LlamaやGemmaは動かしたことがあったのですが、ほかのモデルと比べて少しコードの書き方が違ったので備忘録として書きます。
非常に軽いモデルなのに結構ちゃんとOCRできたので、今後に期待しています。

SmolDoclingとは

IBMとHuggingFaceが協働で開発している”ウルトラコンパクト”なOCRモデルとのこと。
たった256Mパラメータのビジョンモデルで、出力がmarkdown形式なのもありがたい。

今のところ以下の制約があります。ただ、モデルが非常に小さいのに精度が高く、実行時間も速かったので驚きました。今後、日本語対応してくれることを期待しています。
ちなみに、私の環境ではGPUを4Gちょっと使ってました。なので、GPUを8G以上搭載のPCであれば問題なく動くかと思います。（本当はCPUのみでも動いてほしい。。）

対応言語は英語のみ
GPU（正確にはVRAM）がないと動かない（少なくとも自分はそうでした）

環境

・OS : Ubuntu24.04（正確には、Windows11上のWSL2環境。GPUは20Gほど。）
・python : 3.11.11 (私の環境ではpyenvを使っています)

サンプルコード

# main.py
from services.smoldocling import SmolDoc

def main():
    SDS = SmolDoc()
    SDS.load_model()
    SDS.vision("./document.jpg")
    
if __name__ == "__main__":
    main()

# smoldocling.py
import torch
from transformers import AutoModelForVision2Seq, AutoProcessor
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument
from transformers.image_utils import load_image

class SmolDoc():
    model_id = "ds4sd/SmolDocling-256M-preview"

    def __init__(self):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        print(self.device)

    def load_model(self):
        self.processor = AutoProcessor.from_pretrained(self.model_id)
        # .to(self.device)とは書いているものの、GPU必須なので実質は.to("cuda")
        self.model = AutoModelForVision2Seq.from_pretrained(
            self.model_id,
            torch_dtype=torch.bfloat16,
            _attn_implementation="eager"
            # _attn_implementation="flash_attention_2" if self.device == "cuda" else "eager",
        ).to(self.device)

    def vision(self, image_path):
        # 画像の読込
        image = load_image(image_path)

        messages = [
            {"role": "user", "content": [
                {"type": "image"},
                {"type": "text", "text": "Convert this page to docling."}
            ]}
        ]

        # プロセッサ
        prompt = self.processor.apply_chat_template(messages, add_generation_prompt=True)
        inputs = self.processor(text=prompt, images=image, return_tensors="pt").to(self.device)

        # 出力
        generated_ids = self.model.generate(**inputs, max_new_tokens=4096)
        prompt_length = inputs.input_ids.shape[1]
        trimmed_generated_ids = generated_ids[:, prompt_length:]

        doctags = self.processor.batch_decode(
            trimmed_generated_ids,
            skip_special_tokens=False,
        )[0]

        doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
        doc = DoclingDocument(name="Document")
        doc.load_from_doctags(doctags_doc)
        print(doc.export_to_markdown())

        # ファイル出力
        doc.save_as_markdown("./output.md")

参考にさせて頂いたサイト

・SmolDoclingとは
https://www.aibase.com/ja/news/16427
・SmolDocling-256M-preview
https://huggingface.co/ds4sd/SmolDocling-256M-preview
・コードについて
https://adasci.org/a-hands-on-guide-to-compact-vision-language-models-using-smoldocling/
https://koshurai.medium.com/smoldocling-256m-tutorial-891cb3353c78
https://www.reddit.com/r/LocalLLaMA/comments/1je4eka/smoldocling_256m_vlm_for_document_understanding/?rdt=35822

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up