青池亨・木下貴文『軽量なレイアウト認識モデルを活用した大規模なOCRテキストデータの構造化及び成果物の分析』(じんもんこん2025, ポスターP-2-18)を横目に、NDL古典籍OCR-LiteをGoogle Colaboratoryで使ってみた。
img="http://kanji.zinbun.kyoto-u.ac.jp/db-machine/toho/L/B0010001.jpg"
url="https://github.com/ndl-lab/ndlkotenocr-lite"
import os
d=os.path.basename(url)
!test -d {d} || git clone --depth=1 {url}
!pip install `sed 's/==/>=/' {d}/requirements.txt`
f=os.path.basename(img)
!test -f {f} || curl -LO {img}
!( cd {d}/src && python ocr.py --sourceimg ../../{f} --output ../.. )
!cat {f.replace(".jpg",".xml")}
with open(f.replace(".jpg",".json"),"r",encoding="utf-8") as r:
import json
d=json.load(r)
s=f'<svg xmlns="http://www.w3.org/2000/svg" width="{d["imginfo"]["img_width"]}" height="{d["imginfo"]["img_height"]}">\n'
for z in d["contents"][0]:
if z["isVertical"]=="true":
t,b=z["text"],z["boundingBox"]
x,y=(b[0][0]+b[3][0])/2,(b[0][1]+b[3][1])/2
w,h=b[3][0]-b[0][0],b[3][1]-b[0][1]
s+=f'<text transform="scale({w},{h/len(t)})" x="{x/w}" y="{y/h*len(t)}" text-anchor="middle" font-size="1" font-family="sans-serif" writing-mode="vertical-rl">{t}</text>\n'
s+='</svg>'
from IPython.display import SVG,display
display(SVG(s))
『漢書零片』を読み込ませてXMLで出力しつつ、SVGにも変換してみたところ、私(安岡孝一)の手元では以下の結果が出力された。
<OCRDATASET>
<PAGE IMAGENAME="B0010001.jpg" WIDTH="1500" HEIGHT="992">
<LINE TYPE="本文" X="1301" Y="111" WIDTH="45" HEIGHT="22" CONF="0.424" ORDER="0" STRING="の一" />
<LINE TYPE="本文" X="1300" Y="703" WIDTH="29" HEIGHT="175" CONF="0.745" ORDER="1" STRING="東方文化学院京都研究所" />
<LINE TYPE="本文" X="1171" Y="233" WIDTH="49" HEIGHT="660" CONF="0.843" ORDER="2" STRING="因章事挙直言極諫並見耶從官展盡其" />
<LINE TYPE="本文" X="1113" Y="231" WIDTH="51" HEIGHT="661" CONF="0.873" ORDER="3" STRING="意加於往前以明示四方使天下咸知主" />
<LINE TYPE="本文" X="1059" Y="233" WIDTH="48" HEIGHT="659" CONF="0.882" ORDER="4" STRING="上聖明不以言罪下也若此則流言消釈" />
<LINE TYPE="本文" X="1002" Y="234" WIDTH="49" HEIGHT="657" CONF="0.868" ORDER="5" STRING="疑惑著明鳳白行其策錄之補過將美皆" />
<LINE TYPE="本文" X="948" Y="239" WIDTH="46" HEIGHT="114" CONF="0.731" ORDER="6" STRING="此類也" />
<LINE TYPE="本文" X="973" Y="358" WIDTH="22" HEIGHT="84" CONF="0.738" ORDER="7" STRING="師古曰" />
<LINE TYPE="本文" X="947" Y="357" WIDTH="24" HEIGHT="83" CONF="0.821" ORDER="8" STRING="将助也" />
<LINE TYPE="本文" X="945" Y="430" WIDTH="50" HEIGHT="462" CONF="0.738" ORDER="9" STRING="優游不仕以寿終欽子及" />
<LINE TYPE="本文" X="892" Y="235" WIDTH="51" HEIGHT="654" CONF="0.866" ORDER="10" STRING="昆第支屬至二千石者且十人欽兄緩前" />
<LINE TYPE="本文" X="839" Y="226" WIDTH="48" HEIGHT="668" CONF="0.875" ORDER="11" STRING="免太常以列侯奉朝請成帝時乃薨子業一" />
<LINE TYPE="本文" X="783" Y="230" WIDTH="49" HEIGHT="651" CONF="0.876" ORDER="12" STRING="嗣業有材能以列侯運復爲太常數言唱" />
<LINE TYPE="本文" X="725" Y="224" WIDTH="52" HEIGHT="675" CONF="0.858" ORDER="13" STRING="失不事權貴與丞相雇方進衛尉定慶長" />
<LINE TYPE="本文" X="686" Y="348" WIDTH="31" HEIGHT="134" CONF="0.593" ORDER="14" STRING="前漢伝三十" />
<LINE TYPE="本文" X="626" Y="226" WIDTH="51" HEIGHT="669" CONF="0.830" ORDER="15" STRING="淳于長不平後業坐法免官復爲函谷關" />
<LINE TYPE="本文" X="572" Y="230" WIDTH="50" HEIGHT="657" CONF="0.823" ORDER="16" STRING="都尉会定陵侯長有罪當就國長勇紅" />
<LINE TYPE="本文" X="518" Y="231" WIDTH="46" HEIGHT="661" CONF="0.839" ORDER="17" STRING="侯立與業書曰誠哀老姉垂白隨無狀一" />
<LINE TYPE="本文" X="487" Y="316" WIDTH="23" HEIGHT="273" CONF="0.790" ORDER="18" STRING="・・・師古曰垂白者言白髪" />
<LINE TYPE="本文" X="469" Y="229" WIDTH="38" HEIGHT="91" CONF="0.748" ORDER="19" STRING="出圖" />
<LINE TYPE="本文" X="461" Y="585" WIDTH="48" HEIGHT="307" CONF="0.747" ORDER="20" STRING="願勿復用前事相一" />
<LINE TYPE="本文" X="461" Y="321" WIDTH="26" HEIGHT="270" CONF="0.799" ORDER="21" STRING="下垂也無状猶言不肖" />
<LINE TYPE="本文" X="428" Y="697" WIDTH="26" HEIGHT="192" CONF="0.792" ORDER="22" STRING="蘇林曰長與許_一" />
<LINE TYPE="本文" X="409" Y="225" WIDTH="48" HEIGHT="482" CONF="0.765" ORDER="24" STRING="侵定陵侯既出關伏罪復発" />
<LINE TYPE="本文" X="404" Y="697" WIDTH="25" HEIGHT="192" CONF="0.814" ORDER="25" STRING="后書也語在外" />
<LINE TYPE="本文" X="352" Y="231" WIDTH="50" HEIGHT="661" CONF="0.807" ORDER="26" STRING="飜下離陽獄丞相史捜得紅陽侯書奏業" />
<LINE TYPE="本文" X="321" Y="399" WIDTH="24" HEIGHT="148" CONF="0.800" ORDER="27" STRING="服度曰受立" />
<LINE TYPE="本文" X="295" Y="542" WIDTH="48" HEIGHT="352" CONF="0.766" ORDER="28" STRING="一理坐免就國其春承祖" />
<LINE TYPE="本文" X="299" Y="228" WIDTH="47" HEIGHT="174" CONF="0.808" ORDER="29" STRING="聴請不断" />
<LINE TYPE="本文" X="296" Y="398" WIDTH="24" HEIGHT="150" CONF="0.818" ORDER="30" STRING="属請爲不敬" />
<LINE TYPE="本文" X="240" Y="230" WIDTH="51" HEIGHT="661" CONF="0.861" ORDER="31" STRING="方進薨業上書言方進本與長深結厚長" />
<LINE TYPE="本文" X="209" Y="370" WIDTH="24" HEIGHT="114" CONF="0.739" ORDER="32" STRING="師古曰更" />
<LINE TYPE="本文" X="185" Y="230" WIDTH="48" HEIGHT="140" CONF="0.825" ORDER="33" STRING="相稱薦" />
<LINE TYPE="本文" X="183" Y="369" WIDTH="25" HEIGHT="116" CONF="0.803" ORDER="34" STRING="晋工衡反" />
<LINE TYPE="本文" X="179" Y="478" WIDTH="52" HEIGHT="418" CONF="0.773" ORDER="35" STRING="《題:長陷大悪獨得不坐苟大悪獨得不坐苟" />
</PAGE>
</OCRDATASET>
ところどころ読み間違いがあるものの、夾註もかなり読めていて、なかなかイイセンだ。GPUも使っておらず、漢文OCRとしては非常に軽い。ただ、python3.9はサポートしていないので、そのあたり、どう環境を合わせていくかかな。
