1
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

NDL古典籍OCR-Liteは漢文OCRとして使えるのか

Posted at

青池亨・木下貴文『軽量なレイアウト認識モデルを活用した大規模なOCRテキストデータの構造化及び成果物の分析』(じんもんこん2025, ポスターP-2-18)を横目に、NDL古典籍OCR-LiteをGoogle Colaboratoryで使ってみた。

img="http://kanji.zinbun.kyoto-u.ac.jp/db-machine/toho/L/B0010001.jpg"
url="https://github.com/ndl-lab/ndlkotenocr-lite"
import os
d=os.path.basename(url)
!test -d {d} || git clone --depth=1 {url}
!pip install `sed 's/==/>=/' {d}/requirements.txt`
f=os.path.basename(img)
!test -f {f} || curl -LO {img}
!( cd {d}/src && python ocr.py --sourceimg ../../{f} --output ../.. )
!cat {f.replace(".jpg",".xml")}
with open(f.replace(".jpg",".json"),"r",encoding="utf-8") as r:
  import json
  d=json.load(r)
s=f'<svg xmlns="http://www.w3.org/2000/svg" width="{d["imginfo"]["img_width"]}" height="{d["imginfo"]["img_height"]}">\n'
for z in d["contents"][0]:
  if z["isVertical"]=="true":
    t,b=z["text"],z["boundingBox"]
    x,y=(b[0][0]+b[3][0])/2,(b[0][1]+b[3][1])/2
    w,h=b[3][0]-b[0][0],b[3][1]-b[0][1]
    s+=f'<text transform="scale({w},{h/len(t)})" x="{x/w}" y="{y/h*len(t)}" text-anchor="middle" font-size="1" font-family="sans-serif" writing-mode="vertical-rl">{t}</text>\n'
s+='</svg>'
from IPython.display import SVG,display
display(SVG(s))

漢書零片』を読み込ませてXMLで出力しつつ、SVGにも変換してみたところ、私(安岡孝一)の手元では以下の結果が出力された。

<OCRDATASET>
<PAGE IMAGENAME="B0010001.jpg" WIDTH="1500" HEIGHT="992">
  <LINE TYPE="本文" X="1301" Y="111" WIDTH="45" HEIGHT="22" CONF="0.424" ORDER="0" STRING="の一" />
  <LINE TYPE="本文" X="1300" Y="703" WIDTH="29" HEIGHT="175" CONF="0.745" ORDER="1" STRING="東方文化学院京都研究所" />
  <LINE TYPE="本文" X="1171" Y="233" WIDTH="49" HEIGHT="660" CONF="0.843" ORDER="2" STRING="因章事挙直言極諫並見耶從官展盡其" />
  <LINE TYPE="本文" X="1113" Y="231" WIDTH="51" HEIGHT="661" CONF="0.873" ORDER="3" STRING="意加於往前以明示四方使天下咸知主" />
  <LINE TYPE="本文" X="1059" Y="233" WIDTH="48" HEIGHT="659" CONF="0.882" ORDER="4" STRING="上聖明不以言罪下也若此則流言消釈" />
  <LINE TYPE="本文" X="1002" Y="234" WIDTH="49" HEIGHT="657" CONF="0.868" ORDER="5" STRING="疑惑著明鳳白行其策錄之補過將美皆" />
  <LINE TYPE="本文" X="948" Y="239" WIDTH="46" HEIGHT="114" CONF="0.731" ORDER="6" STRING="此類也" />
  <LINE TYPE="本文" X="973" Y="358" WIDTH="22" HEIGHT="84" CONF="0.738" ORDER="7" STRING="師古曰" />
  <LINE TYPE="本文" X="947" Y="357" WIDTH="24" HEIGHT="83" CONF="0.821" ORDER="8" STRING="将助也" />
  <LINE TYPE="本文" X="945" Y="430" WIDTH="50" HEIGHT="462" CONF="0.738" ORDER="9" STRING="優游不仕以寿終欽子及" />
  <LINE TYPE="本文" X="892" Y="235" WIDTH="51" HEIGHT="654" CONF="0.866" ORDER="10" STRING="昆第支屬至二千石者且十人欽兄緩前" />
  <LINE TYPE="本文" X="839" Y="226" WIDTH="48" HEIGHT="668" CONF="0.875" ORDER="11" STRING="免太常以列侯奉朝請成帝時乃薨子業一" />
  <LINE TYPE="本文" X="783" Y="230" WIDTH="49" HEIGHT="651" CONF="0.876" ORDER="12" STRING="嗣業有材能以列侯運復爲太常數言唱" />
  <LINE TYPE="本文" X="725" Y="224" WIDTH="52" HEIGHT="675" CONF="0.858" ORDER="13" STRING="失不事權貴與丞相雇方進衛尉定慶長" />
  <LINE TYPE="本文" X="686" Y="348" WIDTH="31" HEIGHT="134" CONF="0.593" ORDER="14" STRING="前漢伝三十" />
  <LINE TYPE="本文" X="626" Y="226" WIDTH="51" HEIGHT="669" CONF="0.830" ORDER="15" STRING="淳于長不平後業坐法免官復爲函谷關" />
  <LINE TYPE="本文" X="572" Y="230" WIDTH="50" HEIGHT="657" CONF="0.823" ORDER="16" STRING="都尉会定陵侯長有罪當就國長勇紅" />
  <LINE TYPE="本文" X="518" Y="231" WIDTH="46" HEIGHT="661" CONF="0.839" ORDER="17" STRING="侯立與業書曰誠哀老姉垂白隨無狀一" />
  <LINE TYPE="本文" X="487" Y="316" WIDTH="23" HEIGHT="273" CONF="0.790" ORDER="18" STRING="・・・師古曰垂白者言白髪" />
  <LINE TYPE="本文" X="469" Y="229" WIDTH="38" HEIGHT="91" CONF="0.748" ORDER="19" STRING="出圖" />
  <LINE TYPE="本文" X="461" Y="585" WIDTH="48" HEIGHT="307" CONF="0.747" ORDER="20" STRING="願勿復用前事相一" />
  <LINE TYPE="本文" X="461" Y="321" WIDTH="26" HEIGHT="270" CONF="0.799" ORDER="21" STRING="下垂也無状猶言不肖" />
  <LINE TYPE="本文" X="428" Y="697" WIDTH="26" HEIGHT="192" CONF="0.792" ORDER="22" STRING="蘇林曰長與許_一" />
  <LINE TYPE="本文" X="409" Y="225" WIDTH="48" HEIGHT="482" CONF="0.765" ORDER="24" STRING="侵定陵侯既出關伏罪復発" />
  <LINE TYPE="本文" X="404" Y="697" WIDTH="25" HEIGHT="192" CONF="0.814" ORDER="25" STRING="后書也語在外" />
  <LINE TYPE="本文" X="352" Y="231" WIDTH="50" HEIGHT="661" CONF="0.807" ORDER="26" STRING="飜下離陽獄丞相史捜得紅陽侯書奏業" />
  <LINE TYPE="本文" X="321" Y="399" WIDTH="24" HEIGHT="148" CONF="0.800" ORDER="27" STRING="服度曰受立" />
  <LINE TYPE="本文" X="295" Y="542" WIDTH="48" HEIGHT="352" CONF="0.766" ORDER="28" STRING="一理坐免就國其春承祖" />
  <LINE TYPE="本文" X="299" Y="228" WIDTH="47" HEIGHT="174" CONF="0.808" ORDER="29" STRING="聴請不断" />
  <LINE TYPE="本文" X="296" Y="398" WIDTH="24" HEIGHT="150" CONF="0.818" ORDER="30" STRING="属請爲不敬" />
  <LINE TYPE="本文" X="240" Y="230" WIDTH="51" HEIGHT="661" CONF="0.861" ORDER="31" STRING="方進薨業上書言方進本與長深結厚長" />
  <LINE TYPE="本文" X="209" Y="370" WIDTH="24" HEIGHT="114" CONF="0.739" ORDER="32" STRING="師古曰更" />
  <LINE TYPE="本文" X="185" Y="230" WIDTH="48" HEIGHT="140" CONF="0.825" ORDER="33" STRING="相稱薦" />
  <LINE TYPE="本文" X="183" Y="369" WIDTH="25" HEIGHT="116" CONF="0.803" ORDER="34" STRING="晋工衡反" />
  <LINE TYPE="本文" X="179" Y="478" WIDTH="52" HEIGHT="418" CONF="0.773" ORDER="35" STRING="《題:長陷大悪獨得不坐苟大悪獨得不坐苟" />
  </PAGE>

</OCRDATASET>

B0010001.png

ところどころ読み間違いがあるものの、夾註もかなり読めていて、なかなかイイセンだ。GPUも使っておらず、漢文OCRとしては非常に軽い。ただ、python3.9はサポートしていないので、そのあたり、どう環境を合わせていくかかな。

1
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?