More than 1 year has passed since last update.

iPhoneで画面を録画しOCRで文字を抽出する（Google Colab編）

Last updated at 2023-12-03Posted at 2023-12-03

iPhoneの画面に表示された文字列は、１枚であればスクリーンキャプチャして文字列をコピペできますが、何画面もスクロールしないといけないほどの情報量だと大変です。そこで、画面を録画して１枚コマごとにOCRで処理する方法を考えてみます。

まず、最初のおまじない。Google ColabにGoogle Driveへのアクセス権を与えます。

from google.colab import drive
drive.mount('/content/drive')

この２行を実行すると、ブラウザ上でポップアップ画面が出るので許可ボタンを押します。

次に動画を処理するffmpegとフリーのOCRであるtesseractをインストールします。ちょっと時間かかります。

!pip install ffmpeg-python
import ffmpeg

!apt install tesseract-ocr libtesseract-dev tesseract-ocr-jpn
!pip install pyocr
import pyocr

ocr_tools = pyocr.get_available_tools()
print(ocr_tools)

ocr = ocr_tools[0]

そしてiPhoneからGoogle Driveに適当な動画をアップし、動画をコマごとに分割してOCRで文字を読み取ります。

from PIL import Image
import numpy as np

video_path = '/content/drive/My Drive/Scan/TEST.MP4'

ffmpeg_cmd = ffmpeg.input(video_path)
ffmpeg_cmd = ffmpeg.output(ffmpeg_cmd, 'pipe:', format='rawvideo', pix_fmt='rgb24')
process = ffmpeg_cmd.run_async(pipe_stdout=True)
video_info = ffmpeg.probe(video_path)
print(video_info['streams'])

width = video_info['streams'][1]['width']
height = video_info['streams'][1]['height']
fps = eval(video_info['streams'][1]['avg_frame_rate'])
num_frames = int(video_info['streams'][1]['nb_frames'])

list1 = []
j = 0
for i in range(num_frames):
  raw_image = process.stdout.read(width * height * 3)
  image = Image.frombytes('RGB', (width, height), raw_image)
  if i == 0:
    prev_image = image
  a = np.array(list(image.resize((1,10)).convert('L').getdata()))
  b = np.array(list(prev_image.resize((1,10)).convert('L').getdata()))
  dist = np.linalg.norm(a-b)
  if dist > 5: # コマ間の距離が5以上の場合にスキャン
    j = j + 1
    if ocr:
      text = ocr.image_to_string(image,lang='jpn',builder=pyocr.builders.TextBuilder(tesseract_layout=11))
      for line in text.splitlines():
        list1.append(line)
    prev_image = image

print(j)
print(list1)
print(len(list1))

process.stdout.close()
process.wait()

私の環境では16秒程度(720フレーム)を全て処理すると1時間ぐらいかかりましたので、縦1コマ×横10コマの縮小画像を作ってコマごとに比較し、差分が少ない場合だけOCRを起動する処理を入れています。これによりフレーム数が1/4程度になっています。

取得するときにtesseract_layout=11と指定していますが、こちらの情報を元にスキャンする文の性質に合わせて、変えてみると変換効率が上がるかもしれません。
https://note.com/yucco72/n/nf416803fa9eb

さて、OCRで抽出した文字列はlist1に入ってるのですが、見てみるとゴミだらけです。また画面をスクロールしながら録画すると、同じ文字列が誤認識しながら繰り返し登場しますので、これを掃除します。

import difflib
import re

list2 = []
for line in list1:
  flag = True
  line = line.replace('\"','')
  line = line.replace("\'","")
  line = line.replace("\.","")
  for char in " (){}<>[]*;:/=@®『』【】|※-—,ヽ":
    line = line.replace(char,'')
  line = re.sub("[0-9]","",line,0)
  line = re.sub(r'^[a-zA-Zあ-ん]+','', line,0)  #行の先頭が漢字でない場合は削除
  if len(line) > 10: #1行が10文字以上
    for word in ["更新","表示数"]:
      if word in line:
        flag = False  
    if flag:
      if line not in list2:
        list2.append(line)

print(list2)
print(len(list2))

私の場合、欲しい情報が10文字以上だったり、更新という文字列が入ってる列は不要といった条件があったので、このあたりを除去してlist2に入れました。

ただ、それでもlist2には「ざわめく」と「ざねわめ」といったように正しく変換されている文字列と誤変換が混じって入っているので、difflibというライブラリで文字列の類似度を計算し、0.9以上の場合はひとまとめにしてみました。

import difflib
import pprint

list3 = []
for w1 in list2:
  for w2 in list2:
    d = difflib.SequenceMatcher(None, w1, w2).ratio()
    if 0.9 < d and d < 1 :
      if (w1,w2) not in list3:
        if (w2,w1) not in list3:
          list3.append((w1,w2))
print(len(list3))
pprint.pprint(list3)

私の用途としては、この段階で、それなりに使いやすくなったので満足です。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up