GASとGoogle ColabでPDFのOCR

Posted at 2024-04-14

はじめに

たくさんのPDFをデータ化したいことがあり、ある程度は手動で補正する必要が出てくるのは許容しつつできるだけ楽にテキストを取り出したいということでOCRしました。

前提

ちゃんとドキュメントとして認識されている(PDFリーダーで文字選択できる)PDFあり。
画像として認識されているPDFもあり。
上記2つが混ざっているPDFもあり。

流れ

PDFはGoogleドライブに保存
PythonでPDFを画像化(ドキュメントとして認識されているものも画像に寄せる)
GASでOCR

これだけです。
画像化は別にローカル環境でも良いのですが、あんまり環境あっち行ったりこっち行ったりしたくないということでGoogleアカウント内で完結できるようGoogle Colabを使用しました。

ちなみにGASを使ったのは初めてでした。

また1ファイルずつ手動で良ければGoogleドキュメントに読ませるとテキスト抽出してくれます。

1. PythonでPDFを画像化

まずpdf2imageを使うために必要なLinuxパッケージのpoppler-utilsをインストールします。

pdf2image.ipnyb

!apt install poppler-utils

その後、Pythonパッケージのpdf2imageをインストールします。

pdf2image.ipnyb

!pip install pdf2image

Googleドライブに保存されているPDFファイルに対して処理がしたいのでマウントします。

pdf2image.ipnyb

from google.colab import drive
drive.mount('/content/drive')

さらに作業フォルダを作っておきます。

pdf2image.ipnyb

%cd /content/drive/MyDrive

%mkdir -p 99_work
%cd 99_work

%mkdir -p 99_pdf
%mkdir -p 99_jpg

実際の変換は特に難しいことはしてなくてpdf2image使って変換しています(今更ながらノートブック同じ名前で分かりづらいな。。。)
これでPDF1ファイルごとにフォルダ作って各ページ1ファイルのJPGになります。

pdf2image.ipnyb

import os
import datetime
from pdf2image import convert_from_path

# DPI
dpi = 300
# JPG形式
extension = 'JPEG'

source_dir = '99_pdf/'
dest_dir = '99_jpg/'

print('過去ファイルの削除')

for file in os.listdir(dest_dir):
  file_path = os.path.join(dest_dir, file)

  if os.path.isfile(file_path):
    print(" >", file_path, '... ', end='')

    try:
        os.remove(file_path)
        print('success')
    except Exception as e:
      print(f'failed : {e}')
print()

print('PDF -> 画像 変換')

for root, _, files in os.walk(source_dir):
  for file in files:
    if file.lower().endswith('.pdf'):
      pdf_path = os.path.join(root, file)
      dest_file_name = os.path.splitext(file)[0]
      dest_sub_dir = os.path.join(dest_dir, dest_file_name)

      os.makedirs(dest_sub_dir, exist_ok=True)

      print(" >", pdf_path, '... ', end='')

      try:
        _ = convert_from_path(pdf_path, dpi=dpi, output_folder=dest_sub_dir, output_file=dest_file_name, fmt=extension)
        print('success')
      except Exception as e:
        print(f'failed : {e}')
print()

2. GASでOCR

各ページOCR結果取った後、全ページ(画像)分を結合して1ファイルにしています。
なのでどこでページが切り替わっているか分からなくならないよう画像のファイル名も間に挟むようにしてます。

画像のフォルダとテキスト保存先のフォルダ指定はGoogleドライブで開いた時のURL末尾にあるIDから取ります。


https://drive.google.com/drive/u/0/folders/ここがID

Googleドライブを操作するにあたってDrive APIを有効化する必要があります。
この時の注意点としてはv2を使用します。
v3だと仕様が変わっているようで DocumentApp.openById でエラーになります。

pdf2image.gs

// メイン関数
function myFunction() {
  ocrAllImagesInFolders();
}

// 画像ルートフォルダ
const rootFolderID = 'ここをフォルダのIDに書き換える';

// テキスト保存先フォルダ
const textFolderID = 'ここをフォルダのIDに書き換える';
const textFolder = DriveApp.getFolderById(textFolderID);

// OCRの設定
const option = {
  'ocr': true,
  'ocrLanguage': 'ja',
}

function ocrAllImagesInFolders() {
  const folders = DriveApp.getFolderById(rootFolderID).getFolders();

  while (folders.hasNext()) {
    const folder = folders.next();

    if (folder.getId() === textFolderID) {
      // テキスト保存先フォルダはスキップ
      continue;
    }

    const folderName = folder.getName();

    console.log(folderName + ' : start...');

    const files = DriveApp.getFolderById(folder.getId()).getFiles();
    
    // 画像ファイルの並び替え
    const sortedFiles = sortImageFiles(files)

    // 並び替えた全画像ファイルにOCRを実行し、テキストを結合して取得
    const allImageText = ocrAllImagesInFolder(sortedFiles);

    console.log(' >> text file save start...');

    const textFileName = folderName + '.txt';
    // テキストへの書き込み
    writeTextFile(allImageText, textFileName);

    console.log(' >> text file save end... : ' + textFileName);

    console.log(folderName + ' : end...');
  }
}

// 画像ファイルの並び替え
function sortImageFiles(files) {
  const filesArray = [];

  while (files.hasNext()) {
    filesArray.push(files.next());
  }

  filesArray.sort((x, y) => {
    if (x < y) return -1;
    if (x > y) return 1;
  });

  return filesArray.values();
}

// フォルダ内の全画像ファイルにOCRを実行し、テキストを結合して取得
function ocrAllImagesInFolder(files) {
  let allImageText = '';

  for (const file of files) {
    if (file.getMimeType() === 'application/vnd.google-apps.script') {
      // GASファイルは処理から除外
      continue;
    }

    const fileName = file.getName();

    console.log(' > ocr start... : ' + fileName);

    // OCRの実行
    const text = ocrImage(file.getId(), fileName);
    // 取得したテキストを結合
    // 何ページ目かわかるようファイル名を付加
    allImageText += fileName + '\r\n' + '--------------------------------------------------\r\n' + text + '\r\n';

    console.log(' > ocr end... : ' + fileName);
  }

  return allImageText;
}

// OCRの実行
function ocrImage(fileId, fileName) {
  const resource = { title: fileName };

  const image = Drive.Files.copy(resource, fileId, option);
  const doc = DocumentApp.openById(image.id);
  const text = doc.getBody().getText();

  Drive.Files.remove(doc.getId());

  return text;
}

// テキストへの書き込み
function writeTextFile(text, fileName) {
  const contentType = 'text/plain';
  const charSet = 'UTF8';
  const blob = Utilities.newBlob('', contentType, fileName).setDataFromString(text, charSet);

  textFolder.createFile(blob);
}

GitHub

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up