More than 3 years have passed since last update.

自炊するにあたってﾁｮｯﾄ自動化しようとした話

Last updated at 2020-05-24Posted at 2020-05-07

概要

指定したファイルパスを監視し、そこに PDF ファイルを置くと、PDF ファイルの名前を本のタイトルに自動的に変更してくれるようにしました。
book_maker

動作確認済み OS

macOS Catalina

必要なもの

Install Poppler(for PDF command)
```
$ brew install poppler
```

Install Tesseract(for OCR)

$ brew install tesseract
$ brew install tesseract-lang

Library
- watchdog
- pdf2image
- pyocr
- pyzbar
- pillow
- requests
- python-box

使い方

$ python3 src/watch.py input_path [output_path] [*extensions]

なぜ作ったか

いい加減に実家にある大量の本を消化しようと思って、断裁機とスキャナを奮発して買ってしまいました。
ですが、よく自炊は面倒くさいと聞いていましたので、ある程度の自動化を図りたいと思い、このプログラムを作成しました。

ワークフロー

以下の流れで組みました。

監視対象のディレクトリを指定し、src/watch.py を起動
PDF を監視対象のディレクトリに配置
イベントを感知し、PDF ファイルの中身から ISBN コードを取得
- ISBN コードの取得方法
  - シェルを使い、バーコードから取得
  - Python コード上で、バーコードから取得
  - Python コード上で、テキストから取得
各 API から、ISBN を元に書籍情報を取得
- 使用している API
  - Google Books APIs
  - openBD
ファイル名を修正し、ファイルを出力ディレクトリ先に PDF ファイルを移動

特定のディレクトリを監視

ディレクトリを常時監視するために、watchdogというライブラリを用いました。
watchdogについての詳細な使い方は、以下のドキュメント・記事が大変参考になりました。
ありがとうございます。

watchdog 公式ドキュメント
- API Reference
Qiita

さて、watchdogを使うにはHandlerとObserverが必要です。
Handler とは、各イベント(作成・削除・移動・変更)が起きたとき、何をどう処理するのかを記述するものです。
なお、今回は作成時のイベントである on_created 関数しか定義していません。
この on_created メソッドは、watchdog.event にある FileSystemEventHandler クラスにあるメソッドをオーバライドしています。

src/handler/handler.py

from watchdog.events import PatternMatchingEventHandler

class Handler(PatternMatchingEventHandler):
    def __init__(self, input_path, output_path, patterns=None):
        if patterns is None:
            patterns = ['*.pdf']
        super(Handler, self).__init__(patterns=patterns,
                                      ignore_directories=True,
                                      case_sensitive=False)

    def on_created(self, event):
        # なにかする

Handler クラスを定義し、パターンマッチングが出来るようになる PatternMatchingEventHandler を継承しています。
これを使うことで、イベントに感知するファイルの種類を限定することが出来ます。
ほかにも、正規表現パターンを使うことが出来る RegexMatchingEventHandler もあります。
今回は PDF のみを限定とした処理を行いたいと思いましたので、patterns=['*.pdf'] としました。
ディレクトリは無視するよう ignore_directories=True とし、*.pdf ・ *.PDF のどちらも感知できるようにしたかったので、 case_sensitive=False としました。

次は Handler を監視する役割である Observer を用意します。

src/watch.py

from watchdog.observers import Observer
from src.handler.handler import Handler


def watch(input_path, output_path, extensions):
    print([f'*.{extension}' for extension in extensions], flush=True)
    event_handler = Handler(input_path=input_path,
                            output_path=output_path,
                            patterns=[f'*.{extension}' for extension in extensions])
    observer = Observer()
    observer.schedule(event_handler, input_path, recursive=False)
    observer.start()
    print('--Start Observer--', flush=True)
    try:
        while True:
            time.sleep(1)
    except KeyboardInterrupt:
        observer.unschedule_all()
        observer.stop()
        print('--End Observer--', flush=True)
    observer.join()

作成した Observer オブジェクトに、Handler のオブジェクト・監視対象ディレクトリ・再帰的にサブディレクトリまで監視するかを記述し、作成します。
observer.start() で監視を始めて、処理を継続させるよう while 文と time.sleep(1) で動かし続けます。
Ctrl+C を押されると、 observer.unschedule_all() で全ての監視を終了させ、イベントハンドラーを切り離し、 observer.stop() でスレッドに停止を通知させます。
最後に observer.join() で、スレッドが終了まで待つようにします。

シェルを使って、ISBN コードをバーコードから取得

こちらのブログを参考にしました。
ありがとうございます。

自炊した本の pdf ファイルからバーコード画像を読み取り ISBN を取得して Amazon の API から得たタイトルのリンクを張りたい

ISBN コードを取得するにあたって、バーコードから取得するようにします。
PDF から情報を取るために使ったものは pdfinfo と pdfimages 、 zbarimg です。
pdfinfo は PDF の総ページ数を取得するため。
pdfimages は、 pdfinfo から取得した総ページを基に、最初と最後のページだけを jpeg にするため。
zbarimg は、 pdfimages で生成した jpeg から ISBN コードを取得するために用いました。

getISBN.sh

#!/bin/bash

# Number of pages to check in PDF
PAGE_COUNT=1
# File path
FILE_PATH="$1"

# If the file extension is not pdf
shopt -s nocasematch
if [[ ! $1 =~ .+(\.pdf)$ ]]; then
  exit 1
fi
shopt -u nocasematch

# Delete all .image* generated by pdfimages
rm -f .image*

# Get total count of PDF pages
pages=$(pdfinfo "$FILE_PATH" | grep -E "^Pages" | sed -E "s/^Pages: +//") &&
# Generate JPEG from PDF
pdfimages -j -l "$PAGE_COUNT" "$FILE_PATH" .image_h &&
pdfimages -j -f $((pages - PAGE_COUNT)) "$FILE_PATH" .image_t &&
# Grep ISBN
isbnTitle="$(zbarimg -q .image* | sort | uniq | grep -E '^EAN-13:978' | sed -E 's/^EAN-13://' | sed 's/-//')" &&
# If the ISBN was found, echo the ISBN
[ "$isbnTitle" != "" ] &&
echo "$isbnTitle" && rm -f .image* && exit 0 ||
# Else, exit with error code
rm -f .image* && exit 1

最終的に、ISBN コードが取得できたときは echo "$isbnTitle" を標準出力として Python 側で受け取るようにしています。

また、この && や || は意味がよくわからなかったのですが、以下の記事が参考になりました。
ありがとうございます。

便利だが理解を要する制御演算子 `&&` と `||`

Python を使って、ISBN コードを取得する

バーコードから取得する

バーコードから取得するにあたって、PDF の画像化に pdf2image 、バーコードから取得するために pyzbar を用いました。

pdf2image で、最後のページから数えて 2 ページ分を jpeg の画像を生成し、それらの画像を対象に pyzbar で decode() を呼び出し、ISBN コードの正規表現パターン(^978)にマッチした文字列があれば、それを返すようにしています。

生成した画像を配置するディレクトリは一時的なものにしたかったので、 TemporaryDirectory() を用いました。

src/isbn_from_pdf.py

import re
import sys
import tempfile
import subprocess
from pyzbar.pyzbar import decode
from pdf2image import convert_from_path

input_path = input_path
texts = []
cmd = f'echo $(pdfinfo "{input_path}" | grep -E "^Pages" | sed -E "s/^Pages: +//")'
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
total_page_count = int(result.stdout.strip())

with tempfile.TemporaryDirectory() as temp_path:
    last_pages = convert_from_path(input_path,
                                    first_page=total_page_count - PAGE_COUNT,
                                    output_folder=temp_path,
                                    fmt='jpeg')
    # extract ISBN from using barcode
    for page in last_pages:
        decoded_data = decode(page)
        for data in decoded_data:
            if re.match('978', data[0].decode('utf-8', 'ignore')):
                return data[0].decode('utf-8', 'ignore').replace('-', '')

テキストから取得する

もう一つの方法として、本の最後のページに本の出版社や版数などの情報が書かれているところから、ISBN コードを抜き取る方法です。

画像から文字列を抽出するにあたって、pyocr を用いました。
pyocr を使うには、OCR ツールが必要なので、Google の tesseract をインストールする必要があります。

src/isbn_from_pdf.py

import re
import sys
import pyocr
import tempfile
import subprocess
import pyocr.builders
from pdf2image import convert_from_path

input_path = input_path
texts = []
cmd = f'echo $(pdfinfo "{input_path}" | grep -E "^Pages" | sed -E "s/^Pages: +//")'
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
total_page_count = int(result.stdout.strip())

with tempfile.TemporaryDirectory() as temp_path:
    last_pages = convert_from_path(input_path,
                                    first_page=total_page_count - PAGE_COUNT,
                                    output_folder=temp_path,
                                    fmt='jpeg')
    tools = pyocr.get_available_tools()
    if len(tools) == 0:
        print('[ERROR] No OCR tool found.', flush=True)
        sys.exit()

    # convert image to string and extract ISBN
    tool = tools[0]
    lang = 'jpn'
    for page in last_pages:
        text = tool.image_to_string(
            page,
            lang=lang,
            builder=pyocr.builders.TextBuilder(tesseract_layout=3)
        )
        texts.append(text)
    for text in texts:
        if re.search(r'ISBN978-[0-4]-[0-9]{4}-[0-9]{4}-[0-9]', text):
            return re.findall(r'978-[0-4]-[0-9]{4}-[0-9]{4}-[0-9]', text).pop().replace('-', '')

各 API から書籍情報を取得

本の情報を取得するにあたって、Google Books APIs・openBD の 2 つを用いました。

どちらも JSON 形式で取得できますが、形が異なるため、出来る限り共通したようなコードを書きたいと思い、 Box というライブラリを使いました。

Box は、本来 dict.get('key') や dict['key'] で取得するようにするものを dict.key.another_key で取得できるようにするためのものです。
また、dict['key'] も使えます。

他にも、key がキャメルケース (camelCase) を Python の命名規則であるスネークケースに(snake_case) に変換してくれる機能や、key が personal thoughts のようにスペースがあったとき、dict.personal_thoughts のようにアクセスできるようにしてくれる便利な機能もあります。

下記は openBD から取得するときのコードです。

src/bookinfo_from_isbn.py

import re
import json
import requests
from box import Box

OPENBD_API_URL = 'https://api.openbd.jp/v1/get?isbn={}'

HEADERS = {"content-type": "application/json"}

class BookInfo:
    def __init__(self, title, author):
        self.title = title
        self.author = author

    def __str__(self):
        return f'<{self.__class__.__name__}>{json.dumps(self.__dict__, indent=4, ensure_ascii=False)}'


def _format_title(title):
    # 全角括弧、全角空白を半角スペースに置換
    title = re.sub('[（）　]', ' ', title).rstrip()
    # 半角スペース1個以上のものを1個に置換
    return re.sub(' +', ' ', title)


def _format_author(author):
    # 著／以降の文字列を削除する
    return re.sub('／.+', '', author)


def book_info_from_openbd(isbn):
    res = requests.get(OPENBD_API_URL.format(isbn), headers=HEADERS)
    if res.status_code == 200:
        openbd_res = Box(res.json()[0], camel_killer_box=True, default_box=True, default_box_attr='')
        if openbd_res is not None:
            open_bd_summary = openbd_res.summary
            title = _format_title(open_bd_summary.title)
            author = _format_author(open_bd_summary.author)
            return BookInfo(title=title, author=author)
    else:
        print(f'[WARNING] openBD status code was {res.status_code}', flush=True)

取得した本のタイトルや、著者の情報には全角半角が混ざっているので、それぞれを修正するよう関数を用意しています。(_format_title・_format_author)
まだ、実際に断裁して試していないので、これらの関数は要調整でしょう。

Box では、キャメルケースをスネークケースに変換してくれる camel_killer_box=True 、値が入っていない場合も考慮して default_box=True と default_box_attr='' としています。

ファイル名を修正し、適切なディレクトリに移動

まず、起動すると、PDF の名前を変更後に移動させるフォルダを作成するようにします。

src/handler/handler.py

import os
import datetime
from watchdog.events import PatternMatchingEventHandler

class Handler(PatternMatchingEventHandler):
    def __init__(self, input_path, output_path, patterns=None):
        if patterns is None:
            patterns = ['*.pdf']
        super(Handler, self).__init__(patterns=patterns,
                                      ignore_directories=True,
                                      case_sensitive=False)
        self.input_path = input_path
        # If the output_path is equal to input_path, then make a directory named with current time
        if input_path == output_path:
            self.output_path = os.path.join(self.input_path, datetime.datetime.now().strftime('%Y%m%d_%H%M%S'))
        else:
            self.output_path = output_path
        os.makedirs(self.output_path, exist_ok=True)

        # Create tmp directory inside of output directory
        self.tmp_path = os.path.join(self.output_path, 'tmp')
        os.makedirs(self.tmp_path, exist_ok=True)

処理が開始したとき、今日の日付でフォーマットされた出力先フォルダ、または指定した出力先フォルダを作成します。
そして、何かしらのエラーが起きたとき (同一の PDF 本があったとき、ISBN が見つからなかったとき、本の情報がなかったとき) に配置する tmp フォルダを、出力先フォルダの中に作成します。

src/handler/handler.py

    def __del__(self):
        # Delete the tmp directory, when the directory is empty
        tmp_files_len = len(os.listdir(self.tmp_path))
        if tmp_files_len == 0:
            os.rmdir(self.tmp_path)

        # Delete the output directory, when the directory is empty
        output_files_len = len(os.listdir(self.output_path))
        if output_files_len == 0:
            os.rmdir(self.output_path)

処理が終了したときは、出力先フォルダ・tmp フォルダにファイルがあったら残し、なかったら削除するよう __del__ メソッドを記述します。

src/handler/handler.py

import shutil
import subprocess
from src.isbn_from_pdf import get_isbn_from_pdf, NoSuchISBNException
from src.bookinfo_from_isbn import book_info_from_google, book_info_from_openbd, NoSuchBookInfoException

    def on_created(self, event):
        print('!Create Event!', flush=True)
        shell_path = os.path.join(os.path.dirname(__file__), '../../getISBN.sh')
        event_src_path = event.src_path
        cmd = f'{shell_path} {event_src_path}'
        result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
        try:
            if result.returncode == 0:
                # Retrieve ISBN from shell
                isbn = result.stdout.strip()
                print(f'ISBN from Shell -> {isbn}', flush=True)
                self._book_info_from_each_api(isbn, event_src_path)

            else:
                # Get ISBN from pdf barcode or text
                isbn = get_isbn_from_pdf(event_src_path)
                print(f'ISBN from Python -> {isbn}', flush=True)
                self._book_info_from_each_api(isbn, event_src_path)

        except (NoSuchISBNException, NoSuchBookInfoException) as e:
            print(e.args[0], flush=True)
            shutil.move(event_src_path, self.tmp_path)
            print(f'Move {os.path.basename(event_src_path)} to {self.tmp_path}', flush=True)

on_created メソッドでは、ワークフローにある全体の流れを記述しました。

シェルを走らせる際は、標準出力を受け取るためにsubprocess.run() でシェルを走らせるようにし、result.returncode からシェルのステータスを、result.stdout で標準出力を受け取ることが出来ます

また、ISBN コードから書籍情報を取得する際は専用の例外を投げるようにしました。

まとめ

ここまで読んでいただきありがとうございました。
コマンドを起動するところや変数名・関数名で悪戦苦闘していましたが、なんとか最低限の形にすることが出来ました。
現段階では、PDF のみ対応していますが、epub なんかにも対応できるようにしたいなーと思っています。
Windowsでも出来るようにしたいですね。

何かしらの誤字や、間違っているところ、ここはこうしたほうがいいよ！
などがあれば是非教えてください。
ありがとうございました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up