More than 1 year has passed since last update.

PythonでPDFテキストを抽出して、クリップボードにコピーする

Last updated at 2023-03-01Posted at 2023-02-28

PDFの表データをコピーしたい

Excelに張り付けても残念な風になってしまう。

不完全ではあるが、表として取り出せるようになりつつあるので、忘れないうちにメモしとく
tabulaというライブラリを使う。裏ではJavaアプリみたい。

ざっくりいうと
引数でPDFファイル名与えて動かして成功したら
~クリップボードにTSVデータが自動的？（強制的）に保存されてるので、Excel開いて何も考えずCtrl-V~
標準出力に出すようにした。（Tkだとちょっと怪しかったため）

複数PDFファイルを一気に読み込ました場合は
--- ファイル名 ---
というセンテンス区切りな感じで全部まとめて
クリップボードへ入る的な、、

本当にどうしようもないデータは改行とかスペースが入っちゃうけど、もう限界

import sys
import re
import tabula
from io import StringIO

def unicode_escape(x):
    return x.encode().decode("unicode_escape")

def load(f, sep="\t"):
    dfs = tabula.read_pdf(f, lattice=True, pages="all", silent=True)
    sio = StringIO()
    cnt = 0

    for df in dfs:
        df = df[df.notnull().any(axis=1)]
        is_head = df.columns[0] != "Unnamed: 0"
        if df.size > 0:
            df.to_csv(sio, index=False, header=is_head, sep=sep)
        elif is_head:
            print(sep.join(df.columns), file=sio)
        elif cnt < 2:
            print(file=sio)
            cnt += 1

    print(re.sub("\r?\n", "\n", sio.getvalue()))
    return sio.tell()

def main():
    from argparse import ArgumentParser
    ps = ArgumentParser(prog="info",
                        description="pdf to text\n")
    padd = ps.add_argument

    padd("files",
         metavar="<files>",
         nargs="+", default=[],
         help="pdf file path")

    padd('-s', '--sep', type=unicode_escape, default="\t",
         help='output separater (default `\\t`)')

    args = ps.parse_args()
    done = True
    kw = dict(sep=args.sep)
    if len(args.files) == 1:
        done = load(args.files[0], **kw)
    else:
        print(f"-- {args.files[0]} --\n")
        for f in args.files:
            print(f"\n\n-- {f} --\n")
            done += load(f, **kw) and done

    if done:
        print("Success! Saved Clipboard from Pdf Text.", file=sys.stderr)
    else:
        print("Failed? Nothing text..", file=sys.stderr)


if __name__ == "__main__":
    main()

オープンデータをPDFでドヤ顔公開するイカれた政府のおかげで生まれました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up