【Python】PythonでPDFファイルの図表番号の整合をチェックする

Posted at 2024-07-21

はじめに

　以前投稿した記事では、Pythonを用いてPowerPointファイルのスライドの図番号を自動取得、自動取得した図番号に対してその図番号が他のスライドから参照されているか自動でチェックをする方法についてご紹介しました。

【続き】【PowerPoint】PythonでPowerPointファイルの図表番号の整合をチェックする(XML自動化)

　今回はPowerPointファイルではなくPDFファイルに対して図表番号の整合をチェックする方法ついてご紹介します。

この記事でわかる・できること

PDFファイルから図番号を自動で取得する方法がわかります
図番号の参照の整合性を自動でチェックする方法がわかります

この記事の対象者

PythonでPDFファイルを操作したい人
資料点検業務を効率化したいと思っている人

動作環境・使用するツールや言語

OS バージョン
- Windows11 23H2
ツール
- Spyder 5.5.1
言語
- Python 3.12
Anaconda3

Pythonで複数のPDFファイルの図表番号の整合を自動でチェックする

早速ですが、以下の機能を備えたコードを書いていきましょう。
　①指定したフォルダ内のPDFファイル(.pdf)を読み込む
　②図表番号を自動で取得する
　③整合性のチェック
　④csvファイル出力、エラー出力

必要なライブラリ

　Anaconda3に含まれていない、必要なライブラリはこちらです。pipコマンドなどで事前にインストールしておきましょう。

・ PyMuPDF(fitz)

pip install PyMuPDF

コード例

diag_check_pdf.py

import re
import pandas as pd
from collections import defaultdict
import os
import fitz  # PyMuPDF

def extract_figure_and_table_info_from_pdf(pdf_file):
    info_dict = defaultdict(lambda: {'title': '', 'files': [], 'pages': [], 'references': []})
    
    doc = fitz.open(pdf_file)
    figure_pattern = re.compile(r'(図\d+(?:[\.-]\d+)*(?:[\.-]\d+)?)\s*(.*)')
    table_pattern = re.compile(r'(表\d+)\s*(.*)')
    reference_pattern = re.compile(r'(図\d+(?:[\.-]\d+)*(?:[\.-]\d+)?|表\d+)')

    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        text = page.get_text()

        # 図番号と図タイトルの抽出
        figure_matches = figure_pattern.findall(text)
        for match in figure_matches:
            figure_number, figure_title = match[0], match[1].strip()
            if figure_title and figure_title[-1] in '。．，、':
                figure_title = "呼び出し"
            info_dict[figure_number]['title'] = figure_title
            info_dict[figure_number]['files'].append(pdf_file)
            info_dict[figure_number]['pages'].append(page_num + 1)

        # 表番号と表タイトルの抽出
        table_matches = table_pattern.findall(text)
        for match in table_matches:
            table_number, table_title = match[0], match[1].strip()
            info_dict[table_number]['title'] = table_title
            info_dict[table_number]['files'].append(pdf_file)
            info_dict[table_number]['pages'].append(page_num + 1)

        # 参照している図番号と表番号の抽出
        reference_matches = reference_pattern.findall(text)
        for ref in reference_matches:
            info_dict[ref]['references'].append(f"{pdf_file}: Page {page_num + 1}")

    return info_dict

def extract_figure_and_table_info_from_folder(folder_path):
    all_info_dict = defaultdict(lambda: {'title': '', 'files': [], 'pages': [], 'references': []})
    #combined_info_dict = defaultdict(lambda: {'title': '', 'slides': [], 'references': []})

    for file_name in os.listdir(folder_path):       
        if file_name.endswith(".pdf"):
            pdf_file = os.path.join(folder_path, file_name)
            info_dict = extract_figure_and_table_info_from_pdf(pdf_file)
            for key, value in info_dict.items():
                all_info_dict[key]['title'] = value['title']
                all_info_dict[key]['files'].extend(value['files'])
                all_info_dict[key]['pages'].extend(value['pages'])
                all_info_dict[key]['references'].extend(value['references'])
                #combined_info_dict = combined_info_dict_pdf

    return all_info_dict

def check_references(info_dict):
    errors = []
    for diag_table, info in info_dict.items():
        if '図' in diag_table or '表' in diag_table:
            if len(info['references']) == 1:
                errors.append(f"Error: {diag_table} is referenced in only one slide {info['references'][0]}.")
    return errors


# フォルダ指定
folder_path = 'target'#対象フォルダを指定

# main
info_dict = extract_figure_and_table_info_from_folder(folder_path)
errors = check_references(info_dict)


if info_dict:
    # データフレームとして表示
    data = []
    for diag_table, info in info_dict.items():
        files = ','.join(sorted(set(info['files'])))
        pages = ','.join(map(str, sorted(set(info['pages']))))
        references = ','.join(sorted(set(info['references'])))
        data.append([diag_table, info['title'], files, pages, references])

    df = pd.DataFrame(data, columns=["Diag/Table", "Title", "File", "Page", "References"])
  
    # タブ区切りのCSVファイルに保存
    output_tsv_file = folder_path + '.csv'
    df.to_csv(output_tsv_file, index=False, encoding='utf-8-sig')
    print(f"データがCSVファイル {output_tsv_file} に保存されました。")

if errors:
    for error in errors:
        print(error)
else:
    print("すべての参照が正しいです。")

コード実行結果例

以下のようにPDFファイルを格納したフォルダを用意して上記コードを実行してみました。
PDFファイルについてはご自身で準備して動作を確認してみてください。(複数PDFファイルについても同様)

target
└sample.pdf

データがCSVファイル target.csv に保存されました。
Error: 図1 is referenced in only one slide target\sample.pdf: Page 1.
Error: 図7 is referenced in only one slide target\sample.pdf: Page 1.
Error: 図1-1 is referenced in only one slide target\sample.pdf: Page 2.
Error: 表2 is referenced in only one slide target\sample.pdf: Page 5.
Error: 図11 is referenced in only one slide target\sample.pdf: Page 7.
Error: 図12.1-1 is referenced in only one slide target\sample.pdf: Page 9.
Error: 図1-1.2 is referenced in only one slide target\sample.pdf: Page 10.

CSVファイルの方には自動取得したファイル名と図表番号(DiagTable)、参照スライド(Reference)が書き込まれています。

target.csv

Diag/Table	Title	File	Page	References
図1	はじめに	target\sample.pdf	1	target\sample.pdf: Page 1
図2	概要	target\sample.pdf	1,3	target\sample.pdf: Page 1,target\sample.pdf: Page 3
図3	結果	target\sample.pdf	1,4	target\sample.pdf: Page 1,target\sample.pdf: Page 4
図7	呼び出し	target\sample.pdf	1	target\sample.pdf: Page 1
図1.1-1	補足	target\sample.pdf	1,8	target\sample.pdf: Page 1,target\sample.pdf: Page 8
図1-1	序論	target\sample.pdf	2	target\sample.pdf: Page 2
表1	比較表	target\sample.pdf	2,6	target\sample.pdf: Page 2,target\sample.pdf: Page 6
表2	作りかけ	target\sample.pdf	5	target\sample.pdf: Page 5
図11	さいごに	target\sample.pdf	7	target\sample.pdf: Page 7
図12.1-1	補足	target\sample.pdf	9	target\sample.pdf: Page 9
図1-1.2	補足	target\sample.pdf	10	target\sample.pdf: Page 10

参照ファイルが何回も登場して少し冗長なので、こちらについては改良の余地がありますね。

PowerPointファイルとPDFファイルの両方に対応する

以前ご紹介したPowerPointファイルの図表番号の整合性をチェックするコードと組み合わせて、PowerPointファイルとPDFファイルのどちらにも対応できるようにしてみましょう。

【続き】【PowerPoint】PythonでPowerPointファイルの図表番号の整合をチェックする(XML自動化)

コード例

diag_check_ppt_pdf.py

from pptx import Presentation
import re
import pandas as pd
from collections import defaultdict
import os
import zipfile
import shutil
import fitz  # PyMuPDF

def extract_figure_and_table_info_from_pptx(pptx_file):
    info_dict = defaultdict(lambda: {'title': '', 'slides': [], 'references': []})

    # 図番号と表番号のパターンを定義
    figure_pattern = re.compile(r'(図\d+(?:[\.-]\d+)*(?:[\.-]\d+)?)\s*(.*)')
    table_pattern = re.compile(r'(表\d+)\s*(.*)')
    reference_pattern = re.compile(r'(図\d+(?:[\.-]\d+)*(?:[\.-]\d+)?|表\d+)')

    for file_name in os.listdir(folder_path):
        if file_name.endswith(".pptx"):
            pptx_file = os.path.join(folder_path, file_name)
            prs = Presentation(pptx_file)
            
            for slide_index, slide in enumerate(prs.slides, start=1):
                for shape in slide.shapes:
                    if shape.has_text_frame:
                        text = shape.text_frame.text

                        # 図番号と図タイトルの抽出
                        figure_matches = figure_pattern.findall(text)
                        for match in figure_matches:
                            figure_number, figure_title = match[0], match[1].strip()
                            if figure_title and figure_title[-1] in '。．，、':
                                figure_title = "呼び出し"
                            info_dict[figure_number]['title'] = figure_title
                            info_dict[figure_number]['slides'].append(f"{file_name}: Slide {slide_index}")

                        # 表番号と表タイトルの抽出
                        table_matches = table_pattern.findall(text)
                        for match in table_matches:
                            table_number, table_title = match[0], match[1].strip()
                            info_dict[table_number]['title'] = table_title
                            info_dict[table_number]['slides'].append(f"{file_name}: Slide {slide_index}")

                        # 参照している図番号と表番号の抽出
                        reference_matches = reference_pattern.findall(text)
                        for ref in reference_matches:
                            info_dict[ref]['references'].append(f"{file_name}: Slide {slide_index}")

    return info_dict

def extract_figure_and_table_info_from_pdf(pdf_file):
    info_dict = defaultdict(lambda: {'title': '', 'files': [], 'pages': [], 'references': []})
    
    doc = fitz.open(pdf_file)
    figure_pattern = re.compile(r'(図\d+(?:[\.-]\d+)*(?:[\.-]\d+)?)\s*(.*)')
    table_pattern = re.compile(r'(表\d+)\s*(.*)')
    reference_pattern = re.compile(r'(図\d+(?:[\.-]\d+)*(?:[\.-]\d+)?|表\d+)')

    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        text = page.get_text()

        # 図番号と図タイトルの抽出
        figure_matches = figure_pattern.findall(text)
        for match in figure_matches:
            figure_number, figure_title = match[0], match[1].strip()
            if figure_title and figure_title[-1] in '。．，、':
                figure_title = "呼び出し"
            info_dict[figure_number]['title'] = figure_title
            info_dict[figure_number]['files'].append(pdf_file)
            info_dict[figure_number]['pages'].append(page_num + 1)

        # 表番号と表タイトルの抽出
        table_matches = table_pattern.findall(text)
        for match in table_matches:
            table_number, table_title = match[0], match[1].strip()
            info_dict[table_number]['title'] = table_title
            info_dict[table_number]['files'].append(pdf_file)
            info_dict[table_number]['pages'].append(page_num + 1)

        # 参照している図番号と表番号の抽出
        reference_matches = reference_pattern.findall(text)
        for ref in reference_matches:
            info_dict[ref]['references'].append(f"{pdf_file}: Page {page_num + 1}")

    return info_dict

def extract_figure_and_table_info_from_folder(folder_path):
    combined_info_dict_pdf = defaultdict(lambda: {'title': '', 'files': [], 'pages': [], 'references': []})
    combined_info_dict = defaultdict(lambda: {'title': '', 'slides': [], 'references': []})

    for file_name in os.listdir(folder_path):
        if file_name.endswith(".pptx"):
            pptx_file = os.path.join(folder_path, file_name)
            info_dict = extract_figure_and_table_info_from_pptx(pptx_file)
            for key, value in info_dict.items():
                combined_info_dict[key]['title'] = value['title']
                #combined_info_dict[key]['files'].extend(value['files'])
                combined_info_dict[key]['slides'].extend(value['slides'])
                #combined_info_dict[key]['pages'].extend(value['pages'])
                combined_info_dict[key]['references'].extend(value['references'])
        
        elif file_name.endswith(".pdf"):
            pdf_file = os.path.join(folder_path, file_name)
            info_dict = extract_figure_and_table_info_from_pdf(pdf_file)
            for key, value in info_dict.items():
                combined_info_dict_pdf[key]['title'] = value['title']
                combined_info_dict_pdf[key]['files'].extend(value['files'])
                #combined_info_dict[key]['slides'].extend(value['slides'])
                combined_info_dict_pdf[key]['pages'].extend(value['pages'])
                combined_info_dict_pdf[key]['references'].extend(value['references'])
                combined_info_dict = combined_info_dict_pdf

    return combined_info_dict

def check_references(info_dict):
    errors = []
    for diag_table, info in info_dict.items():
        if '図' in diag_table or '表' in diag_table:
            if len(info['references']) == 1:
                errors.append(f"Error: {diag_table} is referenced in only one slide {info['references'][0]}.")
    return errors

def convert_and_extract_pptx(file_path):
    # Check if the file is a PowerPoint file
    if not file_path.lower().endswith('.pptx'):
        raise ValueError("The file must be a PowerPoint (.pptx) file.")

    # Get the base name and directory of the file
    base_name = os.path.basename(file_path)
    dir_name = os.path.dirname(file_path)

    # Create a new file name with .zip extension
    zip_file_path = os.path.join(dir_name, base_name.replace('.pptx', '.zip'))

    # Rename the .pptx file to .zip
    shutil.copyfile(file_path, zip_file_path)

    # Create a directory to extract the contents
    extract_dir = os.path.join(dir_name, base_name.replace('.pptx', ''))
    os.makedirs(extract_dir, exist_ok=True)

    # Extract the .zip file
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        zip_ref.extractall(extract_dir)

    # Optionally, remove the .zip file after extraction
    os.remove(zip_file_path)

    return extract_dir

def process_all_pptx_in_folder(folder_path):
    # List all files in the directory
    for file_name in os.listdir(folder_path):
        file_path = os.path.join(folder_path, file_name)
        
        # Check if the file is a PowerPoint file
        if file_path.lower().endswith('.pptx'):
            try:
                extracted_dir = convert_and_extract_pptx(file_path)
                print(f"Extracted contents of {file_name} are located in: {extracted_dir}")
            except Exception as e:
                print(f"Failed to process {file_name}: {e}")


# フォルダを指定
folder_path = 'target'
process_all_pptx_in_folder(folder_path)

# main
info_dict = extract_figure_and_table_info_from_folder(folder_path)
errors = check_references(info_dict)


if info_dict:
    # データフレームとして表示
    data = []
    for diag_table, info in info_dict.items():
        files = ','.join(sorted(set(info['files'])))
        pages = ','.join(map(str, sorted(set(info['pages']))))
        references = ','.join(sorted(set(info['references'])))
        data.append([diag_table, info['title'], files, pages, references])

    df = pd.DataFrame(data, columns=["Diag/Table", "Title", "File", "Page", "References"])
  
    # タブ区切りのCSVファイルに保存
    output_tsv_file = folder_path + '.csv'
    df.to_csv(output_tsv_file, index=False, encoding='utf-8-sig')
    print(f"データがCSVファイル {output_tsv_file} に保存されました。")

if errors:
    for error in errors:
        print(error)
else:
    print("すべての参照が正しいです。")

コード実行結果例

以下のようにPowerPointファイルとPDFファイルを格納したフォルダを用意して上記コードを実行してみました。
PowerPointファイルとPDFファイルについてはご自身で準備して動作を確認してみてください。

target
├data1.pptx
├data2.pptx
├data3.pptx
└sample.pdf

Extracted contents of data1.pptx are located in: target\data1
Extracted contents of data2.pptx are located in: target\data2
Extracted contents of data3.pptx are located in: target\data3
データがCSVファイル target.csv に保存されました。
Error: 図1 is referenced in only one slide target\sample.pdf: Page 1.
Error: 図7 is referenced in only one slide target\sample.pdf: Page 1.
Error: 図1-1 is referenced in only one slide target\sample.pdf: Page 2.
Error: 表2 is referenced in only one slide target\sample.pdf: Page 5.
Error: 図11 is referenced in only one slide target\sample.pdf: Page 7.
Error: 図12.1-1 is referenced in only one slide target\sample.pdf: Page 9.
Error: 図1-1.2 is referenced in only one slide target\sample.pdf: Page 10.

PowerPointファイルとPDFファイルの両方の整合性チェックを実施していますが，結果はPDFファイルの方の解析結果のみ出力しています。
今後，PowerPointファイルとPDFファイルの両方の整合性チェック結果を出力できるようにする予定です。