Qiita Engineer Festa20242024年7月17日まで開催中！

【PowerPoint】PythonでPowerPointファイルの図表番号の整合をチェックする

Posted at 2024-07-17

はじめに

　PowerPointで資料を作成した後に、図番号の整合が取れているかチェックしたい時がありますよね。前回はその第一歩として、Pythonを用いてPowerPointファイルのスライドの図番号を自動取得する方法についてご紹介しました。

　今回は自動取得した図番号に対して、その図番号が他のスライドから参照されているか自動でチェックをする方法についてご紹介します。

　なお、PowerPointファイルはあらかじめXML形式に変換しておく必要があります。
XML形式への変換方法については前回の記事を参考にしてください。

【PowerPoint】PythonでPowerPointファイルから図番号を自動取得する

この記事でわかる・できること

PowerPointファイルから図番号を自動で取得する方法がわかります
図番号の参照の整合性を自動でチェックする方法がわかります

この記事の対象者

PythonでPowerPointファイルを操作したい人
資料点検業務を効率化したいと思っている人

動作環境・使用するツールや言語

OS バージョン
- Windows11 23H2
ツール
- Spyder 5.5.1
言語
- Python 3.12

Pythonで図表番号の整合を自動でチェックする

早速ですが、以下の機能を備えたコードを書いていきましょう。
　①XML形式(.xml)にしたPowerPointファイル(.pptx)を読み込む
　②図表番号を自動で取得する
　③整合性のチェック
　④csvファイル出力、エラー出力

コード例

diag_check.py

from pptx import Presentation
import re
import pandas as pd
from collections import defaultdict

#①XML形式ファイルの読み込み、②図表番号の自動取得
def extract_figure_and_table_info(pptx_file):
    prs = Presentation(pptx_file)
    info_dict = defaultdict(lambda: {'title': '', 'slides': [], 'references': []})

    # 図番号と表番号のパターンを定義
    #figure_pattern = re.compile(r'(図\d+(?:-\d+)?)\s*(.*)')
    #table_pattern = re.compile(r'(表\d+)\s*(.*)')
    #reference_pattern = re.compile(r'(図\d+(?:-\d+)?|表\d+)')
    
    figure_pattern = re.compile(r'(図\d+(?:[\.-]\d+)*(?:[\.-]\d+)?)\s*(.*)')
    table_pattern = re.compile(r'(表\d+(?:[\.-]\d+)*(?:[\.-]\d+)?)\s*(.*)')
    reference_pattern = re.compile(r'(図\d+(?:[\.-]\d+)*(?:[\.-]\d+)?|表\d+(?:[\.-]\d+)*(?:[\.-]\d+)?)')



    for slide_index, slide in enumerate(prs.slides, start=1):
        for shape in slide.shapes:
            if shape.has_text_frame:
                text = shape.text_frame.text

                # 図番号と図タイトルの抽出
                figure_matches = figure_pattern.findall(text)
                for match in figure_matches:
                    figure_number, figure_title = match[0], match[1].strip()
                    if figure_title and figure_title[-1] in '。．，、':
                        figure_title = "呼び出し"
                    info_dict[figure_number]['title'] = figure_title
                    info_dict[figure_number]['slides'].append(slide_index)

                # 表番号の抽出
                table_matches = table_pattern.findall(text)
                for match in table_matches:
                    table_number, table_title = match[0], match[1].strip()
                    info_dict[table_number]['title'] = table_title
                    info_dict[table_number]['slides'].append(slide_index)

                # 参照している図番号と表番号の抽出
                reference_matches = reference_pattern.findall(text)
                for ref in reference_matches:
                    info_dict[ref]['references'].append(slide_index)

    return info_dict

#③整合性チェック
def check_references(info_dict):
    errors = []
    for diag_table, info in info_dict.items():
        if '図' in diag_table or '表' in diag_table:
            if len(info['references']) == 1:#参照スライド数が1の場合は参照元なしと判定
                errors.append(f"Error: {diag_table} is referenced in only one slide {info['references'][0]}.")
    return errors

# main
pptx_file = 'sample'#sampleを解凍したPowerPointのファイル名に置き換える
info_dict = extract_figure_and_table_info(pptx_file)
errors = check_references(info_dict)

if info_dict:
    # データフレームとして表示
    data = []
    for diag_table, info in info_dict.items():
        slides = ','.join(map(str, sorted(set(info['slides']))))
        references = ','.join(map(str, sorted(set(info['references']))))
        data.append([diag_table, info['title'], slides, references])

    df = pd.DataFrame(data, columns=["Diag/Table", "Title", "Slide", "References"])
    
    # ④CSVファイルに保存
    output_tsv_file = pptx_file + '.csv'
    df.to_csv(output_tsv_file, index=False, encoding='utf-8-sig')
    print(f"データがCSVファイル {output_tsv_file} に保存されました。")

#④エラーの出力
if errors:
    for error in errors:
        print(error)
else:
    print("すべての参照が正しいです。")

　なお、コードを実行する前に前章の処理(対象となるPowerPointファイルをzip形式に変更，その後解凍)しておいてください。
　コードを実行する際、「sample」をご自身でPowerPointファイルを解凍したフォルダに変えてください。

コード実行結果例

図1でいくつかの図表番号を呼び出すようなサンプルのPowerPointファイル(sample.pptx)を用意して上記コードを実行してみました。

データがCSVファイル sample6.csv に保存されました。
Error: 図1 is referenced in only one slide 1.
Error: 図7 is referenced in only one slide 1.
Error: 図1-1 is referenced in only one slide 2.
Error: 表2 is referenced in only one slide 5.
Error: 図11 is referenced in only one slide 7.
Error: 図12.1-1 is referenced in only one slide 9.
Error: 図1-1.2 is referenced in only one slide 10.

CSVファイルの方には自動取得した図表番号(DiagTable)と参照スライド(Reference)が書き込まれています。

sample.csv

Diag/Table	Title	Slide	References
図1	はじめに	1	1
図2	概要	1,3	1,3
図3	結果	1,4	1,4
図7	呼び出し	1	1
図1.1-1	補足	1,8	1,8
図1-1	序論	2	2
表1	比較表	2,6	2,6
表2	作りかけ	5	5
図11	さいごに	7	7
図12.1-1	補足	9	9
図1-1.2	補足	10	10

おわりに・まとめ

今回はPythonでPowerPointファイルの図表番号の整合性をチェックする方法について試してみました。

この記事がどなたかのお役に立てば幸いです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up