【続き】【PowerPoint】PythonでPowerPointファイルの図表番号の整合をチェックする(XML自動化)

Posted at 2024-07-21

はじめに

　PowerPointで資料を作成した後に、図番号の整合が取れているかチェックしたい時がありますよね。
前回までの記事では、Pythonを用いてPowerPointファイルのスライドの図番号を自動取得、自動取得した図番号に対してその図番号が他のスライドから参照されているか自動でチェックをする方法についてご紹介しました。

　なお、処理対象のPowerPointファイルはあらかじめXML形式に変換しておく必要がありました。
今回はPythonでPowerPointファイルをXML形式に変換する機能も追加したので、そちらについてご紹介します。

【PowerPoint】PythonでPowerPointファイルの図表番号の整合をチェックする

この記事でわかる・できること

PowerPointファイルから図番号を自動で取得する方法がわかります
図番号の参照の整合性を自動でチェックする方法がわかります
PythonでPowerPointファイルをXML形式に変換する方法がわかります

この記事の対象者

PythonでPowerPointファイルを操作したい人
資料点検業務を効率化したいと思っている人

動作環境・使用するツールや言語

OS バージョン
- Windows11 23H2
ツール
- Spyder 5.5.1
言語
- Python 3.12

手動でPowerPointファイルをXMLファイルに変換する

以前の記事でご紹介しましたが、改めてPowerPointファイルをXML形式のファイルに変換する方法について記載します。

【PowerPoint】PythonでPowerPointファイルから図番号を自動取得する

PowerPointファイル(.pptx)はXMLファイル(.xml)の集合です。.pptxファイルを圧縮、解凍することにより各スライドを.xmlファイルに変換することができます。
手順は以下のとおりです。

PowerPointファイルの拡張子を「.zip」形式に変更します
zipファイルに変更したファイルを解凍(展開)します

展開したフォルダを開くと以下のようなフォルダ構成になっています。
_rels
docProps
ppt
[Context_Types].xml

pptフォルダの中に「slides」というフォルダがあり、そこにXMLファイルになった各スライドのデータが保存されています。
slide1.xml
slide2.xml
...

Pythonを用いてPowerPointファイルをXMLファイルに自動で変換する

それでは自動化してみましょう。自動化するために必要な処理は以下のようになります。
①入力されたファイルがPowerPointファイル(.pptx)であることを確認する
②ファイル名を取得する
③ファイルをコピーし、拡張子を.zip形式に変更する
④zipファイルを解凍する
⑤zipファイルを削除する

コード例

ppt2xml.py

import os
import zipfile
import shutil

def convert_and_extract_pptx(file_path):
    # Check if the file is a PowerPoint file
    if not file_path.lower().endswith('.pptx'):
        raise ValueError("The file must be a PowerPoint (.pptx) file.")

    # Get the base name and directory of the file
    base_name = os.path.basename(file_path)
    dir_name = os.path.dirname(file_path)

    # Create a new file name with .zip extension
    zip_file_path = os.path.join(dir_name, base_name.replace('.pptx', '.zip'))

    # Rename the .pptx file to .zip
    shutil.copyfile(file_path, zip_file_path)

    # Create a directory to extract the contents
    extract_dir = os.path.join(dir_name, base_name.replace('.pptx', ''))
    os.makedirs(extract_dir, exist_ok=True)

    # Extract the .zip file
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        zip_ref.extractall(extract_dir)

    # Optionally, remove the .zip file after extraction
    os.remove(zip_file_path)

    return extract_dir

# main
pptx_file = 'sample.pptx'#PowerPointファイル名を指定
extracted_dir = convert_and_extract_pptx(pptx_file)
print(f"Extracted contents are located in: {extracted_dir}")

sample.pptxをxml形式に変換したいファイル名へ変更して使ってください。

同一フォルダ内にある複数のPowerPointファイルをXMLファイルに自動で変換する

前述のコードは1つのPowerPointファイルをxml形式へ変換しましたが、これを拡張して複数のPowerPointファイルをxml形式へ変換できるようにしたいと思います。

指定したフォルダ内のすべてのファイルをリスト化する処理を追加します。

コード例

ppt2xml_all.py

import os
import zipfile
import shutil

def convert_and_extract_pptx(file_path):
    # Check if the file is a PowerPoint file
    if not file_path.lower().endswith('.pptx'):
        raise ValueError("The file must be a PowerPoint (.pptx) file.")

    # Get the base name and directory of the file
    base_name = os.path.basename(file_path)
    dir_name = os.path.dirname(file_path)

    # Create a new file name with .zip extension
    zip_file_path = os.path.join(dir_name, base_name.replace('.pptx', '.zip'))

    # Rename the .pptx file to .zip
    shutil.copyfile(file_path, zip_file_path)

    # Create a directory to extract the contents
    extract_dir = os.path.join(dir_name, base_name.replace('.pptx', ''))
    os.makedirs(extract_dir, exist_ok=True)

    # Extract the .zip file
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        zip_ref.extractall(extract_dir)

    # Optionally, remove the .zip file after extraction
    os.remove(zip_file_path)

    return extract_dir

def process_all_pptx_in_folder(folder_path):
    # List all files in the directory
    for file_name in os.listdir(folder_path):
        file_path = os.path.join(folder_path, file_name)
        
        # Check if the file is a PowerPoint file
        if file_path.lower().endswith('.pptx'):
            try:
                extracted_dir = convert_and_extract_pptx(file_path)
                print(f"Extracted contents of {file_name} are located in: {extracted_dir}")
            except Exception as e:
                print(f"Failed to process {file_name}: {e}")

# main
folder_path = 'sample_folder'#フォルダを指定
process_all_pptx_in_folder(folder_path)

sample_folderが複数のPowerPointファイルを格納したフォルダ名です。

Pythonで図表番号の整合を自動でチェックする

前回ご紹介した図表番号の整合を自動でチェックするコードに、PowerPointファイルをxml形式に自動で変換する処理を追加してみましょう。

コード例

diag_check_auto.py

from pptx import Presentation
import re
import pandas as pd
from collections import defaultdict
import os
import zipfile
import shutil

def extract_figure_and_table_info(pptx_file):
    prs = Presentation(pptx_file)
    info_dict = defaultdict(lambda: {'title': '', 'slides': [], 'references': []})

    # 図番号と表番号のパターンを定義
    figure_pattern = re.compile(r'(図\d+(?:[\.-]\d+)*(?:[\.-]\d+)?)\s*(.*)')
    table_pattern = re.compile(r'(表\d+(?:[\.-]\d+)*(?:[\.-]\d+)?)\s*(.*)')
    reference_pattern = re.compile(r'(図\d+(?:[\.-]\d+)*(?:[\.-]\d+)?|表\d+(?:[\.-]\d+)*(?:[\.-]\d+)?)')



    for slide_index, slide in enumerate(prs.slides, start=1):
        for shape in slide.shapes:
            if shape.has_text_frame:
                text = shape.text_frame.text

                # 図番号と図タイトルの抽出
                figure_matches = figure_pattern.findall(text)
                for match in figure_matches:
                    figure_number, figure_title = match[0], match[1].strip()
                    if figure_title and figure_title[-1] in '。．，、':
                        figure_title = "呼び出し"
                    info_dict[figure_number]['title'] = figure_title
                    info_dict[figure_number]['slides'].append(slide_index)

                # 表番号の抽出
                table_matches = table_pattern.findall(text)
                for match in table_matches:
                    table_number, table_title = match[0], match[1].strip()
                    info_dict[table_number]['title'] = table_title
                    info_dict[table_number]['slides'].append(slide_index)

                # 参照している図番号と表番号の抽出
                reference_matches = reference_pattern.findall(text)
                for ref in reference_matches:
                    info_dict[ref]['references'].append(slide_index)

    return info_dict

def check_references(info_dict):
    errors = []
    for diag_table, info in info_dict.items():
        if '図' in diag_table or '表' in diag_table:
            if len(info['references']) == 1:
                errors.append(f"Error: {diag_table} is referenced in only one slide {info['references'][0]}.")
    return errors

def convert_and_extract_pptx(file_path):
    # Check if the file is a PowerPoint file
    if not file_path.lower().endswith('.pptx'):
        raise ValueError("The file must be a PowerPoint (.pptx) file.")

    # Get the base name and directory of the file
    base_name = os.path.basename(file_path)
    dir_name = os.path.dirname(file_path)

    # Create a new file name with .zip extension
    zip_file_path = os.path.join(dir_name, base_name.replace('.pptx', '.zip'))

    # Rename the .pptx file to .zip
    shutil.copyfile(file_path, zip_file_path)

    # Create a directory to extract the contents
    extract_dir = os.path.join(dir_name, base_name.replace('.pptx', ''))
    os.makedirs(extract_dir, exist_ok=True)

    # Extract the .zip file
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        zip_ref.extractall(extract_dir)

    # Optionally, remove the .zip file after extraction
    os.remove(zip_file_path)

    return extract_dir

# main
pptx_file = 'sample.pptx'#対象PowerPointファイル
extracted_dir = convert_and_extract_pptx(pptx_file)
print(f"Extracted contents are located in: {extracted_dir}")

info_dict = extract_figure_and_table_info(pptx_file)
errors = check_references(info_dict)

if info_dict:
    # データフレームとして表示
    data = []
    for diag_table, info in info_dict.items():
        slides = ','.join(map(str, sorted(set(info['slides']))))
        references = ','.join(map(str, sorted(set(info['references']))))
        data.append([diag_table, info['title'], slides, references])

    df = pd.DataFrame(data, columns=["Diag/Table", "Title", "Slide", "References"])
    
    # タブ区切りのCSVファイルに保存
    output_tsv_file = pptx_file + '.csv'
    df.to_csv(output_tsv_file, index=False, encoding='utf-8-sig')
    print(f"データがCSVファイル {output_tsv_file} に保存されました。")

if errors:
    for error in errors:
        print(error)
else:
    print("すべての参照が正しいです。")

コード実行結果例

図1でいくつかの図表番号を呼び出すようなサンプルのPowerPointファイル(sample.pptx)を用意して上記コードを実行してみました。
PowerPointファイルについてはご自身で準備して動作を確認してみてください。

Extracted contents are located in: sample
データがCSVファイル sample.pptx.csv に保存されました。
Error: 図1 is referenced in only one slide 1.
Error: 図7 is referenced in only one slide 1.
Error: 図1-1 is referenced in only one slide 2.
Error: 表2 is referenced in only one slide 5.
Error: 図11 is referenced in only one slide 7.
Error: 図12.1-1 is referenced in only one slide 9.
Error: 図1-1.2 is referenced in only one slide 10.

CSVファイルの方には自動取得した図表番号(DiagTable)と参照スライド(Reference)が書き込まれています。

sample.csv

Diag/Table	Title	Slide	References
図1	はじめに	1	1
図2	概要	1,3	1,3
図3	結果	1,4	1,4
図7	呼び出し	1	1
図1.1-1	補足	1,8	1,8
図1-1	序論	2	2
表1	比較表	2,6	2,6
表2	作りかけ	5	5
図11	さいごに	7	7
図12.1-1	補足	9	9
図1-1.2	補足	10	10

Pythonで複数のPowerPointファイルの図表番号の整合を自動でチェックする

複数のファイルをxml形式へ変換できるようにしたので、図表番号の整合チェックも複数ファイルに対応させます。

コード例

diag_check_auto_all.py

from pptx import Presentation
import re
import pandas as pd
from collections import defaultdict
import os
import zipfile
import shutil

def extract_figure_and_table_info_from_folder(folder_path):
    info_dict = defaultdict(lambda: {'title': '', 'slides': [], 'references': []})

    # 図番号と表番号のパターンを定義
    figure_pattern = re.compile(r'(図\d+(?:[\.-]\d+)*(?:[\.-]\d+)?)\s*(.*)')
    table_pattern = re.compile(r'(表\d+)\s*(.*)')
    reference_pattern = re.compile(r'(図\d+(?:[\.-]\d+)*(?:[\.-]\d+)?|表\d+)')

    for file_name in os.listdir(folder_path):
        if file_name.endswith(".pptx"):
            pptx_file = os.path.join(folder_path, file_name)
            prs = Presentation(pptx_file)
            
            for slide_index, slide in enumerate(prs.slides, start=1):
                for shape in slide.shapes:
                    if shape.has_text_frame:
                        text = shape.text_frame.text

                        # 図番号と図タイトルの抽出
                        figure_matches = figure_pattern.findall(text)
                        for match in figure_matches:
                            figure_number, figure_title = match[0], match[1].strip()
                            if figure_title and figure_title[-1] in '。．，、':
                                figure_title = "呼び出し"
                            info_dict[figure_number]['title'] = figure_title
                            info_dict[figure_number]['slides'].append(f"{file_name}: Slide {slide_index}")

                        # 表番号と表タイトルの抽出
                        table_matches = table_pattern.findall(text)
                        for match in table_matches:
                            table_number, table_title = match[0], match[1].strip()
                            info_dict[table_number]['title'] = table_title
                            info_dict[table_number]['slides'].append(f"{file_name}: Slide {slide_index}")

                        # 参照している図番号と表番号の抽出
                        reference_matches = reference_pattern.findall(text)
                        for ref in reference_matches:
                            info_dict[ref]['references'].append(f"{file_name}: Slide {slide_index}")

    return info_dict


def check_references(info_dict):
    errors = []
    for diag_table, info in info_dict.items():
        if '図' in diag_table or '表' in diag_table:
            if len(info['references']) == 1:
                errors.append(f"Error: {diag_table} is referenced in only one slide {info['references'][0]}.")
    return errors

def convert_and_extract_pptx(file_path):
    # Check if the file is a PowerPoint file
    if not file_path.lower().endswith('.pptx'):
        raise ValueError("The file must be a PowerPoint (.pptx) file.")

    # Get the base name and directory of the file
    base_name = os.path.basename(file_path)
    dir_name = os.path.dirname(file_path)

    # Create a new file name with .zip extension
    zip_file_path = os.path.join(dir_name, base_name.replace('.pptx', '.zip'))

    # Rename the .pptx file to .zip
    shutil.copyfile(file_path, zip_file_path)

    # Create a directory to extract the contents
    extract_dir = os.path.join(dir_name, base_name.replace('.pptx', ''))
    os.makedirs(extract_dir, exist_ok=True)

    # Extract the .zip file
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        zip_ref.extractall(extract_dir)

    # Optionally, remove the .zip file after extraction
    os.remove(zip_file_path)

    return extract_dir

def process_all_pptx_in_folder(folder_path):
    # List all files in the directory
    for file_name in os.listdir(folder_path):
        file_path = os.path.join(folder_path, file_name)
        
        # Check if the file is a PowerPoint file
        if file_path.lower().endswith('.pptx'):
            try:
                extracted_dir = convert_and_extract_pptx(file_path)
                print(f"Extracted contents of {file_name} are located in: {extracted_dir}")
            except Exception as e:
                print(f"Failed to process {file_name}: {e}")


# main
folder_path = 'target_folder'#フォルダを指定
process_all_pptx_in_folder(folder_path)

# main
info_dict = extract_figure_and_table_info_from_folder(folder_path)
errors = check_references(info_dict)


if info_dict:
    # データフレームとして表示
    data = []
    for diag_table, info in info_dict.items():
        slides = ','.join(map(str, sorted(set(info['slides']))))
        references = ','.join(map(str, sorted(set(info['references']))))
        data.append([diag_table, info['title'], slides, references])

    df = pd.DataFrame(data, columns=["Diag/Table", "Title", "Slide", "References"])
    
    # タブ区切りのCSVファイルに保存
    output_tsv_file = folder_path + '.csv'
    df.to_csv(output_tsv_file, index=False, encoding='utf-8-sig')
    print(f"データがCSVファイル {output_tsv_file} に保存されました。")

if errors:
    for error in errors:
        print(error)
else:
    print("すべての参照が正しいです。")

コード実行結果例

以下のようにPowerPointファイルを複数格納したフォルダを用意して上記コードを実行してみました。
PowerPointファイルについてはご自身で準備して動作を確認してみてください。

target_folder
├data1.pptx
├data2.pptx
└data3.pptx

Extracted contents of data1.pptx are located in: target\data1
Extracted contents of data2.pptx are located in: target\data2
Extracted contents of data3.pptx are located in: target\data3
データがCSVファイル target.csv に保存されました。
Error: 図1 is referenced in only one slide data1.pptx: Slide 1.
Error: 図7 is referenced in only one slide data1.pptx: Slide 1.
Error: 図1-1 is referenced in only one slide data1.pptx: Slide 2.
Error: 表2 is referenced in only one slide data2.pptx: Slide 2.
Error: 図11 is referenced in only one slide data3.pptx: Slide 1.
Error: 図12.1-1 is referenced in only one slide data3.pptx: Slide 3.
Error: 図1-1.2 is referenced in only one slide data3.pptx: Slide 4.

CSVファイルの方には自動取得したファイル名と図表番号(DiagTable)、参照スライド(Reference)が書き込まれています。

target_folder.csv

Diag/Table	Title	Slide	References
図1	はじめに	data1.pptx: Slide 1	data1.pptx: Slide 1
図2	概要	data1.pptx: Slide 1,data1.pptx: Slide 3	data1.pptx: Slide 1,data1.pptx: Slide 3
図3	結果	data1.pptx: Slide 1,data2.pptx: Slide 1	data1.pptx: Slide 1,data2.pptx: Slide 1
図7	呼び出し	data1.pptx: Slide 1	data1.pptx: Slide 1
図1.1-1	補足	data1.pptx: Slide 1,data3.pptx: Slide 2	data1.pptx: Slide 1,data3.pptx: Slide 2
図1-1	序論	data1.pptx: Slide 2	data1.pptx: Slide 2
表1	比較表	data1.pptx: Slide 2,data2.pptx: Slide 3	data1.pptx: Slide 2,data2.pptx: Slide 3
表2	作りかけ	data2.pptx: Slide 2	data2.pptx: Slide 2
図11	さいごに	data3.pptx: Slide 1	data3.pptx: Slide 1
図12.1-1	補足	data3.pptx: Slide 3	data3.pptx: Slide 3
図1-1.2	補足	data3.pptx: Slide 4	data3.pptx: Slide 4

参照ファイルが何回も登場して少し冗長なので、こちらについては改良の余地がありますね。

おわりに・まとめ

今回はPythonで複数のPowerPointファイルをXML形式のファイルに変換し、図表番号の整合性をチェックする方法について試してみました。

この記事がどなたかのお役に立てば幸いです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up