PythonによるテキストファイルデータをExcelファイルに上書きするスクリプト

Last updated at 2025-01-30Posted at 2025-01-30

概要

　Excelファイルとテキストファイルの組を与えたとき、テキストファイルを参照して、それに合うようにExcelファイルを修正する。
　Excelファイルは先頭のシートに上書きする。元の書式を保持する。
　テキストファイルの先頭列をキー列として、その項目が一致するExcelファイルの行を探して、それ以降のテキストファイルにある列を上書きする。
　Excelファイルとテキストファイルの列数は一致していない。テキストファイルの列の項目のうちでExcelファイルの列の項目にあるものだけをExcelファイルに上書きする。
　テキストファイルのキー列の項目がExcelファイルにない場合はExcelファイルの末尾に新たに行を追記する。
　Excelファイルにテキストファイルにない行がある場合はそのままとする。

スクリプト

#!/bin/bash

FILE1=$1
FILE2=$2

# Bashスクリプトの内容
python3 <<EOF
import pandas as pd
from openpyxl import load_workbook
import os

def update_excel_with_text(excel_file, text_file):
   try:
       # Excelファイルを読み込む
       excel_data = pd.read_excel(excel_file, dtype=str)

       # テキストファイルを読み込む
       text_data = pd.read_csv(text_file, dtype=str, header=0)

       # デバッグ: Excelデータとテキストデータの最初の数行を表示
       print("Excel Data (first 5 rows):")
       print(excel_data.head())
       print("\nText Data (first 5 rows):")
       print(text_data.head())

       print(excel_data.columns)
       print(text_data.columns)

       # カラム名の前処理
       excel_data.columns = excel_data.columns.str.strip()
       text_data.columns = text_data.columns.str.strip()

       # インデックスをリセット
       excel_data = excel_data.reset_index(drop=True)
       text_data = text_data.reset_index(drop=True)

       # 先頭のカラムをキーとして設定
       key_column = excel_data.columns[0]
       if key_column in text_data.columns:
           print(f"Using '{key_column}' as the key column.")
       else:
           raise ValueError(f"Key column '{key_column}' not found in text file.")

       # デバッグ: カラム名と一意性の確認
       print(f"Excel '{key_column}' unique values count:", excel_data[key_column].nunique())
       print(f"Excel '{key_column}' total count:", len(excel_data[key_column]))
       print(f"Text '{key_column}' unique values count:", text_data[key_column].nunique())
       print(f"Text '{key_column}' total count:", len(text_data[key_column]))

       # ファイル形式に基づいてkeep_vbaを設定
       _, file_extension = os.path.splitext(excel_file)
       keep_vba = file_extension.lower() == '.xlsm'

       # Excelファイルを読み込む
       wb = load_workbook(excel_file, keep_vba=keep_vba)
       ws = wb.active

       print("Merging data...")
       excel_key_set = set(excel_data[key_column])
       text_key_set = set(text_data[key_column])

       print(f"{key_column} in Excel but not in Text:", excel_key_set - text_key_set)
       print(f"{key_column} in Text but not in Excel:", text_key_set - excel_key_set)

       # key_columnを基準にテキストファイルのデータでExcelを更新
       for _, row in text_data.iterrows():
           key_column_value = row[key_column]

           # key_columnの行を検索
           matching_rows = excel_data[excel_data[key_column] == key_column_value]
           if not matching_rows.empty:
               # 既存行を更新
               row_index = matching_rows.index[0] + 2  # openpyxlは1-based index
               for col in text_data.columns:
                   if col in excel_data.columns:
                       col_index = excel_data.columns.get_loc(col) + 1
                       cell = ws.cell(row=row_index, column=col_index)
                       if cell.data_type != 'f':  # 数式セルはスキップ
                           cell.value = row[col]
           else:
               # 新しい行を追加
               new_row = [row.get(col, "") for col in excel_data.columns]
               ws.append(new_row)

       wb.save(excel_file)
       wb.close()  # ワークブックを閉じる

   except Exception as e:
       print(f"ファイルの読み込みまたはデータ操作中にエラーが発生しました: {e}")

if __name__ == "__main__":
   # Bashの変数をPythonに渡す
   excel_file = "${FILE1}"
   text_file = "${FILE2}"

   # 関数を呼び出して更新処理を実行
   update_excel_with_text(excel_file, text_file)

EOF

解説

ファイルデータの読み込み

       # Excelファイルを読み込む
       excel_data = pd.read_excel(excel_file, dtype=str)

       # テキストファイルを読み込む
       text_data = pd.read_csv(text_file, dtype=str, header=0)

　pandasでExcelファイルとテキストファイルを読み込んで、pandasのデータフレーム形式にする。dtype=strは全てのデータを文字列として読み込むこと。

カラム名の前処理

        # カラム名の前処理
        excel_data.columns = excel_data.columns.str.strip()
        text_data.columns = text_data.columns.str.strip()

カラム名から余分な空白を削除する。

インデックスのリセット

        # インデックスをリセット
        excel_data = excel_data.reset_index(drop=True)
        text_data = text_data.reset_index(drop=True)

　データフレーム（excel_dataとtext_data）のインデックスをリセットする。現在のインデックスを新しいデフォルトのインデックス（0, 1, 2, ...)に置き換える。drop=Trueの引数が指定されると、現在のインデックスを新しい列として追加せずに、完全に削除する。つまり、元のインデックスは保持されず、新しいインデックスのみが適用される。
　データのマージや結合の際に、特にキー列を基準に行う場合、インデックスが連続していれば、よりエラーが少なくなる。
　

キー列の設定

        # 先頭のカラムをキーとして設定
        key_column = excel_data.columns[0]

　Excelファイルの最初のカラムをキーとして使用する。

Excelファイルの読み込み (openpyxlライブラリ)

        # ファイル形式に基づいてkeep_vbaを設定
        _, file_extension = os.path.splitext(excel_file)
        keep_vba = file_extension.lower() == '.xlsm'

        # Excelファイルを読み込む
        wb = load_workbook(excel_file, keep_vba=keep_vba)
        ws = wb.active

　Excelファイルのワークブック全体を読み込む。
　keep_vbaはExcelファイルのVBA（Visual Basic for Applications）マクロを保持するためのオプション。

キーセットの取得

        excel_key_set = set(excel_data[key_column])
        text_key_set = set(text_data[key_column])

Excelデータとテキストデータからキー列を取得。どのキー値がExcelファイルにだけ存在するか、またはテキストファイルにだけ存在するかを確認。

key_columnが一致する行を検索

            # key_columnの行を検索
            matching_rows = excel_data[excel_data[key_column] == key_column_value]

　テキストデータのkey_columnの値と一致するExcelデータの行を判定。

既存行を更新

                # 既存行を更新
                row_index = matching_rows.index[0] + 2  # openpyxlは1-based index
                for col in text_data.columns:
                    if col in excel_data.columns:
                        col_index = excel_data.columns.get_loc(col) + 1
                        cell = ws.cell(row=row_index, column=col_index)
                        if cell.data_type != 'f':  # 数式セルはスキップ
                            cell.value = row[col]

・matching_rowsの最初の行のインデックスを取得し、openpyxlは1ベースのインデックスを使用するため、2を足してExcelの行番号に変換する。
・text_dataの各カラムをループし、そのカラムがexcel_dataにも存在するか確認する。
・if col in excel_data.columns:: カラム名が一致する場合、
col_index = excel_data.columns.get_loc(col) + 1: そのカラムのインデックスを取得し、1を加えてExcelの列番号に変換する。
cell = ws.cell(row=row_index, column=col_index): 対応するセルを取得する。
if cell.data_type != 'f': セルのデータ型が数式でない場合に、セルの値を新しい値で更新する。

新しい行を追加

                # 新しい行を追加
                new_row = [row.get(col, "") for col in excel_data.columns]
                ws.append(new_row)

　excel_dataのカラム順に従って、新しい行の値を作成する。row.get(col, "")は、テキストファイルの現在の行にそのカラムが存在しない場合、空文字列を返す。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up