ケース1 カラムヘッダーとローヘッダの両方が存在するフォーム
質問
Could you normalize this table in this html? Show me all data as key-value table which has key and value column. All headers and values must be normalized. Nested header must be concatenated by '_'.
回答
Here is an example of how the table could be normalized:
#### 評価
これまでで一番複雑なテーブルなのに一発で正解。これはたまたまなのかな?安定して正解を出力する方法がわからない。神Excelっぽい部分の正規化も完璧。
ケース2 スクレイピングを行うプログラムの生成
質問
Could you show me the program to reproduce this result?
回答
from bs4 import BeautifulSoup
# Replace `html_string` with the actual HTML string of the table
soup = BeautifulSoup(html_string, 'html.parser')
table = soup.find('table')
# Create an empty list to store the normalized data
data = []
# Iterate over each row in the table
for row in table.find_all('tr'):
# Create an empty dictionary for the row data
row_data = {}
# Get the columns in the row
cols = row.find_all(['th', 'td'])
# Iterate over the columns
for i, col in enumerate(cols):
# Get the column span
col_span = int(col.get('colspan', 1))
# Get the cell value
value = col.text.strip()
# Get the previous column header
prev_header = headers[i - 1] if i > 0 else None
# If the column has a row span, use the value of the previous column header as the key
if col.get('rowspan'):
key = f"{prev_header}_{value}"
# If the column does not have a row span, use the value as the key
else:
key = value
# If the column span is greater than 1, concatenate the values
if col_span > 1:
next_cols = cols[i + 1: i + col_span]
value = ''.join([c.text.strip() for c in next_cols])
# Update the headers to skip the concatenated columns
headers = headers[:i] + [''] * (col_span - 1) + headers[i + col_span:]
# Add the key-value pair to the row data
row_data[key] = value
# Add the row data to the list of normalized data
data.append(row_data)
print(data)
評価
出力されるプログラムは微妙。これ本当に動くのかな。今は細かい点はどうでも良い。