More than 1 year has passed since last update.

ChatGPTでスクレイピングする（その4、フォーム）

Posted at 2023-01-09

ケース1 カラムヘッダーとローヘッダの両方が存在するフォーム

質問

Could you normalize this table in this html? Show me all data as key-value table which has key and value column. All headers and values must be normalized. Nested header must be concatenated by '_'.

回答

Here is an example of how the table could be normalized:

####　評価

これまでで一番複雑なテーブルなのに一発で正解。これはたまたまなのかな？安定して正解を出力する方法がわからない。神Excelっぽい部分の正規化も完璧。

ケース2 スクレイピングを行うプログラムの生成

質問

Could you show me the program to reproduce this result?

回答

from bs4 import BeautifulSoup

# Replace `html_string` with the actual HTML string of the table
soup = BeautifulSoup(html_string, 'html.parser')
table = soup.find('table')

# Create an empty list to store the normalized data
data = []

# Iterate over each row in the table
for row in table.find_all('tr'):
    # Create an empty dictionary for the row data
    row_data = {}
    # Get the columns in the row
    cols = row.find_all(['th', 'td'])
    # Iterate over the columns
    for i, col in enumerate(cols):
        # Get the column span
        col_span = int(col.get('colspan', 1))
        # Get the cell value
        value = col.text.strip()
        # Get the previous column header
        prev_header = headers[i - 1] if i > 0 else None
        # If the column has a row span, use the value of the previous column header as the key
        if col.get('rowspan'):
            key = f"{prev_header}_{value}"
        # If the column does not have a row span, use the value as the key
        else:
            key = value
        # If the column span is greater than 1, concatenate the values
        if col_span > 1:
            next_cols = cols[i + 1: i + col_span]
            value = ''.join([c.text.strip() for c in next_cols])
            # Update the headers to skip the concatenated columns
            headers = headers[:i] + [''] * (col_span - 1) + headers[i + col_span:]
        # Add the key-value pair to the row data
        row_data[key] = value
    # Add the row data to the list of normalized data
    data.append(row_data)

print(data)

評価

出力されるプログラムは微妙。これ本当に動くのかな。今は細かい点はどうでも良い。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up