0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

ChatGPTでスクレイピングする(その4、フォーム)

Posted at

ケース1 カラムヘッダーとローヘッダの両方が存在するフォーム

image.png

質問

Could you normalize this table in this html? Show me all data as key-value table which has key and value column. All headers and values must be normalized. Nested header must be concatenated by '_'.

回答

Here is an example of how the table could be normalized:

image.png

#### 評価

これまでで一番複雑なテーブルなのに一発で正解。これはたまたまなのかな?安定して正解を出力する方法がわからない。神Excelっぽい部分の正規化も完璧。

ケース2 スクレイピングを行うプログラムの生成

質問

Could you show me the program to reproduce this result?

回答

from bs4 import BeautifulSoup

# Replace `html_string` with the actual HTML string of the table
soup = BeautifulSoup(html_string, 'html.parser')
table = soup.find('table')

# Create an empty list to store the normalized data
data = []

# Iterate over each row in the table
for row in table.find_all('tr'):
    # Create an empty dictionary for the row data
    row_data = {}
    # Get the columns in the row
    cols = row.find_all(['th', 'td'])
    # Iterate over the columns
    for i, col in enumerate(cols):
        # Get the column span
        col_span = int(col.get('colspan', 1))
        # Get the cell value
        value = col.text.strip()
        # Get the previous column header
        prev_header = headers[i - 1] if i > 0 else None
        # If the column has a row span, use the value of the previous column header as the key
        if col.get('rowspan'):
            key = f"{prev_header}_{value}"
        # If the column does not have a row span, use the value as the key
        else:
            key = value
        # If the column span is greater than 1, concatenate the values
        if col_span > 1:
            next_cols = cols[i + 1: i + col_span]
            value = ''.join([c.text.strip() for c in next_cols])
            # Update the headers to skip the concatenated columns
            headers = headers[:i] + [''] * (col_span - 1) + headers[i + col_span:]
        # Add the key-value pair to the row data
        row_data[key] = value
    # Add the row data to the list of normalized data
    data.append(row_data)

print(data)

評価

出力されるプログラムは微妙。これ本当に動くのかな。今は細かい点はどうでも良い。

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?