ChatGPTを使ってツリー構造化されたテーブルのスクレイピングを行う。
ケース1 ヘッダーがツリー構造になったHTMLの表
質問:
I'm going to show you html table. Could you answer my question? Please use only information that can be read from this table in your answer.
name | gender | age | ID | |||||
---|---|---|---|---|---|---|---|---|
Type1 | Type2 | |||||||
male | female | other | A | B | C | D | ||
John Doe | 1 | 0 | 0 | 30 | 1 | 2 | 0 | |
Jane Doe | 0 | 1 | 0 | 29 | 3 | 4 | 5 |
回答
This HTML table has nine columns and two rows. The first row contains the column labels "name", "gender", "age", and "ID". The second row divides the "gender" and "ID" columns into subcolumns. The remaining rows contain data for two individuals, including their names, genders, ages, and IDs.
The "gender" column is divided into three subcolumns: "male", "female", and "other". Each of these subcolumns represents a different gender, and the values in these cells are binary, with a value of 1 indicating that the individual is of the corresponding gender and a value of 0 indicating that they are not.
The "ID" column is divided into four subcolumns: "A", "B", "C", and "D". These subcolumns represent different types of IDs, and the values in these cells are non-binary. It is not clear from the table what these values represent.
評価: 読めているらしい。言語モデルなのにどうやったら複雑な表が理解できるのか現在の私には理解できない。
ケース2 テーブルの正規化(表1つ、ツリーカラム展開)
質問
Could you normalize this table as tidy data? Please show me the result as single table which has all columns in html table.
回答
To normalize the table as tidy data and include all of the columns from the original table, we can create new columns for each subcategory of the gender and identification columns. Here is what the resulting table would look like:
評価: 正解。ただし、以降のケースのように質問の度に正規化の結果が揺れる。また、一部のカラムを勝手に省略して出力されることがあるようだ。"HTMLの全部のカラムを含む"という点を強調すると良い感じで安定するようだ。このあたりの小手先のあれこれを "プロンプトエンジニアリング" などと称する謎技法としてくくっていると理解した。
実際は表示される値が微妙にズレているが、内部的に正しいデータを保持しているようなので問題ない。
現段階では細かい不具合はどうでも良い。
ケース3 テーブルの正規化(表1つ、キーバリュー)
質問
Could you normalize this table as tidy data? Please show me the result as single table which has all column header in html as key and all data cell in html as value.
回答
To normalize the table as tidy data and include all of the column headers from the original table as keys and all of the data cells as values, we can create a new "Key" column to indicate the type of data contained in each row and a new "Value" column to contain the actual data. Here is what the resulting table would look like:
評価: 無理矢理にでも「key-valueで出力」を指定すると出来る
ケース4 テーブルの正規化(表複数)
質問:
Could you normalize this table as tidy data? Please show me the result as separated tables which contains all data in html table.
回答:
To normalize the table as tidy data and create separate tables for a SQL database that include all of the data from the original table, we can create new tables for each subcategory of the gender and identification columns.
First, we can create a "people" table to store information about individuals:
Next, we can create a "identification" table to store information about the different types of identification and the corresponding values:
Finally, we can create a "gender" table to store information about the gender of each individual:
These tables could then be used to store the data in a SQL database.
評価 正規化が微妙に甘い
ケース5 スクレイピングプログラムの生成
質問
Could you normalize this table as tidy data? Please show me the result as single table which has all columns in html table. and show me the scraping program to reproduce this result in JavaScript.
回答
To normalize the table as tidy data and create a single table with all of the columns from the original table using JavaScript, we can use a library such as cheerio to parse the HTML and create a new table with the desired structure.
Here is an example of how this could be done
const cheerio = require('cheerio');
const html = `
<table>
<tr>
<th rowspan=3>name</th>
<th rowspan=2 colspan=3>gender</th>
<th rowspan=3>age</th>
<th colspan=4>ID</th>
</tr>
<tr>
<th colspan=2>Type1</th>
<th colspan=2>Type2</th>
</tr>
<tr>
<th>male</th>
<th>female</th>
<th>other</th>
<th>A</th>
<th>B</th>
<th>C</th>
<th>D</th>
</tr>
<tr>
<td>John Doe</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>30</td>
<td>1</td>
<td>2</td>
<td>0</td>
<td></td>
</tr>
<tr>
<td>Jane Doe</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>29</td>
</td><td></td>
<td>3</td>
<td>4</td>
<td>5</td>
</tr>
</table>
`;
const $ = cheerio.load(html);
const rows = $('table tr');
const data = [];
rows.each((index, row) => {
const cells = $(row).find('td');
const values = [];
cells.each((index, cell) => {
values.push($(cell).text());
});
data.push(values);
});
const result = data.map((row) => {
return {
name: row[0],
male: row[1],
female: row[2],
other: row[3],
age: row[4],
Type1_A: row[5],
Type1_B: row[6],
Type1_C: row[7],
Type2_C: row[8],
Type2_D: row[9],
};
});
console.log(result);
評価: 正解。生のHTMLをスクレイピングしていただく(謙譲語)のではなく、一旦いい感じに正規化してからのほうが良いようだ。
では、何が「いい感じの正規化」なのか?それは、入力データとデータの利用目的によるのだと思う。
そしてそれはこのレイヤーでは判断できないのでは?問題の建て方が適切ではないかもしれない。
汎用的にデータを格納するにはどうしたらよいのか?
つづく。次はフォーム形式のスクレイピングをやりたい。それこそkey-valueでよいのでは?