LoginSignup
1
1

More than 3 years have passed since last update.

Powerpoint(pptx)の表をスクレイピング

Last updated at Posted at 2020-11-17

政府CIOポータルのオープンデータオープンデータ伝道師一覧のpptxの表をスクレイピング

wget https://cio.go.jp/sites/default/files/uploads/documents/opendata-dendoushi_ichiran.pptx -O ichiran.pptx
pip install python-pptx
import pptx
import pandas as pd

prs = pptx.Presentation("ichiran.pptx")

dfs = []

for page in prs.slides:

    data = [[cell.text for cell in row.cells] for row in page.shapes[1].table.rows]

    dfs.append(pd.DataFrame(data[1:], columns=data[0]))

df = pd.concat(dfs).set_index("No.")

df["所属団体等"] = df["所属団体等"].str.replace("\n", "", regex=True)

df1 = df.join(
    df["氏名"].str.split("\n", expand=True).rename(columns={0: "ふりがな", 1: "名前"})
).drop("氏名", axis=1)

df2 = df1.reindex(columns=["名前", "ふりがな", "主な活動エリア", "これまでの主な実績等", "所属団体等"])

df2.to_csv("ichiran.csv", encoding="utf_8_sig")
1
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
1