0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

建設業許可業者一覧PDFをCSV変換

Last updated at Posted at 2022-11-20

<おことわり>
※検索システムのアクセス集中による不具合が発生しているため、一時的に掲載しております。
※この建設業者一覧及びこの建設業者一覧を加工・変更したものを商用利用、出版、不特定又は多数に対して二次配布することは許可しません。

ダウンロード

# 建設業許可業者一覧のPDF
wget https://www.mlit.go.jp/totikensangyo/const/content/001520358.pdf -O data.pdf

# tabula-java
wget https://github.com/tabulapdf/tabula-java/releases/download/v1.0.5/tabula-1.0.5-jar-with-dependencies.jar

PDFからCSV変換

# ヒープの最大サイズを12GB
java -jar -Xmx12G tabula-1.0.5-jar-with-dependencies.jar -o data.csv -p all -l data.pdf

データクレンジング

import pandas as pd

df0 = pd.read_csv("data.csv", dtype=str, header=None)
df0

# 許可番号 欠損削除
df0.dropna(subset=[0], inplace=True)

# 許可番号 数字以外を除去
df1 = df0[df0[0].str.isnumeric()].copy()

# 特に神奈川県の住所に空白が多いので削除
for col in df1.select_dtypes(include=object).columns:
     df1[col] =  df1[col].str.replace(" +", " ", regex=True)

# CSVに保存
df1.to_csv("result.csv", encoding="utf_8_sig", index=False, header=False)

PDF自動ダウンロード

apt install html-xml-utils
apt install libxml2-utils

curl -sS https://www.mlit.go.jp/totikensangyo/const/1_6_bt_000089.html \
| hxnormalize -x \
| xmllint --html --xpath '//a[strong/span[contains(text(),"建設業許可業者一覧(PDF)")]]/@href' - \
| cut -d= -f2 \
| tr -d '"' \
| sed 's;^/;https://www.mlit.go.jp/;;' \
| xargs -n 1 curl -o data.pdf
0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?