求人票の部分的な自動化 auto_html_job_posting

Posted at 2025-06-15

目的

ダウンロードしたhtmlファイルをexcelに書いて保存する

#ちょっとした説明
https://www.youtube.com/watch?v=9NWq6ibRj-4&list=PLLagHHmIJDB_vrMZicySJ8-WAfLWW491x

使った言語、ライブラリ

python
os - ディレクトリを操作
pandas - データフレームとexcelに書き込み
beautiful soup - webスクレイピング
openpyxl - excelを作る

dammy.py

excelファイルを作る
データの準備
フォルダ内のデータを受け取る
ひとファイルごとに繰り返し処理する
- データの抽出 - idを指定して、テキストデータだけ受けとる
データが入っていない場合：辞書の中のリストに入れる
データをデータフレーム化
excelに書き込み

import os
import time
from bs4 import BeautifulSoup
from collections import defaultdict
import pandas as pd
from openpyxl import Workbook, load_workbook

#時間計測
start_time = time.time()

#excelファイルパス
excel_file_path = "excel_open_posting.xlsx"
if not os.path.exists(excel_file_path):
    #新規作成
    wb = Workbook()
    ws = wb.active
    ws.title = "Sheeting1"
    ws.append(["id", "社名", "職種", "賃金", "出勤日", "年間休日", "残業", "36協定", "固定残業", "経験"])
    wb.save(excel_file_path)
    print(f"{excel_file_path}を新規作成")

# フォルダ内の HTML ファイル一覧を取得
folder_path = "html_open_posting"  # フォルダのパスを指定
html_files = [f for f in os.listdir(folder_path) if f.endswith(".html")] #htmlファイルをリスト化

#空の辞書
data = {
    "id" : [],
    "リンク" : [],
    "社名" : [],
    "職種" : [],
    "賃金" : [],
    "出勤日" : [],
    "年間休日" : [],
    "残業" : [],
    "36協定" : [],
    "固定残業" : [],
    "経験" : [],

}

# 各 HTML ファイルの div 属性を取得
div_attributes = defaultdict(list)

#excelのidを取得と空のハイパーリンク
df = pd.read_excel(excel_file_path)
df_id = df["id"].tolist()
hyperlink = None

for file_name in html_files:
    file_path = os.path.join(folder_path, file_name)
    hyperlink = f'=HYPERLINK("./{folder_path}/{file_name}", "{file_name}")'
    print(hyperlink)

    with open(file_path, "r", encoding="utf-8") as file:
        soup = BeautifulSoup(file, "html.parser")

    #データの抽出
    file_id = soup.find("div", id="ID_kjNo").get_text(strip=True) #findはデータを取得、get_textはテキストを受け取る
    company = soup.find("div", id="ID_jgshMei")
    company_text = company.get_text(strip=True) if company else""
    occupation = soup.find("div", id="ID_sksu").get_text(strip=True)
    money = soup.find("div", id="ID_chgn").get_text(strip=True)
    working_day = soup.find("div", id="ID_thkinRodoNissu").get_text(strip=True)
    overtime = soup.find("div", id="ID_jkgiRodoJn").get_text(strip=True)
    overtime_36 = soup.find("div", id="ID_sanrokuKyotei").get_text(strip=True)
    rest_day = soup.find("div", id="ID_nenkanKjsu").get_text(strip=True)
    fixed_overtime_pay = soup.find("div", id="ID_koteiZngyKbn").get_text(strip=True)
    experience = soup.find("div", id="ID_hynaKiknt").get_text(strip=True)
    print(file_id)
    print(company_text)
    print(occupation)
    print(money)
    print(working_day)
    print(overtime)
    print(overtime_36)
    print(rest_day)
    print(fixed_overtime_pay)
    print(experience)
    print(hyperlink)
    print("-" * 60)
    #データを入れる
    if file_id in df_id:
        print("もうすでにデータが入っている")
        
    else:
        data["id"].append(file_id)
        data["リンク"].append(hyperlink)
        data["社名"].append(company_text)
        data["職種"].append(occupation)
        data["賃金"].append(money)
        data["出勤日"].append(working_day)
        data["残業"].append(overtime)
        data["36協定"].append(overtime_36)
        data["年間休日"].append(rest_day)
        data["固定残業"].append(fixed_overtime_pay)
        data["経験"].append(experience)

#データフレーム作成
new_df = pd.DataFrame(data)
print(new_df)

#データフレームをexcelに追加
df = pd.read_excel(excel_file_path)
df = pd.concat([df, new_df], ignore_index=True)
df.to_excel(excel_file_path, index=False)
print(f"データを{excel_file_path}に追加しました")

#計測完了
end_time = time.time()
print(f"処理速度: {end_time - start_time: .6f}秒")

html（参考のhtmlファイル）

<!DOCTYPE html>
<html lang="ja">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>求人サイト</title>
    <style>
        body {
            font-family: 'Arial', sans-serif;
            background-color: #f4f4f9;
            margin: 0;
            padding: 20px;
        }
        .job-container {
            width: 80%;
            max-width: 800px;
            margin: auto;
            display: flex;
            flex-direction: column;
            gap: 20px;
        }
        .job {
            background-color: #ffffff;
            border-radius: 10px;
            box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.1);
            padding: 20px;
            transition: transform 0.3s ease-in-out;
        }
        .job:hover {
            transform: scale(1.03);
        }
        .job div {
            margin: 10px 0;
        }
        .title {
            font-size: 22px;
            font-weight: bold;
            color: #007BFF;
        }
        .highlight {
            font-weight: bold;
            color: #333;
        }
    </style>
</head>
<body>

    <div class="job-container">
        <div class="job">
            <div id="file_id" class="highlight">ID: 18155-1114</div>
            <div id="company" class="title">社名: つむぎカンパニー</div>
            <div id="occupation"><span class="highlight">職種:</span> voicevox</div>
            <div id="money"><span class="highlight">賃金:</span> 200,000円~300,000円</div>
            <div id="working_day"><span class="highlight">月平均出勤日数:</span> 20日</div>
            <div id="overtime"><span class="highlight">残業の有無:</span> なし</div>
            <div id="overtime_36"><span class="highlight">36協定の有無:</span> なし</div>
            <div id="rest_day"><span class="highlight">年間休日:</span> 145日</div>
            <div id="fixed_overtime_pay"><span class="highlight">固定残業代の有無:</span> あり</div>
            <div id="experience"><span class="highlight">経験が必須かどうか:</span> なし</div>
        </div>
</body>
</html>

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up