More than 5 years have passed since last update.

【Python】ローカルに置いてあるHTMLファイルのタイトルをエクセルにまとめる

Last updated at 2019-09-12Posted at 2019-11-05

ローカルに保存された特定フォルダ以下の（大量の）HTMLファイルのタイトル一覧をエクセルにまとめます
tqdmでなんとなく進捗が出ます

.py

import pathlib
import glob
from tqdm import tqdm
import re
from bs4 import BeautifulSoup
import openpyxl

# フォルダを指定
p_temp = pathlib.Path('C:/hoge/')

# excelの準備
wb = openpyxl.Workbook()
sheet_title = wb.active
sheet_title.title = 'title'

# タイトルを取得（「titlelist」に格納）
titlelist = {}
filelist_t = p_temp.glob('*/*.html')
for i in tqdm(filelist_t):
    filename = str(i)
    with open(filename , encoding='utf-8') as f:
        html = f.read()
    soup = BeautifulSoup(html, "html.parser")
    title_tag = soup.find('title').text
    filelist = filename.split('\\')
    filenamecut = filelist[-1]
    titlelist[filenamecut] = title_tag

# A列にファイル名・B列に値を入れる
def toexcel(listname,sheetname):
    ii = 1
    for i in listname:
        f_name = 'A'+str(ii)
        f_value='B'+str(ii)
        ii +=1
        sheetname[f_name]=i
        sheetname[f_value]=listname[i]

toexcel(titlelist,sheet_title)

wb.save('titlelist.xlsx')

.py

soup.find('title').text

で必要な要素のテキストのみ取得

.py

    filelist = filename.split('\\')
    filenamecut = filelist[-1]
    titlelist[filenamecut] = title_tag

でファイル名のみの表示になっていますが、辞書型に格納するので階層ごとでまとめるとか適当に

結果
titlelist.xlsx

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up