1
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

【Python】ローカルに置いてあるHTMLファイルのタイトルをエクセルにまとめる

Last updated at Posted at 2019-11-05

ローカルに保存された特定フォルダ以下の(大量の)HTMLファイルのタイトル一覧をエクセルにまとめます
tqdmでなんとなく進捗が出ます

.py
import pathlib
import glob
from tqdm import tqdm
import re
from bs4 import BeautifulSoup
import openpyxl

# フォルダを指定
p_temp = pathlib.Path('C:/hoge/')

# excelの準備
wb = openpyxl.Workbook()
sheet_title = wb.active
sheet_title.title = 'title'

# タイトルを取得(「titlelist」に格納)
titlelist = {}
filelist_t = p_temp.glob('*/*.html')
for i in tqdm(filelist_t):
    filename = str(i)
    with open(filename , encoding='utf-8') as f:
        html = f.read()
    soup = BeautifulSoup(html, "html.parser")
    title_tag = soup.find('title').text
    filelist = filename.split('\\')
    filenamecut = filelist[-1]
    titlelist[filenamecut] = title_tag

# A列にファイル名・B列に値を入れる
def toexcel(listname,sheetname):
    ii = 1
    for i in listname:
        f_name = 'A'+str(ii)
        f_value='B'+str(ii)
        ii +=1
        sheetname[f_name]=i
        sheetname[f_value]=listname[i]

toexcel(titlelist,sheet_title)

wb.save('titlelist.xlsx')
.py
soup.find('title').text

で必要な要素のテキストのみ取得

.py
    filelist = filename.split('\\')
    filenamecut = filelist[-1]
    titlelist[filenamecut] = title_tag

でファイル名のみの表示になっていますが、辞書型に格納するので階層ごとでまとめるとか適当に

結果
titlelist.xlsx
image.png

1
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?