More than 5 years have passed since last update.

htmlの連結方法が分からなかったのでいろいろ試してみた

Last updated at 2019-09-26Posted at 2019-09-26

はじめに

edinetからhtmlを取得しようとしたとき，一つのリリースに対してhtmlが複数に分かれていたため後々のことを考えてひとつにまとめたくなりました．

なのでhtmlを連結したかったですが，以外と記事が見つからなかったのでまとめます．

最終スクリプト

下記bind_html関数の引数にhtmlファイルのパスのリストを渡してあげれば，順番通り連結されるはずです．

from bs4 import BeautifulSoup

def bind_html(html_path_list: list) -> str:
    '''複数のhtmlを順番に結合する

    Args:
        html_path_list ([str]): htmlファイルのpathリスト

    Attributes:
        pure_bound_html (str): 純粋にhtmlを文字列として 結合したもの
        bound_html (str): pure_bound_htmlから結合部分を 取り除いたもの

    Returns:
        str
    '''
    soup_list = []
    for html_path in html_path_list:
        with open(html_path) as f:
            soup_list.append(BeautifulSoup(f.read(), 'lxml'))
    pure_bound_html = ''.join([soup.prettify() for soup in soup_list])
    bound_html = pure_bound_html.replace('</html>\n<html>\n', '')
    return bound_html

そのまま保存したければ

'''
Attributes:
    html_path_list: htmlファイルのpathリスト
    save_path: bindされたhtmlを保存する場所
'''
bound_html = bind_html(html_path_list)
with open(save_path, mode='w') as f:
    f.write(bound_html)

を追記すれば大丈夫です．

試したこと

まず試したのは，htmlファイルをテキストファイルとしてそのまま連結させてしまうことです．

サンプルのhtmlファイルを2つ用意します．

sample1.html

<html>
  <body>
    <h1>テスト1</h1>
    <p>p tag1</p>
  </body>
</html>

sample2.html

<html>
  <body>
    <h1>テスト2</h1>
    <p>p tag2</p>
  </body>
</html>

単純に連結させると以下のようになります．

sample_pure_bound.html

<html>
  <body>
    <h1>テスト1</h1>
    <p>p tag1</p>
  </body>
</html>
<html>
  <body>
    <h1>テスト2</h1>
    <p>p tag2</p>
  </body>
</html>

こいつを保存してchromeでみると，ちゃんと想定通り表示されます．ただBeautifulSoupでパースすると以下のように，最初のhtmlのみのパースとなってしまいます．

from bs4 import BeautifulSoup
with open('sample_pure_bound.html') as f:
    soup = BeautifulSoup(f.read(), 'lxml')
print(soup.prettify())
# <html>
# <body>
#  <h1>
#   テスト1
#  </h1>
#  <p>
#   p tag1
#  </p>
# </body>
# </html>

これはBeautifulSoupのパース時に<html>〜</html>を見に行っているからで，つまり間の結合部分

</html>
<html>

を消せばなんとかなるのでは．．？？

ということでやってみました．

sample_bound.html

<html>
  <body>
    <h1>テスト1</h1>
    <p>p tag1</p>
  </body>
  <body>
    <h1>テスト2</h1>
    <p>p tag2</p>
  </body>
</html>

from bs4 import BeautifulSoup
with open('sample_bound.html') as f:
    soup = BeautifulSoup(f.read(), 'lxml')
print(soup.prettify())
# <html>
# <body>
#  <h1>
#   テスト1
#  </h1>
#  <p>
#   p tag1
#  </p>
# </body>
# <body>
#  <h1>
#   テスト2
#  </h1>
#  <p>
#   p tag2
#  </p>
# </body>
# </html>

ちゃんとパースできています．chromeの表示でも問題ないので，この「結合部分」を削除する処理をPythonで書くだけです．

最終スクリプトのスクリプトを解説すると，

soup_list = []
for html_path in html_path_list:
    with open(html_path) as f:
        # htmlファイルを読み込み，BeautifulSoupパースしたものを一つずつsoup_listにつっこむ
        soup_list.append(BeautifulSoup(f.read(), 'lxml'))
# soupをprettifyして文字列にしたあとに，空白なしで結合する
pure_bound_html = ''.join([soup.prettify() for soup in soup_list])
# 結合したあとに，結合部分('</html>\n<html>\n')を削除する
bound_html = pure_bound_html.replace('</html>\n<html>\n', '')

という感じです．

以上です．

参考

BeautifulSoupのprettifyメソッド
 strのjoinメソッド
 ファイルの読み書き

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up