More than 3 years have passed since last update.

foo.webarchive を html に変換する方法

Last updated at 2021-02-21Posted at 2019-09-05

イントロ

JavaScriptなどの関係でhtmlで保存すると情報が落ちてしまうウェブページでも、Safari等の.webarchiveフォーマットで保存するとほとんど情報を失わずに保存できるので便利 -- ということはよく知られていますが、その後の操作法のメモ。

もっとも、そもそも Mac を使っている前提なのだから、terminal を立ち上げて textutil コマンドで一発変換。おしまい、と言う場合も多く、以下の記事を読む必要はほとんどなくなくなったりします。

~ % textutil -convert html foo.webarchive

とは言え、ページによっては一筋縄では行かない場合も間々あるので、その場合は以下の方法を使うことになるかと思います。

方法

Safari等で目的のページを.webarchiveフォーマットで保存
テキストエディタで.webarchiveを開き、ヘッダ部分を掃除する（下図）。
あとは直接 BeautufulSoup で読み込む（Beautiful Soup 4 以外では動作確認していない）。
おわり。

《掃除前》
この例では削除すべき部分（）の直前までが1, 2行と短いが、場合によっては数百行に及ぶこともあるので、<!DOCTYPE html>を検索（複数見つかることも）して、最後の<!DOCTYPE html>の直前までバッサリ削除する。

bplist00�������T_WebMainResource_WebSubframeArchives_WebSubresources������	�
���
�_WebResourceData_WebResourceMIMEType_WebResourceTextEncodingName^WebResourceURL_WebResourceFrameNameO���<!DOCTYPE html><html lang="ja" class="enhanced vanilla-layout mouse-optimized is-search-header-sticky" data-service-name="search" data-is-logged-in="true" data-user-rank="DSR3" data-sentry-dsn="https://example.com/foo" data-kite-env="production">

《掃除後》
頭からの直前まで削除した。
ここからあとはほぼ普通のHTML。

<!DOCTYPE html><html lang="ja" class="enhanced vanilla-layout mouse-optimized is-search-header-sticky" data-service-name="search" data-is-logged-in="true" data-user-rank="DSR3" data-sentry-dsn="https://example.com/foo" data-kite-env="production">

結論

🍺😋

蛇足

ヘッダの削り方

頭から最後の<!DOCTYPE html> の直前まで思いっきり削除した。

最初は、保存したままのfoo.webarchiveをbs4で開くと、

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd3 in position 8: invalid continuation byte

のようなError Messgeが出たので 0xd3 とおぼしき�まで削ってみたら、こんどは

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

のようなErrorになりもう理解の範囲を超えるので、�がいる付近をザックリ削ったらうまく行った。

なお、ファイル末尾近くの</html>の後ろにも何やらコードがあるけれども放置して問題なし。

ちなみに、ひとたび IOPub data rate exceeded. が出てしまうと、Jupyter Notebookの Kernel を立ち上げ直さないとダメでした

以上

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up