この記事は、ConoHa Advent Calendar 2019 23日目の記事です。
清楚かわいい「ConoHa」から ArchiveBox なるテンプレートが公開されたので、使ってみたよ。というお話。
最初にまとめ
まずは以下をコピペして
apt-get update;
yes | apt-get -y upgrade;
apt-get -y dist-upgrade;
cd /opt/archivebox/;
git checkout master;
git pull;
apt-get remove -y youtube-dl;
wget https://yt-dl.org/latest/youtube-dl -O /usr/local/bin/youtube-dl;
chmod a+x /usr/local/bin/youtube-dl;
hash -r;
終わったなら以下でアーカイブができる
echo "アーカイブしたいウェブページのURL" | sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False /opt/archivebox/archive
結果は/opt/archivebox/output/
に保存され、WebUI経由で確認できる。
ArchiveBoxって何ぞ??
【リリース】[VPS]「ArchiveBox」テンプレートイメージ提供開始|VPSならConoHa VPS
ConoHaでは、2019年4月24日(水)より「ArchiveBox」アプリケーションテンプレートイメージの提供を開始いたしました。
テンプレートを利用すると、ご自身で指定したURLのコンテンツをHTMLやPDF、画像などの形式で簡単に保存し、アーカイブできる「ArchiveBox」をすぐにご利用いただけます。
一言でいうと
「The open-source self-hosted web archive.」
らしい。(自分でつくる黒歴史保管サービスウェブ魚拓的な)
サーバをつくる
早速 ConoHa ダッシュボードにログインしてサクッとサーバー立てます。
内容的にストレージがいっぱい必要そうなので、ストレージが多い1GBプランをチョイス(SSD50GBでも1,000円以下安い!!)
イメージタイプは、もちろん「ArchiveBox」rootパスワードとかは適当に。
ログインしてセットアップ
指定されたIPアドレスにログイン
Welcome to Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-47-generic x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
System information as of Sun Dec 22 xx:xx:xx JST 2019
System load: 1.5 Processes: 120
Usage of /: 11.5% of 49.09GB Users logged in: 0
Memory usage: 25% IP address for eth0: 255.255.255.255
Swap usage: 0%
* Overheard at KubeCon: "microk8s.status just blew my mind".
https://microk8s.io/docs/commands#microk8s.status
241 packages can be updated.
112 updates are security updates.
*** System restart required ***
================================================
Welcome to ArchiveBox image!
Server address : http://255.255.255.255/
ArchiveBox directory : /opt/archivebox/
ArchiveBox Web Username : abox_user
ArchiveBox Web Password : XXXXXXXXXX
Enjoy Minecraft!
To delete this message: rm -f /etc/motd
================================================
System information
なる便利そうなものが目に入りますが、とりあえず、ArchiveBoxの箇所を。
ほうほう URL があるということは、WebUIがあるのか。早速。
Web Username
と Web Password
いれて。
...あれ?
ちょっとドキュメントを確認。
アーカイブしたWebページを閲覧する
アーカイブしたウェブページは、ウェブブラウザから閲覧できます。1回以上のアーカイブを実行しないとアーカイブ閲覧用ページは生成されず404エラーとなりますので、前述の「ArchiveBoxを使ってWebページをアーカイブする」の節に従ってアーカイブを実行してください。
なるほど。
アーカイブコマンドはコレですね。
$ cd /opt/archivebox/ && \
sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False \
/opt/archivebox/archive "アーカイブしたいウェブページのURL"
実際に叩く前に、まずお約束のアップデートをして、
$ apt-get update
$ apt-get -y upgrade
$ apt-get -y dist-upgrade
そしてそして、ドキュメントページをアーカイブ。
$ cd /opt/archivebox/ && sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False /opt/archivebox/archive "https://support.conoha.jp/v/archivebox/"
[*] [2019-12-22 15:53:48] Downloading https://support.conoha.jp/v/archivebox/
> output/sources/support.conoha.jp-1576997628.txt
[*] [2019-12-22 15:53:49] Parsing new links from output/sources/support.conoha.jp-1576997628.txt...
[X] No links found :(
・・・なにか違う。
ページが悪いのかな?今度は Google で
$ cd /opt/archivebox/ && sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False /opt/archivebox/archive "https://www.google.com/?hl=ja"
[*] [2019-12-22 15:54:09] Downloading https://www.google.com/?hl=ja
> output/sources/www.google.com-1576997649.txt
[*] [2019-12-22 15:54:09] Parsing new links from output/sources/www.google.com-1576997649.txt...
> Adding 14 new links to index (parsed import as Plain Text)
[*] [2019-12-22 15:54:09] Updating main index files...
> output/index.json
> output/index.html
(略)
※全文表示
$ cd /opt/archivebox/ && sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False /opt/archivebox/archive "https://www.google.com/?hl=ja"
[*] [2019-12-22 15:54:09] Downloading https://www.google.com/?hl=ja
> output/sources/www.google.com-1576997649.txt
[*] [2019-12-22 15:54:09] Parsing new links from output/sources/www.google.com-1576997649.txt...
> Adding 14 new links to index (parsed import as Plain Text)
[*] [2019-12-22 15:54:09] Updating main index files...
> output/index.json
> output/index.html
[?] [2019-12-22 15:54:09] Updating content for 14 pages in archive...
[+] [2019-12-22 15:54:10] "https://www.youtube.com/?gl=JP&tab=w1"
https://www.youtube.com/?gl=JP&tab=w1
> output/archive/1576997649 (new)
> favicon
! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:10] "https://www.google.com/setprefdomain?prefdom=JP&prev=https://www.google.co.jp/&sig=K_JeAmkhcsGNpZGRumn5RDR2zO--w%3D"
https://www.google.com/setprefdomain?prefdom=JP&prev=https://www.google.co.jp/&sig=K_JeAmkhcsGNpZGRumn5RDR2zO--w%3D
> output/archive/1576997649.0 (new)
> favicon
! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:10] "https://www.google.com/logos/doodles/2019/winter-2019-northern-hemisphere-5325275381366784-2x.jpg"
https://www.google.com/logos/doodles/2019/winter-2019-northern-hemisphere-5325275381366784-2x.jpg
> output/archive/1576997649.1 (new)
> favicon
> title
> wget
> pdf
> screenshot
> dom
> git
> media
√ index.json
√ index.html
[+] [2019-12-22 15:54:12] "https://www.google.co.jp/intl/ja/about/products?tab=wh"
https://www.google.co.jp/intl/ja/about/products?tab=wh
> output/archive/1576997649.2 (new)
> favicon
! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "https://www.google.co.jp/imghp?hl=ja&tab=wi"
https://www.google.co.jp/imghp?hl=ja&tab=wi
> output/archive/1576997649.3 (new)
> favicon
! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "https://play.google.com/?hl=ja&tab=w8"
https://play.google.com/?hl=ja&tab=w8
> output/archive/1576997649.4 (new)
> favicon
! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "https://news.google.co.jp/nwshp?hl=ja&tab=wn"
https://news.google.co.jp/nwshp?hl=ja&tab=wn
> output/archive/1576997649.5 (new)
> favicon
! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "https://maps.google.co.jp/maps?hl=ja&tab=wl"
https://maps.google.co.jp/maps?hl=ja&tab=wl
> output/archive/1576997649.6 (new)
> favicon
! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "https://mail.google.com/mail/?tab=wm"
https://mail.google.com/mail/?tab=wm
> output/archive/1576997649.7 (new)
> favicon
! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "https://drive.google.com/?tab=wo"
https://drive.google.com/?tab=wo
> output/archive/1576997649.8 (new)
> favicon
! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "https://accounts.google.com/ServiceLogin?hl=ja&passive=true&continue=https://www.google.com/%3Fhl%3Dja"
https://accounts.google.com/ServiceLogin?hl=ja&passive=true&continue=https://www.google.com/%3Fhl%3Dja
> output/archive/1576997649.9 (new)
> favicon
! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "http://www.google.co.jp/intl/ja/services/"
http://www.google.co.jp/intl/ja/services/
> output/archive/1576997649.10 (new)
> favicon
! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "http://www.google.co.jp/history/optout?hl=ja"
http://www.google.co.jp/history/optout?hl=ja
> output/archive/1576997649.11 (new)
> favicon
! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "http://schema.org/WebPage"
http://schema.org/WebPage
> output/archive/1576997649.12 (new)
> favicon
! Failed to archive link: KeyError: 'domain'
[√] [2019-12-22 15:54:12] Update of 14 pages complete (2.93 sec)
- 1 entries skipped
- 7 entries updated
- 0 errors
To view your archive, open: output/index.html
[*] [2019-12-22 15:54:12] Updating main index files...
> output/index.json
> output/index.html
長い!ブラウザに戻って...!!
思ってたのと違う。
1URL=1行と思ったのに、play.google.com
とかいる。なんで?
バージョンが古いのかな。
Gitぽいので、アップデート。
$ cd /opt/archivebox/
$ git checkout master
$ git pull
リトライ(今度はヤフーで)
$ cd /opt/archivebox/ && sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False /opt/archivebox/archive "https://www.yahoo.co.jp/"
[*] [2019-12-22 15:56:01] Downloading https://www.yahoo.co.jp/
> output/sources/www.yahoo.co.jp-1576997761.txt
[*] [2019-12-22 15:56:01] Parsing new links from output/sources/www.yahoo.co.jp-1576997761.txt...
> Adding 65 new links to index (parsed import as Plain Text)
[*] [2019-12-22 15:56:01] Saving main index files...
√ output/index.json
√ output/index.html
[?] [2019-12-22 15:56:01] Updating content for 79 pages in archive...
[+] [2019-12-22 15:56:01] "https://www.yahoo.co.jp/"
https://www.yahoo.co.jp/
> output/archive/1576997761
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
(略)
※全文表示
$ cd /opt/archivebox/ && sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False /opt/archivebox/archive "https://www.yahoo.co.jp/"
[*] [2019-12-22 15:56:01] Downloading https://www.yahoo.co.jp/
> output/sources/www.yahoo.co.jp-1576997761.txt
[*] [2019-12-22 15:56:01] Parsing new links from output/sources/www.yahoo.co.jp-1576997761.txt...
> Adding 65 new links to index (parsed import as Plain Text)
[*] [2019-12-22 15:56:01] Saving main index files...
√ output/index.json
√ output/index.html
[?] [2019-12-22 15:56:01] Updating content for 79 pages in archive...
[+] [2019-12-22 15:56:01] "https://www.yahoo.co.jp/"
https://www.yahoo.co.jp/
> output/archive/1576997761
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[+] [2019-12-22 15:56:13] "https://www.yahoo-help.jp/app/answers/detail/p/533/a_id/43883"
https://www.yahoo-help.jp/app/answers/detail/p/533/a_id/43883
> output/archive/1576997761.0
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[+] [2019-12-22 15:56:24] "https://www.yahoo-help.jp/"
https://www.yahoo-help.jp/
> output/archive/1576997761.1
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[+] [2019-12-22 15:56:41] "https://weather.yahoo.co.jp/weather/"
https://weather.yahoo.co.jp/weather/
> output/archive/1576997761.2
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[+] [2019-12-22 15:56:53] "https://tv.yahoo.co.jp/"
https://tv.yahoo.co.jp/
> output/archive/1576997761.3
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[+] [2019-12-22 15:57:09] "https://trilltrill.jp/"
https://trilltrill.jp/
> output/archive/1576997761.4
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[+] [2019-12-22 15:57:25] "https://travel.yahoo.co.jp/?sc_e=ytsl"
https://travel.yahoo.co.jp/?sc_e=ytsl
> output/archive/1576997761.5
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[+] [2019-12-22 15:57:46] "https://travel.yahoo.co.jp/?sc_e=ytmh"
https://travel.yahoo.co.jp/?sc_e=ytmh
> output/archive/1576997761.6
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[+] [2019-12-22 15:58:04] "https://transit.yahoo.co.jp/"
https://transit.yahoo.co.jp/
> output/archive/1576997761.7
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[+] [2019-12-22 15:58:13] "https://sports.yahoo.co.jp/"
https://sports.yahoo.co.jp/
> output/archive/1576997761.8
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[+] [2019-12-22 15:58:25] "https://shopping.yahoo.co.jp/?sc_e=ytc"
https://shopping.yahoo.co.jp/?sc_e=ytc
> output/archive/1576997761.9
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[+] [2019-12-22 15:58:49] "https://shopping.yahoo.co.jp/"
https://shopping.yahoo.co.jp/
> output/archive/1576997761.10
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[+] [2019-12-22 15:59:12] "https://services.yahoo.co.jp/?mode=pc"
https://services.yahoo.co.jp/?mode=pc
> output/archive/1576997761.11
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[+] [2019-12-22 15:59:16] "https://search.yahoo.co.jp/search"
https://search.yahoo.co.jp/search
> output/archive/1576997761.12
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[+] [2019-12-22 15:59:20] "https://retty.me/?utm_y_pc_top"
https://retty.me/?utm_y_pc_top
> output/archive/1576997761.13
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[+] [2019-12-22 15:59:39] "https://realestate.yahoo.co.jp/"
https://realestate.yahoo.co.jp/
> output/archive/1576997761.14
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[+] [2019-12-22 15:59:50] "https://rdsig.yahoo.co.jp/travel_kanko/yjtop_cont/RV=1/RU=aHR0cHM6Ly93d3cuaWt5dS5jb20vaWtDby5hc2h4P2Nvc2lkPWlrMDEwMDAyJnN1cmw9JTJG"
https://rdsig.yahoo.co.jp/travel_kanko/yjtop_cont/RV=1/RU=aHR0cHM6Ly93d3cuaWt5dS5jb20vaWtDby5hc2h4P2Nvc2lkPWlrMDEwMDAyJnN1cmw9JTJG
> output/archive/1576997761.15
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[+] [2019-12-22 16:00:05] "https://rdsig.yahoo.co.jp/reservation/yjtop_cont/RV=1/RU=aHR0cHM6Ly9yZXN0YXVyYW50LmlreXUuY29tL3JzQ29zaXRlLmFzcD9Db3NObz0xMDAwMDE3NSZDb3NVcmw9"
https://rdsig.yahoo.co.jp/reservation/yjtop_cont/RV=1/RU=aHR0cHM6Ly9yZXN0YXVyYW50LmlreXUuY29tL3JzQ29zaXRlLmFzcD9Db3NObz0xMDAwMDE3NSZDb3NVcmw9
> output/archive/1576997761.16
> title
Failed: Unable to detect page title
Run to see full output:
cd /opt/archivebox/output/archive/1576997761.16;
curl https://rdsig.yahoo.co.jp/reservation/yjtop_cont/RV=1/RU=aHR0cHM6Ly9yZXN0YXVyYW50LmlreXUuY29tL3JzQ29zaXRlLmFzcD9Db3NObz0xMDAwMDE3NSZDb3NVcmw9 | grep <title>
> favicon
> wget
> pdf
> screenshot
> dom
> media
[+] [2019-12-22 16:00:26] "https://rdsig.yahoo.co.jp/partner/from_ytop/pc/list1/RV=1/RU=aHR0cHM6Ly9wYXJ0bmVyLnlhaG9vLmNvLmpwLw--"
https://rdsig.yahoo.co.jp/partner/from_ytop/pc/list1/RV=1/RU=aHR0cHM6Ly9wYXJ0bmVyLnlhaG9vLmNvLmpwLw--
> output/archive/1576997761.17
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[+] [2019-12-22 16:00:40] "https://rdsig.yahoo.co.jp/auction/promo/yearend2019/pc/ytop/txt/RV=1/RU=aHR0cHM6Ly9hdWN0aW9ucy55YWhvby5jby5qcC90b3BpYy9wcm9tby95ZWFyZW5kMjAxOS8_Y3BpZD1wcl95ZWFyZW5kMjAxOSZtZW51PXRvcHBhZ2UmdGFyPXRvcCZjcj10b3A-"
https://rdsig.yahoo.co.jp/auction/promo/yearend2019/pc/ytop/txt/RV=1/RU=aHR0cHM6Ly9hdWN0aW9ucy55YWhvby5jby5qcC90b3BpYy9wcm9tby95ZWFyZW5kMjAxOS8_Y3BpZD1wcl95ZWFyZW5kMjAxOSZtZW51PXRvcHBhZ2UmdGFyPXRvcCZjcj10b3A-
> output/archive/1576997761.18
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[+] [2019-12-22 16:00:51] "https://privacy.yahoo.co.jp/"
https://privacy.yahoo.co.jp/
> output/archive/1576997761.19
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[+] [2019-12-22 16:00:57] "https://premium.yahoo.co.jp/"
https://premium.yahoo.co.jp/
> output/archive/1576997761.20
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[+] [2019-12-22 16:01:26] "https://points.yahoo.co.jp/"
https://points.yahoo.co.jp/
> output/archive/1576997761.21
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[+] [2019-12-22 16:01:36] "https://news.yahoo.co.jp/topics/top-picks?date=20191222&mc=f&mp=f"
https://news.yahoo.co.jp/topics/top-picks?date=20191222&mc=f&mp=f
> output/archive/1576997761.22
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[+] [2019-12-22 16:01:46] "https://news.yahoo.co.jp/pickup/6346062"
https://news.yahoo.co.jp/pickup/6346062
> output/archive/1576997761.23
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[+] [2019-12-22 16:01:58] "https://news.yahoo.co.jp/pickup/6346059"
https://news.yahoo.co.jp/pickup/6346059
> output/archive/1576997761.24
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[+] [2019-12-22 16:02:11] "https://news.yahoo.co.jp/pickup/6346056"
https://news.yahoo.co.jp/pickup/6346056
> output/archive/1576997761.25
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[+] [2019-12-22 16:02:29] "https://news.yahoo.co.jp/pickup/6346053"
https://news.yahoo.co.jp/pickup/6346053
> output/archive/1576997761.26
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[+] [2019-12-22 16:02:44] "https://news.yahoo.co.jp/pickup/6346052"
https://news.yahoo.co.jp/pickup/6346052
> output/archive/1576997761.27
> title
> favicon
> wget
> pdf
> screenshot
?? 2.4% (1/60sec)^C
[X] [2019-12-22 16:02:52] Downloading paused on link 1576997761.27 (29/79)
To view your archive, open: output/index.html
Continue where you left off by running:
archive 1576997761.27
5分立っても終わらない...。
公式見るか...
How does it work?
echo 'http://example.com' | ./archive
パイプ???
$ echo "https://www.yahoo.co.jp/" | sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False /opt/archivebox/archive
[*] [2019-12-22 16:10:13] Parsing new links from output/sources/stdin-1576998613.txt...
> Adding 1 new links to index (parsed import as Plain Text)
[*] [2019-12-22 16:10:13] Saving main index files...
√ output/index.json
√ output/index.html
[?] [2019-12-22 16:10:13] Updating content for 1 pages in archive...
[+] [2019-12-22 16:10:13] "https://www.yahoo.co.jp/"
https://www.yahoo.co.jp/
> output/archive/1576998613
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[√] [2019-12-22 16:10:26] Update of 1 pages complete (12.93 sec)
- 0 links skipped
- 1 links updated
- 0 links had errors
To view your archive, open: output/index.html
[*] [2019-12-22 16:10:26] Saving main index files...
√ output/index.json
√ output/index.html
更新!!
ヽ(=´▽`=)ノ
リンククリック!
ヽ(=´▽=)ノヽ(=´▽
=)ノヽ(=´▽=)ノヽ(=´▽
=)ノヽ(=´▽`=)ノ
公式ドキュメントを読んだ結果
Usage ・ pirate/ArchiveBox Wiki ・ GitHub
-
./archive
を叩くとアーカイブが実行される。- パイプは単一URLとして認識 → そのページのアーカイブを行う。
- パラメータはURLリストとして認識 → ページ内リンクすべてのアーカイブを行う。
らしい。
また、
- RSS、XML等の外部URL
- Chrome、Firefoxのブラウザ履歴
からURLリストを取得できるらしい。
あと、一点気になったのがAudio & Video: media/ all audio/video files + playlists, including subtitles & metadata with youtube-dl
の部分。
もしかして
$ echo "https://www.youtube.com/watch?v=3F7cYxVFgKo" | sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False /opt/archivebox/archive
[*] [2019-12-22 16:11:20] Parsing new links from output/sources/stdin-1576998680.txt...
> Adding 1 new links to index (parsed import as Plain Text)
[*] [2019-12-22 16:11:20] Saving main index files...
√ output/index.json
√ output/index.html
[?] [2019-12-22 16:11:20] Updating content for 2 pages in archive...
[+] [2019-12-22 16:11:20] "https://www.youtube.com/watch?v=3F7cYxVFgKo"
https://www.youtube.com/watch?v=3F7cYxVFgKo
> output/archive/1576998680
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
Failed: Failed to download media
Got youtube-dl response code: 1.
ERROR: 3F7cYxVFgKo: "token" parameter not in video info for unknown reason; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
Run to see full output:
cd /opt/archivebox/output/archive/1576998680;
youtube-dl --write-description --write-info-json --write-annotations --yes-playlist --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs --extract-audio --keep-video --ignore-errors --geo-bypass --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata https://www.youtube.com/watch?v=3F7cYxVFgKo
[*] [2019-12-22 16:11:34] "Yahoo! JAPAN"
https://www.yahoo.co.jp/
√ output/archive/1576998613
[√] [2019-12-22 16:11:34] Update of 2 pages complete (14.62 sec)
- 1 links skipped
- 0 links updated
- 1 links had errors
To view your archive, open: output/index.html
[*] [2019-12-22 16:11:34] Saving main index files...
√ output/index.json
√ output/index.html
youtube-dl
もバージョンアップが必要...?
$ sudo youtube-dl -U
Usage: youtube-dl [OPTIONS] URL [URL...]
youtube-dl: error: youtube-dl's self-update mechanism is disabled on Debian.
Please update youtube-dl using apt(8).
See https://packages.debian.org/sid/youtube-dl for the latest packaged version.
無効...?
If you have installed youtube-dl using a package manager like apt-get or yum, use the standard system update mechanism to update. Note that distribution packages are often outdated. As a rule of thumb, youtube-dl releases at least once a month, and often weekly or even daily. Simply go to https://yt-dl.org to find out the current version. Unfortunately, there is nothing we youtube-dl developers can do if your distribution serves a really outdated version. You can (and should) complain to your distribution in their bugtracker or support forum.
なるほど。リポジトリ経由はアップデートが遅いと。
一度消して、バイナリを直接ダウンロード
apt-get remove -y youtube-dl
wget https://yt-dl.org/latest/youtube-dl -O /usr/local/bin/youtube-dl
chmod a+x /usr/local/bin/youtube-dl
hash -r
アップデートが終わったので、リトライ
$ echo "https://www.youtube.com/watch?v=3F7cYxVFgKo" | sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False /opt/archivebox/archive
[*] [2019-12-22 16:14:20] Parsing new links from output/sources/stdin-1576998860.txt...
> Adding 0 new links to index (parsed import as Plain Text)
[*] [2019-12-22 16:14:20] Saving main index files...
√ output/index.json
√ output/index.html
[?] [2019-12-22 16:14:20] Updating content for 2 pages in archive...
[*] [2019-12-22 16:14:20] "美雲このは(CV:上坂すみれ)「空色Drops」 - YouTube"
https://www.youtube.com/watch?v=3F7cYxVFgKo
√ output/archive/1576998680
[*] [2019-12-22 16:14:20] "Yahoo! JAPAN"
https://www.yahoo.co.jp/
√ output/archive/1576998613
[√] [2019-12-22 16:14:20] Update of 2 pages complete (0.02 sec)
- 2 links skipped
- 0 links updated
- 0 links had errors
To view your archive, open: output/index.html
[*] [2019-12-22 16:14:20] Saving main index files...
√ output/index.json
√ output/index.html
- 2 links skipped
むむむ...
$ rm -rf /opt/archivebox/output/*
$ echo "https://www.youtube.com/watch?v=3F7cYxVFgKo" | sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False /opt/archivebox/archive
[*] [2019-12-22 16:15:36] Parsing new links from output/sources/stdin-1576998936.txt...
> Adding 1 new links to index (parsed import as Plain Text)
[*] [2019-12-22 16:15:36] Saving main index files...
√ output/index.json
√ output/index.html
[?] [2019-12-22 16:15:36] Updating content for 1 pages in archive...
[+] [2019-12-22 16:15:36] "https://www.youtube.com/watch?v=3F7cYxVFgKo"
https://www.youtube.com/watch?v=3F7cYxVFgKo
> output/archive/1576998936
> title
> favicon
> wget
> pdf
> screenshot
> dom
> media
[√] [2019-12-22 16:15:52] Update of 1 pages complete (16.15 sec)
- 0 links skipped
- 1 links updated
- 0 links had errors
To view your archive, open: output/index.html
[*] [2019-12-22 16:15:52] Saving main index files...
√ output/index.json
√ output/index.html
更新!!
無限ロード...リンクの書き換えとかはしてくれないのか(まぁ当然か)
右上のMedia
のリンクをクリック
文字化けしてるけど、どうやらアーカイブできているっぽい
以上
あとがき
うごくまでが結構大変だったけど、アーカイブの精度は良さそう
ただ、せっかくならConoHaのオブジェクトストレージ使いたいよね?とおもって色々がんばったけどダメだった
(goofys
でマウントしてはパーミッションとか一時ファイルとかとかの制約でエラーになる。一時ディレクトリを作って、アーカイブが終わり次第 mv
すればいけるかも?)