4
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

ConoHaAdvent Calendar 2019

Day 23

ConoHaのArchiveBoxアプリケーションを使ってみたよ

Last updated at Posted at 2019-06-16

この記事は、ConoHa Advent Calendar 2019 23日目の記事です。

清楚かわいい「ConoHa」から ArchiveBox なるテンプレートが公開されたので、使ってみたよ。というお話。

最初にまとめ

まずは以下をコピペして

apt-get update;
yes | apt-get -y upgrade;
apt-get -y dist-upgrade;
cd /opt/archivebox/;
git checkout master;
git pull;
apt-get remove -y youtube-dl;
wget https://yt-dl.org/latest/youtube-dl -O /usr/local/bin/youtube-dl;
chmod a+x /usr/local/bin/youtube-dl;
hash -r;

終わったなら以下でアーカイブができる

echo "アーカイブしたいウェブページのURL"  | sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False  /opt/archivebox/archive

結果は/opt/archivebox/output/に保存され、WebUI経由で確認できる。

alt

ArchiveBoxって何ぞ??

【リリース】[VPS]「ArchiveBox」テンプレートイメージ提供開始|VPSならConoHa VPS

ConoHaでは、2019年4月24日(水)より「ArchiveBox」アプリケーションテンプレートイメージの提供を開始いたしました。

テンプレートを利用すると、ご自身で指定したURLのコンテンツをHTMLやPDF、画像などの形式で簡単に保存し、アーカイブできる「ArchiveBox」をすぐにご利用いただけます。

ArchiveBoxについてはここ参照

一言でいうと
「The open-source self-hosted web archive.」
らしい。(自分でつくる黒歴史保管サービスウェブ魚拓的な)

サーバをつくる

早速 ConoHa ダッシュボードにログインしてサクッとサーバー立てます。

alt

内容的にストレージがいっぱい必要そうなので、ストレージが多い1GBプランをチョイス(SSD50GBでも1,000円以下安い!!)

イメージタイプは、もちろん「ArchiveBox」rootパスワードとかは適当に。

ログインしてセットアップ

指定されたIPアドレスにログイン

Welcome to Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-47-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

  System information as of Sun Dec 22 xx:xx:xx JST 2019

  System load:  1.5                Processes:           120
  Usage of /:   11.5% of 49.09GB   Users logged in:     0
  Memory usage: 25%                IP address for eth0: 255.255.255.255
  Swap usage:   0%

 * Overheard at KubeCon: "microk8s.status just blew my mind".

     https://microk8s.io/docs/commands#microk8s.status

241 packages can be updated.
112 updates are security updates.


*** System restart required ***
================================================
Welcome to ArchiveBox image!

Server address : http://255.255.255.255/

ArchiveBox directory : /opt/archivebox/

ArchiveBox Web Username : abox_user
ArchiveBox Web Password : XXXXXXXXXX

Enjoy Minecraft!

To delete this message: rm -f /etc/motd
================================================

System information なる便利そうなものが目に入りますが、とりあえず、ArchiveBoxの箇所を。

ほうほう URL があるということは、WebUIがあるのか。早速。

Web UsernameWeb Password いれて。

alt

...あれ?

ちょっとドキュメントを確認。

アーカイブしたWebページを閲覧する
アーカイブしたウェブページは、ウェブブラウザから閲覧できます。1回以上のアーカイブを実行しないとアーカイブ閲覧用ページは生成されず404エラーとなりますので、前述の「ArchiveBoxを使ってWebページをアーカイブする」の節に従ってアーカイブを実行してください。

ArchiveBoxアプリケーションイメージの使い方|ConoHa VPSサポート

なるほど。

アーカイブコマンドはコレですね。

$ cd /opt/archivebox/ && \
sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False \
 /opt/archivebox/archive "アーカイブしたいウェブページのURL"

実際に叩く前に、まずお約束のアップデートをして、

$ apt-get update
$ apt-get -y upgrade
$ apt-get -y dist-upgrade

そしてそして、ドキュメントページをアーカイブ。

$ cd /opt/archivebox/ && sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False /opt/archivebox/archive "https://support.conoha.jp/v/archivebox/"
[*] [2019-12-22 15:53:48] Downloading https://support.conoha.jp/v/archivebox/
    > output/sources/support.conoha.jp-1576997628.txt
[*] [2019-12-22 15:53:49] Parsing new links from output/sources/support.conoha.jp-1576997628.txt...
[X] No links found :(

・・・なにか違う。

ページが悪いのかな?今度は Google で

$ cd /opt/archivebox/ && sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False /opt/archivebox/archive "https://www.google.com/?hl=ja"
[*] [2019-12-22 15:54:09] Downloading https://www.google.com/?hl=ja
    > output/sources/www.google.com-1576997649.txt
[*] [2019-12-22 15:54:09] Parsing new links from output/sources/www.google.com-1576997649.txt...
    > Adding 14 new links to index (parsed import as Plain Text)
[*] [2019-12-22 15:54:09] Updating main index files...
    > output/index.json
    > output/index.html
(略)
※全文表示
$ cd /opt/archivebox/ && sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False /opt/archivebox/archive "https://www.google.com/?hl=ja"
[*] [2019-12-22 15:54:09] Downloading https://www.google.com/?hl=ja
    > output/sources/www.google.com-1576997649.txt
[*] [2019-12-22 15:54:09] Parsing new links from output/sources/www.google.com-1576997649.txt...
    > Adding 14 new links to index (parsed import as Plain Text)
[*] [2019-12-22 15:54:09] Updating main index files...
    > output/index.json
    > output/index.html
[?] [2019-12-22 15:54:09] Updating content for 14 pages in archive...
[+] [2019-12-22 15:54:10] "https://www.youtube.com/?gl=JP&tab=w1"
    https://www.youtube.com/?gl=JP&tab=w1
    > output/archive/1576997649 (new)
      > favicon
    ! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:10] "https://www.google.com/setprefdomain?prefdom=JP&prev=https://www.google.co.jp/&sig=K_JeAmkhcsGNpZGRumn5RDR2zO--w%3D"
    https://www.google.com/setprefdomain?prefdom=JP&prev=https://www.google.co.jp/&sig=K_JeAmkhcsGNpZGRumn5RDR2zO--w%3D
    > output/archive/1576997649.0 (new)
      > favicon
    ! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:10] "https://www.google.com/logos/doodles/2019/winter-2019-northern-hemisphere-5325275381366784-2x.jpg"
    https://www.google.com/logos/doodles/2019/winter-2019-northern-hemisphere-5325275381366784-2x.jpg
    > output/archive/1576997649.1 (new)
      > favicon
      > title
      > wget
      > pdf
      > screenshot
      > dom
      > git
      > media
      √ index.json
      √ index.html
[+] [2019-12-22 15:54:12] "https://www.google.co.jp/intl/ja/about/products?tab=wh"
    https://www.google.co.jp/intl/ja/about/products?tab=wh
    > output/archive/1576997649.2 (new)
      > favicon
    ! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "https://www.google.co.jp/imghp?hl=ja&tab=wi"
    https://www.google.co.jp/imghp?hl=ja&tab=wi
    > output/archive/1576997649.3 (new)
      > favicon
    ! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "https://play.google.com/?hl=ja&tab=w8"
    https://play.google.com/?hl=ja&tab=w8
    > output/archive/1576997649.4 (new)
      > favicon
    ! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "https://news.google.co.jp/nwshp?hl=ja&tab=wn"
    https://news.google.co.jp/nwshp?hl=ja&tab=wn
    > output/archive/1576997649.5 (new)
      > favicon
    ! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "https://maps.google.co.jp/maps?hl=ja&tab=wl"
    https://maps.google.co.jp/maps?hl=ja&tab=wl
    > output/archive/1576997649.6 (new)
      > favicon
    ! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "https://mail.google.com/mail/?tab=wm"
    https://mail.google.com/mail/?tab=wm
    > output/archive/1576997649.7 (new)
      > favicon
    ! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "https://drive.google.com/?tab=wo"
    https://drive.google.com/?tab=wo
    > output/archive/1576997649.8 (new)
      > favicon
    ! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "https://accounts.google.com/ServiceLogin?hl=ja&passive=true&continue=https://www.google.com/%3Fhl%3Dja"
    https://accounts.google.com/ServiceLogin?hl=ja&passive=true&continue=https://www.google.com/%3Fhl%3Dja
    > output/archive/1576997649.9 (new)
      > favicon
    ! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "http://www.google.co.jp/intl/ja/services/"
    http://www.google.co.jp/intl/ja/services/
    > output/archive/1576997649.10 (new)
      > favicon
    ! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "http://www.google.co.jp/history/optout?hl=ja"
    http://www.google.co.jp/history/optout?hl=ja
    > output/archive/1576997649.11 (new)
      > favicon
    ! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "http://schema.org/WebPage"
    http://schema.org/WebPage
    > output/archive/1576997649.12 (new)
      > favicon
    ! Failed to archive link: KeyError: 'domain'
[√] [2019-12-22 15:54:12] Update of 14 pages complete (2.93 sec)
    - 1 entries skipped
    - 7 entries updated
    - 0 errors
    To view your archive, open: output/index.html
[*] [2019-12-22 15:54:12] Updating main index files...
    > output/index.json
    > output/index.html

長い!ブラウザに戻って...!!

alt

思ってたのと違う。
1URL=1行と思ったのに、play.google.comとかいる。なんで?

バージョンが古いのかな。
Gitぽいので、アップデート。

$ cd /opt/archivebox/
$ git checkout master
$ git pull

リトライ(今度はヤフーで)

$ cd /opt/archivebox/ && sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False /opt/archivebox/archive "https://www.yahoo.co.jp/"
[*] [2019-12-22 15:56:01] Downloading https://www.yahoo.co.jp/
    > output/sources/www.yahoo.co.jp-1576997761.txt
[*] [2019-12-22 15:56:01] Parsing new links from output/sources/www.yahoo.co.jp-1576997761.txt...
    > Adding 65 new links to index (parsed import as Plain Text)
[*] [2019-12-22 15:56:01] Saving main index files...
    √ output/index.json
    √ output/index.html
[?] [2019-12-22 15:56:01] Updating content for 79 pages in archive...

[+] [2019-12-22 15:56:01] "https://www.yahoo.co.jp/"
    https://www.yahoo.co.jp/
    > output/archive/1576997761
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media
(略)
※全文表示
$ cd /opt/archivebox/ && sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False /opt/archivebox/archive "https://www.yahoo.co.jp/"
[*] [2019-12-22 15:56:01] Downloading https://www.yahoo.co.jp/
    > output/sources/www.yahoo.co.jp-1576997761.txt
[*] [2019-12-22 15:56:01] Parsing new links from output/sources/www.yahoo.co.jp-1576997761.txt...
    > Adding 65 new links to index (parsed import as Plain Text)
[*] [2019-12-22 15:56:01] Saving main index files...
    √ output/index.json
    √ output/index.html
[?] [2019-12-22 15:56:01] Updating content for 79 pages in archive...

[+] [2019-12-22 15:56:01] "https://www.yahoo.co.jp/"
    https://www.yahoo.co.jp/
    > output/archive/1576997761
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:56:13] "https://www.yahoo-help.jp/app/answers/detail/p/533/a_id/43883"
    https://www.yahoo-help.jp/app/answers/detail/p/533/a_id/43883
    > output/archive/1576997761.0
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:56:24] "https://www.yahoo-help.jp/"
    https://www.yahoo-help.jp/
    > output/archive/1576997761.1
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:56:41] "https://weather.yahoo.co.jp/weather/"
    https://weather.yahoo.co.jp/weather/
    > output/archive/1576997761.2
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:56:53] "https://tv.yahoo.co.jp/"
    https://tv.yahoo.co.jp/
    > output/archive/1576997761.3
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:57:09] "https://trilltrill.jp/"
    https://trilltrill.jp/
    > output/archive/1576997761.4
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:57:25] "https://travel.yahoo.co.jp/?sc_e=ytsl"
    https://travel.yahoo.co.jp/?sc_e=ytsl
    > output/archive/1576997761.5
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:57:46] "https://travel.yahoo.co.jp/?sc_e=ytmh"
    https://travel.yahoo.co.jp/?sc_e=ytmh
    > output/archive/1576997761.6
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:58:04] "https://transit.yahoo.co.jp/"
    https://transit.yahoo.co.jp/
    > output/archive/1576997761.7
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:58:13] "https://sports.yahoo.co.jp/"
    https://sports.yahoo.co.jp/
    > output/archive/1576997761.8
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:58:25] "https://shopping.yahoo.co.jp/?sc_e=ytc"
    https://shopping.yahoo.co.jp/?sc_e=ytc
    > output/archive/1576997761.9
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:58:49] "https://shopping.yahoo.co.jp/"
    https://shopping.yahoo.co.jp/
    > output/archive/1576997761.10
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:59:12] "https://services.yahoo.co.jp/?mode=pc"
    https://services.yahoo.co.jp/?mode=pc
    > output/archive/1576997761.11
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:59:16] "https://search.yahoo.co.jp/search"
    https://search.yahoo.co.jp/search
    > output/archive/1576997761.12
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:59:20] "https://retty.me/?utm_y_pc_top"
    https://retty.me/?utm_y_pc_top
    > output/archive/1576997761.13
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:59:39] "https://realestate.yahoo.co.jp/"
    https://realestate.yahoo.co.jp/
    > output/archive/1576997761.14
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:59:50] "https://rdsig.yahoo.co.jp/travel_kanko/yjtop_cont/RV=1/RU=aHR0cHM6Ly93d3cuaWt5dS5jb20vaWtDby5hc2h4P2Nvc2lkPWlrMDEwMDAyJnN1cmw9JTJG"
    https://rdsig.yahoo.co.jp/travel_kanko/yjtop_cont/RV=1/RU=aHR0cHM6Ly93d3cuaWt5dS5jb20vaWtDby5hc2h4P2Nvc2lkPWlrMDEwMDAyJnN1cmw9JTJG
    > output/archive/1576997761.15
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 16:00:05] "https://rdsig.yahoo.co.jp/reservation/yjtop_cont/RV=1/RU=aHR0cHM6Ly9yZXN0YXVyYW50LmlreXUuY29tL3JzQ29zaXRlLmFzcD9Db3NObz0xMDAwMDE3NSZDb3NVcmw9"
    https://rdsig.yahoo.co.jp/reservation/yjtop_cont/RV=1/RU=aHR0cHM6Ly9yZXN0YXVyYW50LmlreXUuY29tL3JzQ29zaXRlLmFzcD9Db3NObz0xMDAwMDE3NSZDb3NVcmw9
    > output/archive/1576997761.16
      > title
        Failed: Unable to detect page title
        Run to see full output:
            cd /opt/archivebox/output/archive/1576997761.16;
            curl https://rdsig.yahoo.co.jp/reservation/yjtop_cont/RV=1/RU=aHR0cHM6Ly9yZXN0YXVyYW50LmlreXUuY29tL3JzQ29zaXRlLmFzcD9Db3NObz0xMDAwMDE3NSZDb3NVcmw9 | grep <title>
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 16:00:26] "https://rdsig.yahoo.co.jp/partner/from_ytop/pc/list1/RV=1/RU=aHR0cHM6Ly9wYXJ0bmVyLnlhaG9vLmNvLmpwLw--"
    https://rdsig.yahoo.co.jp/partner/from_ytop/pc/list1/RV=1/RU=aHR0cHM6Ly9wYXJ0bmVyLnlhaG9vLmNvLmpwLw--
    > output/archive/1576997761.17
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 16:00:40] "https://rdsig.yahoo.co.jp/auction/promo/yearend2019/pc/ytop/txt/RV=1/RU=aHR0cHM6Ly9hdWN0aW9ucy55YWhvby5jby5qcC90b3BpYy9wcm9tby95ZWFyZW5kMjAxOS8_Y3BpZD1wcl95ZWFyZW5kMjAxOSZtZW51PXRvcHBhZ2UmdGFyPXRvcCZjcj10b3A-"
    https://rdsig.yahoo.co.jp/auction/promo/yearend2019/pc/ytop/txt/RV=1/RU=aHR0cHM6Ly9hdWN0aW9ucy55YWhvby5jby5qcC90b3BpYy9wcm9tby95ZWFyZW5kMjAxOS8_Y3BpZD1wcl95ZWFyZW5kMjAxOSZtZW51PXRvcHBhZ2UmdGFyPXRvcCZjcj10b3A-
    > output/archive/1576997761.18
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 16:00:51] "https://privacy.yahoo.co.jp/"
    https://privacy.yahoo.co.jp/
    > output/archive/1576997761.19
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 16:00:57] "https://premium.yahoo.co.jp/"
    https://premium.yahoo.co.jp/
    > output/archive/1576997761.20
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 16:01:26] "https://points.yahoo.co.jp/"
    https://points.yahoo.co.jp/
    > output/archive/1576997761.21
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 16:01:36] "https://news.yahoo.co.jp/topics/top-picks?date=20191222&mc=f&mp=f"
    https://news.yahoo.co.jp/topics/top-picks?date=20191222&mc=f&mp=f
    > output/archive/1576997761.22
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 16:01:46] "https://news.yahoo.co.jp/pickup/6346062"
    https://news.yahoo.co.jp/pickup/6346062
    > output/archive/1576997761.23
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 16:01:58] "https://news.yahoo.co.jp/pickup/6346059"
    https://news.yahoo.co.jp/pickup/6346059
    > output/archive/1576997761.24
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 16:02:11] "https://news.yahoo.co.jp/pickup/6346056"
    https://news.yahoo.co.jp/pickup/6346056
    > output/archive/1576997761.25
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 16:02:29] "https://news.yahoo.co.jp/pickup/6346053"
    https://news.yahoo.co.jp/pickup/6346053
    > output/archive/1576997761.26
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 16:02:44] "https://news.yahoo.co.jp/pickup/6346052"
    https://news.yahoo.co.jp/pickup/6346052
    > output/archive/1576997761.27
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      ??                                                                                          2.4% (1/60sec)^C


[X] [2019-12-22 16:02:52] Downloading paused on link 1576997761.27 (29/79)
    To view your archive, open: output/index.html
    Continue where you left off by running:
        archive 1576997761.27

5分立っても終わらない...。

公式見るか...

How does it work?

echo 'http://example.com' | ./archive

GitHub - pirate/ArchiveBox: ?? The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

パイプ???

$ echo "https://www.yahoo.co.jp/" | sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False  /opt/archivebox/archive
[*] [2019-12-22 16:10:13] Parsing new links from output/sources/stdin-1576998613.txt...
    > Adding 1 new links to index (parsed import as Plain Text)
[*] [2019-12-22 16:10:13] Saving main index files...
    √ output/index.json
    √ output/index.html
[?] [2019-12-22 16:10:13] Updating content for 1 pages in archive...

[+] [2019-12-22 16:10:13] "https://www.yahoo.co.jp/"
    https://www.yahoo.co.jp/
    > output/archive/1576998613
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media
[√] [2019-12-22 16:10:26] Update of 1 pages complete (12.93 sec)
    - 0 links skipped
    - 1 links updated
    - 0 links had errors
    To view your archive, open: output/index.html
[*] [2019-12-22 16:10:26] Saving main index files...
    √ output/index.json
    √ output/index.html

更新!!

alt

ヽ(=´▽`=)ノ

リンククリック!

alt

ヽ(=´▽=)ノヽ(=´▽=)ノヽ(=´▽=)ノヽ(=´▽=)ノヽ(=´▽`=)ノ

公式ドキュメントを読んだ結果

Usage ・ pirate/ArchiveBox Wiki ・ GitHub

  • ./archiveを叩くとアーカイブが実行される。
    • パイプは単一URLとして認識 → そのページのアーカイブを行う。
    • パラメータはURLリストとして認識 → ページ内リンクすべてのアーカイブを行う。

らしい。
また、

  • RSS、XML等の外部URL
  • Chrome、Firefoxのブラウザ履歴

からURLリストを取得できるらしい。

あと、一点気になったのがAudio & Video: media/ all audio/video files + playlists, including subtitles & metadata with youtube-dlの部分。

もしかして

美雲このは(CV:上坂すみれ)

$ echo "https://www.youtube.com/watch?v=3F7cYxVFgKo"  | sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False  /opt/archivebox/archive
[*] [2019-12-22 16:11:20] Parsing new links from output/sources/stdin-1576998680.txt...
    > Adding 1 new links to index (parsed import as Plain Text)
[*] [2019-12-22 16:11:20] Saving main index files...
    √ output/index.json
    √ output/index.html
[?] [2019-12-22 16:11:20] Updating content for 2 pages in archive...

[+] [2019-12-22 16:11:20] "https://www.youtube.com/watch?v=3F7cYxVFgKo"
    https://www.youtube.com/watch?v=3F7cYxVFgKo
    > output/archive/1576998680
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media
        Failed: Failed to download media
            Got youtube-dl response code: 1.
            ERROR: 3F7cYxVFgKo: "token" parameter not in video info for unknown reason; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
        Run to see full output:
            cd /opt/archivebox/output/archive/1576998680;
            youtube-dl --write-description --write-info-json --write-annotations --yes-playlist --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs --extract-audio --keep-video --ignore-errors --geo-bypass --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata https://www.youtube.com/watch?v=3F7cYxVFgKo

[*] [2019-12-22 16:11:34] "Yahoo! JAPAN"
    https://www.yahoo.co.jp/
    √ output/archive/1576998613
[√] [2019-12-22 16:11:34] Update of 2 pages complete (14.62 sec)
    - 1 links skipped
    - 0 links updated
    - 1 links had errors
    To view your archive, open: output/index.html
[*] [2019-12-22 16:11:34] Saving main index files...
    √ output/index.json
    √ output/index.html

youtube-dlもバージョンアップが必要...?

$ sudo youtube-dl -U
Usage: youtube-dl [OPTIONS] URL [URL...]

youtube-dl: error: youtube-dl's self-update mechanism is disabled on Debian.
Please update youtube-dl using apt(8).
See https://packages.debian.org/sid/youtube-dl for the latest packaged version.

無効...?

If you have installed youtube-dl using a package manager like apt-get or yum, use the standard system update mechanism to update. Note that distribution packages are often outdated. As a rule of thumb, youtube-dl releases at least once a month, and often weekly or even daily. Simply go to https://yt-dl.org to find out the current version. Unfortunately, there is nothing we youtube-dl developers can do if your distribution serves a really outdated version. You can (and should) complain to your distribution in their bugtracker or support forum.

GitHub - ytdl-org/youtube-dl: Command-line program to download videos from YouTube.com and other video sites

なるほど。リポジトリ経由はアップデートが遅いと。

一度消して、バイナリを直接ダウンロード

apt-get remove -y youtube-dl
wget https://yt-dl.org/latest/youtube-dl -O /usr/local/bin/youtube-dl
chmod a+x /usr/local/bin/youtube-dl
hash -r

アップデートが終わったので、リトライ

$ echo "https://www.youtube.com/watch?v=3F7cYxVFgKo"  | sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False  /opt/archivebox/archive
[*] [2019-12-22 16:14:20] Parsing new links from output/sources/stdin-1576998860.txt...
    > Adding 0 new links to index (parsed import as Plain Text)
[*] [2019-12-22 16:14:20] Saving main index files...
    √ output/index.json
    √ output/index.html
[?] [2019-12-22 16:14:20] Updating content for 2 pages in archive...

[*] [2019-12-22 16:14:20] "美雲このは(CV:上坂すみれ)「空色Drops」 - YouTube"
    https://www.youtube.com/watch?v=3F7cYxVFgKo
    √ output/archive/1576998680

[*] [2019-12-22 16:14:20] "Yahoo! JAPAN"
    https://www.yahoo.co.jp/
    √ output/archive/1576998613
[√] [2019-12-22 16:14:20] Update of 2 pages complete (0.02 sec)
    - 2 links skipped
    - 0 links updated
    - 0 links had errors
    To view your archive, open: output/index.html
[*] [2019-12-22 16:14:20] Saving main index files...
    √ output/index.json
    √ output/index.html
 - 2 links skipped

むむむ...

$ rm -rf /opt/archivebox/output/*
$ echo "https://www.youtube.com/watch?v=3F7cYxVFgKo"  | sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False  /opt/archivebox/archive
[*] [2019-12-22 16:15:36] Parsing new links from output/sources/stdin-1576998936.txt...
    > Adding 1 new links to index (parsed import as Plain Text)
[*] [2019-12-22 16:15:36] Saving main index files...
    √ output/index.json
    √ output/index.html
[?] [2019-12-22 16:15:36] Updating content for 1 pages in archive...

[+] [2019-12-22 16:15:36] "https://www.youtube.com/watch?v=3F7cYxVFgKo"
    https://www.youtube.com/watch?v=3F7cYxVFgKo
    > output/archive/1576998936
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media
[√] [2019-12-22 16:15:52] Update of 1 pages complete (16.15 sec)
    - 0 links skipped
    - 1 links updated
    - 0 links had errors
    To view your archive, open: output/index.html
[*] [2019-12-22 16:15:52] Saving main index files...
    √ output/index.json
    √ output/index.html

更新!!

無限ロード...リンクの書き換えとかはしてくれないのか(まぁ当然か)

alt

右上のMediaのリンクをクリック
文字化けしてるけど、どうやらアーカイブできているっぽい

alt

以上

あとがき

うごくまでが結構大変だったけど、アーカイブの精度は良さそう
ただ、せっかくならConoHaのオブジェクトストレージ使いたいよね?とおもって色々がんばったけどダメだった
goofysでマウントしてはパーミッションとか一時ファイルとかとかの制約でエラーになる。一時ディレクトリを作って、アーカイブが終わり次第 mvすればいけるかも?)

4
2
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
4
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?