More than 5 years have passed since last update.

curl と pup と jq を使って、コマンドラインで wikipedia からデータを取り出す

Posted at 2016-05-21

curl で取得した html をコマンドライン上でパースしたいと思って、
ericchiang/pup: Parsing HTML at the command line を使ってみた。

以下、2014年のワールドカップの国名を、wikipediaから抽出する例。

$ brew install https://raw.githubusercontent.com/EricChiang/pup/master/pup.rb

$ curl -s https://ja.wikipedia.org/wiki/2014_FIFA%E3%83%AF%E3%83%BC%E3%83%AB%E3%83%89%E3%82%AB%E3%83%83%E3%83%97 > tmp.html

$ cat tmp.html | pup '.wikitable td a:nth-child(2) json{}' | jq ". [] .text" 
"ブラジル"
"アルゼンチン"
"コロンビア"
...

mac なら、 brew でインストールできる
curl -s は、サイレント。進捗とかエラーメッセージとかを抑制する。htmlだけが欲しい。
試行錯誤が必要なので、tmp.html に置いている。
pup '.wikitable td a:nth-child(2) で、国名のリンクを抽出している。
text{}記法: (pup '.hoge text{}') を使うと、ノードの text を取得してくれるのだが、今回の構造が a[title='title']name となっており、name の他に title も取得してしまった。 name のみがほしい。
よって、json 化して抜き出すことにした。json{}記法: (pup '.hoge json{}') でノードをjson 化できる。下記のようになる。

$ pup '.wikitable td a:nth-child(2) json{}'
[
 {
  "href": "/wiki/%E3%82%B5%E3%83%83%E3%82%AB%E3%83%BC%E3%83%96%E3%83%A9%E3%82%B8%E3%83%AB%E4%BB%A3%E8%A1%A8",
  "tag": "a",
  "text": "ブラジル",
  "title": "サッカーブラジル代表"
 },
...

ここから jq を使って、text のみ抜き出す。 jq ". [] .text"
- 配列になっているので、. [] を使って配列を外す。
- .text で、textのみ抜き出す。

今までは、ruby で nokogiri をインストールしたりしたけれど、
コマンドラインだけで抜き出せるので、結構楽。

最後に| pbcopyとかつなぐと、そのままペーストできてさらに楽。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up