スクレイピング
基本、RSSとかではなくサイトのデータの一部を抽出するのでWebサイトに負荷をかける可能性があるので気をつける必要があります。
注意
正規表現は今回未使用。
だって難しいのでm(_ _)m
使用したモジュール
HTMLデータを取得
LWP::UserAgent
※HTMLの取得はこれじゃないとね
HTMLをパース
HTML::TreeBuilder
※これを使用することでJQueryみたいな感じで取得できる
シンプルな日付生成
Date::Simple
※現在日付とかだけでいいならこれでおk
実装
get_html_sample1.pl
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
use FindBin;
BEGIN {
unshift @INC, "$FindBin::Bin/../lib"
}
use LWP::UserAgent;
use HTML::TreeBuilder;
use Date::Simple(':all');
my $url = "http://ejje.weblio.jp/ranking";
my $date_format;
if( defined $ARGV[0] ){
print "date=" . $ARGV[0] . "\n";
$url = "http://ejje.weblio.jp/ranking/$ARGV[0]";
my $date = d8($ARGV[0]);
$date_format = $date->format('%Y-%m-%d');
}else{
my $date = Date::Simple->new();
$date_format = $date->format('%Y-%m-%d');
}
if( defined $ARGV[1] ){
print "category=" . $ARGV[1] . "\n";
}
# HTMLを取得
print "url=" . $url . "\n";
print "date=" . $date_format . "\n";
my $ua = LWP::UserAgent->new();
my $res = $ua->get($url);
if($res->is_success){
my $content = $res->content;
# HTMLをパース
my $tree = HTML::TreeBuilder->new();
$tree->parse($content);
# 行単位にデータを取得して整形
my $table = $tree->look_down('class', 'mainRankCC');
if( defined $table ){
my @list = $table->find('tr');
my $rank_list;
for my $tr (@list) {
my $td1 = $tr->find('td');
my $td2 = $td1->right;
# 空白を除去して設定
my $items = {};
$items->{rank} = trim($td1->as_text);
$items->{word} = trim($td2->as_text);
# 配列に追加(配列のリファレンスをデリファレンス)
push(@$rank_list, $items);
}
print Dumper($rank_list);
}else{
print "rank取得失敗";
}
}else{
print "get処理失敗";
}
# 空白除去
sub trim{
my ($value) = shift;
$value =~ s/^ *(.*?) *$/$1/;
return $value;
}
結果
$ perl get_html_sample1.pl
url=http://ejje.weblio.jp/ranking
date=2014-12-01
$VAR1 = [
{
'rank' => '1',
'word' => 'flood'
},
{
'rank' => '2',
'word' => 'legend'
},
{
'word' => '12月',
'rank' => '3'
},
{
'word' => '訃報',
'rank' => '4'
},
{
'word' => 'silly',
'rank' => '5'
},
{
'word' => 'undefined',
'rank' => '6'
},
{
'rank' => '7',
'word' => 'インターステラー'
},
{
'word' => 'confirm',
'rank' => '8'
},
{
'rank' => '9',
'word' => 'provide'
},
{
'word' => 'December',
'rank' => '10'
},
{
'word' => 'appreciate',
'rank' => '11'
},
{
'rank' => '12',
'word' => 'present'
},
{
'word' => 'expect',
'rank' => '13'
},
{
'word' => 'consider',
'rank' => '14'
},
{
'rank' => '15',
'word' => 'reference'
},
{
'word' => '単語',
'rank' => '16'
},
{
'rank' => '17',
'word' => 'refrain'
},
{
'word' => 'available',
'rank' => '18'
},
{
'word' => 'apply',
'rank' => '19'
},
{
'rank' => '20',
'word' => 'describe'
},
{
'word' => 'concern',
'rank' => '21'
},
{
'rank' => '22',
'word' => 'issue'
},
{
'word' => 'remain',
'rank' => '23'
},
{
'rank' => '24',
'word' => 'appropriate'
},
{
'word' => 'determine',
'rank' => '25'
},
{
'rank' => '26',
'word' => 'while'
},
{
'rank' => '27',
'word' => 'assume'
},
{
'rank' => '28',
'word' => 'implement'
},
{
'word' => 'leave',
'rank' => '29'
},
{
'rank' => '30',
'word' => 'feature'
},
{
'rank' => '31',
'word' => 'further'
},
{
'rank' => '32',
'word' => 'cause'
},
{
'rank' => '33',
'word' => 'indicate'
},
{
'rank' => '34',
'word' => 'even'
},
{
'rank' => '35',
'word' => 'awesome'
},
{
'rank' => '36',
'word' => 'メリークリスマス'
},
{
'rank' => '37',
'word' => 'through'
},
{
'word' => 'represent',
'rank' => '38'
},
{
'word' => 'due to',
'rank' => '39'
},
{
'rank' => '40',
'word' => 'property'
},
{
'word' => 'application',
'rank' => '41'
},
{
'rank' => '42',
'word' => 'involve'
},
{
'rank' => '43',
'word' => 'improve'
},
{
'word' => '紅葉',
'rank' => '44'
},
{
'rank' => '45',
'word' => 'affect'
},
{
'rank' => '46',
'word' => 'require'
},
{
'word' => 'accept',
'rank' => '47'
},
{
'rank' => '48',
'word' => 'respect'
},
{
'word' => '用務員',
'rank' => '49'
},
{
'rank' => '50',
'word' => 'description'
}
];
$
qiitaに始めて投稿するのでちょっと実験がてら書いてみた
webスクレイピングは対象のサーバに負荷をかけるので行う場合はかなり注意する必要が有ると思います。