More than 5 years have passed since last update.

2chスレをスクレイピングする方法

Last updated at 2019-12-25Posted at 2014-03-12

2chスレをスクレイピングする方法を紹介します。
意外と簡単な正規表現で取得できます。

2chスレの中身を確認

ざっくり構成を抜き出すと以下のような感じ・・。

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS">
<base href="http://uni.2ch.net/newsplus/">
<title>タイトル</title>
<script type="text/javascript" src="http://www2.2ch.net/snow/index.js" defer></script>
</head>
<body bgcolor=#efefef text=black link=blue alink=red vlink=#660099>
・・・・（略）
<hr style="background-color:#888;color:#888;border-width:0;height:1px;position:relative;top:-.4em;">
<h1 style="color:red;font-size:larger;font-weight:normal;margin:-.5em 0 0;">【ウクライナ情勢】ウクライナ軍艦接収へ</h1>
<dl class="thread">
<dt>1 ：<font color=green><b>名無しさん＠転載禁止</b></font>：2014/03/12(水) 19:34:28.27 ID:XXXXX<dd> コメントコメントコメント <br><br>
<dt>2 ：<a href="mailto:sage"><b>名無しさん＠転載禁止</b></a>：2014/03/12(水) 19:35:25.05 ID:mWgwEdiQ0<dd> コメントコメントコメント <br><br>
・・・・（略）
</dl>
・・・・（略）
</body>
</html>

今回は、「タイトル」「ID」「ハンドルネーム」「投稿時間」「コメント」を抜き出します。

スクレイピングするプログラム

<?php
$url = "http://hayabusa3.2ch.net/test/read.cgi/mnewsplus/[スレ番号]/";
$raw_contents = file_get_contents("sample.html");
$raw_contents = mb_convert_encoding($raw_contents, "UTF-8", "SJIS");

// タイトル取得
preg_match('/<title>.+?<\/title>/', $raw_contents , $match_title);
$title = removeStr($match_title[0], array("<title>", "</title>"));
echo($title. "\n");

// コメント取得
mb_ereg('<dl class=\"thread\">(\n|.)*?<\/dl>', $raw_contents, $match_dl);
preg_match_all('/<dt>(\n|.)*?<br><br>/', $match_dl[0] , $match_dt);
foreach($match_dt[0] as $dt){
    // コメントNo
    preg_match('/<dt>\d+/', $dt , $match_comment_no);
    $comment_no = removeStr($match_comment_no[0], array("<dt>"));
    echo($comment_no. "\n");

    // ハンドルネーム
    preg_match('/<b>.*?<\/b><\//', $dt , $match_name);
    $name = removeStr($match_name[0], array("<b>", "</b></"));
    echo($name. "\n");

    // 投稿時間
    preg_match('/\d{4}\/\d{2}\/\d{2}\(.*?\)\s\d{2}:\d{2}:\d{2}/', $dt , $match_time);
    $time = $match_time[0];
    echo($time. "\n");

    // ID
    preg_match('/ID:.*<dd>/', $dt , $match_id);
    if(isset($match_id[0])){
        $id = removeStr($match_id[0], array("ID:", "<dd>"));
    } else {
        $id = "";
    }
    echo($id. "\n");

    // コメント
    preg_match('/<dd>.*\s<br><br>$/', $dt , $match_contents);
    $comment = removeStr($match_contents[0], array("<dd> ", " <br><br>"));
    $comment = str_replace("<br>", "<br/>", $comment);
    echo($comment. "\n");
}
exit();

function removeStr($data, $remove_array){
    foreach($remove_array as $remove_data){
        $data = str_replace($remove_data, "", $data);
    }
    trim(mb_convert_kana($data, "s"));
    return $data;
}
?>

解説

対象ファイルを取得

$url = "http://hayabusa3.2ch.net/test/read.cgi/mnewsplus/[スレ番号]/";
$raw_contents = file_get_contents($url);
$raw_contents = mb_convert_encoding($raw_contents, "UTF-8", "SJIS");

「芸能スポ速報＋」を例にしています。
2chスレの文字コードはSJISなので、プログラムがUTF-8の場合は文字コード変換が必要です。

タイトルを抜き出す

preg_match('/<title>.+?<\/title>/', $raw_contents, $match_title);

<title>、</title>に囲まれたデータを取得
正規表現のポイント
- . ⇒ 何でもいい１文字
- + ⇒ １文字以上
- ? ⇒ 文字があってもなくてもOK
- .+? ⇒ あってもなくても、どんな文字列でもOKという意味。
- __ ⇒ エスケープ
上記を実行すると<title>タイトル<\/title>が取得できるので、以下のように不要なタグを削除する

$data = str_replace("<title>", "", $data);

コメント全体を取得

mb_ereg('<dl class=\"thread\">(\n|.)*?<\/dl>', $raw_contents, $match_dl);
preg_match_all('/<dt>(\n|.)*?<br><br>/', $match_dl[0] , $match_dt);
foreach($match_dt[0] as $dt){
    // コメントNo、ハンドルネーム、投稿時間、ID、コメント
}

<dl class="thread">、</dl>に囲まれた部分を取得
- データ量が多いと、preg_matchが動作しないので、mb_eregを使用する。
正規表現のポイント
- \n ⇒ 改行
- | ⇒ OR
- __.+?__だと、改行が含まれないので、正しく取得できない。改行も含める必要がある。
上記で抜き出したデータからさらに、<dt>、 で囲まれたデータを取得。
- 複数ある可能性があるので、preg_match_allを使用。
- ループで回して各項目を取得する

コメントNoを取得

preg_match('/<dt>\d+/', $dt , $match_comment_no);

<dt>で始まる数字を取得
正規表現のポイント
- \d ⇒ 数字

ハンドルネームを取得

preg_match('/<b>.*?<\/b><\//', $dt , $match_name);

と/で囲まれたデータを取得

投稿時間を取得

preg_match('/\d{4}\/\d{2}\/\d{2}\(.*?\)\s\d{2}:\d{2}:\d{2}/', $dt , $match_time);

正規表現のポイント
- __\d{4}__⇒4桁の数字。{}内は桁数

IDを取得

preg_match('/ID:.*<dd>/', $dt , $match_id);
if(isset($match_id[0])){
    $id = removeStr($match_id[0], array("ID:", "<dd>"));
} else {
    $id = "";
}

IDがない場合があるので、あるかどうかチェック

コメントを取得

preg_match('/<dd>.*\s<br><br>$/', $dt , $match_contents);

<dd>、 で囲まれたデータを取得
正規表現のポイント
- \s ⇒ スペース

補足

デバッグの際に、毎回サイトにアクセスするのは時間がかかる、サイトに負荷がかかるので、ローカルにファイルを落としてからやるといいです。
転載禁止のスレがあるので、転載する際は自己責任で。
おーぷん2ちゃんねるは転載自由みたい。
- こちらの文字コードは__UTF-8__なので、文字コード変換は注意が必要

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up