More than 5 years have passed since last update.

C#のScrapySharpでスクレイピングする

Last updated at 2016-08-19Posted at 2016-08-19

概要

C#のスクレイピング用のライブラリScrapySharpの使い方を書いてる記事があまりなかったので、まとめておきます。
ScrapySharpはHtmlAgilityPackと組み合わせて使う形になります。

準備（ライブラリを読み込む）

まず、ScrapySharpとHtmlAgilityPackライブラリをプロジェクトに追加します。いつもの如くproject.jsonに次のように書いておき、NuGetからライブラリをプロジェクトに追加します。


{
  "frameworks": {
    "net46": {
      "dependencies": {
        "HtmlAgilityPack": "1.4.9.4",
        "ScrapySharp": "2.6.2",
        "FSharp.Core": "4.0.0.1"
      }
    }
  }
}

スクレイピングするコード例

ScrapySharpでは、ScrapingBrowserでスクレイピング対象のページを取得し、取得結果に対してCSSセレクタやXPathを適用してスクレイピング対象のデータを取り込む、という手順になります。

例えば、http://example.com/page.html で次のようなHTMLが得られるとします。

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>title</title>
</head>
<body>
  <ul>
    <li>hoge 1</li>
    <li>fuga 2</li>
    <li>piyo 3</li>
  </ul>

  <table>
    <tr>
      <td>weather</td>
      <td>sunny</td>
    </tr>
    <tr>
      <td>my location is</td>
      <td>Tokyo</td>
    </tr>
  </table>
</body>
</html>

そのようなページに対してスクレイピングをするコードの例は次の通りです。

using System;
using HtmlAgilityPack;
using ScrapySharp.Extensions;
using ScrapySharp.Network;
using System.Linq;
using System.Text.RegularExpressions;

namespace ScrapySharpTest
{
    class Program
    {
        static void Main(string[] args)
        {
            var browser = new ScrapingBrowser();
            browser.AllowAutoRedirect = true;
            browser.AllowMetaRedirect = true;

            //まずはスクレイピング対象のページを取得してくる。
            var pageResult = browser.NavigateToPage(new Uri("http://example.com/page.html"));

            //ページに対してCSSセレクタを適用し、該当するDOMノードの最初のものを取り出す。
            // → 「hoge 1」が返る
            pageResult.Html.CssSelect("ul li").First().InnerText;

            //ページに対してCSSセレクタを適用してDOMノード群を取り出し、ノード群の中からinnerTextに「fuga」が入っている最初のノードをLINQで絞り込む
            // → 「fuga 2」が返る
            pageResult.Html.CssSelect("ul li").First(elem => elem.InnerText.Contains("fuga")).InnerText;

            // <td>タグ内に「location」という文字が入っているノードの隣のノードをXPathで絞り込む
            // → 「Tokyo」が返る
            pageResult.Html.SelectNodes("//td[contains(text(),'location')]/following-sibling::td").First().InnerText;
        }
    }
}

備考

上のコードのGistはこちら

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up