More than 1 year has passed since last update.

C# + AngleSharp で基本的なことを少し触ってみる

Posted at 2022-07-14

目的

Yahoo!ニュースの主要トピックスに対してAngleSharpを使って以下を試したときのメモ
・Copy selectorの出力
・Copy XPathの出力
・正規表現＋Linqの組み合わせ

記事に遷移後のCopy selectorでは[記事全文を読む]が以下の様に取得できるが

#uamods-pickup > div.sc-kAZVpg.sPlJi > div > p > a

sc-kAZVpg.sPlJiの部分は、定期的に変更されるようなので、XPathを使用したほうが良いと思われる

追加パッケージ

.NET6でプロジェクトを作成後
プロジェクト -> NuGetパッケージの管理より以下を追加する(作成日のバージョン)

AngleSharp(0.17.1)
AngleSharp.Css(0.16.4)
AngleSharp.Xml(0.17.0)
AngleSharp.XPath(2.0.0)
System.Text.Encoding.CodePages(6.0.0)

CSのサンプルコード

using System.Diagnostics;
using AngleSharp;
using AngleSharp.Dom;
using AngleSharp.Html.Dom;
using AngleSharp.Html.Parser;
using AngleSharp.Xml;
using AngleSharp.XPath;
using System.Text.RegularExpressions;

    private void btnYahooNews_Click(object sender, EventArgs e)
    {
        var ret = YahooNews();
    }

    public async Task YahooNews()
    {
        IBrowsingContext context = BrowsingContext.New(Configuration.Default.WithDefaultLoader());
        IDocument qdoc = await context.OpenAsync("https://news.yahoo.co.jp/");

        //正規表現Ver
        //
        //Regex regyh = new Regex(@"news.yahoo.co.jp/pickup");
        //var nodes = queryDocument.QuerySelectorAll<IHtmlAnchorElement>("a")
        //    .Where(v => regyh.IsMatch(v.GetAttribute("href")))
        //    .Select(s => new { ctxt = s.TextContent.Trim(), href = s.GetAttribute("href") });

        //XPath Ver
        //Top ページで取得できないのは要調査
        //var nodes = qdoc.Body.SelectNodes("//*[@id=\"uamods-topics\"]/div/div/div/ul/li[1]/a");
        //Debug.WriteLine(nodes.Count);

        //
        // #uamods-topics > div > div > div > ul > li:nth-child(1) > a // 個別指定[1～8]
        //var node = qdoc.QuerySelectorAll("#uamods-topics > div > div > div > ul > li:nth-child(1) > a")
        //    .Select(s => new { ctxt = s.TextContent.Trim(), href = s.GetAttribute("href") });

        //
        // #uamods-topics > div > div > div > ul > li:nth-child(n) > a // 全指定
        //
        var nodes = qdoc.QuerySelectorAll<IHtmlAnchorElement>("#uamods-topics > div > div > div > ul > li:nth-child(n) > a")
            .Select(s => new { ctxt = s.TextContent.Trim(), href = s.GetAttribute("href") });

        // 各記事へ移動する -> 記事全文を読むのURLを取得する
        foreach (var item in nodes)
        {
            //Debug.WriteLine(item.ctxt.Trim() + " " + item.href);

            IBrowsingContext ctx = BrowsingContext.New(Configuration.Default.WithDefaultLoader());
            IDocument doc = await ctx.OpenAsync(item.href);

            // pattern_1 Copy selector
            //var node = doc.QuerySelectorAll<IHtmlAnchorElement>("#uamods-pickup > div.sc-euitrJ.gxSnEc > div > p > a")
            //    .Where(v => regdt.IsMatch(v.GetAttribute("href")))
            //    .Select(s => new { ctxt = s.TextContent.Trim(), href = s.GetAttribute("href") });

            //foreach (var itm in node)
            //{
            //    Debug.WriteLine(itm.ctxt.Trim() + " " + itm.href);
            //}

            // pattern_2 Copy XPath
            var node = doc.Body.SelectSingleNode("//*[@id=\"uamods-pickup\"]/div[2]/div/p/a");
            //Debug.WriteLine(node.ToMarkup());

            // <a href="https://news.yahoo.co.jp/articles/ ～>記事全文を読む</a> が抽出される
            var parser = new HtmlParser();
            var docmkup = parser.ParseDocument(node.ToMarkup());
            var lstmkup = docmkup.QuerySelectorAll("a")
                 .Select(s => new { ctxt = s.TextContent.Trim(), href = s.GetAttribute("href") });

            foreach (var itm in lstmkup)
            {
                Debug.WriteLine(itm.ctxt.Trim() + " " + itm.href);
            }
        }

参考にしたのは以下のサイト

AngleSharp/AngleSharp.XPath
Form Submission by Example
.NET 6とAngleSharpによるC#でのスクレイピング技法
 C# 正規表現（Regex)とLINQを使用して結果が想定通り取得できない
 開発者向けのウェブ技術 > CSS:カスケーディングスタイルシート > :nth-child()
Windows10 + Python3 + BeautifulSoup4 を試してみる

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up