More than 5 years have passed since last update.

rvest と HTMLその1

Last updated at 2016-08-09Posted at 2015-11-09

以下のようなHTMLがあったとする

<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8" />
  <meta http-equiv="content-language" content="ja" />

  <style type="text/css">
<!--
p {color:blue; line-height:1.5;}
p.green { color: green; }
p#red { color: red; }
.table3 {
  border-collapse: collapse;
}
.myTable th {
  background-color: #00cc00;
}
div#divRoot { color: yellow; }
div.inDiv { color: blue; }
span.inSpan { color: pink; }

-->
</style>
</head>

<body>

<p>p タグ </p>
  
<p class="green">classを使った例</p>

<p id="red">idを使った例</p>

<div id='divRoot'> 根本のdiv
  <ul class='sub'>
    <li>
      <div class='inDiv'>
	div の中のdiv でクラスは inDiv
      </div>
      <span class='inSpan' title="2.0"> test </span>
    </li>
  </ul>
</div>



<table  border=1>
 <tr><th></th><th>A</th><th>B</th></tr>
 <tr><td>R1</td><td>R-A1</td><td>R-B1</td></tr>
 <tr><td>R2</td><td>R-A2</td><td>R-B2</td></tr>
 <tr><td>R3</td><td>R-A3</td><td>R-B3</td></tr>
</table>


<table class="myTable" border=1>
 <tr><th></th><th>列-A</th><th>列-B</th></tr>
 <tr><td>行-1</td><td>要素A1</td><td>要素B1</td></tr>
 <tr><td>行-2</td><td>要素A2</td><td>要素B2</td></tr>
 <tr><td>行-3</td><td>要素A3</td><td>要素B3</td></tr>
</table>

	
</body>
</html>

このファイルをsample.htmlとしてrvestを使って部分抽出する

ソースがUTF-8のHTMLをWindows環境で表示しようとすると文字化けするので、iconvで文字コード変換している。ただし、表(table)の場合、要素の文字コード変換指定はちと面倒なので、後で補足する。なお%>% iconv(from = "UTF-8") の部分は、Macユーザーには不要な処理です。

読み込み read_html

> # 読み込み Windows環境だとする
> x <- read_html ("C:/test/sample.html", encoding = "UTF-8")
> # x <- read_html("http://rmecab.jp/R/sample.html")

各種タグ(ノード)を指定しての抽出

pタグ

> //p という指定で、html内のPタグがすべて抽出される
> x %>% html_nodes( xpath = "//p") %>% html_text %>% iconv(from = "UTF-8")
[1] "p タグ "         "classを使った例" "idを使った例"
> 
> # pタグのgreenクラスに指定された文字列
> x %>% html_nodes( xpath = "//p[@class = 'green']") %>% html_text () %>% 
+   iconv(from = "UTF-8")
[1] "classを使った例"
> # 上の略記
> x %>% html_nodes( ".green") %>% html_text () %>% 
+   iconv(from = "UTF-8")
[1] "classを使った例"
> 
> # pタグのredアイディーに指定された文字列
> x %>% html_nodes( xpath = "//p[@id = 'red']") %>% html_text () %>% 
+   iconv(from = "UTF-8")
[1] "idを使った例"
> # その略記
> x %>% html_nodes( "#red") %>% html_text () %>% 
+   iconv(from = "UTF-8")
[1] "idを使った例"

div タグ

> # divタグ
> x %>% html_nodes("div") %>% html_text() %>% 
+   iconv(from = "UTF-8")
[1] " 根本のdiv \n    \n      \n\tdiv の中のdiv でクラスは inDiv\n\n     \n      \n    \n"
[2] "\n\tdiv の中のdiv でクラスは inDiv\n\n     "                          
> # divタグの入れ子になったdivタグの値
> x %>% html_nodes(".inDiv") %>% html_text() %>% 
+   iconv(from = "UTF-8")
[1] "\n\tdiv の中のdiv でクラスは inDiv\n\n     "
> 
> # 別の指定方法
> x %>% html_nodes("#divRoot .inDiv") %>% html_text() %>% 
+   iconv(from = "UTF-8")
[1] "\n\tdiv の中のdiv でクラスは inDiv\n\n     "

属性値を取る

> #  Spanタグ内のtitleに設定された値を取る
> x %>% html_nodes(".inSpan") %>% html_attr("title") %>% 
+   iconv(from = "UTF-8")
[1] "2.0"
> 
> # その別の方法
> x %>% html_nodes("#divRoot li span.inSpan") %>% html_attr("title") %>% 
+   iconv(from = "UTF-8")
[1] "2.0"

表の取り出し

> # html 内にあるtableがすべて取り出され、リストとして返される
> x %>% html_table %>% `[[`(1) 
        A    B
1 R1 R-A1 R-B1
2 R2 R-A2 R-B2
3 R3 R-A3 R-B3
> # クラスを指定(Windowsでは文字化け)
> x %>% html_node(".myTable") %>% html_table
              蛻\x97-A    蛻\x97-B
1 陦\x8c-1 隕∫ｴ\xa0A1 隕∫ｴ\xa0B1
2 陦\x8c-2 隕∫ｴ\xa0A2 隕∫ｴ\xa0B2
3 陦\x8c-3 隕∫ｴ\xa0A3 隕∫ｴ\xa0B3
> # 属性を取り出す
> x %>% html_node(".myTable") %>% html_attr ("border")
[1] "1"

文字化け対策 (Windowsユーザー向けで、Macユーザーには不要な処理です) 2016 年8月のバージョン rvest_0.3.2 xml2_1.0.0 dplyr_0.5.0 では文字化けは確認されませんでした

> ## 文字化け対策
> library(readr)

 次のパッケージを付け加えます: ‘readr’ 
> # 列名は化ける
> x %>% html_node(".myTable") %>% html_table %>% type_convert()
       蛻\x97-A 蛻\x97-B
1 行-1   要素A1   要素B1
2 行-2   要素A2   要素B2
3 行-3   要素A3   要素B3
> # 予め列名だけ変換しておくか、そもそも日本語を使わない(後者推奨)
> x2 <- x %>% html_node(".myTable") %>% html_table 
> colnames(x2) <- iconv(colnames(x2), from = "UTF-8")
> #or colnames(x2) <- c("A","B","C")
> x2 %>% type_convert
         列-A   列-B
1 行-1 要素A1 要素B1
2 行-2 要素A2 要素B2
3 行-3 要素A3 要素B3
> ##
> # 以下は理解しづらいので勧めない
> x %>% html_node(".myTable") %>% html_table %T>% {
+   colnames(.) <- iconv(colnames(.) , from = "UTF-8")
+ }%>% readr::type_convert()
         列-A   列-B
1 行-1 要素A1 要素B1
2 行-2 要素A2 要素B2
3 行-3 要素A3 要素B3

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up