勝手に添削：Tokyo.Rの新規率・直帰率について集計・プロットしてみた #rstatsj

Posted at 2015-07-16

呼ばれた気がしたので添削してみる。

Tokyo.Rの新規率・直帰率について集計・プロットしてみた - INPUTしたらOUTPUT!

1. イベントIDの取得

library(dplyr)
library(rvest)
library(stringr)

url <- iconv("http://api.atnd.org/events/?keyword=R勉強会＠東京&count=100", "sjis", "utf8")

doc <- html(url)

event_id <- doc %>% 
  html_nodes(xpath = "//events/event/event_id") %>% 
  html_text %>% 
  as.integer
title <- doc %>% 
  html_nodes(xpath = "//events/event/title") %>% 
  html_text

events <- data.frame(event_id, title) %>% 
  filter(str_detect(title, "^第[[:digit:]]+回R勉強会＠東京")) %>%
  arrange(desc(event_id)) %>%
  mutate(time = rev(row_number())) %>%
  select(event_id, time)

head(events)

結果

  event_id time
1    68150   49
2    66469   48
3    63990   47
4    61553   46
5    60908   45
6    57640   44

Windows だからか、URL は UTF-8 に変換してやらないとうまくいかなかった。

ポイントは 2 つ。

"第[0-9|０-９]+回" という正規表現の書き方は、第[[:digit:]]+回 みたいな書き方があります。この書き方については R における正規表現 - RjpWiki が詳しいです。
time = seq(dim(.)[1], 1, -1) のところは time = rev(row_number()) と書けます。row_number() は dplyr の便利関数です。

2. 出席者の取得

指定したイベントIDの出席者を取得する関数を作成し、各回の出席者を取得する。 @hoxo_m氏のpforeachパッケージでもっと楽にできそうな気もする。

ということなので、pforeach で書いてみる。

library(pforeach)

get_users <- function(event_id, time) {
  cat(time, event_id, "\n")
  url <- sprintf("http://api.atnd.org/events/users/?event_id=%s", event_id)
  doc <- html(url)
  Sys.sleep(1)
  get_node_text <- function(node_name) {
    xpath <- sprintf("//events/event/users/user/%s", node_name)
    doc %>% html_nodes(xpath = xpath) %>% html_text
  }
  data_frame(event_id = event_id, 
             time = time, 
             user_id = get_node_text("user_id"),
             nickname = get_node_text("nickname"),
             twitter_id = get_node_text("twitter_id"),
             status = get_node_text("status"))
}

npforeach(row=rows(events), .c=rbind)({
  get_users(row$event_id, row$time)
}) -> users

head(users)

結果

  event_id time user_id       nickname     twitter_id status
1    68150   49  145711 __john_smith__ __john_smith__      1
2    68150   49  159246         缶これ    wonder_zone      1
3    68150   49   21686       kos59125       kos59125      1
4    68150   49  109170   tom_of_death   tom_of_death      1
5    68150   49  173679         re_t_s         re_t_s      1
6    68150   49  166177   YugoKawamura                     1

特に楽にはならないっす。

好みの問題だけど、私ならこういうときは lambdaR を使います。

library(lambdaR)

users <- events %>% 
  Map2_(get_users) %>%
  rbind_all

3. 前回出席回、次回出席回、直帰フラグの追加

ここからちょっと元記事とは違う定義にします。

新規参加：初めての参加
離脱：その回が最後の参加
直帰：初めての参加かつ、その回が最後

好みの問題かもしれませんが、この定義の方がしっくりくる。
なので、ここで追加する列もシンプルになります。

users2 <- users %>% 
  group_by(user_id) %>% 
  arrange(time) %>% 
  mutate(first = first(time) == time, 
         last = last(time) == time, 
         bounce = first & last)

head(users2)

結果

  event_id time user_id         nickname   twitter_id status first  last bounce
1     5441    6   10046          rti7743    super_rti      1  TRUE FALSE  FALSE
2     6439    7   10046          rti7743    super_rti      1 FALSE FALSE  FALSE
3     9177   10   10046          rti7743    super_rti      1 FALSE FALSE  FALSE
4    14967   13   10046          rti7743    super_rti      1 FALSE  TRUE  FALSE
5    28416   23  100744         天才白菜        bae_j      1  TRUE  TRUE   TRUE
6    24629   20  100799 Akihiro Kasahara Aki_Kasahara      1  TRUE FALSE  FALSE

4. 新規率、離脱率、直帰率の算出

result <- users2 %>% 
  group_by(time) %>%
  summarise(n = n(), 
            new = sum(first),
            exit = sum(last),
            bounce = sum(bounce),
            new_rate = new / n,
            exit_rate = exit / n,
            bounce_rate = bounce / new) %>%
  arrange(desc(time))

head(result)

結果

  time   n new exit bounce  new_rate exit_rate bounce_rate
1   49 113  31  113     31 0.2743363 1.0000000   1.0000000
2   48  97  36   60     30 0.3711340 0.6185567   0.8333333
3   47  84  38   44     34 0.4523810 0.5238095   0.8947368
4   46 138  56   69     42 0.4057971 0.5000000   0.7500000
5   45  90  29   35     17 0.3222222 0.3888889   0.5862069
6   44  99  32   47     27 0.3232323 0.4747475   0.8437500

一つポイントなのは、summarise(とmutate)は左で求めた値を右側で使い回すことができるということです。(transform では無理)
なので n とかは使い回したほうが再計算されなくて地球にやさしい。

library(ggvis)

data <- result %>% slice(1:20)
result %>% ggvis(~ time, ~ bounce_rate) %>% layer_paths

result %>% ggvis(~ time, ~ new_rate) %>% layer_paths

プロットしてみると、定義は違うけど、ほとんど同じ動きをしています。
おもしろいなー。

Enjoy!

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up