More than 1 year has passed since last update.

GA4のサンプルデータを用いてロジスティック回帰から継続ユーザーの特徴を探る

Last updated at 2023-09-25Posted at 2023-09-25

今回やること

BigQueryのサンプルデータにあるパズルゲーム「Flood-It!」のGA4アクセスログデータセットを用いて、相関行列、ロジスティック回帰から継続/ 離脱ユーザーの特徴を探ってみる。（外れ値/異常値処理、精度向上、CV等は省く）

※過去に同データセットを用いて書いた記事
GA4のサンプルデータから成長指数を算出してみる
 GA4のサンプルデータからMAUに占めるユーザータイプ別に指標を作ってみる
 GA4のサンプルデータからタイプ別＆定着度合い別に継続率を出してみる
 GA4のサンプルデータからエンゲージメント指標アクティブ率を出してみる

前処理（SQL）

特徴量は対象期間の起動日数、代表的なイベントのイベント数、ユーザーフラグ（＊こちら参照）とし、目的変数を継続:1、離脱：0とする。

  WITH tmp AS (
    SELECT DISTINCT
      DATE_TRUNC(PARSE_DATE('%Y%m%d', event_date), MONTH) AS event_month
      , PARSE_DATE('%Y%m%d', event_date) AS event_date
      , event_name
      , user_pseudo_id
    FROM `firebase-public-project.analytics_153293282.events_*`
    WHERE _TABLE_SUFFIX BETWEEN '20180701' AND '20180931'
    AND event_name IN ('session_start', 'first_open')
  )
  #全ユーザーログ
  , mst_all_users AS (
    SELECT DISTINCT
      event_month
      , user_pseudo_id
    FROM tmp
  )
  #新規ユーザーログ
  , mst_fitst_users AS (
    SELECT DISTINCT
      event_month
      , user_pseudo_id
      , 'new' AS flag
    FROM tmp
    WHERE event_name = 'first_open'
  )
  #継続ユーザーログ
  , mst_continue_users AS (
    SELECT DISTINCT
      A.event_month
      , A.user_pseudo_id
      , 'continue' AS flag
    FROM mst_all_users AS A #今月
    INNER JOIN mst_all_users AS B #先月
      ON A.user_pseudo_id = B.user_pseudo_id
      AND DATE_DIFF(A.event_month, B.event_month, MONTH) = 1
  )
  #復帰ユーザーログ
  , mst_return_users AS (
    SELECT
      event_month
      , user_pseudo_id
      , 'return' AS flag
    FROM (
      SELECT event_month, user_pseudo_id FROM mst_all_users
      EXCEPT DISTINCT
      SELECT event_month, user_pseudo_id FROM mst_fitst_users
      EXCEPT DISTINCT
      SELECT event_month, user_pseudo_id FROM mst_continue_users
    )
  )
  #フラグ付き全ユーザー
  , mst_union_users AS (
    SELECT event_month, user_pseudo_id, flag FROM mst_fitst_users
    UNION ALL
    SELECT event_month, user_pseudo_id, flag FROM mst_continue_users
    UNION ALL
    SELECT event_month, user_pseudo_id, flag FROM mst_return_users
  )
  #翌月継続ユーザーをJOIN
  , mst_join_next_month AS (
    SELECT
      A.event_month
      , A.flag
      , A.user_pseudo_id
      , MAX(IF(B.user_pseudo_id IS NOT NULL, 1, 0)) AS next_month_flag
    FROM mst_union_users AS A #当月
    LEFT JOIN mst_all_users AS B #翌月
      ON A.user_pseudo_id = B.user_pseudo_id
      AND DATE_DIFF(B.event_month, A.event_month, MONTH) = 1
    WHERE
      A.event_month = '2018-08-01'
    GROUP BY 1, 2, 3
  )
  #起動日数を取得
  , mst_action_days AS (
    SELECT
      user_pseudo_id
      , COUNT(DISTINCT event_date) AS action_days
    FROM tmp
    WHERE event_month = '2018-08-01'
    GROUP BY 1
  )
  #各種イベント数のログ
  , event_log AS (
    SELECT DISTINCT
      user_pseudo_id
      , event_name
      , COUNT(*) AS num_events
    FROM `firebase-public-project.analytics_153293282.events_*`
    WHERE _TABLE_SUFFIX BETWEEN '20180801' AND '20180831'
    AND event_name IN ('screen_view', 'post_score', 'select_content', 'level_up'
      , 'level_retry', 'level_start', 'level_end', 'level_fail', 'level_reset'
      , 'level_start_quickplay', 'level_end_quickplay', 'level_complete_quickplay', 'level_fail_quickplay')
    GROUP BY 1, 2
  )
  SELECT
    A.user_pseudo_id
    , A.next_month_flag
    , A.flag AS user_flag
    , B.action_days
    , MAX(IF(event_name = 'screen_view', num_events, 0)) AS num_screen_view
    , MAX(IF(event_name = 'post_score', num_events, 0)) AS num_post_score
    , MAX(IF(event_name = 'select_content', num_events, 0)) AS num_select_content
    , MAX(IF(event_name = 'level_up', num_events, 0)) AS num_level_up
    , MAX(IF(event_name = 'level_retry', num_events, 0)) AS num_level_retry
    , MAX(IF(event_name = 'level_start', num_events, 0)) AS num_level_start
    , MAX(IF(event_name = 'level_end', num_events, 0)) AS num_level_end
    , MAX(IF(event_name = 'level_fail', num_events, 0)) AS num_level_fail
    , MAX(IF(event_name = 'level_reset', num_events, 0)) AS num_level_reset
    , MAX(IF(event_name = 'level_start_quickplay', num_events, 0)) AS num_level_start_quickplay
    , MAX(IF(event_name = 'level_end_quickplay', num_events, 0)) AS num_level_end_quickplay
    , MAX(IF(event_name = 'level_complete_quickplay', num_events, 0)) AS num_level_complete_quickplay
    , MAX(IF(event_name = 'level_fail_quickplay', num_events, 0)) AS num_level_fail_quickplay
  FROM mst_join_next_month AS A
  LEFT JOIN mst_action_days AS B
    ON A.user_pseudo_id = B.user_pseudo_id
  LEFT JOIN event_log AS C
    ON A.user_pseudo_id = C.user_pseudo_id
  GROUP BY
    1, 2, 3, 4

ロジスティック回帰（Python）

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklean.model_selection import train_test_split
from sklean.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.linearmodel import LogisticRegression

df = pd.read_csv('event_log.csv')
df = df.drop(['user_pseudo_id'], axis=1)

#エンコーディング
cat_cols = df.select_dtypes(include=object)
num_cols = df.select_dtypes(exclude=object)
cat_cols_en = pd.get_dummies(cat_cols)
data = pd.merge(cat_cols_en, num_cols, left_index=True, right_index=True)

#相関行列
df_corr = data.corr()
plt.figure(figsize=(15, 15))
sns.heatmap(df_corr, vmax=1, vmin=-1, center=0, annot=True, cmap='Blues')
plt.show()

#前処理
X = data.drop(['next_month_flag'], axis=1)
y = data['next_month_flag']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, shuffle=True, random_state=42)

##標準化
num_col_names = X_train.columns[3:]
scaler = StandardScaler()
scaler.fit(X_train[num_col_names])

X_train[num_col_names] = scaler.transform(X_train[num_col_names])
X_test[num_col_names] = scaler.transform(X_test[num_col_names])

##ロジスティック回帰
model_lr = LogisticRegression(max_iter=100, multi_class='ovr', solver='liblinear', C=0.1, penalty='l1', random_state=0)
model_lr.fit(X_train, y_train)

y_test_pred = model_lr.predict(X_test)
ac_score = accuracy_score(y_test, y_test_pred)
print('accuracy_score:', ac_score)

##混合行列
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_test_pred)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('pred')
plt.ylabel('label')

##回帰係数
importances = model_lr.coef_[0]
indices = np.argsort(importances)[::1]

plt.figure(figsize=(10, 10))
plt.title('LogisticRegression coef')
plt.barh(range(len(indices)), importances[indices])
plt.yticks(range(len(indices)), X_train.columns[indices], rotation=0)
plt.show()

相関行列

重要なのは翌月継続フラグ（next_month_flag）との相関係数になるが、起動日数が多いこと、そもそも先月からの継続ユーザーであることが比較的高い相関を持っている。逆に新規ユーザーで合ったり、復帰ユーザーは負の効果を示している。
また、一部レベルに関する指標同士で高い相関を持っており、ゲームにトライしている数が高ければ高いほど同じようなイベントが飛ぶことがわかる。（イベントが何であるか詳細には調べても出なかった）

混合行列

正解率が79.9%であったので、特徴量エンジニアリングを特に行っていないロジスティック回帰としては肌感に合う結果であったので、比較的精度があると言える。
L1正則化（C=0.1）を行っているが、グリッドサーチを行えばもう少しばかり精度は良くなるであろう。

回帰係数（特徴量重要度）

各量的変数を標準化しているので、回帰係数の大きさで特徴量の重要度を見ることができる。（外れ値処理はしていないが）
相関行列と同じような結果になるが、起動日数が最も継続に寄与する変数であり、ついで継続ユーザーであることがプラスに寄与している。
一方で、新規ユーザー、復帰ユーザーはマイナスの効果を出しており、そのほか変数は正則化により0となっている。

感想

実務においてはもっと丁寧に行うがおおよそ同じようなアウトプットが出ており、継続しているユーザーはずっと継続し、新規や復帰ユーザーを継続に転換することはなかなか難しい。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up