More than 1 year has passed since last update.

App StoreとGoogle Playのアプリレビューの取得

Last updated at 2024-05-10Posted at 2024-02-29

概要

App StoreとGoogle Playのアプリレビューを取得する方法の備忘録
評価，レビュー，バージョンの情報を抽出し，pandasのデータフレームにする
Pythonを使う

1. App Storeのレビュー取得

アプリIDを利用してxmlファイルを取得します．xmlファイルから評価，レビュー，バージョン情報を抽出する流れとなります．

App StoreのアプリIDを利用してxmlファイルを取得します．
IDは，App Storeのレビューを取得するアプリのページのURLに記載されています．

例）Suicaの場合

App Storeのページ
https://apps.apple.com/jp/app/suica/id1156875272
レビューのxmlファイル（page=1からpage=10）
https://itunes.apple.com/jp/rss/customerreviews/page=1/id=1156875272/sortby=mostrecent/xml

Suicaの場合，id1156875272の番号（1156875272）がアプリidになります．
xmlファイルは最大10ページまであるようです．
page=1〜page=10をスクレイピング等を使い，保存します（略）．
今回はファイル名がそれぞれpage1.xml,...,page10.xmlとしてdataディレクトリに保存してある状態から始まります．

１つのファイルからデータを抽出

BeautifulSoupを利用してpage1.xmlファイルからレビュー，評価，バージョンの情報を抽出してpandasのデータフレームへ変換してみた．

page1.xmlの抜粋

<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns:im="http://itunes.apple.com/rss" xmlns="http://www.w3.org/2005/Atom" xml:lang="ja">
  <id>https://mzstoreservices-int.itunes.apple.com/jp/rss/customerreviews/page=1/id=1156875272/sortby=mostrecent/xml</id><title>iTunes Store: カスタマーレビュー</title><updated>2024-02-28T15:41:16-07:00</updated><link rel="alternate" type="text/html" href="https://music.apple.com/WebObjects/MZStore.woa/wa/viewGrouping?cc=jp&amp;id=1000"/><link rel="self" href="https://mzstoreservices-int.itunes.apple.com/jp/rss/customerreviews/page=1/id=1156875272/sortby=mostrecent/xml"/><link rel="first" href="https://itunes.apple.com/jp/rss/customerreviews/page=1/id=1156875272/sortby=mostrecent/xml?urlDesc=/customerreviews/page=1/id=1156875272/sortby=mostrecent/xml"/><link rel="last" href="https://itunes.apple.com/jp/rss/customerreviews/page=10/id=1156875272/sortby=mostrecent/xml?urlDesc=/customerreviews/page=1/id=1156875272/sortby=mostrecent/xml"/><link rel="previous" href="https://itunes.apple.com/jp/rss/customerreviews/page=1/id=1156875272/sortby=mostrecent/xml?urlDesc=/customerreviews/page=1/id=1156875272/sortby=mostrecent/xml"/><link rel="next" href="https://itunes.apple.com/jp/rss/customerreviews/page=2/id=1156875272/sortby=mostrecent/xml?urlDesc=/customerreviews/page=1/id=1156875272/sortby=mostrecent/xml"/><icon>http://itunes.apple.com/favicon.ico</icon><author><name>iTunes Store</name><uri>http://www.apple.com/jp/itunes/</uri></author><rights>Copyright 2008 Apple Inc.</rights>
  <entry>
	<id>111111</id>
	<title>突然使えなくなった</title>
	<content type="text">今まではiPhoneの顔認証でパスワード認識していたけど、(略)</content>
	<im:contentType term="Application" label="アプリケーション"/>
	<im:voteSum>0</im:voteSum>
	<im:voteCount>0</im:voteCount>
	<im:rating>1</im:rating>
	<updated>2024-02-27T06:44:40-07:00</updated>
	<im:version>3.2.3</im:version>
〜〜〜〜〜〜〜（略）〜〜〜〜〜〜〜

  </entry>
</feed>

page1.xmlの内容から，抽出する部分の特徴がわかります。

レビューは，content type="text"のテキスト部分
評価は，im:ratingのテキスト部分
バージョンは，im:versionのテキスト部分
これを参考に，テキストデータをリストにして，データフレームへまとめる流れになります．

from bs4 import BeautifulSoup
import pandas as pd

filename = "./data/page1.xml"
with open(filename, "rt") as file:
    text = file.read()


soup = BeautifulSoup(text, features="xml")  

# 必要な部分を抽出 find_allで全検索
review_list  = [soup.find_all("content", {"type":"text"})[i].text for i in range(len(soup.find_all("content", {"type":"text"})))]
score_list   = [soup.find_all("im:rating")[i].text for i in range(len(soup.find_all("im:rating")))]
version_list = [soup.find_all("im:version")[i].text for i in range(len(soup.find_all("im:version")))]

# 3種類のリストをzipでひとまとめにして，列に名前をつける．
data = pd.DataFrame(zip(score_list,review_list, version_list) , columns=["score","content", "version"])
print(data.head(3))
#  score                                            content version
#0     1  今まではiPhoneの顔認証でパスワード認識していたけど、突然顔認証もパスワードも認証しなく...   3.2.3
#1     1  機種変でアプリ入れ直したら「このSuicaはご利用になれません」の一点張り。ログインすらさせ...   3.2.3
#2     1  お知らせを通知するバッチを消すのにいちいちリンク先まで飛ばないといけないのはなんとかなりませんか？   3.2.3

ディレクトリのファイルを一括してデータを抽出

dataディレクトリーに保存されている10個のファイル(page1.xml,...,page10.xml)をそれぞれ読み込んでデータフレームにします．10個のデータフレームをつなげれば完成となります．

from bs4 import BeautifulSoup
import glob
import pandas as pd

# sortedしなくてもよいけど，
filename_list  = sorted(glob.glob("./data/*.xml"))

# ファイルを開いて評価，レビュー，バージョンの情報をデータフレームとして抽出
def convert2df(filename):
    with open(filename, "rt") as file:
        text = file.read()
    
    xml_text = BeautifulSoup(text, "xml")
    review_list = [xml_text.find_all("content", {"type":"text"})[i].text for i in range(len(xml_text.find_all("content", {"type":"text"})))]
    score_list = [xml_text.find_all("im:rating")[i].text for i in range(len(xml_text.find_all("im:rating")))]
    version_list = [xml_text.find_all("im:version")[i].text for i in range(len(xml_text.find_all("im:version")))]

    df = pd.DataFrame(zip(score_list,review_list, version_list) , columns=["score","content", "version"])
    return df

df_list = []
for filename in filename_list:
    df_list.append(convert2df(filename))

# df_listにあるデータフレームをつなげる
data = pd.concat(df_list, ignore_index=True)

dataは評価，レビュー，バージョンのデータとなります．

2. Google Playのレビュー取得

Google Playから取得するには，google_play_scraperを利用します．ライブラリ製作者に感謝
利用したバージョン： google-play-scraper 1.2.6

pip3 install google-play-scraper

【Google play scraper】

すべてのレビューを取得するので，レビューの多いアプリではデータ抽出に時間がかかります．

google_play_scraperに書かれている方法をそのまま利用します．app_id はレビューを取得したいアプリのidとなります．

例）Suicaの場合

Google Playのページ
https://play.google.com/store/apps/details?id=com.mobilesuica.msb.android&hl=ja&gl=JP
app_idは，Google PlayのURLにあるidの部分です．
app_id=com.mobilesuica.msb.android

from google_play_scraper import app
from google_play_scraper import Sort, reviews_all

import numpy as np
import pandas as pd
app_id = "com.mobilesuica.msb.android"  # Suica

jp_reviews = reviews_all(
    app_id = app_id,
    sleep_milliseconds=0,  # defaults to 0
    lang='ja',  # defaults to 'en'
    country='jp',  # defaults to 'us'
    sort=Sort.MOST_RELEVANT,  # defaults to Sort.MOST_RELEVANT
)

print(jp_reviews[0])
#{'reviewId': 'xxxxxx', 
# 'userName': 'xxxxxx', 
# 'userImage': 'https://play-lh.googleusercontent.com/a-/xxxxxx', 
# 'content': 'iPhone版にはありませんでしたが、画面を移動すると稀に数秒のロード（読み込み）が入ります。, 
# 'score': 3, 
# 'thumbsUpCount': 66, 
# 'reviewCreatedVersion': '6.2.3', 
# 'at': datetime.datetime(2023, 12, 27, 2, 17, 3), 
# 'replyContent': None, 
# 'repliedAt': None, 
# 'appVersion': '6.2.3'}

取得したjson形式のデータをpandasのデータフレームへ変換します．

# 取得したレビューをデータフレームに格納
df_reviews = pd.DataFrame(np.array(jp_reviews), columns=['review'])
df_reviews = df_reviews.join(pd.DataFrame(df_reviews.pop('review').tolist()))
df = df_reviews[['score', 'content', 'appVersion']]    # 評価，レビュー，バージョン
print(df.head(3))
#   score                                            content appVersion
# 0      3  iPhone版にはありませんでしたが、画面を移動すると稀に数秒のロード（読み込み）が入ります...      6.2.3
# 1      1  改札にタッチしても反応しない時があり地獄。携帯の機器のせいかと思っていたがここで同じ書き込み...      6.2.3
# 2      1  何の権限があるのか知りませんが、勝手にクレジットカードチャージ金額の上限をアプリ側が設定して...      6.2.3

3. 雑感

App Storeのレビューは最大500件しか取得できないのが残念
集めたレビューを利用してテキストマイニングへと続くZzz

4. 参考の記事

分析まで書かれている素晴らしい記事たち．

perlでレビューを取得

ゲームレビューの分析実例

WAONとnanacoのApp Storeレビュー分析

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up