More than 5 years have passed since last update.

「サザエさんのじゃんけんデータ分析」のR言語からPythonに移植

Last updated at 2018-12-31Posted at 2017-01-03

はじめに

もともと2013年に静岡Developers勉強会で機械学習を学んだ際にR言語を覚えたのですが、その後に1年に1回だけ「サザエさんのじゃんけんデータ分析」のためだけにR Studioを起動するだけになっていて、操作も思い出す感じでやっている状態です。
昨年、Visual Studio CodeでPythonの開発環境を整えることが出来たので、R言語からPythonに移植してみました。とはいっても、Pythonを本格的に意識して組むこと自体は今回初めてなので、そこそこ苦労しました。

【2018/12/31追記】
2017年冬版サザエさんじゃんけん白書によるとクール(四半期)の初回(1月、4月、7月、10月の初回)はチョキが出やすいとのことで、今回はこれを取り入れてみました。

開発環境

OS：Windows10 Home(64bit)
Python:Python 3.5.2 :: Anaconda 4.2.0.0 (64bit)
エディタ:Visual Studio Code version 1.24.0

インストール

pip install beautifulsoup4
pip install pandasql

Webスクレイピング

PythonでWebスクレイピングを組むのは初めてだったので、検索したらサザエさん(とプリキュア)のジャンケンデータのダウンロードとしてGit Hubにスクリプトが公開されていたので参考にさせて頂きました。
今回は、Pandasを使用してCSVファイルに出力するようになっています。

【2018/01/01追記】
「サザエさんのジャンケン学」が2017/06/25を以って終了となりましたので、2017/07以降のデータは取得できません。長い間お疲れ様でしたm(_ _)m
それ以降のデータはサザエさんとの勝負結果(年別)から手動で追記しています。

GetSazaeData.py

# -*- coding: utf-8 -*-

# get data from http://www.asahi-net.or.jp/~tk7m-ari/sazae_ichiran.html

'''
サザエさんのじゃんけん予想データ取得
'''
import urllib.request
import datetime as dt
import bs4 as bs
import pandas as pd

def normalized_year(year):
    '''
    2桁年を4桁年に編集
    '''
    return int(year) + 2000 if year < 91 else int(year) + 1900

def read_data():
    '''
    Webスクレイピングによるデータ取得
    '''
    result = []
    response = urllib.request.urlopen('http://www.asahi-net.or.jp/~tk7m-ari/sazae_ichiran.html')
    soup = bs.BeautifulSoup(response, "lxml")

    for i in soup.body.contents[9].contents:
        if i.string and len(i.string.strip()) != 0:
            split = i.strip().split()
            seq = int(split[0][1:-1])
            year, month, day = map(int, split[1].split('.'))
            year = normalized_year(year)
            #the data contain illegal date: 91.11.31 -> 91.12.01
            if year == 1991 and month == 11 and day == 31:
                date = dt.date(year, month + 1, 1)
            else:
                date = dt.date(year, month, day)

            kind, idx = ('-', 9)
            hand = split[2]
            if hand == 'グー':
                kind, idx = ('G', 1)
            if hand == 'チョキ':
                kind, idx = ('C', 2)
            if hand == 'パー':
                kind, idx = ('P', 3)

            result.append((seq, year, date, kind, idx))
    result.reverse()

    return result

def main():
    '''
    メイン
    '''
    df_data = pd.DataFrame(read_data(), columns=['seq', 'year', 'date', 'kind', 'idx'])
    df_data.to_csv('SazaeData.csv', index=False)

if __name__ == '__main__':
    main()

サザエさんのじゃんけん予想

次の手の予測アルゴリズム

チョキが多いので、グー＞チョキ＞パーの優先順位とする
前回と違う手を出すので、上記の優先順位で勝手を選ぶ
二手前と一手前が違う手なら、残りの手を出すので勝手を選ぶ
三手の中に同手がある場合、残りの手を出すので勝手を選ぶ
二手前と一手前が同じ手なら、勝手を出すので負手を選ぶ
1月、4月、7月、10月の第1週目はチョキが出やすいので、グーを選ぶ(追加)

GetSazaeData.pyによって出力した「SazaeData.csv」を読み込んで2009年～2018年の10年間の勝敗結果を出力します。
pandasのデータ操作に慣れそうにないので、「pandasql」を使用してSQLによるデータ操作にしています。あと、pandasqlをフォークしてユーザー定義関数を使えるようにした「pysqldf 」も良さげですね。

main.py

# -*- coding: utf-8 -*-
'''
サザエさんのじゃんけん予想
'''
import datetime
import pandas as pd
from pandasql import sqldf

def get_guess(fstkind, sndkind, thrkind):
    '''
    次手の予測
    '''

    guess = 'G'

    if fstkind == 'G':
        guess = 'C'
    elif fstkind == 'C':
        guess = 'G'
    elif fstkind == 'P':
        guess = 'G'

    #2手前が在る場合
    if sndkind != '':
        if sndkind != fstkind:
            #違う組み合わせ　残りの手が出ると予想するので残りの手の勝手にする
            ptn = fstkind + sndkind
            if ptn in('GC', 'CG'):
                guess = 'C' #Pの予想でCにする
            elif ptn in('CP', 'PC'):
                guess = 'P' #Gの予想でPにする
            elif ptn in('PG', 'GP'):
                guess = 'G' #Cの予想でGにする
        else:
            #同一なら勝手と予想するので負手にする
            if fstkind == 'G':
                guess = 'C' #Pの予想でCにする
            elif fstkind == 'C':
                guess = 'P' #Gの予想でPにする
            elif fstkind == 'P':
                guess = 'G' #Cの予想でGにする

        #3手前が在る場合
        if thrkind != '':
            #違う組み合わせ　残りの手が出ると予想するので残りの手の勝手にする
            ptn = fstkind + sndkind
            if ptn in('GC', 'CG'):
                guess = 'C' #Pが出るのでCにする
            ptn = fstkind + sndkind + thrkind
            if ptn in('GCG', 'CGC'):
                guess = 'C' #Pの予想でCにする
            elif ptn in('CPC', 'PCP'):
                guess = 'P' #Gの予想でPにする
            elif ptn in('PGP', 'GPG'):
                guess = 'G' #Cの予想でGにする
            elif ptn in('GGC', 'CCG', 'GCC', 'CGG'):
                guess = 'C' #Pの予想でCにする
            elif ptn in('CCP', 'PPC', 'PCC', 'CPP'):
                guess = 'P' #Gの予想でPにする
            elif ptn in('PPG', 'GGP', 'GPP', 'PGG'):
                guess = 'G' #Cの予想でGにする

    return guess   #戻り値

def get_fight(kind, guess):
    '''
    勝敗 関数作成
    '''

    ptn = kind + guess
    if ptn in('GP', 'CG', 'PC'):
        result = 'win'
    elif kind == guess:
        result = 'draw'
    else:
        result = 'lose'

    return result   #戻り値

def isFirstWeek(value):
    '''
    1,4,7,10の第1週目か判定
    '''
    date = datetime.datetime.strptime(value, '%Y-%m-%d')
    if((date.month - 1) % 3 != 0):
       return False
      
    day = date.day

    weeks = 0
    while day > 0:
        weeks += 1
        day -= 7

    return (weeks == 1) 

def get_fight_result(df_data):
    '''
    年別の過去データとの勝敗
    '''
    result = []
    i = 0
    oldyear = 0
    row = len(df_data)
    while i < row:
        if oldyear != df_data.ix[i, 'year']:
            oldyear = df_data.ix[i, 'year']
            thrkind, sndkind, fstkind = ['', '', '']

        seq = df_data.ix[i, 'seq']
        year = df_data.ix[i, 'year']
        date = df_data.ix[i, 'date']
        kind = df_data.ix[i, 'kind']

        #次の手の勝手を取得
        guess = get_guess(fstkind, sndkind, thrkind)
        #1,4,7,10の第1週目はチョキが多い
        if(isFirstWeek(date)):
            guess = 'G'
        fight = get_fight(kind, guess)
        thrkind, sndkind, fstkind = [sndkind, fstkind, kind]

        result.append((seq, year, date, kind, guess, fight))
        i = i + 1

    return pd.DataFrame(result, columns=['seq', 'year', 'date', 'kind', 'guess', 'fight'])

def get_winning_percentage(df_data):
    '''
    年別の勝率計算
    '''
    result = []
    i = 0
    oldyear = 0
    row = len(df_data)
    while i < row:
        if oldyear != df_data.ix[i, 'year']:
            oldyear = df_data.ix[i, 'year']
            year = oldyear
            draw = df_data.ix[i, 'cnt']
            lose = df_data.ix[i+1, 'cnt']
            win = df_data.ix[i+2, 'cnt']
            rate = round(win / (win + lose), 3)
            result.append((year, win, lose, draw, rate))

        i = i + 1

    return pd.DataFrame(result, columns=['year', 'win', 'lose', 'draw', 'rate'])

def main():
    '''
    メイン
    '''
    #サザエさんのじゃんけんデータの読み込み
    ytbl = pd.read_csv('SazaeData.csv')

    #10年分の過去データとの勝敗
    pd.set_option("display.max_rows", 100)
    query = "SELECT seq, year, date, kind, idx FROM ytbl WHERE idx<>9 AND year BETWEEN 2009 AND 2018;"
    ytblptn = sqldf(query, locals())
    fighttbl = get_fight_result(ytblptn)
    print(fighttbl)

    #10年分の年別の勝率計算
    query = "SELECT year,fight,COUNT(fight) AS cnt FROM fighttbl GROUP BY year,fight ORDER BY year;"
    fightcnt = sqldf(query, locals())
    ratetbl = get_winning_percentage(fightcnt)
    print(ratetbl)

if __name__ == '__main__':
    main()

出力結果

長いので2018年のサザエさんの手の予想と勝敗結果

	seq	year	date	kind	guess	fight
443	1364	2018	2018-01-07	C	G	win
444	1365	2018	2018-01-14	G	G	draw
445	1366	2018	2018-01-21	C	C	draw
446	1367	2018	2018-01-28	G	C	lose
447	1368	2018	2018-02-04	P	C	win
448	1369	2018	2018-02-11	G	G	draw
449	1370	2018	2018-02-18	C	G	win
450	1371	2018	2018-02-25	C	C	draw
451	1372	2018	2018-03-04	P	C	win
452	1373	2018	2018-03-11	G	P	win
453	1374	2018	2018-03-18	G	G	draw
454	1375	2018	2018-03-25	C	G	win
455	1376	2018	2018-04-01	C	G	win
456	1377	2018	2018-04-08	G	C	lose
457	1378	2018	2018-04-15	P	C	win
458	1379	2018	2018-04-22	P	G	lose
459	1380	2018	2018-04-29	C	G	win
460	1381	2018	2018-05-06	G	P	win
461	1382	2018	2018-05-13	C	C	draw
462	1383	2018	2018-05-20	P	C	win
463	1384	2018	2018-05-27	G	P	win
464	1385	2018	2018-06-03	C	G	win
465	1386	2018	2018-06-10	G	C	lose
466	1387	2018	2018-06-17	P	C	win
467	1388	2018	2018-06-24	P	G	lose
468	1389	2018	2018-07-01	C	G	win
469	1390	2018	2018-07-08	G	P	win
470	1391	2018	2018-07-15	P	C	win
471	1392	2018	2018-07-22	G	G	draw
472	1393	2018	2018-07-29	C	G	win
473	1394	2018	2018-08-05	C	C	draw
474	1395	2018	2018-08-12	P	C	win
475	1396	2018	2018-08-19	G	P	win
476	1397	2018	2018-08-26	G	G	draw
477	1398	2018	2018-09-02	P	G	lose
478	1399	2018	2018-09-09	C	G	win
479	1400	2018	2018-09-16	G	P	win
480	1401	2018	2018-09-23	C	C	draw
481	1402	2018	2018-09-30	P	C	win
482	1403	2018	2018-10-07	C	G	win
483	1404	2018	2018-10-14	G	P	win
484	1405	2018	2018-10-21	P	C	win
485	1406	2018	2018-11-04	P	G	lose
486	1407	2018	2018-11-11	C	G	win
487	1408	2018	2018-11-18	C	P	lose
488	1409	2018	2018-11-25	G	P	win
489	1410	2018	2018-12-02	G	C	lose
490	1411	2018	2018-12-09	P	C	win
491	1412	2018	2018-12-16	C	G	win

2009年～2018年の勝敗結果

	year	win	lose	draw	rate
0	2009	32	5	12	0.865
1	2010	27	6	14	0.818
2	2011	30	8	12	0.789
3	2012	27	12	10	0.692
4	2013	26	11	12	0.703
5	2014	32	8	11	0.800
6	2015	34	8	8	0.810
7	2016	26	12	12	0.684
8	2017	34	8	6	0.810
9	2018	30	9	10	0.769

最後に

R言語でのデータフレームもPythonのpandasを使うことで簡単に移植できました。sqldfもpandasqlで代用できたしね。
どちらかというとVisual Studio CodeによるPylintでの警告エラーを減らす方が苦労しました。エラー内容から検索しても日本語で書かれたサイトに辿り着かないので英文を理解しながら直していきました。Pythonの命名規則などを理解しないといけないですね。

参照

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

「サザエさんのじゃんけん データ分析」のR言語からPythonに移植