More than 5 years have passed since last update.

データ分析　よく使う技

Last updated at 2020-04-06Posted at 2020-02-26

ダミー変数化

新データ名=pd.get_dummies(data=旧データ名,columns=["ダミー化したいカラム名"])

元のデータを更新(inplace)

uselog_weekday.rename(columns={"log_id":"count"},inplace=True) #inplaceは元のデータを更新するか否か（Trueの時更新）

object型の日時データをdatetime型へ

2019-12-16 09:09:00の様なデータから、年と月だけ抽出したいとき、

datetime.py

import datetime
import pandas as pd
旧データ名["発信開始した日時"]=pd.to_datetime(旧データ名["発信開始した日時"])
旧データ名["発信開始時刻(～時)"]=旧データ名["発信開始した日時"].dt.strftime("%Y%H")
新データ名=旧データ名[["発信開始時刻(～時)","大職種名","アポ結果"]]
# カラム名はサンプルです

欠損値確認

データ名.isnull().sum()

特定の欠損値がある行を削除

データ名.dropna(subset = ["カラム名"],inplace=True)

カラムの中の要素名・要素数を得る

import collections
print(collections.Counter(データ名["カラム名"]))

データ名.groupby("カラム名").count()["分類したい要素が入ってるカラム名"]

groupbyした後,カラム名を直したい

新データ名=旧データ名.groupby("customer_id").agg(["mean","median","max","min"])["count"]
新データ名=新データ名.reset_index(drop=False)#False！！

新たなデータ（必要なデータだけほしい）を作る

新たなデータ名=旧データ名[["カラム名１","カラム名２"]]

関数をつくって適用させたい

例.py

def 適当な関数名(x):
    time = x
    ans=time//60
    if ans<=5:
      return ans
    
# personという新しい列を追加します。
データ名['新カラム名'] = データ名["旧カラム名"].apply(適当な関数名)

特定のカラム削除

del データ名['カラム名']

グラフ化１（ヒストグラム）

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# カラム名1が横軸　countが縦軸　カラム名2がヒストグラム化される
sns.countplot(x = データ名["カラム名1"], hue = データ名["カラム名2"])

カラム名,インデックス名変更

データ名 = データ名.rename(columns={'旧カラム名': '新カラム名'}, index={'旧インデックス名': '新インデックス名'})

あるカラム列の要素だけを集めたデータが欲しい

新データ名=旧データ名.loc[call2["大職種名"]=="営業"]
# ""内はサンプルです

あるカラム列の要素をグループ化したい（重複を無くす）

新データ名=旧データ名.groupby(["発信開始時刻(～時)"]).sum()
# カラム名はサンプルです

グラフ日本語設定(Google Colaboratoty)

!apt-get -y install fonts-ipafont-gothic

!rm /root/.cache/matplotlib/fontlist-v310.json

pip install japanize-matplotlib

import pandas as pd
import matplotlib as plt
import japanize_matplotlib #日本語化matplotlib
import seaborn as sns
sns.set(font="IPAexGothic") #日本語フォント設定

NaNに対してデフォルト値を設定する

mergeする際の注意

whereの使い方

決定木で学習したモデルを評価

# 関数の処理で必要なライブラリ
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
 
# 予測値と正解値を描写する関数
def True_Pred_map(pred_df):
    RMSLE = np.sqrt(mean_squared_error(pred_df['true'], pred_df['pred']))
    R2 = r2_score(pred_df['true'], pred_df['pred']) 
    plt.figure(figsize=(8,8))
    ax = plt.subplot(111)
    ax.scatter('true', 'pred', data=pred_df)
    ax.set_xlabel('True Value', fontsize=15)
    ax.set_ylabel('Pred Value', fontsize=15)
    ax.set_xlim(pred_df.min().min()-0.1 , pred_df.max().max()+0.1)
    ax.set_ylim(pred_df.min().min()-0.1 , pred_df.max().max()+0.1)
    x = np.linspace(pred_df.min().min()-0.1, pred_df.max().max()+0.1, 2)
    y = x
    ax.plot(x,y,'r-')
    plt.text(0.1, 0.9, 'RMSLE = {}'.format(str(round(RMSLE, 5))), transform=ax.transAxes, fontsize=15)
    plt.text(0.1, 0.8, 'R^2 = {}'.format(str(round(R2, 5))), transform=ax.transAxes, fontsize=15)

決定係数でモデル評価

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

データ分析 よく使う技