More than 5 years have passed since last update.

Pythonで分析する際に使うコード【備忘録】

Last updated at 2019-01-17Posted at 2019-01-16

概要

Pythonで分析をする際に，
よく使うはずなのに結構忘れてしまうコードがあるため，備忘録を作成する．
まとめかたに工夫はなく，あくまでも今後Pythonを使用する際のメモ程度なのでご勘弁を．

Pythonでよく使うCodeまとめ

DF[DF["A"]>200] # 200以上の項目を抽出
DF.at[1 ,"A"] = -1 # 項目"A"の1を-1に変更
DF["A"].map({0:"A",1:"B"}) # dummy変数の作成
DF.dropna() # NA部分の削除
DF.fillna(0) # NA部分を0で置換
df.interpolate() # NaNを前後の値から補間
sklearn.preprocessing.imputer # NaNを補間
sklearn.preprocessing.StandardScaler # [0,1]にScale

Standard Scaler

# スケールの取り出し(新しいデータが来たときに、固定された正規化定数で対応が可能)
scaler = preprocessing.StandardScaler().fit(df)
# スケールを出す
print("mean:{} std:{}".format(scaler.mean_,scaler.scale_))
# スケーリングを行う
df_scaled = scaler.transform(df)
print(df_scaled)

欠損値の可視化

import missingno as msno
msno.matrix(DF)
# 下の方が色的に綺麗
msno.matrix(df=train, figsize=(20,14), color=(0.5,0,0))

Plot

# Histogram の Plot
plt.title('Histogram')
plt.xlabel('Score')
plt.ylabel('Number of students')
plt.hist(scores, range=(0, 100), bins=100)
plt.savefig(files['histogram'])

df.plot.scatter(x="A",y="B")

# Seaborn によるヒートマップ
sns.pairplot()
sns.heatmap()

# Seaborn による Plot
sns.set()
plt.plot(function_1_1)

Feature Importance using Random Forest


from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=80, max_features='auto')
rf.fit(X_train, y_train)
print('Training done using Random Forest')
ranking = np.argsort(-rf.feature_importances_)
f, ax = plt.subplots(figsize=(11, 9))
sns.barplot(x=rf.feature_importances_[ranking], y=X_train.columns.values[ranking], orient='h')
ax.set_xlabel("feature importance")
plt.tight_layout()
plt.show()

まとめ

とりあえず，よく使うものだけをまとめてみました．
時間があれば少しずつ追加して，もう少しまとめていけると嬉しいです．

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up