0
2

More than 1 year has passed since last update.

Numpy Pandas チートシート

Posted at

Numpy

条件指定で抽出

# 配列生成
>>> array1 = np.array([16, 10, 3])
>>> array1.shape
(3,)

# boolリストで条件指定できる
>>> array1[[True, False, True]]
array([16,  3])

# boolリストが返る
>>> print(array1 > 5)
[ True  True False]

# 組み合わせるとこう使える
>>> array1[array1 > 5]
array([16, 10])

bincount

配列の各値の出現回数をカウント

>>> a1 = np.array([0, 1, 1, 0, 1, 0, 0, 0])

>>> np.bincount(a1)
array([5, 3])

arange, linspace

numpy配列作成

>>> np.arange(5)
array([0, 1, 2, 3, 4])

>>> np.linspace(-3, 3, 5)
array([-3. , -1.5,  0. ,  1.5,  3. ])

reshape

配列の次元変換

>>> line = np.linspace(-3, 3, 10)

# 1次元配列
>>> line
array([-3.        , -2.33333333, -1.66666667, -1.        , -0.33333333,
        0.33333333,  1.        ,  1.66666667,  2.33333333,  3.        ])

# 2次元配列に変換(10行1列)
>>> line.reshape(10, 1)
array([[-3.        ],
       [-2.33333333],
       [-1.66666667],
       [-1.        ],
       [-0.33333333],
       [ 0.33333333],
       [ 1.        ],
       [ 1.66666667],
       [ 2.33333333],
       [ 3.        ]])

# どちらかは-1を指定でき、もう片方の次元から推測して変換される
>>> line.reshape(-1, 1)
array([[-3.        ],
       [-2.33333333],
       [-1.66666667],
       [-1.        ],
       [-0.33333333],
       [ 0.33333333],
       [ 1.        ],
       [ 1.66666667],
       [ 2.33333333],
       [ 3.        ]])

Numpy配列の演算

>>> x = np.array([1, 2, 3])
>>> x.mean()
2.0

>>> xc = x - x.mean()
>>> xc
array([-1.,  0.,  1.])

axis指定

  • axis=0: 層
  • axis=1: 行
  • axis=2: 列
>>> b
array([[[0, 1],
        [2, 3],
        [4, 5]],

       [[0, 1],
        [2, 3],
        [4, 5]]])
>>> b.shape
(2, 3, 2)

>>> b.sum(axis=0)
array([[ 0,  2],
       [ 4,  6],
       [ 8, 10]])

>>> b.sum(axis=1)
array([[6, 9],
       [6, 9]])

>>> b.sum(axis=2)
array([[1, 5, 9],
       [1, 5, 9]])

Pandas

csv読み込み

読み込みと表示

>>> df = pd.read_csv('housing.csv')
>>> df.head()
        x1    x2    x3  x4     x5     x6    x7      x8  x9  x10   x11     x12   x13     y
0  0.00632  18.0  2.31   0  0.538  6.575  65.2  4.0900   1  296  15.3  396.90  4.98  24.0
1  0.02731   0.0  7.07   0  0.469  6.421  78.9  4.9671   2  242  17.8  396.90  9.14  21.6
2  0.02729   0.0  7.07   0  0.469  7.185  61.1  4.9671   2  242  17.8  392.83  4.03  34.7
3  0.03237   0.0  2.18   0  0.458  6.998  45.8  6.0622   3  222  18.7  394.63  2.94  33.4
4  0.06905   0.0  2.18   0  0.458  7.147  54.2  6.0622   3  222  18.7  396.90  5.33  36.2
>>> len(df)
506
>>> df.describe()
               x1          x2          x3          x4          x5     ...             x10         x11         x12         x13           y
count  506.000000  506.000000  506.000000  506.000000  506.000000     ...      506.000000  506.000000  506.000000  506.000000  506.000000
mean     3.613524   11.363636   11.136779    0.069170    0.554695     ...      408.237154   18.455534  356.674032   12.653063   22.532806
std      8.601545   23.322453    6.860353    0.253994    0.115878     ...      168.537116    2.164946   91.294864    7.141062    9.197104
min      0.006320    0.000000    0.460000    0.000000    0.385000     ...      187.000000   12.600000    0.320000    1.730000    5.000000
25%      0.082045    0.000000    5.190000    0.000000    0.449000     ...      279.000000   17.400000  375.377500    6.950000   17.025000
50%      0.256510    0.000000    9.690000    0.000000    0.538000     ...      330.000000   19.050000  391.440000   11.360000   21.200000
75%      3.677082   12.500000   18.100000    0.000000    0.624000     ...      666.000000   20.200000  396.225000   16.955000   25.000000
max     88.976200  100.000000   27.740000    1.000000    0.871000     ...      711.000000   22.000000  396.900000   37.970000   50.000000
>>> df.shape
(506, 14)
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
x1     506 non-null float64
x2     506 non-null float64
x3     506 non-null float64
x4     506 non-null int64
x5     506 non-null float64
x6     506 non-null float64
x7     506 non-null float64
x8     506 non-null float64
x9     506 non-null int64
x10    506 non-null int64
x11    506 non-null float64
x12    506 non-null float64
x13    506 non-null float64
y      506 non-null float64
dtypes: float64(11), int64(3)
memory usage: 55.4 KB

情報の取り出し

列の取り出し

  • 1次元指定でSeriesとして取得
>>> type(df['x1'])
<class 'pandas.core.series.Series'>

>>> df['x1']
0       0.00632
1       0.02731
         ...   
504     0.10959
505     0.04741
Name: x1, Length: 506, dtype: float64
  • 2次元指定でDataFrameとして取得
>>> type(df[['x1']])
<class 'pandas.core.frame.DataFrame'>

>>> df[['x1']]
           x1
0     0.00632
1     0.02731
..        ...
504   0.10959
505   0.04741

[506 rows x 1 columns]

# 複数列
>>> df[['x1', 'x2']]
           x1    x2
0     0.00632  18.0
1     0.02731   0.0
..        ...   ...
504   0.10959   0.0
505   0.04741   0.0

[506 rows x 2 columns]

行の取り出し

>>> df[500:]
          x1   x2     x3  x4     x5     x6    x7      x8  x9  x10   x11     x12    x13     y
500  0.22438  0.0   9.69   0  0.585  6.027  79.7  2.4982   6  391  19.2  396.90  14.33  16.8
501  0.06263  0.0  11.93   0  0.573  6.593  69.1  2.4786   1  273  21.0  391.99   9.67  22.4
502  0.04527  0.0  11.93   0  0.573  6.120  76.7  2.2875   1  273  21.0  396.90   9.08  20.6
503  0.06076  0.0  11.93   0  0.573  6.976  91.0  2.1675   1  273  21.0  396.90   5.64  23.9
504  0.10959  0.0  11.93   0  0.573  6.794  89.3  2.3889   1  273  21.0  393.45   6.48  22.0
505  0.04741  0.0  11.93   0  0.573  6.030  80.8  2.5050   1  273  21.0  396.90   7.88  11.9

>>> df[:5]
        x1    x2    x3  x4     x5     x6    x7      x8  x9  x10   x11     x12   x13     y
0  0.00632  18.0  2.31   0  0.538  6.575  65.2  4.0900   1  296  15.3  396.90  4.98  24.0
1  0.02731   0.0  7.07   0  0.469  6.421  78.9  4.9671   2  242  17.8  396.90  9.14  21.6
2  0.02729   0.0  7.07   0  0.469  7.185  61.1  4.9671   2  242  17.8  392.83  4.03  34.7
3  0.03237   0.0  2.18   0  0.458  6.998  45.8  6.0622   3  222  18.7  394.63  2.94  33.4
4  0.06905   0.0  2.18   0  0.458  7.147  54.2  6.0622   3  222  18.7  396.90  5.33  36.2

>>> df[100:150:10]
          x1   x2     x3  x4     x5     x6    x7      x8  x9  x10   x11     x12    x13     y
100  0.14866  0.0   8.56   0  0.520  6.727  79.9  2.7778   5  384  20.9  394.76   9.42  27.5
110  0.10793  0.0   8.56   0  0.520  6.195  54.4  2.7778   5  384  20.9  393.49  13.00  21.7
120  0.06899  0.0  25.65   0  0.581  5.870  69.7  2.2577   2  188  19.1  389.15  14.37  22.0
130  0.34006  0.0  21.89   0  0.624  6.458  98.9  2.1185   4  437  21.2  395.04  12.60  19.2
140  0.29090  0.0  21.89   0  0.624  6.174  93.6  1.6119   4  437  21.2  388.08  24.16  14.0

行と列の取り出し

ラベル指定(loc)

>>> df.loc[10:15, ['x1', 'x2']]
         x1    x2
10  0.22489  12.5
11  0.11747  12.5
12  0.09378  12.5
13  0.62976   0.0
14  0.63796   0.0
15  0.62739   0.0

インデックス指定(iloc)

>>> df.iloc[10:15, 0:2]
         x1    x2
10  0.22489  12.5
11  0.11747  12.5
12  0.09378  12.5
13  0.62976   0.0
14  0.63796   0.0

>>> df.iloc[:, :-1]  # 全行と最後の列以外
           x1    x2     x3  x4     x5     x6     x7      x8  x9  x10   x11     x12    x13
0     0.00632  18.0   2.31   0  0.538  6.575   65.2  4.0900   1  296  15.3  396.90   4.98
1     0.02731   0.0   7.07   0  0.469  6.421   78.9  4.9671   2  242  17.8  396.90   9.14
2     0.02729   0.0   7.07   0  0.469  7.185   61.1  4.9671   2  242  17.8  392.83   4.03
..        ...   ...    ...  ..    ...    ...    ...     ...  ..  ...   ...     ...    ...
504   0.10959   0.0  11.93   0  0.573  6.794   89.3  2.3889   1  273  21.0  393.45   6.48
505   0.04741   0.0  11.93   0  0.573  6.030   80.8  2.5050   1  273  21.0  396.90   7.88

[506 rows x 13 columns]

条件指定で抽出

基本はNumpyと同じ

>>> df['x1'] > 50
0      False
1      False
       ...  
504    False
505    False
Name: x1, Length: 506, dtype: bool

>>> df[df['x1'] > 50]
          x1   x2    x3  x4     x5     x6     x7      x8  x9  x10   x11     x12    x13     y
380  88.9762  0.0  18.1   0  0.671  6.968   91.9  1.4165  24  666  20.2  396.90  17.21  10.4
405  67.9208  0.0  18.1   0  0.693  5.683  100.0  1.4254  24  666  20.2  384.97  22.98   5.0
410  51.1358  0.0  18.1   0  0.597  5.757  100.0  1.4130  24  666  20.2    2.60  10.11  15.0
418  73.5341  0.0  18.1   0  0.679  5.957  100.0  1.8026  24  666  20.2   16.45  20.62   8.8

>>> df[(df['x1'] > 50) & (df['x6'] > 6)]
          x1   x2    x3  x4     x5     x6    x7      x8  x9  x10   x11    x12    x13     y
380  88.9762  0.0  18.1   0  0.671  6.968  91.9  1.4165  24  666  20.2  396.9  17.21  10.4

ただし、条件にはSeries(1次元)で渡さないとダメ

DataFrame(2次元)で渡すと変になる

SeriesとDataFrameの違い
>>> df.head(3)
         date home_team away_team  home_score  away_score tournament     city   country  neutral
0  1872-11-30  Scotland   England           0           0   Friendly  Glasgow  Scotland    False
1  1873-03-08   England  Scotland           4           2   Friendly   London   England    False
2  1874-03-07  Scotland   England           2           1   Friendly  Glasgow  Scotland    False

# 1次元指定だとSeries, 2次元指定だとDataFrameが返る
>>> type(df['home_team'])
<class 'pandas.core.series.Series'>

>>> type(df[['home_team']])
<class 'pandas.core.frame.DataFrame'>

>>> type(df[['home_team', 'away_team']])
<class 'pandas.core.frame.DataFrame'>


>>> type(df['home_team'] == 'England')
<class 'pandas.core.series.Series'>

>>> type(df[['home_team']] == 'England')
<class 'pandas.core.frame.DataFrame'>

>>> type(df[['home_team', 'away_team']] == 'England')
<class 'pandas.core.frame.DataFrame'>


# 表示の違い
# Series
>>> df['home_team'] == 'England'
0        False
1         True
2        False
         ...  
39005    False
39006    False
39007    False
Name: home_team, Length: 39008, dtype: bool

# DataFrame
>>> df[['home_team']] == 'England'
       home_team
0          False
1           True
2          False
...          ...
39005      False
39006      False
39007      False

[39008 rows x 1 columns]

# DataFrame
>>> df[['home_team', 'away_team']] == 'England'
       home_team  away_team
0          False       True
1           True      False
2          False       True
...          ...        ...
39005      False      False
39006      False      False
39007      False      False

[39008 rows x 2 columns]
条件に指定してみる
# Series(いい感じ)
>>> df[df['home_team'] == 'England']
             date home_team         away_team  home_score  away_score                    tournament            city  country  neutral
1      1873-03-08   England          Scotland           4           2                      Friendly          London  England    False
3      1875-03-06   England          Scotland           2           2                      Friendly          London  England    False
...           ...       ...               ...         ...         ...                           ...             ...      ...      ...
38881  2018-03-27   England             Italy           1           1                      Friendly          London  England    False
38981  2018-06-02   England           Nigeria           2           1                      Friendly          London  England    False

[480 rows x 9 columns]

# DataFrame(行が抽出できないし、一致するところ以外がNaNになってしまう)
>>> df[df[['home_team']] == 'England']
      date home_team away_team  home_score  away_score tournament city country  neutral
0      NaN       NaN       NaN         NaN         NaN        NaN  NaN     NaN      NaN
1      NaN   England       NaN         NaN         NaN        NaN  NaN     NaN      NaN
2      NaN       NaN       NaN         NaN         NaN        NaN  NaN     NaN      NaN
...    ...       ...       ...         ...         ...        ...  ...     ...      ...
39006  NaN       NaN       NaN         NaN         NaN        NaN  NaN     NaN      NaN
39007  NaN       NaN       NaN         NaN         NaN        NaN  NaN     NaN      NaN

[39008 rows x 9 columns]

any

  • DataFrameに対して、どちらかがTrueの場合に抽出というように使える
  • anyを使うことでDataFrameがSeriesに変換される
# anyなし
>>> type(df[['home_team', 'away_team']] == 'England')
<class 'pandas.core.frame.DataFrame'>

>>> df[['home_team', 'away_team']] == 'England'
       home_team  away_team
0          False       True
1           True      False
2          False       True
...          ...        ...
39005      False      False
39006      False      False
39007      False      False

[39008 rows x 2 columns]

# anyあり
>>> type((df[['home_team', 'away_team']] == 'England').any(axis=1))
<class 'pandas.core.series.Series'>

>>> (df[['home_team', 'away_team']] == 'England').any(axis=1)
0         True
1         True
2         True
         ...  
39005    False
39006    False
39007    False
Length: 39008, dtype: bool

# anyで条件抽出
>>> df[(df[['home_team', 'away_team']] == 'England').any(axis=1)]
             date         home_team         away_team  home_score  away_score                    tournament           city      country  neutral
0      1872-11-30          Scotland           England           0           0                      Friendly        Glasgow     Scotland    False
1      1873-03-08           England          Scotland           4           2                      Friendly         London      England    False
...           ...               ...               ...         ...         ...                           ...            ...          ...      ...
38881  2018-03-27           England             Italy           1           1                      Friendly         London      England    False
38981  2018-06-02           England           Nigeria           2           1                      Friendly         London      England    False

[976 rows x 9 columns]

クエリ検索

>>> df.query("home_team == 'Japan' | away_team == 'Japan'")
             date             home_team             away_team   ...                 city               country neutral
443    1917-05-07                 Japan           Philippines   ...                Tokyo                 Japan   False
571    1921-05-30                 Japan           Philippines   ...             Shanghai                 China    True
...           ...                   ...                   ...   ...                  ...                   ...     ...
38916  2018-03-27               Ukraine                 Japan   ...                Liège               Belgium    True
38957  2018-05-30                 Japan                 Ghana   ...             Yokohama                 Japan   False

[597 rows x 9 columns]

基本的な値を取得

平均(mean)

>>> df.mean()
x1       3.613524
x2      11.363636
x3      11.136779
x4       0.069170
x5       0.554695
x6       6.284634
x7      68.574901
x8       3.795043
x9       9.549407
x10    408.237154
x11     18.455534
x12    356.674032
x13     12.653063
y       22.532806
dtype: float64

>>> df[['x1', 'x2']].mean()
x1     3.613524
x2    11.363636
dtype: float64

標準偏差(std)

>>> df.std()
x1       8.601545
x2      23.322453
x3       6.860353
x4       0.253994
x5       0.115878
x6       0.702617
x7      28.148861
x8       2.105710
x9       8.707259
x10    168.537116
x11      2.164946
x12     91.294864
x13      7.141062
y        9.197104
dtype: float64

最大値(max)

>>> df
    a   b   c   d
0  11  12  13  14
1  21  22  23  24
2  31  32  33  34

>>> df.max()
a    31
b    32
c    33
d    34
dtype: int64

>>> df.max(axis=1)
0    14
1    24
2    34
dtype: int64

データフレームの列同士の演算

>>> df.head(3)
        x1    x2    x3  x4     x5     x6    x7      x8  x9  x10   x11     x12   x13     y
0  0.00632  18.0  2.31   0  0.538  6.575  65.2  4.0900   1  296  15.3  396.90  4.98  24.0
1  0.02731   0.0  7.07   0  0.469  6.421  78.9  4.9671   2  242  17.8  396.90  9.14  21.6
2  0.02729   0.0  7.07   0  0.469  7.185  61.1  4.9671   2  242  17.8  392.83  4.03  34.7

>>> df.mean()
x1       3.613524
x2      11.363636
x3      11.136779
x4       0.069170
x5       0.554695
x6       6.284634
x7      68.574901
x8       3.795043
x9       9.549407
x10    408.237154
x11     18.455534
x12    356.674032
x13     12.653063
y       22.532806

>>> df_c = df - df.mean()
>>> df_c.head(3)
         x1         x2        x3       x4        x5        x6         x7        x8        x9         x10       x11        x12       x13          y
0 -3.607204   6.636364 -8.826779 -0.06917 -0.016695  0.290366  -3.374901  0.294957 -8.549407 -112.237154 -3.155534  40.225968 -7.673063   1.467194
1 -3.586214 -11.363636 -4.066779 -0.06917 -0.085695  0.136366  10.325099  1.172057 -7.549407 -166.237154 -0.655534  40.225968 -3.513063  -0.932806
2 -3.586234 -11.363636 -4.066779 -0.06917 -0.085695  0.900366  -7.474901  1.172057 -7.549407 -166.237154 -0.655534  36.155968 -8.623063  12.167194

新しい列の追加

# 追加前
>>> df.head()
         date home_team away_team  home_score  away_score tournament     city   country  neutral
0  1872-11-30  Scotland   England           0           0   Friendly  Glasgow  Scotland    False
1  1873-03-08   England  Scotland           4           2   Friendly   London   England    False
2  1874-03-07  Scotland   England           2           1   Friendly  Glasgow  Scotland    False
3  1875-03-06   England  Scotland           2           2   Friendly   London   England    False
4  1876-03-04  Scotland   England           3           0   Friendly  Glasgow  Scotland    False

# 列の追加
>>> df['year'] = pd.to_numeric([date.split('-')[0] for date in df['date']])
>>> df['month'] = pd.to_numeric([date.split('-')[1] for date in df['date']])
>>> df['day'] = pd.to_numeric([date.split('-')[2] for date in df['date']])

# 確認
>>> df.head()
         date home_team away_team  home_score  away_score tournament     city   country  neutral  year  month  day
0  1872-11-30  Scotland   England           0           0   Friendly  Glasgow  Scotland    False  1872     11   30
1  1873-03-08   England  Scotland           4           2   Friendly   London   England    False  1873      3    8
2  1874-03-07  Scotland   England           2           1   Friendly  Glasgow  Scotland    False  1874      3    7
3  1875-03-06   England  Scotland           2           2   Friendly   London   England    False  1875      3    6
4  1876-03-04  Scotland   England           3           0   Friendly  Glasgow  Scotland    False  1876      3    4

並び替え

>>> df.sort_values('home_score', ascending = False).head()
             date  home_team       away_team  home_score  away_score                    tournament           city    country  neutral  year  month  day
23569  2001-04-11  Australia  American Samoa          31           0  FIFA World Cup qualification  Coffs Harbour  Australia    False  2001      4   11
10860  1979-08-30       Fiji        Kiribati          24           0           South Pacific Games        Nausori       Fiji    False  1979      8   30
23566  2001-04-09  Australia           Tonga          22           0  FIFA World Cup qualification  Coffs Harbour  Australia    False  2001      4    9
22344  2000-02-14     Kuwait          Bhutan          20           0   AFC Asian Cup qualification    Kuwait City     Kuwait    False  2000      2   14
22257  2000-01-26      China            Guam          19           0   AFC Asian Cup qualification          Hanoi    Vietnam     True  2000      1   26

置き換え

# Trueを1に置き換え
>>> df.replace({True: 1})
             date         home_team         away_team  home_score  away_score ...        country neutral  year month  day
0      1872-11-30          Scotland           England           0           0 ...       Scotland   False  1872    11   30
1      1873-03-08           England          Scotland           4           2 ...        England   False  1873     3    8
...           ...               ...               ...         ...         ... ...            ...     ...   ...   ...  ...
39006  2018-06-04           Armenia           Moldova           0           0 ...        Austria       1  2018     6    4
39007  2018-06-04             India             Kenya           3           0 ...          India   False  2018     6    4

[39008 rows x 12 columns]

欠損値処理

欠損値がひとつでも含まれていたら行を削除(dropna)

>>> df.dropna()
           x1    x2     x3  x4     x5     x6     x7      x8  x9  x10   x11     x12    x13     y
0     0.00632  18.0   2.31   0  0.538  6.575   65.2  4.0900   1  296  15.3  396.90   4.98  24.0


[506 rows x 14 columns]

欠損値を埋める(fillna)

>>> df.fillna(0)
           x1    x2     x3  x4     x5     x6     x7      x8  x9  x10   x11     x12    x13     y
0     0.00632  18.0   2.31   0  0.538  6.575   65.2  4.0900   1  296  15.3  396.90   4.98  24.0

[506 rows x 14 columns]

欠損値のある行を抽出(isnull)

>>> df[df['x1'].isnull()]
Empty DataFrame
Columns: [x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, y]

要素、行、列に関数を適用

Seriesの各要素に適用(map)

# DataFrame
>>> df
    a   b   c   d
0  11  12  13  14
1  21  22  23  24
2  31  32  33  34
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

# Series
>>> df['a']
0    11
1    21
2    31
Name: a, dtype: int64
>>> type(df['a'])
<class 'pandas.core.series.Series'>

# mapでSeriesの各要素に関数を適用
>>> df['a'].map(lambda x: '[{}]'.format(x))
0    [11]
1    [21]
2    [31]
Name: a, dtype: object

mapに辞書を渡すこともできる(replaceみたいな使い方)

>>> df['a']
0    11
1    21
2    31
Name: a, dtype: int64

>>> df['a'].map({11:'one',21:'two'})
0    one
1    two
2    NaN
Name: a, dtype: object

DataFrameの各行・各列に適用(apply)

# DataFrame
>>> df
    a   b   c   d
0  11  12  13  14
1  21  22  23  24
2  31  32  33  34
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

# applyでDataFrameに関数を適用
>>> df.apply(lambda x: max(x))
a    31
b    32
c    33
d    34
dtype: int64

# 結果はSeriesに変換される
>>> type(df.apply(lambda x: max(x)))
<class 'pandas.core.series.Series'>
0
2
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
2