More than 5 years have passed since last update.

Numpyのきほんのき

Last updated at 2018-03-15Posted at 2018-03-15

DataQuestというデータサイエンティストのためのcodeacademyみたいなサイトで学習中なので記録と備忘録のために。英語と日本語の殴り書きご容赦。もし間違っている点や勘違いがあればコメントしていただけると助かります。

きほん

script.py

vector = np.array([10, 20, 30])
matrix = np.array([[5, 10, 15], [20, 25, 30], [35, 40, 45]])

vector_shape = vector.shape
matrix_shape = matrix.shape

print(vector_shape, matrix_shape)
# (3,) (3,3) -> (# of elemets), (rows, columns)

var.array constructs an array/matrix to read it into Numpy array
var.numpy.genfromtxt("file.csv", delimiter=",")
var.shape tells the structure of an array

script.py

import numpy as np
world_alcohol = np.genfromtxt("world_alcohol.csv", delimiter=",")
world_alcohol_dtype = world_alcohol.dtype
print(world_alcohol_dtype)

nanという概念

意味

nan indicates "not a number" used to represent missing a value.
PythonでいうNoneみたいなもの。

Because all of the values in a NumPy array have to have the same data type, NumPy attempted to convert all of the columns to floats when they were read in. The numpy.genfromtxt() function will attempt to guess the correct data type of the array it creates.

Numpyで生成した配列内では全て同じデータの型を入れるのが基本らしい。例えばpythonノリでlst=[1,2,3,"foo"]とかやると[1,2,3,nan]になってしまう。

script.py

array([[             nan,              nan,              nan,              nan,              nan],
       [  1.98600000e+03,              nan,              nan,              nan,   0.00000000e+00],

データの変換、取得

specify data type

To specify the data type for the entire NumPy array, we use the keyword argument dtype and set it to "U75". This specifies that we want to read in each value as a 75 byte unicode data type.

データタイプといってそれぞれのデータper配列の容量も指定できるのか。。

world_alcohol = np.genfromtxt("world_alcohol.csv", delimiter=",", dtype="U75")

###ヘッダーをスキップしたい時

add skip_header=[# of columns to skip] parameter

The skip_header parameter accepts an integer value, specifying the number of lines from the top of the file we want NumPy to ignore.

何らかのデータを読み込む際に任意で最初の行を飛ばすことができる。特にエクセルでは最初の一行目はデータではなく一般的な概念や種類であることが多いのでデータ取得とかで使えそう。

world_alcohol = np.genfromtxt("world_alcohol.csv", delimiter=",", dtype="U75", skip_header=1)

###データ取得

matrix[row, column] where row = ↓ and column = →

script.py

>> matrix = np.array([
                        [5, 10, 15], 
                        [20, 25, 30]
                     ])
>> matrix[1,2]
30

to select one whole dimension, matrix[:, column]

This will select all of the rows, but only the column with index 1.

script.py

>> matrix = np.array([
                    [5, 10, 15], 
                    [20, 25, 30],
                    [35, 40, 45]
                 ])
>> matrix[:,1]
array([10, 25, 40])

script.py

>> matrix = np.array([
                    [5, 10, 15], 
                    [20, 25, 30],
                    [35, 40, 45]
                 ])
>> matrix[1:3,0:2]
array([[20, 25],
       [35, 40]])

ベクター内のTrueの値のみを摘出したい場合：

script.py

matrix = numpy.array([
                [5, 10, 15], 
                [20, 25, 30],
                [35, 40, 45]
             ])
    second_column_25 = (matrix[:,1] == 25)
    print(matrix[second_column_25, :])

[
    [20, 25, 30]
]

is_algeria_and_1986 = (world_alcohol[:, 0] == "1986") & (world_alcohol[:, 2] == "Algeria")
# is_algeria_and_1986 = [True, False, True, True...]
rows_with_algeria_and_1986 = world_alcohol[is_algeria_and_1986, :]

###データの書き換え

script.py

vector = numpy.array([5, 10, 15, 20])
equal_to_ten_or_five = (vector == 10) | (vector == 5)
vector[equal_to_ten_or_five] = 50
print(vector)

script.py

is_equal_to_1986 = world_alcohol[:, 0] == "1986"
world_alcohol[is_equal_to_1986, 0] = "2014"

is_equal_to_Wine = world_alcohol[:, 3] == "Wine"
world_alcohol[is_equal_to_Wine, 3] = "Grog"
# 2行目の`0`は`:`にしてしまうと怒られるので注意！

###データの種類を変えたい時

指定した配列の中身のデータタイプが全て変わるので注意！

script.py

vector = numpy.array(["1", "2", "3"])
vector = vector.astype(float)

計算をする

matrix（n<=2d）の場合はmatrix.sum()だけでなくaxis引数を指定してあげないといけない。row（=行）は1でcolumn（=列）は0を指定する。

script.py

matrix = numpy.array([
                [5, 10, 15], 
                [20, 25, 30],
                [35, 40, 45]
             ])
    matrix.sum(axis=1)
#[30, 75, 120]

#数字表記について

Scientific notation is a way to condense how very large or very precise numbers are displayed. We can represent 100 in scientific notation as 1e+02. The e+02 indicates that we should multiply what comes before it by 10 ^ 2(10 to the power 2, or 10 squared). This results in 1 * 100, or 100. Thus, 1.98600000e+03 is actually 1.986 * 10 ^ 3, or 1986. 1000000000000000 can be written as 1e+15.

当たり前かもしれないがe+以降の数字分だけ10を掛け合わせて表示するのは面倒なのでこういう表記の仕方をする。こんなことまで親切に教えてくれるのか。

#問題を解いてみる

##例題１

Create a matrix called canada_1986 that only contains the rows in world_alcohol where the first column is the string 1986 and the third column is the string Canada.
Extract the fifth column of canada_1986, replace any empty strings ('') with the string 0, and convert the column to the float data type. Assign the result to canada_alcohol.
Compute the sum of canada_alcohol. Assign the result to total_canadian_drinking.

オリジナル：

q1.py

canada_1986 = (world_alcohol[:, 0] == "1986") & (world_alcohol[:, 2] == "Canada")
canada_1986 = world_alcohol[canada_1986, :]
is_empty = canada_1986[:, 4] == ''
canada_alcohol[is_empty] = "0"
canada_alcohol = canada_alcohol.astype(float)
total_canadian_drinking = canada_alcohol.sum()

答え：

q1.py

canada_1986 = (world_alcohol[:, 0] == "1986") & (world_alcohol[:, 2] == "Canada")
canada_1986 = world_alcohol[canada_1986, :]
canada_alcohol = canada_1986[:, 4]
is_empty = canada_alcohol == ''
canada_alcohol[is_empty] = "0"
canada_alcohol = canada_alcohol.astype(float)
total_canadian_drinking = canada_alcohol.sum()

反省点

ステップごとに立ち止まってちゃんと定義してあげる。横着してis_empty = canada_1986[:, 4] == ''と定義してしまうと処理（する順番が変わってくるので）内容が変わってくる。しっかりとカッコをつけて表現する。is_empty = (canada_1986[:, 4] == '')とすれば正解になる。

##例題２

We've assigned the list of all countries to the variable countries.
Find the total consumption for each country in countries for the year 1989.
When you're finished, totals should contain all of the country names as keys, with the corresponding alcohol consumption totals for 1989 as values.

##ヒント
xxxx年のとある国のそれぞれのアルコール消費量の平均を知りたいときは以下のフレームワークが参考になりそうだ。

Create an empty dictionary called totals.
Select only the rows in world_alcohol that match a given year. Assign the result to year.
Loop through a list of countries. For each country:
Select only the rows from year that match the given country.
Assign the result to country_consumption.
Extract the fifth column from country_consumption.
Replace any empty string values in the column with the string 0.
Convert the column to the float data type.
Find the sum of the column.
Add the sum to the totals dictionary, with the country name as the key.

After the code executes, you'll have a dictionary containing all of the country names as keys, with the associated alcohol consumption totals as the values.

##オリジナル：

script.py

totals = {}
is_1989 = (world_alcohol[:,0] == "1989")
year = world_alcohol[is_1989]
for country in countries:
    is_matching_country = (year[:, 2] == country)
    country_consumption = year[is_matching_country]
    country_consumption = country_consumption[:, 4]
    is_empty = (country_consumption[:] == '')
    country_consumption[is_empty] = "0"
    country_consumption = country_consumption.astype(float)
    country_sum = country_consumption.sum()
    totals[country] = country_sum

##答え：

script.py

totals = {}
is_year = world_alcohol[:,0] == "1989"
year = world_alcohol[is_year,:]

for country in countries:
    is_country = year[:,2] == country
    country_consumption = year[is_country,:]
    alcohol_column = country_consumption[:,4]
    is_empty = alcohol_column == ''
    alcohol_column[is_empty] = "0"
    alcohol_column = alcohol_column.astype(float)
    totals[country] = alcohol_column.sum()

##反省点：
実感としてエラーのほとんどはケアレスミスだった。当たり前のことだがそれぞれの行でどのオブジェクトを対象にどういう処理をしていてどういうアウトプットを期待しているのか、また実際のアウトプットは何かについてゆっくりでもいいので考えながら進めていくこと。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up