overview
Perceptronはpythonに加えてCython(.pyx), C で書かれている。全て読まないと理解できない。そしてChatGPTは全く役出たない。嘘ばっかり言う。
Build方法
_sgd_fast.pyx
などのCythonのコードは生成しなければいけない。ネット上に見当たらない。なんで。
このbuild方法もいまいち見当たらないので適当にやりました
git clone https://github.com/scikit-learn/scikit-learn.git
pip install .
pip install meson-python
pip install ninja
pip install cython
make dev
これで build/cp310/sklearn/linear_model/_sgd_fast.pyx
などが生成されます。
Perceptron.fit() のコードを追う
以下は、Perceptron.fit()
から始まる一連の関数の役割と主な入力、主な出力の説明です。ChatGPTが全く役立たないので、自力でほぼ探しました。
Perceptron.fit()
BaseSGDClassifier.fit()
への alias
sklearn/linear_model/_stochastic_gradient.py
の BaseSGDClassifier.fit()
- 役割: 確率的勾配降下法(SGD)を使用して分類器をトレーニングする。
-
主な入力:
X
、y
、その他のハイパーパラメータ -
主な出力: トレーニングされたモデル
https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/linear_model/_stochastic_gradient.py#L902
BaseSGDClassifier._fit()
- 役割: トレーニングプロセスのメインループを実行し、データをバッチに分割して処理する。
-
主な入力:
X
、y
、その他のハイパーパラメータ -
主な出力: トレーニングされたモデル
https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/linear_model/_stochastic_gradient.py#L670
BaseSGDClassifier._partial_fit()
- 役割: マルチクラス、binary 分類の分岐をする
-
主な入力:
X
、y
、その他のハイパーパラメータ -
主な出力: トレーニングされたモデル
https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/linear_model/_stochastic_gradient.py#L574
BaseSGDClassifier._fit_binary()
- 役割: fit_binary() の呼び出しと後処理
-
主な入力:
X
、y
、その他のハイパーパラメータ -
主な出力: トレーニングされたモデル
https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/linear_model/_stochastic_gradient.py#L748
BaseSGDClassifier.fit_binary()
- 役割: バイナリ分類のためのトレーニングプロセスを実行する
-
主な入力:
X
、y
、その他のハイパーパラメータ -
主な出力: トレーニングされたモデル
https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/linear_model/_stochastic_gradient.py#L361
build/cp310/sklearn/linear_model/_sgd_fast.pyx
の _plain_sgd64()
- 役割: Cythonで実装されたSGDアルゴリズムを使用してモデルをトレーニングする。この中で重みの更新が行われる。これが学習ほぼ本体
-
主な入力:
weights
、intercept
、average_weights
、average_intercept
、loss
、penalty_type
、alpha
、C
、l1_ratio
、dataset
、validation_mask
、early_stopping
、validation_score_cb
、n_iter_no_change
、max_iter
、tol
、fit_intercept
、verbose
、shuffle
、seed
、weight_pos
、weight_neg
、learning_rate
、eta0
、power_t
、one_class
、t
、intercept_decay
、average
-
主な出力:
weights
、intercept
、average_weights
、average_intercept
、n_iter_
_seq_dataset.pyx
の SequentialDataset64.next()
- 役割: データセットから次のサンプルを取得する。
-
主な入力:
x_data_ptr
、x_ind_ptr
、nnz
、y
、sample_weight
-
主な出力: 次のサンプルの特徴値、インデックス、非ゼロ要素数、ターゲット値、サンプルの重み
https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/_seq_dataset.pyx.tp#L96C14-L96C21
sklearn/utils/_seq_dataset.pyx.tp
の _sample()
これはCythonの _seq_dataset.pyx
側が pass するだけの関数で、template側がコードの実態のようだった
次のサンプルに向けてデータ位置のoffsetを足すだけのすごくシンプルな処理
https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/_seq_dataset.pyx.tp#L336
_weight_vector.pyx
の WeightVector64.add()
- 役割: 重みベクトルに更新を適用する。
-
主な入力:
x_data_ptr
、x_ind_ptr
、xnnz
、update
-
主な出力: 更新された重みベクトル
https://visual2.cs.ovgu.de/pubres/lambda-dbscan/-/blob/main/sklearn/utils/_weight_vector.pyx.tp
Perceptronは特徴ゼロの重み ωi を更新するのか?
一通りソースをみたところ、特徴ゼロの重みωiを更新しているように見えた。
WeightVector64.add()
の引数にある xnnz
は The number of non-zero features of "x".
と説明されているが、ここに至るまでにXを特徴>0だけのフィルタで絞ったコードは見つけられなかった。結果的にすべてのXが入力されており、 xnnz
はmisleadな変数名だと理解した。他のルートを通ってくると xnnz
はfilterされてくると想像した。
Cythonのコードたち
せっかくbuldしたのでここに記録しておく。
_sgd_fast.pyx
"""SGD implementation"""
import numpy as np
from time import time
from cython cimport floating
from libc.math cimport exp, fabs, isfinite, log, pow, INFINITY
from .._loss._loss cimport CyLossFunction
from ..utils._typedefs cimport uint32_t, uint8_t
from ..utils._weight_vector cimport WeightVector32, WeightVector64
from ..utils._seq_dataset cimport SequentialDataset32, SequentialDataset64
cdef extern from *:
"""
/* Penalty constants */
#define NO_PENALTY 0
#define L1 1
#define L2 2
#define ELASTICNET 3
/* Learning rate constants */
#define CONSTANT 1
#define OPTIMAL 2
#define INVSCALING 3
#define ADAPTIVE 4
#define PA1 5
#define PA2 6
"""
int NO_PENALTY = 0
int L1 = 1
int L2 = 2
int ELASTICNET = 3
int CONSTANT = 1
int OPTIMAL = 2
int INVSCALING = 3
int ADAPTIVE = 4
int PA1 = 5
int PA2 = 6
# ----------------------------------------
# Extension Types for Loss Functions
# ----------------------------------------
cdef class Regression(CyLossFunction):
"""Base class for loss functions for regression"""
def py_loss(self, double p, double y):
"""Python version of `loss` for testing only.
Pytest needs a python function and can't use cdef functions.
Parameters
----------
p : double
The prediction, `p = w^T x + intercept`.
y : double
The true value (aka target).
Returns
-------
double
The loss evaluated at `p` and `y`.
"""
return self.cy_loss(y, p)
def py_dloss(self, double p, double y):
"""Python version of `dloss` for testing only.
Pytest needs a python function and can't use cdef functions.
Parameters
----------
p : double
The prediction, `p = w^T x`.
y : double
The true value (aka target).
Returns
-------
double
The derivative of the loss function with regards to `p`.
"""
return self.cy_gradient(y, p)
cdef class Classification(CyLossFunction):
"""Base class for loss functions for classification"""
def py_loss(self, double p, double y):
"""Python version of `loss` for testing only."""
return self.cy_loss(y, p)
def py_dloss(self, double p, double y):
"""Python version of `dloss` for testing only."""
return self.cy_gradient(y, p)
cdef class ModifiedHuber(Classification):
"""Modified Huber loss for binary classification with y in {-1, 1}
This is equivalent to quadratically smoothed SVM with gamma = 2.
See T. Zhang 'Solving Large Scale Linear Prediction Problems Using
Stochastic Gradient Descent', ICML'04.
"""
cdef double cy_loss(self, double y, double p) noexcept nogil:
cdef double z = p * y
if z >= 1.0:
return 0.0
elif z >= -1.0:
return (1.0 - z) * (1.0 - z)
else:
return -4.0 * z
cdef double cy_gradient(self, double y, double p) noexcept nogil:
cdef double z = p * y
if z >= 1.0:
return 0.0
elif z >= -1.0:
return 2.0 * (1.0 - z) * -y
else:
return -4.0 * y
def __reduce__(self):
return ModifiedHuber, ()
cdef class Hinge(Classification):
"""Hinge loss for binary classification tasks with y in {-1,1}
Parameters
----------
threshold : float > 0.0
Margin threshold. When threshold=1.0, one gets the loss used by SVM.
When threshold=0.0, one gets the loss used by the Perceptron.
"""
cdef double threshold
def __init__(self, double threshold=1.0):
self.threshold = threshold
cdef double cy_loss(self, double y, double p) noexcept nogil:
cdef double z = p * y
if z <= self.threshold:
return self.threshold - z
return 0.0
cdef double cy_gradient(self, double y, double p) noexcept nogil:
cdef double z = p * y
if z <= self.threshold:
return -y
return 0.0
def __reduce__(self):
return Hinge, (self.threshold,)
cdef class SquaredHinge(Classification):
"""Squared Hinge loss for binary classification tasks with y in {-1,1}
Parameters
----------
threshold : float > 0.0
Margin threshold. When threshold=1.0, one gets the loss used by
(quadratically penalized) SVM.
"""
cdef double threshold
def __init__(self, double threshold=1.0):
self.threshold = threshold
cdef double cy_loss(self, double y, double p) noexcept nogil:
cdef double z = self.threshold - p * y
if z > 0:
return z * z
return 0.0
cdef double cy_gradient(self, double y, double p) noexcept nogil:
cdef double z = self.threshold - p * y
if z > 0:
return -2 * y * z
return 0.0
def __reduce__(self):
return SquaredHinge, (self.threshold,)
cdef class EpsilonInsensitive(Regression):
"""Epsilon-Insensitive loss (used by SVR).
loss = max(0, |y - p| - epsilon)
"""
cdef double epsilon
def __init__(self, double epsilon):
self.epsilon = epsilon
cdef double cy_loss(self, double y, double p) noexcept nogil:
cdef double ret = fabs(y - p) - self.epsilon
return ret if ret > 0 else 0
cdef double cy_gradient(self, double y, double p) noexcept nogil:
if y - p > self.epsilon:
return -1
elif p - y > self.epsilon:
return 1
else:
return 0
def __reduce__(self):
return EpsilonInsensitive, (self.epsilon,)
cdef class SquaredEpsilonInsensitive(Regression):
"""Epsilon-Insensitive loss.
loss = max(0, |y - p| - epsilon)^2
"""
cdef double epsilon
def __init__(self, double epsilon):
self.epsilon = epsilon
cdef double cy_loss(self, double y, double p) noexcept nogil:
cdef double ret = fabs(y - p) - self.epsilon
return ret * ret if ret > 0 else 0
cdef double cy_gradient(self, double y, double p) noexcept nogil:
cdef double z
z = y - p
if z > self.epsilon:
return -2 * (z - self.epsilon)
elif z < -self.epsilon:
return 2 * (-z - self.epsilon)
else:
return 0
def __reduce__(self):
return SquaredEpsilonInsensitive, (self.epsilon,)
def _plain_sgd64(
const double[::1] weights,
double intercept,
const double[::1] average_weights,
double average_intercept,
CyLossFunction loss,
int penalty_type,
double alpha,
double C,
double l1_ratio,
SequentialDataset64 dataset,
const uint8_t[::1] validation_mask,
bint early_stopping,
validation_score_cb,
int n_iter_no_change,
unsigned int max_iter,
double tol,
int fit_intercept,
int verbose,
bint shuffle,
uint32_t seed,
double weight_pos,
double weight_neg,
int learning_rate,
double eta0,
double power_t,
bint one_class,
double t=1.0,
double intercept_decay=1.0,
int average=0,
):
"""SGD for generic loss functions and penalties with optional averaging
Parameters
----------
weights : ndarray[double, ndim=1]
The allocated vector of weights.
intercept : double
The initial intercept.
average_weights : ndarray[double, ndim=1]
The average weights as computed for ASGD. Should be None if average
is 0.
average_intercept : double
The average intercept for ASGD. Should be 0 if average is 0.
loss : CyLossFunction
A concrete ``CyLossFunction`` object.
penalty_type : int
The penalty 2 for L2, 1 for L1, and 3 for Elastic-Net.
alpha : float
The regularization parameter.
C : float
Maximum step size for passive aggressive.
l1_ratio : float
The Elastic Net mixing parameter, with 0 <= l1_ratio <= 1.
l1_ratio=0 corresponds to L2 penalty, l1_ratio=1 to L1.
dataset : SequentialDataset
A concrete ``SequentialDataset`` object.
validation_mask : ndarray[uint8_t, ndim=1]
Equal to True on the validation set.
early_stopping : boolean
Whether to use a stopping criterion based on the validation set.
validation_score_cb : callable
A callable to compute a validation score given the current
coefficients and intercept values.
Used only if early_stopping is True.
n_iter_no_change : int
Number of iteration with no improvement to wait before stopping.
max_iter : int
The maximum number of iterations (epochs).
tol: double
The tolerance for the stopping criterion.
fit_intercept : int
Whether or not to fit the intercept (1 or 0).
verbose : int
Print verbose output; 0 for quite.
shuffle : boolean
Whether to shuffle the training data before each epoch.
weight_pos : float
The weight of the positive class.
weight_neg : float
The weight of the negative class.
seed : uint32_t
Seed of the pseudorandom number generator used to shuffle the data.
learning_rate : int
The learning rate:
(1) constant, eta = eta0
(2) optimal, eta = 1.0/(alpha * t).
(3) inverse scaling, eta = eta0 / pow(t, power_t)
(4) adaptive decrease
(5) Passive Aggressive-I, eta = min(alpha, loss/norm(x))
(6) Passive Aggressive-II, eta = 1.0 / (norm(x) + 0.5*alpha)
eta0 : double
The initial learning rate.
power_t : double
The exponent for inverse scaling learning rate.
one_class : boolean
Whether to solve the One-Class SVM optimization problem.
t : double
Initial state of the learning rate. This value is equal to the
iteration count except when the learning rate is set to `optimal`.
Default: 1.0.
average : int
The number of iterations before averaging starts. average=1 is
equivalent to averaging for all iterations.
Returns
-------
weights : array, shape=[n_features]
The fitted weight vector.
intercept : float
The fitted intercept term.
average_weights : array shape=[n_features]
The averaged weights across iterations. Values are valid only if
average > 0.
average_intercept : float
The averaged intercept across iterations.
Values are valid only if average > 0.
n_iter_ : int
The actual number of iter (epochs).
"""
# get the data information into easy vars
cdef Py_ssize_t n_samples = dataset.n_samples
cdef Py_ssize_t n_features = weights.shape[0]
cdef WeightVector64 w = WeightVector64(weights, average_weights)
cdef double *x_data_ptr = NULL
cdef int *x_ind_ptr = NULL
# helper variables
cdef int no_improvement_count = 0
cdef bint infinity = False
cdef int xnnz
cdef double eta = 0.0
cdef double p = 0.0
cdef double update = 0.0
cdef double intercept_update = 0.0
cdef double sumloss = 0.0
cdef double score = 0.0
cdef double best_loss = INFINITY
cdef double best_score = -INFINITY
cdef double y = 0.0
cdef double sample_weight
cdef double class_weight = 1.0
cdef unsigned int count = 0
cdef unsigned int train_count = n_samples - np.sum(validation_mask)
cdef unsigned int epoch = 0
cdef unsigned int i = 0
cdef int is_hinge = isinstance(loss, Hinge)
cdef double optimal_init = 0.0
cdef double dloss = 0.0
cdef double MAX_DLOSS = 1e12
cdef long long sample_index
# q vector is only used for L1 regularization
cdef double[::1] q = None
cdef double * q_data_ptr = NULL
if penalty_type == L1 or penalty_type == ELASTICNET:
q = np.zeros((n_features,), dtype=np.float64, order="c")
q_data_ptr = &q[0]
cdef double u = 0.0
if penalty_type == L2:
l1_ratio = 0.0
elif penalty_type == L1:
l1_ratio = 1.0
eta = eta0
if learning_rate == OPTIMAL:
typw = np.sqrt(1.0 / np.sqrt(alpha))
# computing eta0, the initial learning rate
initial_eta0 = typw / max(1.0, loss.cy_gradient(1.0, -typw))
# initialize t such that eta at first sample equals eta0
optimal_init = 1.0 / (initial_eta0 * alpha)
t_start = time()
with nogil:
for epoch in range(max_iter):
sumloss = 0
if verbose > 0:
with gil:
print("-- Epoch %d" % (epoch + 1))
if shuffle:
dataset.shuffle(seed)
for i in range(n_samples):
dataset.next(&x_data_ptr, &x_ind_ptr, &xnnz,
&y, &sample_weight)
sample_index = dataset.index_data_ptr[dataset.current_index]
if validation_mask[sample_index]:
# do not learn on the validation set
continue
p = w.dot(x_data_ptr, x_ind_ptr, xnnz) + intercept
if learning_rate == OPTIMAL:
eta = 1.0 / (alpha * (optimal_init + t - 1))
elif learning_rate == INVSCALING:
eta = eta0 / pow(t, power_t)
if verbose or not early_stopping:
sumloss += loss.cy_loss(y, p)
if y > 0.0:
class_weight = weight_pos
else:
class_weight = weight_neg
if learning_rate == PA1:
update = sqnorm(x_data_ptr, x_ind_ptr, xnnz)
if update == 0:
continue
update = min(C, loss.cy_loss(y, p) / update)
elif learning_rate == PA2:
update = sqnorm(x_data_ptr, x_ind_ptr, xnnz)
update = loss.cy_loss(y, p) / (update + 0.5 / C)
else:
dloss = loss.cy_gradient(y, p)
# clip dloss with large values to avoid numerical
# instabilities
if dloss < -MAX_DLOSS:
dloss = -MAX_DLOSS
elif dloss > MAX_DLOSS:
dloss = MAX_DLOSS
update = -eta * dloss
if learning_rate >= PA1:
if is_hinge:
# classification
update *= y
elif y - p < 0:
# regression
update *= -1
update *= class_weight * sample_weight
if penalty_type >= L2:
# do not scale to negative values when eta or alpha are too
# big: instead set the weights to zero
w.scale(max(0, 1.0 - ((1.0 - l1_ratio) * eta * alpha)))
if update != 0.0:
w.add(x_data_ptr, x_ind_ptr, xnnz, update)
if fit_intercept == 1:
intercept_update = update
if one_class: # specific for One-Class SVM
intercept_update -= 2. * eta * alpha
if intercept_update != 0:
intercept += intercept_update * intercept_decay
if 0 < average <= t:
# compute the average for the intercept and update the
# average weights, this is done regardless as to whether
# the update is 0
w.add_average(x_data_ptr, x_ind_ptr, xnnz,
update, (t - average + 1))
average_intercept += ((intercept - average_intercept) /
(t - average + 1))
if penalty_type == L1 or penalty_type == ELASTICNET:
u += (l1_ratio * eta * alpha)
l1penalty64(w, q_data_ptr, x_ind_ptr, xnnz, u)
t += 1
count += 1
# report epoch information
if verbose > 0:
with gil:
print("Norm: %.2f, NNZs: %d, Bias: %.6f, T: %d, "
"Avg. loss: %f"
% (w.norm(), np.nonzero(weights)[0].shape[0],
intercept, count, sumloss / train_count))
print("Total training time: %.2f seconds."
% (time() - t_start))
# floating-point under-/overflow check.
if (not isfinite(intercept) or any_nonfinite(weights)):
infinity = True
break
# evaluate the score on the validation set
if early_stopping:
with gil:
score = validation_score_cb(weights.base, intercept)
if tol > -INFINITY and score < best_score + tol:
no_improvement_count += 1
else:
no_improvement_count = 0
if score > best_score:
best_score = score
# or evaluate the loss on the training set
else:
if tol > -INFINITY and sumloss > best_loss - tol * train_count:
no_improvement_count += 1
else:
no_improvement_count = 0
if sumloss < best_loss:
best_loss = sumloss
# if there is no improvement several times in a row
if no_improvement_count >= n_iter_no_change:
if learning_rate == ADAPTIVE and eta > 1e-6:
eta = eta / 5
no_improvement_count = 0
else:
if verbose:
with gil:
print("Convergence after %d epochs took %.2f "
"seconds" % (epoch + 1, time() - t_start))
break
if infinity:
raise ValueError(("Floating-point under-/overflow occurred at epoch"
" #%d. Scaling input data with StandardScaler or"
" MinMaxScaler might help.") % (epoch + 1))
w.reset_wscale()
return (
weights.base,
intercept,
None if average_weights is None else average_weights.base,
average_intercept,
epoch + 1
)
def _plain_sgd32(
const float[::1] weights,
double intercept,
const float[::1] average_weights,
double average_intercept,
CyLossFunction loss,
int penalty_type,
double alpha,
double C,
double l1_ratio,
SequentialDataset32 dataset,
const uint8_t[::1] validation_mask,
bint early_stopping,
validation_score_cb,
int n_iter_no_change,
unsigned int max_iter,
double tol,
int fit_intercept,
int verbose,
bint shuffle,
uint32_t seed,
double weight_pos,
double weight_neg,
int learning_rate,
double eta0,
double power_t,
bint one_class,
double t=1.0,
double intercept_decay=1.0,
int average=0,
):
"""SGD for generic loss functions and penalties with optional averaging
Parameters
----------
weights : ndarray[float, ndim=1]
The allocated vector of weights.
intercept : double
The initial intercept.
average_weights : ndarray[float, ndim=1]
The average weights as computed for ASGD. Should be None if average
is 0.
average_intercept : double
The average intercept for ASGD. Should be 0 if average is 0.
loss : CyLossFunction
A concrete ``CyLossFunction`` object.
penalty_type : int
The penalty 2 for L2, 1 for L1, and 3 for Elastic-Net.
alpha : float
The regularization parameter.
C : float
Maximum step size for passive aggressive.
l1_ratio : float
The Elastic Net mixing parameter, with 0 <= l1_ratio <= 1.
l1_ratio=0 corresponds to L2 penalty, l1_ratio=1 to L1.
dataset : SequentialDataset
A concrete ``SequentialDataset`` object.
validation_mask : ndarray[uint8_t, ndim=1]
Equal to True on the validation set.
early_stopping : boolean
Whether to use a stopping criterion based on the validation set.
validation_score_cb : callable
A callable to compute a validation score given the current
coefficients and intercept values.
Used only if early_stopping is True.
n_iter_no_change : int
Number of iteration with no improvement to wait before stopping.
max_iter : int
The maximum number of iterations (epochs).
tol: double
The tolerance for the stopping criterion.
fit_intercept : int
Whether or not to fit the intercept (1 or 0).
verbose : int
Print verbose output; 0 for quite.
shuffle : boolean
Whether to shuffle the training data before each epoch.
weight_pos : float
The weight of the positive class.
weight_neg : float
The weight of the negative class.
seed : uint32_t
Seed of the pseudorandom number generator used to shuffle the data.
learning_rate : int
The learning rate:
(1) constant, eta = eta0
(2) optimal, eta = 1.0/(alpha * t).
(3) inverse scaling, eta = eta0 / pow(t, power_t)
(4) adaptive decrease
(5) Passive Aggressive-I, eta = min(alpha, loss/norm(x))
(6) Passive Aggressive-II, eta = 1.0 / (norm(x) + 0.5*alpha)
eta0 : double
The initial learning rate.
power_t : double
The exponent for inverse scaling learning rate.
one_class : boolean
Whether to solve the One-Class SVM optimization problem.
t : double
Initial state of the learning rate. This value is equal to the
iteration count except when the learning rate is set to `optimal`.
Default: 1.0.
average : int
The number of iterations before averaging starts. average=1 is
equivalent to averaging for all iterations.
Returns
-------
weights : array, shape=[n_features]
The fitted weight vector.
intercept : float
The fitted intercept term.
average_weights : array shape=[n_features]
The averaged weights across iterations. Values are valid only if
average > 0.
average_intercept : float
The averaged intercept across iterations.
Values are valid only if average > 0.
n_iter_ : int
The actual number of iter (epochs).
"""
# get the data information into easy vars
cdef Py_ssize_t n_samples = dataset.n_samples
cdef Py_ssize_t n_features = weights.shape[0]
cdef WeightVector32 w = WeightVector32(weights, average_weights)
cdef float *x_data_ptr = NULL
cdef int *x_ind_ptr = NULL
# helper variables
cdef int no_improvement_count = 0
cdef bint infinity = False
cdef int xnnz
cdef double eta = 0.0
cdef double p = 0.0
cdef double update = 0.0
cdef double intercept_update = 0.0
cdef double sumloss = 0.0
cdef double score = 0.0
cdef double best_loss = INFINITY
cdef double best_score = -INFINITY
cdef float y = 0.0
cdef float sample_weight
cdef float class_weight = 1.0
cdef unsigned int count = 0
cdef unsigned int train_count = n_samples - np.sum(validation_mask)
cdef unsigned int epoch = 0
cdef unsigned int i = 0
cdef int is_hinge = isinstance(loss, Hinge)
cdef double optimal_init = 0.0
cdef double dloss = 0.0
cdef double MAX_DLOSS = 1e12
cdef long long sample_index
# q vector is only used for L1 regularization
cdef float[::1] q = None
cdef float * q_data_ptr = NULL
if penalty_type == L1 or penalty_type == ELASTICNET:
q = np.zeros((n_features,), dtype=np.float32, order="c")
q_data_ptr = &q[0]
cdef double u = 0.0
if penalty_type == L2:
l1_ratio = 0.0
elif penalty_type == L1:
l1_ratio = 1.0
eta = eta0
if learning_rate == OPTIMAL:
typw = np.sqrt(1.0 / np.sqrt(alpha))
# computing eta0, the initial learning rate
initial_eta0 = typw / max(1.0, loss.cy_gradient(1.0, -typw))
# initialize t such that eta at first sample equals eta0
optimal_init = 1.0 / (initial_eta0 * alpha)
t_start = time()
with nogil:
for epoch in range(max_iter):
sumloss = 0
if verbose > 0:
with gil:
print("-- Epoch %d" % (epoch + 1))
if shuffle:
dataset.shuffle(seed)
for i in range(n_samples):
dataset.next(&x_data_ptr, &x_ind_ptr, &xnnz,
&y, &sample_weight)
sample_index = dataset.index_data_ptr[dataset.current_index]
if validation_mask[sample_index]:
# do not learn on the validation set
continue
p = w.dot(x_data_ptr, x_ind_ptr, xnnz) + intercept
if learning_rate == OPTIMAL:
eta = 1.0 / (alpha * (optimal_init + t - 1))
elif learning_rate == INVSCALING:
eta = eta0 / pow(t, power_t)
if verbose or not early_stopping:
sumloss += loss.cy_loss(y, p)
if y > 0.0:
class_weight = weight_pos
else:
class_weight = weight_neg
if learning_rate == PA1:
update = sqnorm(x_data_ptr, x_ind_ptr, xnnz)
if update == 0:
continue
update = min(C, loss.cy_loss(y, p) / update)
elif learning_rate == PA2:
update = sqnorm(x_data_ptr, x_ind_ptr, xnnz)
update = loss.cy_loss(y, p) / (update + 0.5 / C)
else:
dloss = loss.cy_gradient(y, p)
# clip dloss with large values to avoid numerical
# instabilities
if dloss < -MAX_DLOSS:
dloss = -MAX_DLOSS
elif dloss > MAX_DLOSS:
dloss = MAX_DLOSS
update = -eta * dloss
if learning_rate >= PA1:
if is_hinge:
# classification
update *= y
elif y - p < 0:
# regression
update *= -1
update *= class_weight * sample_weight
if penalty_type >= L2:
# do not scale to negative values when eta or alpha are too
# big: instead set the weights to zero
w.scale(max(0, 1.0 - ((1.0 - l1_ratio) * eta * alpha)))
if update != 0.0:
w.add(x_data_ptr, x_ind_ptr, xnnz, update)
if fit_intercept == 1:
intercept_update = update
if one_class: # specific for One-Class SVM
intercept_update -= 2. * eta * alpha
if intercept_update != 0:
intercept += intercept_update * intercept_decay
if 0 < average <= t:
# compute the average for the intercept and update the
# average weights, this is done regardless as to whether
# the update is 0
w.add_average(x_data_ptr, x_ind_ptr, xnnz,
update, (t - average + 1))
average_intercept += ((intercept - average_intercept) /
(t - average + 1))
if penalty_type == L1 or penalty_type == ELASTICNET:
u += (l1_ratio * eta * alpha)
l1penalty32(w, q_data_ptr, x_ind_ptr, xnnz, u)
t += 1
count += 1
# report epoch information
if verbose > 0:
with gil:
print("Norm: %.2f, NNZs: %d, Bias: %.6f, T: %d, "
"Avg. loss: %f"
% (w.norm(), np.nonzero(weights)[0].shape[0],
intercept, count, sumloss / train_count))
print("Total training time: %.2f seconds."
% (time() - t_start))
# floating-point under-/overflow check.
if (not isfinite(intercept) or any_nonfinite(weights)):
infinity = True
break
# evaluate the score on the validation set
if early_stopping:
with gil:
score = validation_score_cb(weights.base, intercept)
if tol > -INFINITY and score < best_score + tol:
no_improvement_count += 1
else:
no_improvement_count = 0
if score > best_score:
best_score = score
# or evaluate the loss on the training set
else:
if tol > -INFINITY and sumloss > best_loss - tol * train_count:
no_improvement_count += 1
else:
no_improvement_count = 0
if sumloss < best_loss:
best_loss = sumloss
# if there is no improvement several times in a row
if no_improvement_count >= n_iter_no_change:
if learning_rate == ADAPTIVE and eta > 1e-6:
eta = eta / 5
no_improvement_count = 0
else:
if verbose:
with gil:
print("Convergence after %d epochs took %.2f "
"seconds" % (epoch + 1, time() - t_start))
break
if infinity:
raise ValueError(("Floating-point under-/overflow occurred at epoch"
" #%d. Scaling input data with StandardScaler or"
" MinMaxScaler might help.") % (epoch + 1))
w.reset_wscale()
return (
weights.base,
intercept,
None if average_weights is None else average_weights.base,
average_intercept,
epoch + 1
)
cdef inline bint any_nonfinite(const floating[::1] w) noexcept nogil:
for i in range(w.shape[0]):
if not isfinite(w[i]):
return True
return 0
cdef inline double sqnorm(
floating * x_data_ptr,
int * x_ind_ptr,
int xnnz,
) noexcept nogil:
cdef double x_norm = 0.0
cdef int j
cdef double z
for j in range(xnnz):
z = x_data_ptr[j]
x_norm += z * z
return x_norm
cdef void l1penalty64(
WeightVector64 w,
double * q_data_ptr,
int *x_ind_ptr,
int xnnz,
double u,
) noexcept nogil:
"""Apply the L1 penalty to each updated feature
This implements the truncated gradient approach by
[Tsuruoka, Y., Tsujii, J., and Ananiadou, S., 2009].
"""
cdef double z = 0.0
cdef int j = 0
cdef int idx = 0
cdef double wscale = w.wscale
cdef double *w_data_ptr = w.w_data_ptr
for j in range(xnnz):
idx = x_ind_ptr[j]
z = w_data_ptr[idx]
if wscale * z > 0.0:
w_data_ptr[idx] = max(
0.0, w_data_ptr[idx] - ((u + q_data_ptr[idx]) / wscale))
elif wscale * z < 0.0:
w_data_ptr[idx] = min(
0.0, w_data_ptr[idx] + ((u - q_data_ptr[idx]) / wscale))
q_data_ptr[idx] += wscale * (w_data_ptr[idx] - z)
cdef void l1penalty32(
WeightVector32 w,
float * q_data_ptr,
int *x_ind_ptr,
int xnnz,
double u,
) noexcept nogil:
"""Apply the L1 penalty to each updated feature
This implements the truncated gradient approach by
[Tsuruoka, Y., Tsujii, J., and Ananiadou, S., 2009].
"""
cdef double z = 0.0
cdef int j = 0
cdef int idx = 0
cdef double wscale = w.wscale
cdef float *w_data_ptr = w.w_data_ptr
for j in range(xnnz):
idx = x_ind_ptr[j]
z = w_data_ptr[idx]
if wscale * z > 0.0:
w_data_ptr[idx] = max(
0.0, w_data_ptr[idx] - ((u + q_data_ptr[idx]) / wscale))
elif wscale * z < 0.0:
w_data_ptr[idx] = min(
0.0, w_data_ptr[idx] + ((u - q_data_ptr[idx]) / wscale))
q_data_ptr[idx] += wscale * (w_data_ptr[idx] - z)
_weight_vector.pyx
# cython: binding=False
#
# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause
cimport cython
from libc.limits cimport INT_MAX
from libc.math cimport sqrt
from ._cython_blas cimport _dot, _scal, _axpy
cdef class WeightVector64(object):
"""Dense vector represented by a scalar and a numpy array.
The class provides methods to ``add`` a sparse vector
and scale the vector.
Representing a vector explicitly as a scalar times a
vector allows for efficient scaling operations.
Attributes
----------
w : ndarray, dtype=double, order='C'
The numpy array which backs the weight vector.
aw : ndarray, dtype=double, order='C'
The numpy array which backs the average_weight vector.
w_data_ptr : double*
A pointer to the data of the numpy array.
wscale : double
The scale of the vector.
n_features : int
The number of features (= dimensionality of ``w``).
sq_norm : double
The squared norm of ``w``.
"""
def __cinit__(self,
double[::1] w,
double[::1] aw):
if w.shape[0] > INT_MAX:
raise ValueError("More than %d features not supported; got %d."
% (INT_MAX, w.shape[0]))
self.w = w
self.w_data_ptr = &w[0]
self.wscale = 1.0
self.n_features = w.shape[0]
self.sq_norm = _dot(self.n_features, self.w_data_ptr, 1, self.w_data_ptr, 1)
self.aw = aw
if self.aw is not None:
self.aw_data_ptr = &aw[0]
self.average_a = 0.0
self.average_b = 1.0
cdef void add(self, double *x_data_ptr, int *x_ind_ptr, int xnnz,
double c) noexcept nogil:
"""Scales sample x by constant c and adds it to the weight vector.
This operation updates ``sq_norm``.
Parameters
----------
x_data_ptr : double*
The array which holds the feature values of ``x``.
x_ind_ptr : np.intc*
The array which holds the feature indices of ``x``.
xnnz : int
The number of non-zero features of ``x``.
c : double
The scaling constant for the example.
"""
cdef int j
cdef int idx
cdef double val
cdef double innerprod = 0.0
cdef double xsqnorm = 0.0
# the next two lines save a factor of 2!
cdef double wscale = self.wscale
cdef double* w_data_ptr = self.w_data_ptr
for j in range(xnnz):
idx = x_ind_ptr[j]
val = x_data_ptr[j]
innerprod += (w_data_ptr[idx] * val)
xsqnorm += (val * val)
w_data_ptr[idx] += val * (c / wscale)
self.sq_norm += (xsqnorm * c * c) + (2.0 * innerprod * wscale * c)
# Update the average weights according to the sparse trick defined
# here: https://research.microsoft.com/pubs/192769/tricks-2012.pdf
# by Leon Bottou
cdef void add_average(self, double *x_data_ptr, int *x_ind_ptr, int xnnz,
double c, double num_iter) noexcept nogil:
"""Updates the average weight vector.
Parameters
----------
x_data_ptr : double*
The array which holds the feature values of ``x``.
x_ind_ptr : np.intc*
The array which holds the feature indices of ``x``.
xnnz : int
The number of non-zero features of ``x``.
c : double
The scaling constant for the example.
num_iter : double
The total number of iterations.
"""
cdef int j
cdef int idx
cdef double val
cdef double mu = 1.0 / num_iter
cdef double average_a = self.average_a
cdef double wscale = self.wscale
cdef double* aw_data_ptr = self.aw_data_ptr
for j in range(xnnz):
idx = x_ind_ptr[j]
val = x_data_ptr[j]
aw_data_ptr[idx] += (self.average_a * val * (-c / wscale))
# Once the sample has been processed
# update the average_a and average_b
if num_iter > 1:
self.average_b /= (1.0 - mu)
self.average_a += mu * self.average_b * wscale
cdef double dot(self, double *x_data_ptr, int *x_ind_ptr,
int xnnz) noexcept nogil:
"""Computes the dot product of a sample x and the weight vector.
Parameters
----------
x_data_ptr : double*
The array which holds the feature values of ``x``.
x_ind_ptr : np.intc*
The array which holds the feature indices of ``x``.
xnnz : int
The number of non-zero features of ``x`` (length of x_ind_ptr).
Returns
-------
innerprod : double
The inner product of ``x`` and ``w``.
"""
cdef int j
cdef int idx
cdef double innerprod = 0.0
cdef double* w_data_ptr = self.w_data_ptr
for j in range(xnnz):
idx = x_ind_ptr[j]
innerprod += w_data_ptr[idx] * x_data_ptr[j]
innerprod *= self.wscale
return innerprod
cdef void scale(self, double c) noexcept nogil:
"""Scales the weight vector by a constant ``c``.
It updates ``wscale`` and ``sq_norm``. If ``wscale`` gets too
small we call ``reset_swcale``."""
self.wscale *= c
self.sq_norm *= (c * c)
if self.wscale < 1e-09:
self.reset_wscale()
cdef void reset_wscale(self) noexcept nogil:
"""Scales each coef of ``w`` by ``wscale`` and resets it to 1. """
if self.aw_data_ptr != NULL:
_axpy(self.n_features, self.average_a,
self.w_data_ptr, 1, self.aw_data_ptr, 1)
_scal(self.n_features, 1.0 / self.average_b, self.aw_data_ptr, 1)
self.average_a = 0.0
self.average_b = 1.0
_scal(self.n_features, self.wscale, self.w_data_ptr, 1)
self.wscale = 1.0
cdef double norm(self) noexcept nogil:
"""The L2 norm of the weight vector. """
return sqrt(self.sq_norm)
cdef class WeightVector32(object):
"""Dense vector represented by a scalar and a numpy array.
The class provides methods to ``add`` a sparse vector
and scale the vector.
Representing a vector explicitly as a scalar times a
vector allows for efficient scaling operations.
Attributes
----------
w : ndarray, dtype=float, order='C'
The numpy array which backs the weight vector.
aw : ndarray, dtype=float, order='C'
The numpy array which backs the average_weight vector.
w_data_ptr : float*
A pointer to the data of the numpy array.
wscale : float
The scale of the vector.
n_features : int
The number of features (= dimensionality of ``w``).
sq_norm : float
The squared norm of ``w``.
"""
def __cinit__(self,
float[::1] w,
float[::1] aw):
if w.shape[0] > INT_MAX:
raise ValueError("More than %d features not supported; got %d."
% (INT_MAX, w.shape[0]))
self.w = w
self.w_data_ptr = &w[0]
self.wscale = 1.0
self.n_features = w.shape[0]
self.sq_norm = _dot(self.n_features, self.w_data_ptr, 1, self.w_data_ptr, 1)
self.aw = aw
if self.aw is not None:
self.aw_data_ptr = &aw[0]
self.average_a = 0.0
self.average_b = 1.0
cdef void add(self, float *x_data_ptr, int *x_ind_ptr, int xnnz,
float c) noexcept nogil:
"""Scales sample x by constant c and adds it to the weight vector.
This operation updates ``sq_norm``.
Parameters
----------
x_data_ptr : float*
The array which holds the feature values of ``x``.
x_ind_ptr : np.intc*
The array which holds the feature indices of ``x``.
xnnz : int
The number of non-zero features of ``x``.
c : float
The scaling constant for the example.
"""
cdef int j
cdef int idx
cdef double val
cdef double innerprod = 0.0
cdef double xsqnorm = 0.0
# the next two lines save a factor of 2!
cdef float wscale = self.wscale
cdef float* w_data_ptr = self.w_data_ptr
for j in range(xnnz):
idx = x_ind_ptr[j]
val = x_data_ptr[j]
innerprod += (w_data_ptr[idx] * val)
xsqnorm += (val * val)
w_data_ptr[idx] += val * (c / wscale)
self.sq_norm += (xsqnorm * c * c) + (2.0 * innerprod * wscale * c)
# Update the average weights according to the sparse trick defined
# here: https://research.microsoft.com/pubs/192769/tricks-2012.pdf
# by Leon Bottou
cdef void add_average(self, float *x_data_ptr, int *x_ind_ptr, int xnnz,
float c, float num_iter) noexcept nogil:
"""Updates the average weight vector.
Parameters
----------
x_data_ptr : float*
The array which holds the feature values of ``x``.
x_ind_ptr : np.intc*
The array which holds the feature indices of ``x``.
xnnz : int
The number of non-zero features of ``x``.
c : float
The scaling constant for the example.
num_iter : float
The total number of iterations.
"""
cdef int j
cdef int idx
cdef double val
cdef double mu = 1.0 / num_iter
cdef double average_a = self.average_a
cdef double wscale = self.wscale
cdef float* aw_data_ptr = self.aw_data_ptr
for j in range(xnnz):
idx = x_ind_ptr[j]
val = x_data_ptr[j]
aw_data_ptr[idx] += (self.average_a * val * (-c / wscale))
# Once the sample has been processed
# update the average_a and average_b
if num_iter > 1:
self.average_b /= (1.0 - mu)
self.average_a += mu * self.average_b * wscale
cdef float dot(self, float *x_data_ptr, int *x_ind_ptr,
int xnnz) noexcept nogil:
"""Computes the dot product of a sample x and the weight vector.
Parameters
----------
x_data_ptr : float*
The array which holds the feature values of ``x``.
x_ind_ptr : np.intc*
The array which holds the feature indices of ``x``.
xnnz : int
The number of non-zero features of ``x`` (length of x_ind_ptr).
Returns
-------
innerprod : float
The inner product of ``x`` and ``w``.
"""
cdef int j
cdef int idx
cdef double innerprod = 0.0
cdef float* w_data_ptr = self.w_data_ptr
for j in range(xnnz):
idx = x_ind_ptr[j]
innerprod += w_data_ptr[idx] * x_data_ptr[j]
innerprod *= self.wscale
return innerprod
cdef void scale(self, float c) noexcept nogil:
"""Scales the weight vector by a constant ``c``.
It updates ``wscale`` and ``sq_norm``. If ``wscale`` gets too
small we call ``reset_swcale``."""
self.wscale *= c
self.sq_norm *= (c * c)
if self.wscale < 1e-06:
self.reset_wscale()
cdef void reset_wscale(self) noexcept nogil:
"""Scales each coef of ``w`` by ``wscale`` and resets it to 1. """
if self.aw_data_ptr != NULL:
_axpy(self.n_features, self.average_a,
self.w_data_ptr, 1, self.aw_data_ptr, 1)
_scal(self.n_features, 1.0 / self.average_b, self.aw_data_ptr, 1)
self.average_a = 0.0
self.average_b = 1.0
_scal(self.n_features, self.wscale, self.w_data_ptr, 1)
self.wscale = 1.0
cdef float norm(self) noexcept nogil:
"""The L2 norm of the weight vector. """
return sqrt(self.sq_norm)
_seq_dataset.pyx
"""Dataset abstractions for sequential data access."""
import numpy as np
cimport cython
from libc.limits cimport INT_MAX
from ._random cimport our_rand_r
from ._typedefs cimport float32_t, float64_t, uint32_t
#------------------------------------------------------------------------------
cdef class SequentialDataset64:
"""Base class for datasets with sequential data access.
SequentialDataset is used to iterate over the rows of a matrix X and
corresponding target values y, i.e. to iterate over samples.
There are two methods to get the next sample:
- next : Iterate sequentially (optionally randomized)
- random : Iterate randomly (with replacement)
Attributes
----------
index : np.ndarray
Index array for fast shuffling.
index_data_ptr : int
Pointer to the index array.
current_index : int
Index of current sample in ``index``.
The index of current sample in the data is given by
index_data_ptr[current_index].
n_samples : Py_ssize_t
Number of samples in the dataset.
seed : uint32_t
Seed used for random sampling. This attribute is modified at each call to the
`random` method.
"""
cdef void next(self, float64_t **x_data_ptr, int **x_ind_ptr,
int *nnz, float64_t *y, float64_t *sample_weight) noexcept nogil:
"""Get the next example ``x`` from the dataset.
This method gets the next sample looping sequentially over all samples.
The order can be shuffled with the method ``shuffle``.
Shuffling once before iterating over all samples corresponds to a
random draw without replacement. It is used for instance in SGD solver.
Parameters
----------
x_data_ptr : float64_t**
A pointer to the float64_t array which holds the feature
values of the next example.
x_ind_ptr : np.intc**
A pointer to the int array which holds the feature
indices of the next example.
nnz : int*
A pointer to an int holding the number of non-zero
values of the next example.
y : float64_t*
The target value of the next example.
sample_weight : float64_t*
The weight of the next example.
"""
cdef int current_index = self._get_next_index()
self._sample(x_data_ptr, x_ind_ptr, nnz, y, sample_weight,
current_index)
cdef int random(self, float64_t **x_data_ptr, int **x_ind_ptr,
int *nnz, float64_t *y, float64_t *sample_weight) noexcept nogil:
"""Get a random example ``x`` from the dataset.
This method gets next sample chosen randomly over a uniform
distribution. It corresponds to a random draw with replacement.
It is used for instance in SAG solver.
Parameters
----------
x_data_ptr : float64_t**
A pointer to the float64_t array which holds the feature
values of the next example.
x_ind_ptr : np.intc**
A pointer to the int array which holds the feature
indices of the next example.
nnz : int*
A pointer to an int holding the number of non-zero
values of the next example.
y : float64_t*
The target value of the next example.
sample_weight : float64_t*
The weight of the next example.
Returns
-------
current_index : int
Index of current sample.
"""
cdef int current_index = self._get_random_index()
self._sample(x_data_ptr, x_ind_ptr, nnz, y, sample_weight,
current_index)
return current_index
cdef void shuffle(self, uint32_t seed) noexcept nogil:
"""Permutes the ordering of examples."""
# Fisher-Yates shuffle
cdef int *ind = self.index_data_ptr
cdef int n = self.n_samples
cdef unsigned i, j
for i in range(n - 1):
j = i + our_rand_r(&seed) % (n - i)
ind[i], ind[j] = ind[j], ind[i]
cdef int _get_next_index(self) noexcept nogil:
cdef int current_index = self.current_index
if current_index >= (self.n_samples - 1):
current_index = -1
current_index += 1
self.current_index = current_index
return self.current_index
cdef int _get_random_index(self) noexcept nogil:
cdef int n = self.n_samples
cdef int current_index = our_rand_r(&self.seed) % n
self.current_index = current_index
return current_index
cdef void _sample(self, float64_t **x_data_ptr, int **x_ind_ptr,
int *nnz, float64_t *y, float64_t *sample_weight,
int current_index) noexcept nogil:
pass
def _shuffle_py(self, uint32_t seed):
"""python function used for easy testing"""
self.shuffle(seed)
def _next_py(self):
"""python function used for easy testing"""
cdef int current_index = self._get_next_index()
return self._sample_py(current_index)
def _random_py(self):
"""python function used for easy testing"""
cdef int current_index = self._get_random_index()
return self._sample_py(current_index)
def _sample_py(self, int current_index):
"""python function used for easy testing"""
cdef float64_t* x_data_ptr
cdef int* x_indices_ptr
cdef int nnz, j
cdef float64_t y, sample_weight
# call _sample in cython
self._sample(&x_data_ptr, &x_indices_ptr, &nnz, &y, &sample_weight,
current_index)
# transform the pointed data in numpy CSR array
cdef float64_t[:] x_data = np.empty(nnz, dtype=np.float64)
cdef int[:] x_indices = np.empty(nnz, dtype=np.int32)
cdef int[:] x_indptr = np.asarray([0, nnz], dtype=np.int32)
for j in range(nnz):
x_data[j] = x_data_ptr[j]
x_indices[j] = x_indices_ptr[j]
cdef int sample_idx = self.index_data_ptr[current_index]
return (
(np.asarray(x_data), np.asarray(x_indices), np.asarray(x_indptr)),
y,
sample_weight,
sample_idx,
)
cdef class ArrayDataset64(SequentialDataset64):
"""Dataset backed by a two-dimensional numpy array.
The dtype of the numpy array is expected to be ``np.float64`` (float64_t)
and C-style memory layout.
"""
def __cinit__(
self,
const float64_t[:, ::1] X,
const float64_t[::1] Y,
const float64_t[::1] sample_weights,
uint32_t seed=1,
):
"""A ``SequentialDataset`` backed by a two-dimensional numpy array.
Parameters
----------
X : ndarray, dtype=float64_t, ndim=2, mode='c'
The sample array, of shape(n_samples, n_features)
Y : ndarray, dtype=float64_t, ndim=1, mode='c'
The target array, of shape(n_samples, )
sample_weights : ndarray, dtype=float64_t, ndim=1, mode='c'
The weight of each sample, of shape(n_samples,)
"""
if X.shape[0] > INT_MAX or X.shape[1] > INT_MAX:
raise ValueError("More than %d samples or features not supported;"
" got (%d, %d)."
% (INT_MAX, X.shape[0], X.shape[1]))
# keep a reference to the data to prevent garbage collection
self.X = X
self.Y = Y
self.sample_weights = sample_weights
self.n_samples = X.shape[0]
self.n_features = X.shape[1]
self.feature_indices = np.arange(0, self.n_features, dtype=np.intc)
self.feature_indices_ptr = <int *> &self.feature_indices[0]
self.current_index = -1
self.X_stride = X.strides[0] // X.itemsize
self.X_data_ptr = <float64_t *> &X[0, 0]
self.Y_data_ptr = <float64_t *> &Y[0]
self.sample_weight_data = <float64_t *> &sample_weights[0]
# Use index array for fast shuffling
self.index = np.arange(0, self.n_samples, dtype=np.intc)
self.index_data_ptr = <int *> &self.index[0]
# seed should not be 0 for our_rand_r
self.seed = max(seed, 1)
cdef void _sample(self, float64_t **x_data_ptr, int **x_ind_ptr,
int *nnz, float64_t *y, float64_t *sample_weight,
int current_index) noexcept nogil:
cdef long long sample_idx = self.index_data_ptr[current_index]
cdef long long offset = sample_idx * self.X_stride
y[0] = self.Y_data_ptr[sample_idx]
x_data_ptr[0] = self.X_data_ptr + offset
x_ind_ptr[0] = self.feature_indices_ptr
nnz[0] = self.n_features
sample_weight[0] = self.sample_weight_data[sample_idx]
cdef class CSRDataset64(SequentialDataset64):
"""A ``SequentialDataset`` backed by a scipy sparse CSR matrix. """
def __cinit__(
self,
const float64_t[::1] X_data,
const int[::1] X_indptr,
const int[::1] X_indices,
const float64_t[::1] Y,
const float64_t[::1] sample_weights,
uint32_t seed=1,
):
"""Dataset backed by a scipy sparse CSR matrix.
The feature indices of ``x`` are given by x_ind_ptr[0:nnz].
The corresponding feature values are given by
x_data_ptr[0:nnz].
Parameters
----------
X_data : ndarray, dtype=float64_t, ndim=1, mode='c'
The data array of the CSR features matrix.
X_indptr : ndarray, dtype=np.intc, ndim=1, mode='c'
The index pointer array of the CSR features matrix.
X_indices : ndarray, dtype=np.intc, ndim=1, mode='c'
The column indices array of the CSR features matrix.
Y : ndarray, dtype=float64_t, ndim=1, mode='c'
The target values.
sample_weights : ndarray, dtype=float64_t, ndim=1, mode='c'
The weight of each sample.
"""
# keep a reference to the data to prevent garbage collection
self.X_data = X_data
self.X_indptr = X_indptr
self.X_indices = X_indices
self.Y = Y
self.sample_weights = sample_weights
self.n_samples = Y.shape[0]
self.current_index = -1
self.X_data_ptr = <float64_t *> &X_data[0]
self.X_indptr_ptr = <int *> &X_indptr[0]
self.X_indices_ptr = <int *> &X_indices[0]
self.Y_data_ptr = <float64_t *> &Y[0]
self.sample_weight_data = <float64_t *> &sample_weights[0]
# Use index array for fast shuffling
self.index = np.arange(self.n_samples, dtype=np.intc)
self.index_data_ptr = <int *> &self.index[0]
# seed should not be 0 for our_rand_r
self.seed = max(seed, 1)
cdef void _sample(self, float64_t **x_data_ptr, int **x_ind_ptr,
int *nnz, float64_t *y, float64_t *sample_weight,
int current_index) noexcept nogil:
cdef long long sample_idx = self.index_data_ptr[current_index]
cdef long long offset = self.X_indptr_ptr[sample_idx]
y[0] = self.Y_data_ptr[sample_idx]
x_data_ptr[0] = self.X_data_ptr + offset
x_ind_ptr[0] = self.X_indices_ptr + offset
nnz[0] = self.X_indptr_ptr[sample_idx + 1] - offset
sample_weight[0] = self.sample_weight_data[sample_idx]
#------------------------------------------------------------------------------
cdef class SequentialDataset32:
"""Base class for datasets with sequential data access.
SequentialDataset is used to iterate over the rows of a matrix X and
corresponding target values y, i.e. to iterate over samples.
There are two methods to get the next sample:
- next : Iterate sequentially (optionally randomized)
- random : Iterate randomly (with replacement)
Attributes
----------
index : np.ndarray
Index array for fast shuffling.
index_data_ptr : int
Pointer to the index array.
current_index : int
Index of current sample in ``index``.
The index of current sample in the data is given by
index_data_ptr[current_index].
n_samples : Py_ssize_t
Number of samples in the dataset.
seed : uint32_t
Seed used for random sampling. This attribute is modified at each call to the
`random` method.
"""
cdef void next(self, float32_t **x_data_ptr, int **x_ind_ptr,
int *nnz, float32_t *y, float32_t *sample_weight) noexcept nogil:
"""Get the next example ``x`` from the dataset.
This method gets the next sample looping sequentially over all samples.
The order can be shuffled with the method ``shuffle``.
Shuffling once before iterating over all samples corresponds to a
random draw without replacement. It is used for instance in SGD solver.
Parameters
----------
x_data_ptr : float32_t**
A pointer to the float32_t array which holds the feature
values of the next example.
x_ind_ptr : np.intc**
A pointer to the int array which holds the feature
indices of the next example.
nnz : int*
A pointer to an int holding the number of non-zero
values of the next example.
y : float32_t*
The target value of the next example.
sample_weight : float32_t*
The weight of the next example.
"""
cdef int current_index = self._get_next_index()
self._sample(x_data_ptr, x_ind_ptr, nnz, y, sample_weight,
current_index)
cdef int random(self, float32_t **x_data_ptr, int **x_ind_ptr,
int *nnz, float32_t *y, float32_t *sample_weight) noexcept nogil:
"""Get a random example ``x`` from the dataset.
This method gets next sample chosen randomly over a uniform
distribution. It corresponds to a random draw with replacement.
It is used for instance in SAG solver.
Parameters
----------
x_data_ptr : float32_t**
A pointer to the float32_t array which holds the feature
values of the next example.
x_ind_ptr : np.intc**
A pointer to the int array which holds the feature
indices of the next example.
nnz : int*
A pointer to an int holding the number of non-zero
values of the next example.
y : float32_t*
The target value of the next example.
sample_weight : float32_t*
The weight of the next example.
Returns
-------
current_index : int
Index of current sample.
"""
cdef int current_index = self._get_random_index()
self._sample(x_data_ptr, x_ind_ptr, nnz, y, sample_weight,
current_index)
return current_index
cdef void shuffle(self, uint32_t seed) noexcept nogil:
"""Permutes the ordering of examples."""
# Fisher-Yates shuffle
cdef int *ind = self.index_data_ptr
cdef int n = self.n_samples
cdef unsigned i, j
for i in range(n - 1):
j = i + our_rand_r(&seed) % (n - i)
ind[i], ind[j] = ind[j], ind[i]
cdef int _get_next_index(self) noexcept nogil:
cdef int current_index = self.current_index
if current_index >= (self.n_samples - 1):
current_index = -1
current_index += 1
self.current_index = current_index
return self.current_index
cdef int _get_random_index(self) noexcept nogil:
cdef int n = self.n_samples
cdef int current_index = our_rand_r(&self.seed) % n
self.current_index = current_index
return current_index
cdef void _sample(self, float32_t **x_data_ptr, int **x_ind_ptr,
int *nnz, float32_t *y, float32_t *sample_weight,
int current_index) noexcept nogil:
pass
def _shuffle_py(self, uint32_t seed):
"""python function used for easy testing"""
self.shuffle(seed)
def _next_py(self):
"""python function used for easy testing"""
cdef int current_index = self._get_next_index()
return self._sample_py(current_index)
def _random_py(self):
"""python function used for easy testing"""
cdef int current_index = self._get_random_index()
return self._sample_py(current_index)
def _sample_py(self, int current_index):
"""python function used for easy testing"""
cdef float32_t* x_data_ptr
cdef int* x_indices_ptr
cdef int nnz, j
cdef float32_t y, sample_weight
# call _sample in cython
self._sample(&x_data_ptr, &x_indices_ptr, &nnz, &y, &sample_weight,
current_index)
# transform the pointed data in numpy CSR array
cdef float32_t[:] x_data = np.empty(nnz, dtype=np.float32)
cdef int[:] x_indices = np.empty(nnz, dtype=np.int32)
cdef int[:] x_indptr = np.asarray([0, nnz], dtype=np.int32)
for j in range(nnz):
x_data[j] = x_data_ptr[j]
x_indices[j] = x_indices_ptr[j]
cdef int sample_idx = self.index_data_ptr[current_index]
return (
(np.asarray(x_data), np.asarray(x_indices), np.asarray(x_indptr)),
y,
sample_weight,
sample_idx,
)
cdef class ArrayDataset32(SequentialDataset32):
"""Dataset backed by a two-dimensional numpy array.
The dtype of the numpy array is expected to be ``np.float32`` (float32_t)
and C-style memory layout.
"""
def __cinit__(
self,
const float32_t[:, ::1] X,
const float32_t[::1] Y,
const float32_t[::1] sample_weights,
uint32_t seed=1,
):
"""A ``SequentialDataset`` backed by a two-dimensional numpy array.
Parameters
----------
X : ndarray, dtype=float32_t, ndim=2, mode='c'
The sample array, of shape(n_samples, n_features)
Y : ndarray, dtype=float32_t, ndim=1, mode='c'
The target array, of shape(n_samples, )
sample_weights : ndarray, dtype=float32_t, ndim=1, mode='c'
The weight of each sample, of shape(n_samples,)
"""
if X.shape[0] > INT_MAX or X.shape[1] > INT_MAX:
raise ValueError("More than %d samples or features not supported;"
" got (%d, %d)."
% (INT_MAX, X.shape[0], X.shape[1]))
# keep a reference to the data to prevent garbage collection
self.X = X
self.Y = Y
self.sample_weights = sample_weights
self.n_samples = X.shape[0]
self.n_features = X.shape[1]
self.feature_indices = np.arange(0, self.n_features, dtype=np.intc)
self.feature_indices_ptr = <int *> &self.feature_indices[0]
self.current_index = -1
self.X_stride = X.strides[0] // X.itemsize
self.X_data_ptr = <float32_t *> &X[0, 0]
self.Y_data_ptr = <float32_t *> &Y[0]
self.sample_weight_data = <float32_t *> &sample_weights[0]
# Use index array for fast shuffling
self.index = np.arange(0, self.n_samples, dtype=np.intc)
self.index_data_ptr = <int *> &self.index[0]
# seed should not be 0 for our_rand_r
self.seed = max(seed, 1)
cdef void _sample(self, float32_t **x_data_ptr, int **x_ind_ptr,
int *nnz, float32_t *y, float32_t *sample_weight,
int current_index) noexcept nogil:
cdef long long sample_idx = self.index_data_ptr[current_index]
cdef long long offset = sample_idx * self.X_stride
y[0] = self.Y_data_ptr[sample_idx]
x_data_ptr[0] = self.X_data_ptr + offset
x_ind_ptr[0] = self.feature_indices_ptr
nnz[0] = self.n_features
sample_weight[0] = self.sample_weight_data[sample_idx]
cdef class CSRDataset32(SequentialDataset32):
"""A ``SequentialDataset`` backed by a scipy sparse CSR matrix. """
def __cinit__(
self,
const float32_t[::1] X_data,
const int[::1] X_indptr,
const int[::1] X_indices,
const float32_t[::1] Y,
const float32_t[::1] sample_weights,
uint32_t seed=1,
):
"""Dataset backed by a scipy sparse CSR matrix.
The feature indices of ``x`` are given by x_ind_ptr[0:nnz].
The corresponding feature values are given by
x_data_ptr[0:nnz].
Parameters
----------
X_data : ndarray, dtype=float32_t, ndim=1, mode='c'
The data array of the CSR features matrix.
X_indptr : ndarray, dtype=np.intc, ndim=1, mode='c'
The index pointer array of the CSR features matrix.
X_indices : ndarray, dtype=np.intc, ndim=1, mode='c'
The column indices array of the CSR features matrix.
Y : ndarray, dtype=float32_t, ndim=1, mode='c'
The target values.
sample_weights : ndarray, dtype=float32_t, ndim=1, mode='c'
The weight of each sample.
"""
# keep a reference to the data to prevent garbage collection
self.X_data = X_data
self.X_indptr = X_indptr
self.X_indices = X_indices
self.Y = Y
self.sample_weights = sample_weights
self.n_samples = Y.shape[0]
self.current_index = -1
self.X_data_ptr = <float32_t *> &X_data[0]
self.X_indptr_ptr = <int *> &X_indptr[0]
self.X_indices_ptr = <int *> &X_indices[0]
self.Y_data_ptr = <float32_t *> &Y[0]
self.sample_weight_data = <float32_t *> &sample_weights[0]
# Use index array for fast shuffling
self.index = np.arange(self.n_samples, dtype=np.intc)
self.index_data_ptr = <int *> &self.index[0]
# seed should not be 0 for our_rand_r
self.seed = max(seed, 1)
cdef void _sample(self, float32_t **x_data_ptr, int **x_ind_ptr,
int *nnz, float32_t *y, float32_t *sample_weight,
int current_index) noexcept nogil:
cdef long long sample_idx = self.index_data_ptr[current_index]
cdef long long offset = self.X_indptr_ptr[sample_idx]
y[0] = self.Y_data_ptr[sample_idx]
x_data_ptr[0] = self.X_data_ptr + offset
x_ind_ptr[0] = self.X_indices_ptr + offset
nnz[0] = self.X_indptr_ptr[sample_idx + 1] - offset
sample_weight[0] = self.sample_weight_data[sample_idx]