内容
lightGBMの全パラメーターについて大雑把に解説していく。内容が多いので、何日間かかけて、ゆっくり翻訳していく。細かいことで気になることに関しては別記事で随時アップデートしていこうと思う。間違っている際には、ご指摘いただけると嬉しいです。
lightGBMの公式githubはこちら
基本的な説明形式は、デフォルト=default, 型=type, オプション=options, 制約=constraints
コアパラメーター(Core Parameters)
-
config
, default =""
, type = string, 別名:config_file
-
設定ファイルのパス
-
Note: CLIバージョンでのみ使用可能
-
-
task
, default =train
, type = enum, options:train
,predict
,convert_model
,refit
, 別名:task_type
-
train
, 別名:training
-
predict
, 別名:prediction
,test
-
convert_model
, モデルファイルをif-else形式に変換する。詳しくは IO Parameters -
refit
, 新しいデータを用いて再フィットします, 別名:refit_tree
-
Note: CLIバージョンでのみ使用可能; 言語指定パッケージでは対応した機能を使用できます。
-
-
objective
, default =regression
, type = enum, options:regression
,regression_l1
,huber
,fair
,poisson
,quantile
,mape
,gamma
,tweedie
,binary
,multiclass
,multiclassova
,cross_entropy
,cross_entropy_lambda
,lambdarank
,rank_xendcg
, aliases:objective_type
,app
,application
-
回帰
-
regression
, L2損失, 別名:regression_l2
,l2
,mean_squared_error
,mse
,l2_root
,root_mean_squared_error
,rmse
-
regression_l1
, L1損失, 別名:l1
,mean_absolute_error
,mae
-
huber
, Huber loss -
fair
, Fair loss -
poisson
, Poisson regression -
quantile
, Quantile regression -
mape
, MAPE loss:mean_absolute_percentage_error
-
gamma
, Gamma regression with log-link. 使用例: 保険適用の頻度をモデリングするケースや、その他ガンマ分布に従うようなケース。gamma-distributed -
tweedie
, Tweedie regression with log-link. 使用例: 保険の総損失をモデリングするケースや、その他tweedie分布に従うケースtweedie-distributed
-
-
二項分類
-
他クラス分類
-
multiclass
, ソフトマックス, 別名:softmax
-
multiclassova
, One-vs-All, 別名:multiclass_ova
,ova
,ovr
-
num_class
should be set as well
-
-
交差エントロピー応用
-
cross_entropy
, 交差エントロピーの目的関数 (重みは任意), 別名:xentropy
-
cross_entropy_lambda
, 交差エントロピーのその他のパラメータ化, 別名:xentlambda
-
label is anything in interval [0, 1]
-
-
順位応用
-
lambdarank
, lambdarank.label_gain
(定義本ページ以降説明) は整数ラベルがあり、ラベルの任意の値がlabel_gain
の要素の個数より少ないように重みを付ける。 -
rank_xendcg
, XE_NDCG_MART 順位目的関数, 別名:xendcg
,xe_ndcg
,xe_ndcg_mart
,xendcg_mart
-
rank_xendcg
計算が早く挙動はlambdarank
と似ている。 -
ラベルは
int
型である必要があり, 大きい数字にはより良い意味を持たせる必要がある。 (例. 0:悪い, 1:普通, 2:良い, 3:かなり良い)
-
-
-
boosting
, default =gbdt
, type = enum, options:gbdt
,rf
,dart
,goss
, 別名:boosting_type
,boost
-
gbdt
, 典型的な勾配ブースティング, 別名:gbrt
-
rf
, ランダム木, 別名:random_forest
-
goss
, Gradient-based One-Side Sampling
-
-
data
, default =""
, type = string, 別名:train
,train_data
,train_data_file
,data_filename
-
訓練データのパス, パスを指定すればLightGBMはそのデータを用いて訓練する。
-
Note: CLIバージョンのみ使用可能
-
-
valid
, default =""
, type = string, 別名:test
,valid_data
,valid_data_file
,test_data
,test_data_file
,valid_filenames
-
バリデーション/テストデータのパス, LightGBMはこれらのデータを用いて結果を出力しようとする。
-
複数のバリデーションデータを用いることが出来る,
,
で分ける。 -
Note: CLIバージョンのみ使用可能
-
-
num_iterations
, default =100
, type = int, 別名:num_iteration
,n_iter
,num_tree
,num_trees
,num_round
,num_rounds
,num_boost_round
,n_estimators
, 制約:num_iterations >= 0
-
ブースティングの回数
-
Note: 内部では、LightGBMは他クラス分類問題で
num_class * num_iterations
個の木を構築する。
-
-
learning_rate
, default =0.1
, type = double, 別名:shrinkage_rate
,eta
, 制約:learning_rate > 0.0
-
収縮測度
-
dart
では,dropped trees
の正規化された重みに影響を及ぼす。
-
-
num_leaves
, default =31
, type = int, 別名:num_leaf
,max_leaves
,max_leaf
, 制約:1 < num_leaves <= 131072
- 一つの木の最大葉数。
-
tree_learner
, default =serial
, type = enum, options:serial
,feature
,data
,voting
, 別名:tree
,tree_type
,tree_learner_type
-
木の学習方法を指定する。用語が専門的なため、翻訳は割愛する。
-
serial
, single machine tree learner -
feature
, feature parallel tree learner, 別名:feature_parallel
-
data
, data parallel tree learner, 別名:data_parallel
-
voting
, voting parallel tree learner, 別名:voting_parallel
-
並列学習を参考にしてください。
-
-
num_threads
, default =0
, type = int, aliases:num_thread
,nthread
,nthreads
,n_jobs
-
LightGBMに用いるスレッド数
-
OpenMPでは、
0
はスレッドのデフォルト数を意味する。 -
計算速度を最大化するためには, このパラメーターを実際のCPUコア数にすれば良い, スレッド数ではないので気を付けていただきたい。(ほとんどのCPUはhyper-threadingを用いて一つのCPUあたり2スレッドを生成している。)
-
データセットが小さい場合は大きくしないようにしてください。 (例えば, 10000列のデータに対して64スレッドを使用しないでください。)
-
タスクマネージャーやその他のCPU監視ツールではコアが全て使われていないと表示される可能性がございます。これは正常です
-
並列処理においては、ネットワークのパフォーマンスが落とさないために、CPUのコア数を全て使用しないようにしてください。
-
Note: 訓練中にこちらのパラメータを変更しないでください。特に外部パッケージで複数のタスクを同時に動かしている場合、予期しないエラーが発生する可能性があります。
-
-
device_type
, default =cpu
, type = enum, options:cpu
,gpu
, aliases:device
-
木の学習に用いるデバイスを指定します。GPUを使用することで高速化出来ます。
-
Note: より小さい
max_bin
(例 63) にすることで高速化出来ます。 -
Note: より高速化するために, デフォルトでGPUは32-bit浮動小数点と足し上げています。これはいくつかのタスクの精度に影響を及ぼす可能性があり、
gpu_use_dp=true
とすることで64-bit浮動小数点に変更出来ますが, 訓練する時間がより多くかかる可能性があります。 -
Note: lightGBMでGPUを使用したい場合はインストールガイドをご参照ください。
-
-
seed
, default =None
, type = int, aliases:random_seed
,random_state
-
このシードは他のシードを生成します。例.
data_random_seed
,feature_fraction_seed
等. -
デフォルトでは、他のシードのデフォルト値の為に、このシードを使用しておりません。
-
このシードは他のシードを比べて優先度が低いです。つまり、他のシードを明示的に指定すると、このシードは上書きされます。
-
学習制御パラメーター(Learning Control Parameters)
-
force_col_wise
, default =false
, type = bool-
cpu
のみ使用可能 -
これを
true
にすることによって、カラムベースのヒストグラムを生成出来ます。 -
以下の際に、こちらのパラメーターの適用を推奨:
-
カラム数が大きい、またはビンの数が大きい
-
num_threads
が大きい, 例>20
-
メモリーのコストを抑えたい
-
-
Note:
force_col_wise
とforce_row_wise
が両方ともfalse
のとき, LightGBMは初めに両方とも試し, より速い方を使用する. テストセットのメモリコスト(overhead)を取り除くためには、より速い方をマニュアルでtrue
にしてください。 -
Note:
force_row_wise
と一緒に使用できません, 二つのうち、一つだけ選んでください。
-
-
force_row_wise
, default =false
, type = bool-
cpu
のみ使用可能 -
これを
true
にすることによって、行ベースのヒストグラムを生成出来ます。 -
以下の際に、こちらのパラメーターの適用を推奨:
-
データ数が大きい、またはビンの数が比較的少ない
-
num_threads
が比較的少ない, 例<=16
-
小さい値の
bagging_fraction
またはgoss
を使用し、高速化したいとき
-
-
Note: これを
true
にするとデータセットに対するメモリの使用量が倍になります。 メモリが足りない場合は、force_col_wise=true
を使用してください。 -
Note:
force_col_wise
とforce_row_wise
が両方ともfalse
のとき, LightGBMは初めに両方とも試し, より速い方を使用する. テストセットのメモリコスト(overhead)を取り除くためには、より速い方をマニュアルでtrue
にしてください。 -
Note:
force_col_wise
と一緒に使用できません, 二つのうち、一つだけ選んでください。
-
-
histogram_pool_size
, default =-1.0
, type = double, aliases:hist_pool_size
-
ヒストリカルヒストグラムの最大キャッシュサイズ(単位MB)
-
< 0
は無制限を意味する
-
-
max_depth
, default =-1
, type = int-
木モデルの最大深さを制限する。これはデータ数が小さいときのオーバーフィッティングへの対処に用いる。木の仕様は変わらない。
-
<= 0
は無制限を意味する。
-
-
min_data_in_leaf
, default =20
, type = int, 別名:min_data_per_leaf
,min_data
,min_child_samples
, constraints:min_data_in_leaf >= 0
- 一つの葉のデータの最小個数。オーバーフィッティングへの対処に用いる。
-
min_sum_hessian_in_leaf
, default =1e-3
, type = double, aliases:min_sum_hessian_per_leaf
,min_sum_hessian
,min_hessian
,min_child_weight
, constraints:min_sum_hessian_in_leaf >= 0.0
- 一つの葉のヘシアンの最小和。
min_data_in_leaf
と同様に, オーバーフィッティングへの対処に用いる。
- 一つの葉のヘシアンの最小和。
-
bagging_fraction
, default =1.0
, type = double, 別名:sub_row
,subsample
,bagging
, 制約:0.0 < bagging_fraction <= 1.0
-
feature_fraction
と似ているが, これはデータの部分集合をリサンプリングせずにランダムに抽出する。 -
訓練の計算速度向上に用いる。
-
オーバーフィッティングへの対処に用いる。
-
Note: バギングを有効にするためには,
bagging_freq
も0でない値にする必要がある。
-
-
pos_bagging_fraction
, default =1.0
, type = double, 別名:pos_sub_row
,pos_subsample
,pos_bagging
, 制約:0.0 < pos_bagging_fraction <= 1.0
-
binary
でのみ使用する。 -
不均衡な二値分類問題に用いる。バギング中にランダムに
#pos_samples * pos_bagging_fraction
個の正値サンプルを抽出する。 -
neg_bagging_fraction
と一緒に用いる必要がある。 -
1.0
にすると無効になる。 -
Note: 有効にするためには,
bagging_freq
とneg_bagging_fraction
を埋める必要がある。 -
Note: もし
pos_bagging_fraction
とneg_bagging_fraction
が共に1.0
ならば, 均衡バギング(balanced bagging)は無効になる。 -
Note: もし均衡バギングが有効であるならば、
bagging_fraction
は無視される。
-
-
neg_bagging_fraction
, default =1.0
, type = double, 別名:neg_sub_row
,neg_subsample
,neg_bagging
, 制約:0.0 < neg_bagging_fraction <= 1.0
-
binary
でのみ使用する。 -
不均衡な二値分類問題に用いる。バギング中にランダムに
#neg_samples * neg_bagging_fraction
個の負値サンプルを抽出する。 -
pos_bagging_fraction
と一緒に用いる。 -
1.0
にすると無効になる。 -
Note: 有効にするためには,
bagging_freq
とneg_bagging_fraction
を埋める必要がある。 -
Note: もし
pos_bagging_fraction
とneg_bagging_fraction
が共に1.0
ならば, 均衡バギング(balanced bagging)は無効になる。 -
Note: もし均衡バギングが有効であるならば、
bagging_fraction
は無視される。
-
-
bagging_freq
, default =0
, type = int, 別名:subsample_freq
-
バギングする頻度
-
0
はバギングをしないことを意味する。;k
はk
回に一回繰り返しバギングされることを意味する。 -
Note: バギングを有効にするためには,
bagging_fraction
の値を1.0
より小さくする必要がある。
-
-
bagging_seed
, default =3
, type = int, 別名:bagging_fraction_seed
- バギングのランダムシード
-
feature_fraction
, default =1.0
, type = double, 別名:sub_feature
,colsample_bytree
, 制約:0.0 < feature_fraction <= 1.0
-
もし
feature_fraction
が1.0
より小さければ、LightGBMは毎回ランダムに特徴量を部分的に抽出する。 例えば、0.8
にしたならば, LightGBMは訓練前に特徴量の80%を選択する。 -
訓練の高速化に使える。
-
オーバーフィッティング対策に使える。
-
-
feature_fraction_bynode
, default =1.0
, type = double, 別名:sub_feature_bynode
,colsample_bynode
, 制約:0.0 < feature_fraction_bynode <= 1.0
-
もし
feature_fraction_bynode
が1.0
より小さければ、それぞれの木のノードでLightGBMが特徴量を部分的に抽出する。 例えば、0.8
にしたら、 LightGBMは特徴量の80%をそれぞれの木のノードから抽出する。 -
オーバーフィッティング対策に使える。
-
Note:
feature_fraction
と違い、訓練は高速化されない。 -
Note:
feature_fraction
とfeature_fraction_bynode
が共に1.0
より小さければ、 それぞれのノードの最終的な割合は元のfeature_fraction * feature_fraction_bynode
倍となる。
-
-
feature_fraction_seed
, default =2
, type = int-
feature_fraction
のランダムシード
-
-
extra_trees
, default =false
, type = bool-
極端にランダムな木に用いる。
-
true
ならば、ノードの分割を評価する際に、lightGBMがそれぞれの特徴量に対してランダムな敷居を一つだけ選択します。 -
オーバーフィッティングの対策に用いる。
-
-
extra_seed
, default =6
, type = int-
extra_trees
がtrueのときの敷居の選択に使用するランダムシード
-
-
early_stopping_round
, default =0
, type = int, 別名:early_stopping_rounds
,early_stopping
,n_iter_no_change
-
early_stopping_round
の最後の回に、 性能が向上しなかった場合に訓練を停止する。 -
<= 0
は無効を意味する。
-
-
first_metric_only
, default =false
, type = bool- もしearly stoppingの最初の評価のみを使用したい場合は、こちらを
true
にする。
- もしearly stoppingの最初の評価のみを使用したい場合は、こちらを
-
max_delta_step
, default =0.0
, type = double, 別名:max_tree_output
,max_leaf_output
-
葉の最大出力数を制限する。
-
<= 0
は無制限を意味する。 -
最終的な葉の最大出力数は
learning_rate * max_delta_step
となる。
-
-
lambda_l1
, default =0.0
, type = double, 別名:reg_alpha
, 制限:lambda_l1 >= 0.0
- L1正則化
-
lambda_l2
, default =0.0
, type = double, 別名:reg_lambda
,lambda
, 制限:lambda_l2 >= 0.0
- L2正則化
-
min_gain_to_split
, default =0.0
, type = double, 別名:min_split_gain
, 制限:min_gain_to_split >= 0.0
- 分割する際の最小獲得量(gain)
-
drop_rate
, default =0.1
, type = double, 別名:rate_drop
, 制約:0.0 <= drop_rate <= 1.0
-
dart
にのみ使用される。 -
dropout rate: ドロップアウト中の前の木の削減率(ドロップアウトは訓練中に特徴量のランダムな部分を弱めるために使用される。(dropouts are used
to mute a random fraction of the input features during the training phase. 参考文献)
-
-
max_drop
, default =50
, type = int-
used only in
dart
-
max number of dropped trees during one boosting iteration
-
<=0
means no limit
-
-
skip_drop
, default =0.5
, type = double, constraints:0.0 <= skip_drop <= 1.0
-
used only in
dart
-
probability of skipping the dropout procedure during a boosting iteration
-
-
xgboost_dart_mode
, default =false
, type = bool-
used only in
dart
-
set this to
true
, if you want to use xgboost dart mode
-
-
uniform_drop
, default =false
, type = bool-
used only in
dart
-
set this to
true
, if you want to use uniform drop
-
-
drop_seed
, default =4
, type = int-
used only in
dart
-
random seed to choose dropping models
-
-
top_rate
, default =0.2
, type = double, constraints:0.0 <= top_rate <= 1.0
-
used only in
goss
-
the retain ratio of large gradient data
-
-
other_rate
, default =0.1
, type = double, constraints:0.0 <= other_rate <= 1.0
-
used only in
goss
-
the retain ratio of small gradient data
-
-
min_data_per_group
, default =100
, type = int, constraints:min_data_per_group > 0
- minimal number of data per categorical group
-
max_cat_threshold
, default =32
, type = int, constraints:max_cat_threshold > 0
-
used for the categorical features
-
limit the max threshold points in categorical features
-
-
cat_l2
, default =10.0
, type = double, constraints:cat_l2 >= 0.0
-
used for the categorical features
-
L2 regularization in categorical split
-
-
cat_smooth
, default =10.0
, type = double, constraints:cat_smooth >= 0.0
-
used for the categorical features
-
this can reduce the effect of noises in categorical features, especially for categories with few data
-
-
max_cat_to_onehot
, default =4
, type = int, constraints:max_cat_to_onehot > 0
- when number of categories of one feature smaller than or equal to
max_cat_to_onehot
, one-vs-other split algorithm will be used
- when number of categories of one feature smaller than or equal to
-
top_k
, default =20
, type = int, aliases:topk
, constraints:top_k > 0
-
used only in
voting
tree learner, refer toVoting parallel <./Parallel-Learning-Guide.rst#choose-appropriate-parallel-algorithm>
__ -
set this to larger value for more accurate result, but it will slow down the training speed
-
-
monotone_constraints
, default =None
, type = multi-int, aliases:mc
,monotone_constraint
-
used for constraints of monotonic features
-
1
means increasing,-1
means decreasing,0
means non-constraint -
you need to specify all features in order. For example,
mc=-1,0,1
means decreasing for 1st feature, non-constraint for 2nd feature and increasing for the 3rd feature
-
-
monotone_constraints_method
, default =basic
, type = string, aliases:monotone_constraining_method
,mc_method
-
used only if
monotone_constraints
is set -
monotone constraints method
-
basic
, the most basic monotone constraints method. It does not slow the library at all, but over-constrains the predictions -
intermediate
, amore advanced method <https://github.com/microsoft/LightGBM/files/3457826/PR-monotone-constraints-report.pdf>
__, which may slow the library very slightly. However, this method is much less constraining than the basic method and should significantly improve the results
-
-
-
monotone_penalty
, default =0.0
, type = double, aliases:monotone_splits_penalty
,ms_penalty
,mc_penalty
, constraints:monotone_penalty >= 0.0
-
used only if
monotone_constraints
is set -
monotone penalty <https://github.com/microsoft/LightGBM/files/3457826/PR-monotone-constraints-report.pdf>
__: a penalization parameter X forbids any monotone splits on the first X (rounded down) level(s) of the tree. The penalty applied to monotone splits on a given depth is a continuous, increasing function the penalization parameter -
if
0.0
(the default), no penalization is applied
-
-
feature_contri
, default =None
, type = multi-double, aliases:feature_contrib
,fc
,fp
,feature_penalty
-
used to control feature's split gain, will use
gain[i] = max(0, feature_contri[i]) * gain[i]
to replace the split gain of i-th feature -
you need to specify all features in order
-
-
forcedsplits_filename
, default =""
, type = string, aliases:fs
,forced_splits_filename
,forced_splits_file
,forced_splits
-
path to a
.json
file that specifies splits to force at the top of every decision tree before best-first learning commences -
.json
file can be arbitrarily nested, and each split containsfeature
,threshold
fields, as well asleft
andright
fields representing subsplits -
categorical splits are forced in a one-hot fashion, with
left
representing the split containing the feature value andright
representing other values -
Note: the forced split logic will be ignored, if the split makes gain worse
-
see
this file <https://github.com/microsoft/LightGBM/tree/master/examples/binary_classification/forced_splits.json>
__ as an example
-
-
refit_decay_rate
, default =0.9
, type = double, constraints:0.0 <= refit_decay_rate <= 1.0
-
decay rate of
refit
task, will useleaf_output = refit_decay_rate * old_leaf_output + (1.0 - refit_decay_rate) * new_leaf_output
to refit trees -
used only in
refit
task in CLI version or as argument inrefit
function in language-specific package
-
-
cegb_tradeoff
, default =1.0
, type = double, constraints:cegb_tradeoff >= 0.0
- cost-effective gradient boosting multiplier for all penalties
-
cegb_penalty_split
, default =0.0
, type = double, constraints:cegb_penalty_split >= 0.0
- cost-effective gradient-boosting penalty for splitting a node
-
cegb_penalty_feature_lazy
, default =0,0,...,0
, type = multi-double-
cost-effective gradient boosting penalty for using a feature
-
applied per data point
-
-
cegb_penalty_feature_coupled
, default =0,0,...,0
, type = multi-double-
cost-effective gradient boosting penalty for using a feature
-
applied once per forest
-
-
path_smooth
, default =0
, type = double, constraints:path_smooth >= 0.0
-
controls smoothing applied to tree nodes
-
helps prevent overfitting on leaves with few samples
-
if set to zero, no smoothing is applied
-
if
path_smooth > 0
thenmin_data_in_leaf
must be at least2
-
larger values give stronger regularisation
-
the weight of each node is
(n / path_smooth) * w + w_p / (n / path_smooth + 1)
, wheren
is the number of samples in the node,w
is the optimal node weight to minimise the loss (approximately-sum_gradients / sum_hessians
), andw_p
is the weight of the parent node -
note that the parent output
w_p
itself has smoothing applied, unless it is the root node, so that the smoothing effect accumulates with the tree depth
-
-
-
verbosity
, default =1
, type = int, aliases:verbose
-
controls the level of LightGBM's verbosity
-
< 0
: Fatal,= 0
: Error (Warning),= 1
: Info,> 1
: Debug
-
-
input_model
, default =""
, type = string, aliases:model_input
,model_in
-
filename of input model
-
for
prediction
task, this model will be applied to prediction data -
for
train
task, training will be continued from this model -
Note: can be used only in CLI version
-
-
output_model
, default =LightGBM_model.txt
, type = string, aliases:model_output
,model_out
-
filename of output model in training
-
Note: can be used only in CLI version
-
-
snapshot_freq
, default =-1
, type = int, aliases:save_period
-
frequency of saving model file snapshot
-
set this to positive value to enable this function. For example, the model file will be snapshotted at each iteration if
snapshot_freq=1
-
Note: can be used only in CLI version
-
IO Parameters
Dataset Parameters
- ``max_bin`` , default = ``255``, type = int, constraints: ``max_bin > 1``
- max number of bins that feature values will be bucketed in
- small number of bins may reduce training accuracy but may increase general power (deal with over-fitting)
- LightGBM will auto compress memory according to ``max_bin``. For example, LightGBM will use ``uint8_t`` for feature value if ``max_bin=255``
- ``max_bin_by_feature`` , default = ``None``, type = multi-int
- max number of bins for each feature
- if not specified, will use ``max_bin`` for all features
- ``min_data_in_bin`` , default = ``3``, type = int, constraints: ``min_data_in_bin > 0``
- minimal number of data inside one bin
- use this to avoid one-data-one-bin (potential over-fitting)
- ``bin_construct_sample_cnt`` , default = ``200000``, type = int, aliases: ``subsample_for_bin``, constraints: ``bin_construct_sample_cnt > 0``
- number of data that sampled to construct histogram bins
- setting this to larger value will give better training result, but will increase data loading time
- set this to larger value if data is very sparse
- ``data_random_seed`` , default = ``1``, type = int, aliases: ``data_seed``
- random seed for sampling data to construct histogram bins
- ``is_enable_sparse`` , default = ``true``, type = bool, aliases: ``is_sparse``, ``enable_sparse``, ``sparse``
- used to enable/disable sparse optimization
- ``enable_bundle`` , default = ``true``, type = bool, aliases: ``is_enable_bundle``, ``bundle``
- set this to ``false`` to disable Exclusive Feature Bundling (EFB), which is described in `LightGBM: A Highly Efficient Gradient Boosting Decision Tree <https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree>`__
- **Note**: disabling this may cause the slow training speed for sparse datasets
- ``use_missing`` , default = ``true``, type = bool
- set this to ``false`` to disable the special handle of missing value
- ``zero_as_missing`` , default = ``false``, type = bool
- set this to ``true`` to treat all zero as missing values (including the unshown values in LibSVM / sparse matrices)
- set this to ``false`` to use ``na`` for representing missing values
- ``feature_pre_filter`` , default = ``true``, type = bool
- set this to ``true`` to pre-filter the unsplittable features by ``min_data_in_leaf``
- as dataset object is initialized only once and cannot be changed after that, you may need to set this to ``false`` when searching parameters with ``min_data_in_leaf``, otherwise features are filtered by ``min_data_in_leaf`` firstly if you don't reconstruct dataset object
- **Note**: setting this to ``false`` may slow down the training
- ``pre_partition`` , default = ``false``, type = bool, aliases: ``is_pre_partition``
- used for parallel learning (excluding the ``feature_parallel`` mode)
- ``true`` if training data are pre-partitioned, and different machines use different partitions
- ``two_round`` , default = ``false``, type = bool, aliases: ``two_round_loading``, ``use_two_round_loading``
- set this to ``true`` if data file is too big to fit in memory
- by default, LightGBM will map data file to memory and load features from memory. This will provide faster data loading speed, but may cause run out of memory error when the data file is very big
- **Note**: works only in case of loading data directly from file
- ``header`` , default = ``false``, type = bool, aliases: ``has_header``
- set this to ``true`` if input data has header
- **Note**: works only in case of loading data directly from file
- ``label_column`` , default = ``""``, type = int or string, aliases: ``label``
- used to specify the label column
- use number for index, e.g. ``label=0`` means column\_0 is the label
- add a prefix ``name:`` for column name, e.g. ``label=name:is_click``
- **Note**: works only in case of loading data directly from file
- ``weight_column`` , default = ``""``, type = int or string, aliases: ``weight``
- used to specify the weight column
- use number for index, e.g. ``weight=0`` means column\_0 is the weight
- add a prefix ``name:`` for column name, e.g. ``weight=name:weight``
- **Note**: works only in case of loading data directly from file
- **Note**: index starts from ``0`` and it doesn't count the label column when passing type is ``int``, e.g. when label is column\_0, and weight is column\_1, the correct parameter is ``weight=0``
- ``group_column`` , default = ``""``, type = int or string, aliases: ``group``, ``group_id``, ``query_column``, ``query``, ``query_id``
- used to specify the query/group id column
- use number for index, e.g. ``query=0`` means column\_0 is the query id
- add a prefix ``name:`` for column name, e.g. ``query=name:query_id``
- **Note**: works only in case of loading data directly from file
- **Note**: data should be grouped by query\_id
- **Note**: index starts from ``0`` and it doesn't count the label column when passing type is ``int``, e.g. when label is column\_0 and query\_id is column\_1, the correct parameter is ``query=0``
- ``ignore_column`` , default = ``""``, type = multi-int or string, aliases: ``ignore_feature``, ``blacklist``
- used to specify some ignoring columns in training
- use number for index, e.g. ``ignore_column=0,1,2`` means column\_0, column\_1 and column\_2 will be ignored
- add a prefix ``name:`` for column name, e.g. ``ignore_column=name:c1,c2,c3`` means c1, c2 and c3 will be ignored
- **Note**: works only in case of loading data directly from file
- **Note**: index starts from ``0`` and it doesn't count the label column when passing type is ``int``
- **Note**: despite the fact that specified columns will be completely ignored during the training, they still should have a valid format allowing LightGBM to load file successfully
- ``categorical_feature`` , default = ``""``, type = multi-int or string, aliases: ``cat_feature``, ``categorical_column``, ``cat_column``
- used to specify categorical features
- use number for index, e.g. ``categorical_feature=0,1,2`` means column\_0, column\_1 and column\_2 are categorical features
- add a prefix ``name:`` for column name, e.g. ``categorical_feature=name:c1,c2,c3`` means c1, c2 and c3 are categorical features
- **Note**: only supports categorical with ``int`` type (not applicable for data represented as pandas DataFrame in Python-package)
- **Note**: index starts from ``0`` and it doesn't count the label column when passing type is ``int``
- **Note**: all values should be less than ``Int32.MaxValue`` (2147483647)
- **Note**: using large values could be memory consuming. Tree decision rule works best when categorical features are presented by consecutive integers starting from zero
- **Note**: all negative values will be treated as **missing values**
- **Note**: the output cannot be monotonically constrained with respect to a categorical feature
- ``forcedbins_filename`` , default = ``""``, type = string
- path to a ``.json`` file that specifies bin upper bounds for some or all features
- ``.json`` file should contain an array of objects, each containing the word ``feature`` (integer feature index) and ``bin_upper_bound`` (array of thresholds for binning)
- see `this file <https://github.com/microsoft/LightGBM/tree/master/examples/regression/forced_bins.json>`__ as an example
- ``save_binary`` , default = ``false``, type = bool, aliases: ``is_save_binary``, ``is_save_binary_file``
- if ``true``, LightGBM will save the dataset (including validation data) to a binary file. This speed ups the data loading for the next time
- **Note**: ``init_score`` is not saved in binary file
- **Note**: can be used only in CLI version; for language-specific packages you can use the correspondent function
Predict Parameters
-
num_iteration_predict
, default =-1
, type = int-
used only in
prediction
task -
used to specify how many trained iterations will be used in prediction
-
<= 0
means no limit
-
-
predict_raw_score
, default =false
, type = bool, aliases:is_predict_raw_score
,predict_rawscore
,raw_score
-
used only in
prediction
task -
set this to
true
to predict only the raw scores -
set this to
false
to predict transformed scores
-
-
predict_leaf_index
, default =false
, type = bool, aliases:is_predict_leaf_index
,leaf_index
-
used only in
prediction
task -
set this to
true
to predict with leaf index of all trees
-
-
predict_contrib
, default =false
, type = bool, aliases:is_predict_contrib
,contrib
-
used only in
prediction
task -
set this to
true
to estimateSHAP values <https://arxiv.org/abs/1706.06060>
__, which represent how each feature contributes to each prediction -
produces
#features + 1
values where the last value is the expected value of the model output over the training data -
Note: if you want to get more explanation for your model's predictions using SHAP values like SHAP interaction values, you can install
shap package <https://github.com/slundberg/shap>
__ -
Note: unlike the shap package, with
predict_contrib
we return a matrix with an extra column, where the last column is the expected value
-
-
predict_disable_shape_check
, default =false
, type = bool-
used only in
prediction
task -
control whether or not LightGBM raises an error when you try to predict on data with a different number of features than the training data
-
if
false
(the default), a fatal error will be raised if the number of features in the dataset you predict on differs from the number seen during training -
if
true
, LightGBM will attempt to predict on whatever data you provide. This is dangerous because you might get incorrect predictions, but you could use it in situations where it is difficult or expensive to generate some features and you are very confident that they were never chosen for splits in the model -
Note: be very careful setting this parameter to
true
-
-
pred_early_stop
, default =false
, type = bool-
used only in
prediction
task -
if
true
, will use early-stopping to speed up the prediction. May affect the accuracy
-
-
pred_early_stop_freq
, default =10
, type = int-
used only in
prediction
task -
the frequency of checking early-stopping prediction
-
-
pred_early_stop_margin
, default =10.0
, type = double-
used only in
prediction
task -
the threshold of margin in early-stopping prediction
-
-
output_result
, default =LightGBM_predict_result.txt
, type = string, aliases:predict_result
,prediction_result
,predict_name
,prediction_name
,pred_name
,name_pred
-
used only in
prediction
task -
filename of prediction result
-
Note: can be used only in CLI version
-
Convert Parameters
- ``convert_model_language`` , default = ``""``, type = string
- used only in ``convert_model`` task
- only ``cpp`` is supported yet; for conversion model to other languages consider using `m2cgen <https://github.com/BayesWitnesses/m2cgen>`__ utility
- if ``convert_model_language`` is set and ``task=train``, the model will be also converted
- **Note**: can be used only in CLI version
- ``convert_model`` , default = ``gbdt_prediction.cpp``, type = string, aliases: ``convert_model_file``
- used only in ``convert_model`` task
- output filename of converted model
- **Note**: can be used only in CLI version
Objective Parameters
--------------------
- ``objective_seed`` , default = ``5``, type = int
- used only in ``rank_xendcg`` objective
- random seed for objectives, if random process is needed
- ``num_class`` , default = ``1``, type = int, aliases: ``num_classes``, constraints: ``num_class > 0``
- used only in ``multi-class`` classification application
- ``is_unbalance`` , default = ``false``, type = bool, aliases: ``unbalance``, ``unbalanced_sets``
- used only in ``binary`` and ``multiclassova`` applications
- set this to ``true`` if training data are unbalanced
- **Note**: while enabling this should increase the overall performance metric of your model, it will also result in poor estimates of the individual class probabilities
- **Note**: this parameter cannot be used at the same time with ``scale_pos_weight``, choose only **one** of them
- ``scale_pos_weight`` , default = ``1.0``, type = double, constraints: ``scale_pos_weight > 0.0``
- used only in ``binary`` and ``multiclassova`` applications
- weight of labels with positive class
- **Note**: while enabling this should increase the overall performance metric of your model, it will also result in poor estimates of the individual class probabilities
- **Note**: this parameter cannot be used at the same time with ``is_unbalance``, choose only **one** of them
- ``sigmoid`` , default = ``1.0``, type = double, constraints: ``sigmoid > 0.0``
- used only in ``binary`` and ``multiclassova`` classification and in ``lambdarank`` applications
- parameter for the sigmoid function
- ``boost_from_average`` , default = ``true``, type = bool
- used only in ``regression``, ``binary``, ``multiclassova`` and ``cross-entropy`` applications
- adjusts initial score to the mean of labels for faster convergence
- ``reg_sqrt`` , default = ``false``, type = bool
- used only in ``regression`` application
- used to fit ``sqrt(label)`` instead of original values and prediction result will be also automatically converted to ``prediction^2``
- might be useful in case of large-range labels
- ``alpha`` , default = ``0.9``, type = double, constraints: ``alpha > 0.0``
- used only in ``huber`` and ``quantile`` ``regression`` applications
- parameter for `Huber loss <https://en.wikipedia.org/wiki/Huber_loss>`__ and `Quantile regression <https://en.wikipedia.org/wiki/Quantile_regression>`__
- ``fair_c`` , default = ``1.0``, type = double, constraints: ``fair_c > 0.0``
- used only in ``fair`` ``regression`` application
- parameter for `Fair loss <https://www.kaggle.com/c/allstate-claims-severity/discussion/24520>`__
- ``poisson_max_delta_step`` , default = ``0.7``, type = double, constraints: ``poisson_max_delta_step > 0.0``
- used only in ``poisson`` ``regression`` application
- parameter for `Poisson regression <https://en.wikipedia.org/wiki/Poisson_regression>`__ to safeguard optimization
- ``tweedie_variance_power`` , default = ``1.5``, type = double, constraints: ``1.0 <= tweedie_variance_power < 2.0``
- used only in ``tweedie`` ``regression`` application
- used to control the variance of the tweedie distribution
- set this closer to ``2`` to shift towards a **Gamma** distribution
- set this closer to ``1`` to shift towards a **Poisson** distribution
- ``lambdarank_truncation_level`` , default = ``20``, type = int, constraints: ``lambdarank_truncation_level > 0``
- used only in ``lambdarank`` application
- used for truncating the max DCG, refer to "truncation level" in the Sec. 3 of `LambdaMART paper <https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/MSR-TR-2010-82.pdf>`__
- ``lambdarank_norm`` , default = ``true``, type = bool
- used only in ``lambdarank`` application
- set this to ``true`` to normalize the lambdas for different queries, and improve the performance for unbalanced data
- set this to ``false`` to enforce the original lambdarank algorithm
- ``label_gain`` , default = ``0,1,3,7,15,31,63,...,2^30-1``, type = multi-double
- used only in ``lambdarank`` application
- relevant gain for labels. For example, the gain of label ``2`` is ``3`` in case of default label gains
- separate by ``,``
Metric Parameters
-----------------
- ``metric`` , default = ``""``, type = multi-enum, aliases: ``metrics``, ``metric_types``
- metric(s) to be evaluated on the evaluation set(s)
- ``""`` (empty string or not specified) means that metric corresponding to specified ``objective`` will be used (this is possible only for pre-defined objective functions, otherwise no evaluation metric will be added)
- ``"None"`` (string, **not** a ``None`` value) means that no metric will be registered, aliases: ``na``, ``null``, ``custom``
- ``l1``, absolute loss, aliases: ``mean_absolute_error``, ``mae``, ``regression_l1``
- ``l2``, square loss, aliases: ``mean_squared_error``, ``mse``, ``regression_l2``, ``regression``
- ``rmse``, root square loss, aliases: ``root_mean_squared_error``, ``l2_root``
- ``quantile``, `Quantile regression <https://en.wikipedia.org/wiki/Quantile_regression>`__
- ``mape``, `MAPE loss <https://en.wikipedia.org/wiki/Mean_absolute_percentage_error>`__, aliases: ``mean_absolute_percentage_error``
- ``huber``, `Huber loss <https://en.wikipedia.org/wiki/Huber_loss>`__
- ``fair``, `Fair loss <https://www.kaggle.com/c/allstate-claims-severity/discussion/24520>`__
- ``poisson``, negative log-likelihood for `Poisson regression <https://en.wikipedia.org/wiki/Poisson_regression>`__
- ``gamma``, negative log-likelihood for **Gamma** regression
- ``gamma_deviance``, residual deviance for **Gamma** regression
- ``tweedie``, negative log-likelihood for **Tweedie** regression
- ``ndcg``, `NDCG <https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG>`__, aliases: ``lambdarank``, ``rank_xendcg``, ``xendcg``, ``xe_ndcg``, ``xe_ndcg_mart``, ``xendcg_mart``
- ``map``, `MAP <https://makarandtapaswi.wordpress.com/2012/07/02/intuition-behind-average-precision-and-map/>`__, aliases: ``mean_average_precision``
- ``auc``, `AUC <https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve>`__
- ``binary_logloss``, `log loss <https://en.wikipedia.org/wiki/Cross_entropy>`__, aliases: ``binary``
- ``binary_error``, for one sample: ``0`` for correct classification, ``1`` for error classification
- ``auc_mu``, `AUC-mu <http://proceedings.mlr.press/v97/kleiman19a/kleiman19a.pdf>`__
- ``multi_logloss``, log loss for multi-class classification, aliases: ``multiclass``, ``softmax``, ``multiclassova``, ``multiclass_ova``, ``ova``, ``ovr``
- ``multi_error``, error rate for multi-class classification
- ``cross_entropy``, cross-entropy (with optional linear weights), aliases: ``xentropy``
- ``cross_entropy_lambda``, "intensity-weighted" cross-entropy, aliases: ``xentlambda``
- ``kullback_leibler``, `Kullback-Leibler divergence <https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence>`__, aliases: ``kldiv``
- support multiple metrics, separated by ``,``
- ``metric_freq`` , default = ``1``, type = int, aliases: ``output_freq``, constraints: ``metric_freq > 0``
- frequency for metric output
- **Note**: can be used only in CLI version
- ``is_provide_training_metric`` , default = ``false``, type = bool, aliases: ``training_metric``, ``is_training_metric``, ``train_metric``
- set this to ``true`` to output metric result over training dataset
- **Note**: can be used only in CLI version
- ``eval_at`` , default = ``1,2,3,4,5``, type = multi-int, aliases: ``ndcg_eval_at``, ``ndcg_at``, ``map_eval_at``, ``map_at``
- used only with ``ndcg`` and ``map`` metrics
- `NDCG <https://en.wikipedia.org/wiki/Discounted_cumulative_gain#Normalized_DCG>`__ and `MAP <https://makarandtapaswi.wordpress.com/2012/07/02/intuition-behind-average-precision-and-map/>`__ evaluation positions, separated by ``,``
- ``multi_error_top_k`` , default = ``1``, type = int, constraints: ``multi_error_top_k > 0``
- used only with ``multi_error`` metric
- threshold for top-k multi-error metric
- the error on each sample is ``0`` if the true class is among the top ``multi_error_top_k`` predictions, and ``1`` otherwise
- more precisely, the error on a sample is ``0`` if there are at least ``num_classes - multi_error_top_k`` predictions strictly less than the prediction on the true class
- when ``multi_error_top_k=1`` this is equivalent to the usual multi-error metric
- ``auc_mu_weights`` , default = ``None``, type = multi-double
- used only with ``auc_mu`` metric
- list representing flattened matrix (in row-major order) giving loss weights for classification errors
- list should have ``n * n`` elements, where ``n`` is the number of classes
- the matrix co-ordinate ``[i, j]`` should correspond to the ``i * n + j``-th element of the list
- if not specified, will use equal weights for all classes
Network Parameters
------------------
- ``num_machines`` , default = ``1``, type = int, aliases: ``num_machine``, constraints: ``num_machines > 0``
- the number of machines for parallel learning application
- this parameter is needed to be set in both **socket** and **mpi** versions
- ``local_listen_port`` , default = ``12400``, type = int, aliases: ``local_port``, ``port``, constraints: ``local_listen_port > 0``
- TCP listen port for local machines
- **Note**: don't forget to allow this port in firewall settings before training
- ``time_out`` , default = ``120``, type = int, constraints: ``time_out > 0``
- socket time-out in minutes
- ``machine_list_filename`` , default = ``""``, type = string, aliases: ``machine_list_file``, ``machine_list``, ``mlist``
- path of file that lists machines for this parallel learning application
- each line contains one IP and one port for one machine. The format is ``ip port`` (space as a separator)
- ``machines`` , default = ``""``, type = string, aliases: ``workers``, ``nodes``
- list of machines in the following format: ``ip1:port1,ip2:port2``
GPU Parameters
--------------
- ``gpu_platform_id`` , default = ``-1``, type = int
- OpenCL platform ID. Usually each GPU vendor exposes one OpenCL platform
- ``-1`` means the system-wide default platform
- **Note**: refer to `GPU Targets <./GPU-Targets.rst#query-opencl-devices-in-your-system>`__ for more details
- ``gpu_device_id`` , default = ``-1``, type = int
- OpenCL device ID in the specified platform. Each GPU in the selected platform has a unique device ID
- ``-1`` means the default device in the selected platform
- **Note**: refer to `GPU Targets <./GPU-Targets.rst#query-opencl-devices-in-your-system>`__ for more details
- ``gpu_use_dp`` , default = ``false``, type = bool
- set this to ``true`` to use double precision math on GPU (by default single precision is used)
.. end params list
Others
------
Continued Training with Input Score
LightGBM supports continued training with initial scores. It uses an additional file to store these initial scores, like the following:
::
0.5
-0.1
0.9
...
It means the initial score of the first data row is 0.5
, second is -0.1
, and so on.
The initial score file corresponds with data file line by line, and has per score per line.
And if the name of data file is train.txt
, the initial score file should be named as train.txt.init
and placed in the same folder as the data file.
In this case, LightGBM will auto load initial score file if it exists.
Weight Data
LightGBM supports weighted training. It uses an additional file to store weight data, like the following:
::
1.0
0.5
0.8
...
It means the weight of the first data row is ``1.0``, second is ``0.5``, and so on.
The weight file corresponds with data file line by line, and has per weight per line.
And if the name of data file is ``train.txt``, the weight file should be named as ``train.txt.weight`` and placed in the same folder as the data file.
In this case, LightGBM will load the weight file automatically if it exists.
Also, you can include weight column in your data file. Please refer to the ``weight_column`` `parameter <#weight_column>`__ in above.
Query Data
~~~~~~~~~~
For learning to rank, it needs query information for training data.
LightGBM uses an additional file to store query data, like the following:
::
27
18
67
...
It means first ``27`` lines samples belong to one query and next ``18`` lines belong to another, and so on.
**Note**: data should be ordered by the query.
If the name of data file is ``train.txt``, the query file should be named as ``train.txt.query`` and placed in the same folder as the data file.
In this case, LightGBM will load the query file automatically if it exists.
Also, you can include query/group id column in your data file. Please refer to the ``group_column`` `parameter <#group_column>`__ in above.
.. _Laurae++ Interactive Documentation: https://sites.google.com/view/lauraepp/parameters