secretflow.ml.boost.homo_boost.tree_core#

secretflow.ml.boost.homo_boost.tree_core.criterion#

Classes:

`Criterion`()	Base class for split criterion
`XgboostCriterion`([reg_lambda, reg_alpha, ...])	XgboostCriterion 分裂规则类 .

class secretflow.ml.boost.homo_boost.tree_core.criterion.Criterion[源代码]#

基类：ABC

Base class for split criterion

Methods:

split_gain(left_node_sum, right_node_sum)

abstract split_gain(left_node_sum, right_node_sum)[源代码]#

class secretflow.ml.boost.homo_boost.tree_core.criterion.XgboostCriterion(reg_lambda: float = 0.1, reg_alpha: float = 0, decimal: int = 10)[源代码]#

基类：Criterion

XgboostCriterion 分裂规则类 .. attribute:: reg_lambda

L2 regularization term on weight

reg_alpha#: L1 regularization term on weight

decimal#: truncate parms

Methods:

`__init__`([reg_lambda, reg_alpha, decimal])
`split_gain`(node_sum, left_node_sum, ...)	Calculate split gain :param node_sum: After the split, Grad and Hess at this node :param left_node_sum: After the split, Grad and Hess at the left split point :param right_node_sum: After the split, Grad and Hess at the right split point
`truncate`(f[, decimal])	Truncate control precision can reduce training time with early stop
`node_gain`(sum_grad, sum_hess)	Calculate node gain :param sum_grad: Sum of gradient :param sum_hess: Sum of hessian
`node_weight`(sum_grad, sum_hess)	Calculte node weight :param sum_grad: Sum of gradient :param sum_hess: Sum of hessian

__init__(reg_lambda: float = 0.1, reg_alpha: float = 0, decimal: int = 10)[源代码]#

split_gain(node_sum: Tuple[float, float], left_node_sum: Tuple[float, float], right_node_sum: Tuple[float, float]) → float[源代码]#

Calculate split gain :param node_sum: After the split, Grad and Hess at this node :param left_node_sum: After the split, Grad and Hess at the left split point :param right_node_sum: After the split, Grad and Hess at the right split point

返回:: Split gain of this split
返回类型:: gain

static truncate(f, decimal=10)[源代码]#: Truncate control precision can reduce training time with early stop

node_gain(sum_grad: float, sum_hess: float) → float[源代码]#

Calculate node gain :param sum_grad: Sum of gradient :param sum_hess: Sum of hessian

返回:: Gain of this node

node_weight(sum_grad: float, sum_hess: float) → float[源代码]#

Calculte node weight :param sum_grad: Sum of gradient :param sum_hess: Sum of hessian

返回:: Weight of this node

secretflow.ml.boost.homo_boost.tree_core.decision_tree#

Classes:

DecisionTree([tree_param, data, ...])

Class for local version decision tree

class secretflow.ml.boost.homo_boost.tree_core.decision_tree.DecisionTree(tree_param: Optional[TreeParam] = None, data: Optional[DataFrame] = None, bin_split_points: Optional[ndarray] = None, tree_id: Optional[int] = None, group_id: Optional[int] = None, iter_round: Optional[int] = None, grad_key: str = 'grad', hess_key: str = 'hess', label_key: str = 'label')[源代码]#

基类：object

Class for local version decision tree

tree_param#: params for tree build

data#: training data, HdataFrame

bin_split_points#: global binning infos

tree_id#: tree id

group_id#: group id indicates which class the tree classifies

iter_round#: iteration round

hess_key#: unique column name for hess value

grad_key#: unique column name for grad value

Methods:

`__init__`([tree_param, data, ...])
`feature_col_sample`(all_features[, sample_rate])	Column sample for features :param all_features: A list of feature names for all columns :param sample_rate: subsample rate, a float-number in [0, 1]
`get_feature_importance`()
`convert_bin_to_real`()	convert bid to real value
`columns_filter`()
`get_grad_hess_sum`(data_frame)	calculate sum of grad and hess :param data_frame: data frame which contains hess and grad
`update_feature_importance`(split_info)	Calculate feature importance default split count :param split_info: Global optimal splitting information calculated from histogram
`fit`()	Entrance for local decision tree
`update_tree`(cur_to_split, split_info, ...)	Tree update function :param cur_to_split: List of nodes to be split :param split_info: Global optim split info :param cur_data_frames: List of dataframe in each node
`init_xgboost_model`(model_path)	Init standard xgboost model :param model_path: model path
`save_xgboost_model`(model_path, tree_nodes)	Transform tree info to standard xgboost model ref: https://xgboost.readthedocs.io/en/latest/dev/structxgboost_1_1TreeParam.html#aab8ff286e59f1bbab47bfa865da4a107 :param model_path: model path :param tree_nodes: federate decision tree internal model

__init__(tree_param: Optional[TreeParam] = None, data: Optional[DataFrame] = None, bin_split_points: Optional[ndarray] = None, tree_id: Optional[int] = None, group_id: Optional[int] = None, iter_round: Optional[int] = None, grad_key: str = 'grad', hess_key: str = 'hess', label_key: str = 'label')[源代码]#

feature_col_sample(all_features: List[str], sample_rate: float = 1.0)[源代码]#

Column sample for features :param all_features: A list of feature names for all columns :param sample_rate: subsample rate, a float-number in [0, 1]

返回:: A dict of valid features, which will be use in this round built
返回类型:: valid_features

get_feature_importance()[源代码]#

convert_bin_to_real()[源代码]#: convert bid to real value

columns_filter()[源代码]#

get_grad_hess_sum(data_frame)[源代码]#

calculate sum of grad and hess :param data_frame: data frame which contains hess and grad

返回:: sum of grad hess: sum of hess
返回类型:: grad

update_feature_importance(split_info)[源代码]#: Calculate feature importance default split count :param split_info: Global optimal splitting information calculated from histogram

fit()[源代码]#: Entrance for local decision tree

update_tree(cur_to_split: List[Node], split_info: List[SplitInfo], cur_data_frames: List[DataFrame])[源代码]#

Tree update function :param cur_to_split: List of nodes to be split :param split_info: Global optim split info :param cur_data_frames: List of dataframe in each node

返回:: List of nodes to be evaluated in the next iteration next_layer_data: List of data to be evaluated in the next iteration
返回类型:: next_layer_node

init_xgboost_model(model_path: str)[源代码]#: Init standard xgboost model :param model_path: model path

save_xgboost_model(model_path: str, tree_nodes: List[Node])[源代码]#

Transform tree info to standard xgboost model ref: https://xgboost.readthedocs.io/en/latest/dev/structxgboost_1_1TreeParam.html#aab8ff286e59f1bbab47bfa865da4a107 :param model_path: model path :param tree_nodes: federate decision tree internal model

返回:: update standard xgboost model on the model path

secretflow.ml.boost.homo_boost.tree_core.feature_histogram#

Classes:

`HistogramBag`([histogram, hid, p_hid])	Histogram container
`FeatureHistogram`()	Feature Histogram

class secretflow.ml.boost.homo_boost.tree_core.feature_histogram.HistogramBag(histogram: Optional[List] = None, hid: int = -1, p_hid: int = -1)[源代码]#

基类：object

Histogram container

histogram#

Histogram list calculated by calculate_histogram

Type:: List

hid#

histogram id

Type:: int

p_hid#

parent histogram id

Type:: int

Attributes:

`histogram`
`hid`
`p_hid`

Methods:

`binary_op`(other, func[, inplace])
`__init__`([histogram, hid, p_hid])

histogram: List = None#

hid: int = -1#

p_hid: int = -1#

binary_op(other, func: callable, inplace: bool = False)[源代码]#

__init__(histogram: Optional[List] = None, hid: int = -1, p_hid: int = -1) → None#

class secretflow.ml.boost.homo_boost.tree_core.feature_histogram.FeatureHistogram[源代码]#

基类：object

Feature Histogram

Methods:

`calculate_histogram`(data_frame_list, ...[, ...])	Calculate histogram according to G and H histogram: [cols,[buckets,[sum_g,sum_h,count]]
`calculate_single_histogram`(data, bin_split_point)

static calculate_histogram(data_frame_list: List[DataFrame], bin_split_points: ndarray, valid_features: Optional[Dict] = None, use_missing: bool = False, grad_key: str = 'grad', hess_key: str = 'hess', thread_pool: Optional[ThreadPoolExecutor] = None)[源代码]#

Calculate histogram according to G and H histogram: [cols,[buckets,[sum_g,sum_h,count]]

参数:

data_frame_list – A list of data frame, which contain grad and hess
bin_split_points – global split point dicts
valid_features – valid feature names Dict[id:bool]
use_missing – whether missing value participate in train
grad_key – unique column name for grad value
hess_key – unique column name for hess value

返回:

一个List[histogram1, histogram2, …]

返回类型:

node_histograms

static calculate_single_histogram(data: ndarray, bin_split_point: ndarray)[源代码]#

secretflow.ml.boost.homo_boost.tree_core.feature_importance#

Classes:

FeatureImportance([main_importance, ...])

Feature importance class

class secretflow.ml.boost.homo_boost.tree_core.feature_importance.FeatureImportance(main_importance: float = 0, other_importance: float = 0, main_type: str = 'split')[源代码]#

基类：object

Feature importance class

main_importance#: main importance value, ref main_type

other_importance#: other importance value, ref opposite to main_type

main_type#: type of importance, eg:gain

Methods:

`__init__`([main_importance, ...])
`add_gain`(val)
`add_split`(val)

__init__(main_importance: float = 0, other_importance: float = 0, main_type: str = 'split')[源代码]#

add_gain(val: float)[源代码]#

add_split(val: float)[源代码]#

secretflow.ml.boost.homo_boost.tree_core.loss_function#

Classes:

LossFunction(obj_name)

Inner define for loss functions

class secretflow.ml.boost.homo_boost.tree_core.loss_function.LossFunction(obj_name: str)[源代码]#

基类：object

Inner define for loss functions

obj_name#: Name of loss function in [“binary:logistic”,# logistic regression “reg:logistic”, # logistic regression for binary classification, output probability “multi:softmax”, # logistic regression for binary classification, output score before logistic transformation “multi:softprob”, # logistic regression for binary classification, output probability “reg:squarederror” # for multi label classification ]

Methods:

`__init__`(obj_name)
`obj_function`()

__init__(obj_name: str)[源代码]#

obj_function()[源代码]#

secretflow.ml.boost.homo_boost.tree_core.node#

Classes:

Node([id, fid, bid, weight, is_leaf, ...])

Tree Node

class secretflow.ml.boost.homo_boost.tree_core.node.Node(id: Optional[int] = None, fid: Optional[int] = None, bid: Optional[int] = None, weight: float = 0.0, is_leaf: bool = False, sum_grad: Optional[float] = None, sum_hess: Optional[float] = None, left_nodeid: int = -1, right_nodeid: int = -1, missing_dir: int = 1, sample_num: int = 0, parent_nodeid: Optional[int] = None, is_left_node: bool = False, sibling_nodeid: Optional[int] = None, loss_change: float = 0.0)[源代码]#

基类：object

Tree Node

id#

node id

Type:: int

fid#

feature id

Type:: int

bid#

bucket id

Type:: int

weight#

node weight

Type:: float

is_leaf#

whether this node is leaf

Type:: bool

sum_grad#

sum of grad

Type:: float

sum_hess#

sum of hess

Type:: float

left_nodeid#

left node id

Type:: int

right_nodeid#

right node id

Type:: int

missing_dir#

which branch to go when encounting missing value default 1->right

Type:: int

sample_num#

num of data sample

Type:: int

parent_nodeid#

parent nodeid

Type:: int

is_left_node#

is this node if left child of the parent

Type:: bool

sibling_nodeid#

sibling node id

Type:: int

loss_change#

the loss change.

Type:: float

Attributes:

`id`
`fid`
`bid`
`weight`
`is_leaf`
`sum_grad`
`sum_hess`
`left_nodeid`
`right_nodeid`
`missing_dir`
`sample_num`
`parent_nodeid`
`is_left_node`
`sibling_nodeid`
`loss_change`

Methods:

__init__([id, fid, bid, weight, is_leaf, ...])

id: int = None#

fid: int = None#

bid: int = None#

weight: float = 0.0#

is_leaf: bool = False#

sum_grad: float = None#

sum_hess: float = None#

left_nodeid: int = -1#

right_nodeid: int = -1#

missing_dir: int = 1#

sample_num: int = 0#

parent_nodeid: int = None#

is_left_node: bool = False#

sibling_nodeid: int = None#

loss_change: float = 0.0#

__init__(id: Optional[int] = None, fid: Optional[int] = None, bid: Optional[int] = None, weight: float = 0.0, is_leaf: bool = False, sum_grad: Optional[float] = None, sum_hess: Optional[float] = None, left_nodeid: int = -1, right_nodeid: int = -1, missing_dir: int = 1, sample_num: int = 0, parent_nodeid: Optional[int] = None, is_left_node: bool = False, sibling_nodeid: Optional[int] = None, loss_change: float = 0.0) → None#

secretflow.ml.boost.homo_boost.tree_core.splitter#

Classes:

`SplitInfo`([best_fid, best_bid, sum_grad, ...])	Split Info .
`Splitter`(criterion_method[, ...])	Split Calculate Class .

class secretflow.ml.boost.homo_boost.tree_core.splitter.SplitInfo(best_fid: Optional[int] = None, best_bid: Optional[int] = None, sum_grad: float = 0, sum_hess: float = 0, gain: Optional[float] = None, missing_dir: int = 1, sample_count: int = -1)[源代码]#

基类：object

Split Info .. attribute:: best_fid

best split on feature id

type:

int

best_bid#

best split on bucket id

Type:: int

sum_grad#

sum of grad

Type:: float

sum_hess#

sum of hess

Type:: float

gain#

split gain

Type:: float

missing_dir#

which branch to go when encounting missing value default 1->right

Type:: int

sample_count#

num of sample after split

Type:: int

Attributes:

`best_fid`
`best_bid`
`sum_grad`
`sum_hess`
`gain`
`missing_dir`
`sample_count`

Methods:

__init__([best_fid, best_bid, sum_grad, ...])

best_fid: int = None#

best_bid: int = None#

sum_grad: float = 0#

sum_hess: float = 0#

gain: float = None#

missing_dir: int = 1#

sample_count: int = -1#

__init__(best_fid: Optional[int] = None, best_bid: Optional[int] = None, sum_grad: float = 0, sum_hess: float = 0, gain: Optional[float] = None, missing_dir: int = 1, sample_count: int = -1) → None#

class secretflow.ml.boost.homo_boost.tree_core.splitter.Splitter(criterion_method: str, criterion_params: List = [0, 0, 10], min_impurity_split: float = 0.01, min_sample_split: int = 2, min_leaf_node: int = 1, min_child_weight: int = 1)[源代码]#

基类：object

Split Calculate Class .. attribute:: criterion_method

criterion method

criterion_params#: criterion parms, eg[l1: 0.1, l2: 0.2]

min_impurity_split#: minimum gain threshold of splitting

min_sample_split#: minimum sample split of splitting, default to 2

min_leaf_node#: minimum samples on node to split

min_child_weight#: minimum sum of hess after split

Methods:

`__init__`(criterion_method[, ...])
`find_split_once`(histogram, valid_features, ...)	Find best split info from histogram
`find_split`(histograms, valid_features[, ...])	查找最优分裂点 :param histograms: a list of histogram :param valid_features: valid feature names Dict[id:bool] :param use_missing: whether missing value participate in train
`node_gain`(grad, hess)
`node_weight`(grad, hess)
`split_gain`(sum_grad, sum_hess, sum_grad_l, ...)

__init__(criterion_method: str, criterion_params: List = [0, 0, 10], min_impurity_split: float = 0.01, min_sample_split: int = 2, min_leaf_node: int = 1, min_child_weight: int = 1)[源代码]#

find_split_once(histogram: List, valid_features: Dict, use_missing: bool) → SplitInfo[源代码]#

Find best split info from histogram

参数:

histogram – a three-dimensional matrix store G,H,Count
valid_features – valid feature names Dict[id:bool]
use_missing – whether missing value participate in train

返回:

best split point info

返回类型:

SplitInfo

find_split(histograms: List, valid_features: Dict, use_missing: bool = False) → List[SplitInfo][源代码]#

查找最优分裂点 :param histograms: a list of histogram :param valid_features: valid feature names Dict[id:bool] :param use_missing: whether missing value participate in train

返回:: best split info on each node
返回类型:: tree_node_splitinfo

node_gain(grad: float, hess: float) → float[源代码]#

node_weight(grad: float, hess: float) → float[源代码]#

split_gain(sum_grad: float, sum_hess: float, sum_grad_l: float, sum_hess_l: float, sum_grad_r: float, sum_hess_r: float) → float[源代码]#