secretflow.preprocessing.binning#
secretflow.preprocessing.binning.homo_binning#
driver端程序
Classes:
|
entrance of federate binning |
- class secretflow.preprocessing.binning.homo_binning.HomoBinning(bin_num: int = 10, compress_thres: int = 10000, head_size: int = 10000, error: float = 0.0001, bin_indexes: List[int] = [], bin_names: List[str] = [], abnormal_list: Optional[List[str]] = None, allow_duplicate: bool = False, max_iter: int = 10, aggregator=None)[源代码]#
基类:
ActorProxy(HomoBinningBase)
entrance of federate binning
- bin_num#
how many buckets need to be split
- compress_thres#
compression threshold. If the value is greater than the threshold, do compression
- head_size#
buffer size
- error#
error tolerance
- bin_indexes#
index of features to binning
- bin_names#
name of features to binning
- abnormal_list#
list of anomaly features
- allow_duplicate#
whether to allow duplicate bucket values
- aggregator#
to aggregate values with aggregator
- max_iter#
max iteration round
Methods:
__init__
([bin_num, compress_thres, ...])Abstraction device object base class.
fit_split_points
(hdata)entrance of federate binning
setup_header_param
(header, bin_names, ...)- __init__(bin_num: int = 10, compress_thres: int = 10000, head_size: int = 10000, error: float = 0.0001, bin_indexes: List[int] = [], bin_names: List[str] = [], abnormal_list: Optional[List[str]] = None, allow_duplicate: bool = False, max_iter: int = 10, aggregator=None)[源代码]#
Abstraction device object base class.
- 参数:
device (Device) – Device where this object is located.
- fit_split_points(hdata: HDataFrame)[源代码]#
entrance of federate binning
- 参数:
data – HDataFrame,input data to binning
- 返回:
a dict of binning result, PYUObject
- 返回类型:
bin_result
secretflow.preprocessing.binning.homo_binning_base#
Classes:
|
Dataclass of split point node |
|
- class secretflow.preprocessing.binning.homo_binning_base.SplitPointNode(value: float, min_value: float, max_value: float, aim_rank: int = -1, allow_error_rank: int = 0, error: float = 0.0001, fixed: bool = False)[源代码]#
基类:
object
Dataclass of split point node
- value#
value of the split point
- Type:
float
- min_value#
min value of the split point
- Type:
float
- max_value#
nax value of the split point
- Type:
float
- aim_rank#
aim rank of the split point
- Type:
int
- allow_error_rank#
error tolerance on ranks
- Type:
int
- error#
create a new node if the difference is greater than error
- Type:
float
- fixed#
whether the split position converges
- Type:
bool
Attributes:
Methods:
Search the right half
Search the left half
__init__
(value, min_value, max_value[, ...])- value: float#
- min_value: float#
- max_value: float#
- aim_rank: int = -1#
- allow_error_rank: int = 0#
- error: float = 0.0001#
- fixed: bool = False#
- __init__(value: float, min_value: float, max_value: float, aim_rank: int = -1, allow_error_rank: int = 0, error: float = 0.0001, fixed: bool = False) None #
- secretflow.preprocessing.binning.homo_binning_base.HomoBinningBase[源代码]#
ActorProxy(HomoBinningBase)
的别名 Methods:__init__
(*args, **kwargs)Abstraction device object base class.
get_missing_count
()statistics of missing count of all parties
set_missing_dict
(missing_count)cal_summary_dict
(data)init_query_points
(split_num[, error_rank, ...])query points initialize
fit_split_points
(data)query_values
()Query what is the global rank for each current partition point :returns: Dict eg: {col1: [g_rank1], col2: [g_rank2] } :rtype: global_rank
query_table
(summary, query_points)Query the rank of query_points in the local summary
set_aim_rank
()set_header_param
(bin_names, bin_indexes, ...)get_split_points_dict
()renew_query_points
(global_ranks)Use to update query points
check_converge
()check convergence of federate binning
get_bin_result
()
secretflow.preprocessing.binning.vert_woe_binning#
Classes:
|
woe binning for vertical slice datasets. |
- class secretflow.preprocessing.binning.vert_woe_binning.VertWoeBinning(secure_device: Union[SPU, HEU])[源代码]#
基类:
object
woe binning for vertical slice datasets.
Split all features into bins by equal frequency or ChiMerge. Then calculate woe value & iv value for each bin by SS or HE secure device to protect Y label.
Finally, this method will output binning rules used to substitute features’ value into woe by VertWOESubstitution.
more details about woe/iv value: https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html
- secure_device#
HEU or SPU for secure bucket summation.
Methods:
__init__
(secure_device)binning
(vdata[, binning_method, bin_num, ...])Build woe substitution rules base on vdata.
- binning(vdata: VDataFrame, binning_method: str = 'quantile', bin_num: int = 10, bin_names: Dict[PYU, List[str]] = {}, label_name: str = '', positive_label: str = '1', chimerge_init_bins: int = 100, chimerge_target_bins: int = 10, chimerge_target_pvalue: float = 0.1, audit_log_path: Dict[str, str] = {})[源代码]#
Build woe substitution rules base on vdata. Only support binary classification label dataset.
- vdata#
vertical slice datasets use {binning_method} to bin all number type features. for string type feature bin by it’s categories. else bin is count for np.nan samples
- binning_method#
how to bin number type features. Options: “quantile”(equal frequency)/”chimerge”(ChiMerge from AAAI92-019) Default: “quantile”
- bin_num#
max bin counts for one features. Range: (0, ∞] Default: 10
- bin_names#
which features should be binned.
- label_name#
label column name.
- positive_label#
which value represent positive value in label.
- chimerge_init_bins#
max bin counts for initialization binning in ChiMerge. Range: (2, ∞] Default: 100
- chimerge_target_bins#
stop merge if remain bin counts is less than or equal to this value. Range: [2, {chimerge_init_bins}) Default: 10
- chimerge_target_pvalue#
stop merge if biggest pvalue of remain bins is greater than this value. Range: (0, 1) Default: 0.1
- audit_log_path#
output audit log for HEU encrypt to device’s local path. empty means disable. example: {‘alice’: ‘/path/to/alice/audit/filename’, ‘bob’: ‘bob/audit/filename’} NOTICE: Please !!DO NOT!! touch this options, leave it empty and disabled.
Unless you really know this option’s meaning and accept its risk.
- 返回:
Dict[PYU, PYUObject], PYUObject contain a dict for all features’ rule in this party.
{ "variables":[ { "name": str, # feature name "type": str, # "string" or "numeric", if feature is discrete or continuous "categories": list[str], # categories for discrete feature "split_points": list[float], # left-open right-close split points "total_counts": list[int], # total samples count in each bins. "else_counts": int, # np.nan samples count "woes": list[float], # woe values for each bins. "else_woe": float, # woe value for np.nan samples. "ivs": list[float], # iv values for each bins. "else_iv": float, # iv value for np.nan samples. }, # ... others feature ] }
secretflow.preprocessing.binning.vert_woe_binning_pyu#
Classes:
|
- secretflow.preprocessing.binning.vert_woe_binning_pyu.VertWoeBinningPyuWorker[源代码]#
ActorProxy(VertWoeBinningPyuWorker)
的别名 Methods:__init__
(*args, **kwargs)Abstraction device object base class.
coordinator_work
(data)Label holder build report for it's own feature, and provide label to driver.
participant_build_sum_indices
(data)build sum indices for driver to calculate positive samples by HE.
participant_build_sum_select
(data)build select matrix for driver to calculate positive samples by Secret Sharing.
participant_sum_bin
(bins_positive)build bins stat tuple.
coordinator_calc_woe_for_peer
(bins_stat)calculate woe/iv for participant party.
participant_build_report
(woe_ivs)build report based on coordinator party's woe/iv values.
secretflow.preprocessing.binning.vert_woe_substitution#
Classes:
|
|
- secretflow.preprocessing.binning.vert_woe_substitution.VertWOESubstitutionPyuWorker[源代码]#
ActorProxy(VertWOESubstitutionPyuWorker)
的别名 Methods:sub
(data, r)PYU functions for woe substitution.
__init__
(*args, **kwargs)Abstraction device object base class.
- class secretflow.preprocessing.binning.vert_woe_substitution.VertWOESubstitution[源代码]#
基类:
object
Methods:
substitution
(vdata, woe_rules)substitute dataset's value by woe substitution rules.
- substitution(vdata: VDataFrame, woe_rules: Dict[PYU, PYUObject]) VDataFrame [源代码]#
substitute dataset’s value by woe substitution rules.
- 参数:
vdata – vertical slice dataset to be substituted.
woe_rules – woe substitution rules build by VertWoeBinning.
- 返回:
vertical slice dataset after substituted.
- 返回类型:
new_vdata