secretflow.preprocessing.binning.kernels#

secretflow.preprocessing.binning.kernels.base_binning#

Classes:

BaseBinning(bin_names, bin_indexes, bin_num, ...)

class secretflow.preprocessing.binning.kernels.base_binning.BaseBinning(bin_names: List, bin_indexes: List, bin_num: int, abnormal_list: List)[源代码]#

基类：ABC

Methods:

`__init__`(bin_names, bin_indexes, bin_num, ...)
`fit_split_points`(data)

Attributes:

split_points

__init__(bin_names: List, bin_indexes: List, bin_num: int, abnormal_list: List)[源代码]#

property split_points#

abstract fit_split_points(data)[源代码]#

secretflow.preprocessing.binning.kernels.quantile_binning#

Classes:

QuantileBinning([bin_num, compress_thres, ...])

Use QuantileSummary algorithm for constant frequency binning

class secretflow.preprocessing.binning.kernels.quantile_binning.QuantileBinning(bin_num: int = 10, compress_thres: int = 10000, head_size: int = 10000, error: float = 0.0001, bin_indexes: List[int] = [], bin_names: List[str] = [], local_only: bool = False, abnormal_list: Optional[List[str]] = None, allow_duplicate: bool = False)[源代码]#

基类：BaseBinning

Use QuantileSummary algorithm for constant frequency binning

bin_num#: the num of buckets

compress_thres#: if size of summary greater than compress_thres, do compress operation

cols_dict#: mapping of value to index. {key: col_name , value: index}.

head_size#: buffer size

error#: 0 <= error < 1 default: 0.001,error tolerance, floor((p - 2 * error) * N) <= rank(x) <= ceil((p + 2 * error) * N)

abnormal_list#: list of anomaly features.

summary_dict#: a dict store summary of each features

col_name_maps#: a dict store column index to name

bin_idx_name#: a dict store index to name

allow_duplicate#: Whether duplication is allowed

Methods:

`__init__`([bin_num, compress_thres, ...])
`fit_split_points`(data_frame)	calculate bin split points base on QuantileSummary algorithm
`feature_summary`(data_frame, compress_thres, ...)	calculate summary

__init__(bin_num: int = 10, compress_thres: int = 10000, head_size: int = 10000, error: float = 0.0001, bin_indexes: List[int] = [], bin_names: List[str] = [], local_only: bool = False, abnormal_list: Optional[List[str]] = None, allow_duplicate: bool = False)[源代码]#

fit_split_points(data_frame: DataFrame) → DataFrame[源代码]#

calculate bin split points base on QuantileSummary algorithm

参数:: data_frame – input data
返回:: bin result returned as dataframe
返回类型:: bin_result

static feature_summary(data_frame: DataFrame, compress_thres: int, head_size: int, error: float, bin_dict: Dict[str, int], abnormal_list: List[str]) → Dict[源代码]#

calculate summary

参数:

data_frame – pandas.DataFrame, input data
compress_thres – int,
head_size – int, buffer size, when
error – float, error tolerance
bin_dict – a dict store col name to index
abnormal_list – list of anomaly features

secretflow.preprocessing.binning.kernels.quantile_summaries#

Classes:

`Stats`(value, w, delta)	store information for each item in the summary
`QuantileSummaries`([compress_thres, ...])	QuantileSummary

class secretflow.preprocessing.binning.kernels.quantile_summaries.Stats(value: float, w: int, delta: int)[源代码]#

基类：object

store information for each item in the summary

value#

value of this stat

Type:: float

w#

weight of this stat

Type:: int

delta#

delta = rmax - rmin

Type:: int

Attributes:

`value`
`w`
`delta`

Methods:

__init__(value, w, delta)

value: float#

w: int#

delta: int#

__init__(value: float, w: int, delta: int) → None#

class secretflow.preprocessing.binning.kernels.quantile_summaries.QuantileSummaries(compress_thres: int = 10000, head_size: int = 10000, error: float = 0.0001, abnormal_list: Optional[List] = None)[源代码]#

基类：object

QuantileSummary: insert: insert data to summary merge: merge summaries fast_init: A fast version implementation creates the summary with little performance loss compress: compress summary to some size

compress_thres#: if num of stats greater than compress_thres, do compress

head_size#: buffer size for insert data, when samples come to head_size do create summary

error#: 0 <= error < 1 default: 0.001, error tolerance for binning. floor((p - 2 * error) * N) <= rank(x) <= ceil((p + 2 * error) * N)

abnormal_list#: List of abnormal feature, will not participate in binning

Methods:

`__init__`([compress_thres, head_size, error, ...])
`fast_init`(col_data)
`compress`()	compress the summary, summary.sample will under compress_thres
`query`(quantile)	Use to query the value that specifies the quantile location
`value_to_rank`(value)
`batch_query_value`(values)	batch query function

__init__(compress_thres: int = 10000, head_size: int = 10000, error: float = 0.0001, abnormal_list: Optional[List] = None)[源代码]#

fast_init(col_data: ndarray)[源代码]#

compress()[源代码]#: compress the summary, summary.sample will under compress_thres

query(quantile: float) → float[源代码]#

Use to query the value that specifies the quantile location

参数:: quantile – float [0.0, 1.0]
返回:: float, the value of the quantile location

value_to_rank(value: Union[float, int]) → int[源代码]#

batch_query_value(values: List[float]) → List[int][源代码]#

batch query function

参数:: values – List sorted_list of value. eg:[13, 56, 79]
返回:: output ranks of each query
返回类型:: List