secretflow.stats#
Classes:
|
|
|
|
|
|
|
Statistics Evaluation for a regression model on a dataset. |
|
Statistics Evaluation for a bi-classification model on a dataset. |
|
The component provides a mapping procedure from binary regression's probability value to an integer range score. |
Functions:
|
Compute Prediction Vs Actual score. |
|
Get table statistics for a pd.DataFrame or VDataFrame. |
|
Calculate population stability index. |
- secretflow.stats.SSVertPearsonR#
PearsonR
的别名 Methods:__init__
(device)pearsonr
(vdata[, standardize])- secretflow.stats.vdata#
- secretflow.stats.SSVertVIF#
VIF
的别名 Methods:__init__
(device)vif
(vdata[, standardize])- secretflow.stats.vdata#
- class secretflow.stats.RegressionEval(y_true: FedNdarray, y_pred: FedNdarray, spu_device: Optional[SPU] = None, bins=10)[源代码]#
基类:
object
Statistics Evaluation for a regression model on a dataset.
- y_true#
FedNdarray If y_true is from a single party, then each statistics is a PYUObject. If y_true is from multiple parties, then an SPU device is required and each statistics is an SPUObject.
- y_pred#
FedNdarray y_true and y_pred must have the same device and partition shapes
- r2_score#
Union[PYUObject, SPUObject]
- mean_abs_err#
Union[PYUObject, SPUObject]
- mean_abs_percent_err#
Union[PYUObject, SPUObject]
- sum_squared_errors#
Union[PYUObject, SPUObject]
- mean_squared_errors#
Union[PYUObject, SPUObject]
- root_mean_squared_errors#
Union[PYUObject, SPUObject]
- y_true_mean#
Union[PYUObject, SPUObject]
- y_pred_mean#
Union[PYUObject, SPUObject]
- residual_hist#
Union[PYUObject, SPUObject]
Methods:
__init__
(y_true, y_pred[, spu_device, bins])- __init__(y_true: FedNdarray, y_pred: FedNdarray, spu_device: Optional[SPU] = None, bins=10)[源代码]#
- class secretflow.stats.BiClassificationEval(y_true: Union[FedNdarray, VDataFrame], y_score: Union[FedNdarray, VDataFrame], bucket_size: int)[源代码]#
基类:
object
Statistics Evaluation for a bi-classification model on a dataset.
- Attribute:
- y_true: Union[FedNdarray, VDataFrame]
input of labels
- y_score: Union[FedNdarray, VDataFrame]
input of prediction scores
- bucket_size: int
input of number of bins in report
Methods:
__init__
(y_true, y_score, bucket_size)get all reports.
- __init__(y_true: Union[FedNdarray, VDataFrame], y_score: Union[FedNdarray, VDataFrame], bucket_size: int)[源代码]#
- get_all_reports() PYUObject [源代码]#
get all reports. The reports contains:
summary_report: SummaryReport
group_reports: List[GroupReport]
eq_frequent_bin_report: List[EqBinReport]
eq_range_bin_report: List[EqBinReport]
- head_report: List[PrReport]
reports for fpr = 0.001, 0.005, 0.01, 0.05, 0.1, 0.2
see more in core.biclassification_eval_core
- secretflow.stats.pva_eval(actual: Union[FedNdarray, VDataFrame], prediction: Union[FedNdarray, VDataFrame], target) PYUObject [源代码]#
Compute Prediction Vs Actual score.
- 参数:
actual – Union[FedNdarray, VDataFrame]
prediction – Union[FedNdarray, VDataFrame]
target – numeric the target label in actual entries to consider.
- compute:
- result: PYUObject
Underlying a float of abs(mean(prediction) - sum(actual == target)/count(actual))
- secretflow.stats.table_statistics(table: Union[DataFrame, VDataFrame]) DataFrame [源代码]#
Get table statistics for a pd.DataFrame or VDataFrame.
- 参数:
table – Union[pd.DataFrame, VDataFrame]
- 返回:
- pd.DataFrame
including each column’s datatype, total_count, count, count_na, min, max, var, std, sem, skewness, kurtosis, q1, q2, q3, moment_2, moment_3, moment_4, central_moment_2, central_moment_3, central_moment_4, sum, sum_2, sum_3 and sum_4.
moment_2 means E[X^2].
central_moment_2 means E[(X - mean(X))^2].
sum_2 means sum(X^2).
- 返回类型:
table_statistics
- secretflow.stats.psi_eval(X: Union[FedNdarray, VDataFrame], Y: Union[FedNdarray, VDataFrame], split_points) PYUObject [源代码]#
Calculate population stability index.
- 参数:
X – Union[FedNdarray, VDataFrame] a collection of samples
Y – Union[FedNdarray, VDataFrame] a collection of samples
split_points – array an ordered sequence of split points
- 返回:
- float
population stability index
- 返回类型:
result
- class secretflow.stats.ScoreCard(odd_base: float, scaled_value: float, pdo: float, max_score: int = 1000, min_score: int = 0, bad_label_value: int = 0)[源代码]#
基类:
object
The component provides a mapping procedure from binary regression’s probability value to an integer range score.
- The mapping process is as follows:
odds = pred / (1 - pred) score = offset + factor * log(odds)
- The offset and factor in the formula come from the user’s settings. Usually users do not directly give offset and factor, but give three constraint parameters:
scaled_value: a score baseline odd_base: the odds value at given score baseline pdo: how many scores are needed to double odds
- The offset and factor can be solved using these three constraint parameters:
factor = pdo / log(2) offset = scaled_value - (factor * log(odd_base))
- odd_base / scaled_value / pdo
see above
- max_score#
up limit for score
- min_score#
down limit for score
- bad_label_value#
which label represents the negative sample
Methods:
__init__
(odd_base, scaled_value, pdo[, ...])transform
(pred)computer pvalue for lr model
- __init__(odd_base: float, scaled_value: float, pdo: float, max_score: int = 1000, min_score: int = 0, bad_label_value: int = 0)[源代码]#
- transform(pred: Union[FedNdarray, VDataFrame, HDataFrame]) FedNdarray [源代码]#
computer pvalue for lr model
- 参数:
pred – Union[FedNdarray, VDataFrame, HDataFrame] predicted probability from binary regression
- 返回:
mapped scores.
secretflow.stats.biclassification_eval#
Classes:
|
Statistics Evaluation for a bi-classification model on a dataset. |
- class secretflow.stats.biclassification_eval.BiClassificationEval(y_true: Union[FedNdarray, VDataFrame], y_score: Union[FedNdarray, VDataFrame], bucket_size: int)[源代码]#
基类:
object
Statistics Evaluation for a bi-classification model on a dataset.
- Attribute:
- y_true: Union[FedNdarray, VDataFrame]
input of labels
- y_score: Union[FedNdarray, VDataFrame]
input of prediction scores
- bucket_size: int
input of number of bins in report
Methods:
__init__
(y_true, y_score, bucket_size)get all reports.
- __init__(y_true: Union[FedNdarray, VDataFrame], y_score: Union[FedNdarray, VDataFrame], bucket_size: int)[源代码]#
- get_all_reports() PYUObject [源代码]#
get all reports. The reports contains:
summary_report: SummaryReport
group_reports: List[GroupReport]
eq_frequent_bin_report: List[EqBinReport]
eq_range_bin_report: List[EqBinReport]
- head_report: List[PrReport]
reports for fpr = 0.001, 0.005, 0.01, 0.05, 0.1, 0.2
see more in core.biclassification_eval_core
secretflow.stats.psi_eval#
Functions:
|
Calculate population stability index. |
- secretflow.stats.psi_eval.psi_eval(X: Union[FedNdarray, VDataFrame], Y: Union[FedNdarray, VDataFrame], split_points) PYUObject [源代码]#
Calculate population stability index.
- 参数:
X – Union[FedNdarray, VDataFrame] a collection of samples
Y – Union[FedNdarray, VDataFrame] a collection of samples
split_points – array an ordered sequence of split points
- 返回:
- float
population stability index
- 返回类型:
result
secretflow.stats.pva_eval#
Functions:
|
Compute Prediction Vs Actual score. |
- secretflow.stats.pva_eval.pva_eval(actual: Union[FedNdarray, VDataFrame], prediction: Union[FedNdarray, VDataFrame], target) PYUObject [源代码]#
Compute Prediction Vs Actual score.
- 参数:
actual – Union[FedNdarray, VDataFrame]
prediction – Union[FedNdarray, VDataFrame]
target – numeric the target label in actual entries to consider.
- compute:
- result: PYUObject
Underlying a float of abs(mean(prediction) - sum(actual == target)/count(actual))
secretflow.stats.regression_eval#
Classes:
|
Statistics Evaluation for a regression model on a dataset. |
- class secretflow.stats.regression_eval.RegressionEval(y_true: FedNdarray, y_pred: FedNdarray, spu_device: Optional[SPU] = None, bins=10)[源代码]#
基类:
object
Statistics Evaluation for a regression model on a dataset.
- y_true#
FedNdarray If y_true is from a single party, then each statistics is a PYUObject. If y_true is from multiple parties, then an SPU device is required and each statistics is an SPUObject.
- y_pred#
FedNdarray y_true and y_pred must have the same device and partition shapes
- r2_score#
Union[PYUObject, SPUObject]
- mean_abs_err#
Union[PYUObject, SPUObject]
- mean_abs_percent_err#
Union[PYUObject, SPUObject]
- sum_squared_errors#
Union[PYUObject, SPUObject]
- mean_squared_errors#
Union[PYUObject, SPUObject]
- root_mean_squared_errors#
Union[PYUObject, SPUObject]
- y_true_mean#
Union[PYUObject, SPUObject]
- y_pred_mean#
Union[PYUObject, SPUObject]
- residual_hist#
Union[PYUObject, SPUObject]
Methods:
__init__
(y_true, y_pred[, spu_device, bins])- __init__(y_true: FedNdarray, y_pred: FedNdarray, spu_device: Optional[SPU] = None, bins=10)[源代码]#
secretflow.stats.score_card#
Classes:
|
The component provides a mapping procedure from binary regression's probability value to an integer range score. |
- class secretflow.stats.score_card.ScoreCard(odd_base: float, scaled_value: float, pdo: float, max_score: int = 1000, min_score: int = 0, bad_label_value: int = 0)[源代码]#
基类:
object
The component provides a mapping procedure from binary regression’s probability value to an integer range score.
- The mapping process is as follows:
odds = pred / (1 - pred) score = offset + factor * log(odds)
- The offset and factor in the formula come from the user’s settings. Usually users do not directly give offset and factor, but give three constraint parameters:
scaled_value: a score baseline odd_base: the odds value at given score baseline pdo: how many scores are needed to double odds
- The offset and factor can be solved using these three constraint parameters:
factor = pdo / log(2) offset = scaled_value - (factor * log(odd_base))
- odd_base / scaled_value / pdo
see above
- max_score#
up limit for score
- min_score#
down limit for score
- bad_label_value#
which label represents the negative sample
Methods:
__init__
(odd_base, scaled_value, pdo[, ...])transform
(pred)computer pvalue for lr model
- __init__(odd_base: float, scaled_value: float, pdo: float, max_score: int = 1000, min_score: int = 0, bad_label_value: int = 0)[源代码]#
- transform(pred: Union[FedNdarray, VDataFrame, HDataFrame]) FedNdarray [源代码]#
computer pvalue for lr model
- 参数:
pred – Union[FedNdarray, VDataFrame, HDataFrame] predicted probability from binary regression
- 返回:
mapped scores.
secretflow.stats.ss_pearsonr_v#
Classes:
|
Calculate pearson product-moment correlation coefficient for vertical slice dataset by using secret sharing. |
- class secretflow.stats.ss_pearsonr_v.PearsonR(device: SPU)[源代码]#
基类:
object
Calculate pearson product-moment correlation coefficient for vertical slice dataset by using secret sharing.
more detail for PearsonR: https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
For large dataset(large than 10w samples & 200 features) Recommend use [Ring size: 128, Fxp: 40] options for SPU device.
- device#
SPU Device
Methods:
__init__
(device)pearsonr
(vdata[, standardize])- vdata#
- pearsonr(vdata: VDataFrame, standardize: bool = True)[源代码]#
- vdata#
VDataFrame vertical slice dataset.
- standardize#
bool if need standardize dataset. dataset must be standardized please keep standardize=True, unless dataset is already standardized. standardize purpose: - reduce the result number of matrix xtx, avoid overflow in secret sharing. - after standardize, the variance is 1 and the mean is 0, which can simplify the calculation.
secretflow.stats.ss_pvalue_v#
Classes:
|
Calculate P-Value for LR model training on vertical slice dataset by using secret sharing. |
- class secretflow.stats.ss_pvalue_v.PVlaue(spu: SPU)[源代码]#
基类:
object
Calculate P-Value for LR model training on vertical slice dataset by using secret sharing.
more detail for P-Value: https://www.w3schools.com/datascience/ds_linear_regression_pvalue.asp
For large dataset(large than 10w samples & 200 features) Recommend use [Ring size: 128, Fxp: 40] options for SPU device.
- device#
SPU Device
Methods:
__init__
(spu)pvalues
(x, y, model)- pvalues(x: VDataFrame, y: VDataFrame, model: Any) ndarray [源代码]#
secretflow.stats.ss_vif_v#
Classes:
|
Calculate variance inflation factor for vertical slice dataset by using secret sharing. |
- class secretflow.stats.ss_vif_v.VIF(device: SPU)[源代码]#
基类:
object
Calculate variance inflation factor for vertical slice dataset by using secret sharing.
see https://en.wikipedia.org/wiki/Variance_inflation_factor
For large dataset(large than 10w samples & 200 features) Recommend use [Ring size: 128, Fxp: 40] options for SPU device.
NOTICE: The analytical solution of matrix inversion in secret sharing is very expensive, so this method uses Newton iteration to find approximate solution.
When there is multicollinearity in the input dataset, the XTX matrix is not full rank, and the analytical solution for the inverse of the XTX matrix does not exist.
The VIF results of these linear correlational columns calculated by statsmodels are INF, indicating that the correlation is infinite. However, this method will get a large VIF value (>>1000) on these columns, which can also correctly reflect the strong correlation of these columns.
When there are constant columns in the data, the VIF result calculated by statsmodels is NAN, and the result of this method is also a large VIF value (>> 1000), means these columns need to be removed before training.
Therefore, although the results of this method cannot be completely consistent with statemodels that calculations in plain text, but they can still correctly reflect the correlation of the input data columns.
- device#
SPU Device
Methods:
__init__
(device)vif
(vdata[, standardize])- vdata#
- vif(vdata: VDataFrame, standardize: bool = True)[源代码]#
- vdata#
VDataFrame vertical slice dataset.
- standardize#
bool if need standardize dataset. dataset must be standardized please keep standardize=True, unless dataset is already standardized. standardize purpose: - reduce the result number of matrix xtx, avoid overflow in secret sharing. - after standardize, the variance is 1 and the mean is 0, which can simplify the calculation.
secretflow.stats.table_statistics#
Functions:
|
Get table statistics for a pd.DataFrame or VDataFrame. |
- secretflow.stats.table_statistics.table_statistics(table: Union[DataFrame, VDataFrame]) DataFrame [源代码]#
Get table statistics for a pd.DataFrame or VDataFrame.
- 参数:
table – Union[pd.DataFrame, VDataFrame]
- 返回:
- pd.DataFrame
including each column’s datatype, total_count, count, count_na, min, max, var, std, sem, skewness, kurtosis, q1, q2, q3, moment_2, moment_3, moment_4, central_moment_2, central_moment_3, central_moment_4, sum, sum_2, sum_3 and sum_4.
moment_2 means E[X^2].
central_moment_2 means E[(X - mean(X))^2].
sum_2 means sum(X^2).
- 返回类型:
table_statistics