secretflow.stats.core#

secretflow.stats.core.biclassification_eval_core#

Classes:

`Report`(eq_frequent_result_arr_list, ...)	Report containing all other reports for bi-classification evaluation
`PrReport`(arr)	Precision Related statistics Report.
`SummaryReport`(arr)	Summary Report for bi-classification evaluation.
`GroupReport`()	Report for each group
`EqBinReport`(arr)	Statistics Report for each bin.

Functions:

`gen_all_reports`(y_true, y_score, bin_size)	Generate all reports.
`create_sorted_label_score_pair`(y_true, y_score)	produce an n * 2 shaped array with the second column as the sorted scores, in decreasing order
`eq_frequent_bin_evaluate`(sorted_pairs, ...)	Fill eq frequent bin report.
`eq_range_bin_evaluate`(sorted_pairs, ...)	Fill eq range bin report.
`evaluate_bins`(sorted_pairs, pos_count, ...)	evaluate bins given sorted pairs, pos_count and split_points (in decreasing order)
`bin_evaluate`(sorted_pairs, start_pos, ...)	Evaluate statistics for a bin.
`gen_pr_reports`(sorted_pairs, thresholds)	Generate pr report per specified threshold.
`precision_recall_false_positive_rate`(...)
`confusion_matrix_from_cum_counts`(...)	Compute the confusion matrix.
`binary_clf_curve`(sorted_pairs)	Calculate true and false positives per binary classification threshold (can be used for roc curve or precision/recall curve).
`roc_curve`(sorted_pairs)	Compute Receiver operating characteristic (ROC).
`auc`(x, y)	Compute Area Under the Curve (AUC) using the trapezoidal rule.
`binary_roc_auc`(sorted_pairs)	Compute Area Under the Curve (AUC) for ROC from labels and prediction scores in sorted_pairs.
`compute_f1_score`(true_positive, ...)	Calculate the F1 score.

class secretflow.stats.core.biclassification_eval_core.Report(eq_frequent_result_arr_list, eq_range_result_arr_list, summary_report_arr, head_prs)[源代码]#

基类：object

Report containing all other reports for bi-classification evaluation

summary_report#: SummaryReport

group_reports#: List[GroupReport]

eq_frequent_bin_report#: List[EqBinReport]

eq_range_bin_report#: List[EqBinReport]

head_report#: List[PrReport] reports for fpr = 0.001, 0.005, 0.01, 0.05, 0.1, 0.2

Methods:

__init__(eq_frequent_result_arr_list, ...)

__init__(eq_frequent_result_arr_list, eq_range_result_arr_list, summary_report_arr, head_prs)[源代码]#

class secretflow.stats.core.biclassification_eval_core.PrReport(arr)[源代码]#

基类：object

Precision Related statistics Report.

fpr#: float FP/(FP+TN)

precision#: float TP/(TP+FP)

recall#: float TP/(TP+FN)

Methods:

__init__(arr)

__init__(arr)[源代码]#

class secretflow.stats.core.biclassification_eval_core.SummaryReport(arr)[源代码]#

基类：object

Summary Report for bi-classification evaluation.

total_samples#: int

positive_samples#: int

negative_samples#: int

auc#: float auc: area under the curve: https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc

ks#: float Kolmogorov-Smirnov statistics: https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

f1_score#: float harmonic mean of precision and recall: https://en.wikipedia.org/wiki/F-score

Methods:

__init__(arr)

__init__(arr)[源代码]#

class secretflow.stats.core.biclassification_eval_core.GroupReport[源代码]#

基类：object

Report for each group

Attributes:

`group_name`
`summary`

group_name: str#

summary: SummaryReport#

class secretflow.stats.core.biclassification_eval_core.EqBinReport(arr)[源代码]#

基类：object

Statistics Report for each bin.

start_value#: float

end_value#: float

positive#: int

negative#: int

total#: int

precision#: float

recall#: float

false_positive_rate#: float

f1_score#: float

lift#: float see https://en.wikipedia.org/wiki/Lift_(data_mining)

predicted_positive_ratio#: float predicted positive samples / total samples.

predicted_negative_ratio#: float predicted negative samples / total samples.

cumulative_percent_of_positive#: float

cumulative_percent_of_negative#: float

total_cumulative_percent#: float

ks#: float

avg_score#: float

Methods:

__init__(arr)

__init__(arr)[源代码]#

secretflow.stats.core.biclassification_eval_core.gen_all_reports(y_true: Union[DataFrame, array], y_score: Union[DataFrame, array], bin_size: int)[源代码]#

Generate all reports.

参数:

y_true – Union[pd.DataFrame, jnp.array] should be of shape n * 1 and with binary entries 1 means positive sample
y_score – Union[pd.DataFrame, jnp.array] should be of shape n * 1 and with each entry between [0, 1] probability of being positive
bin_size – int number of bins to evaluate

Returns:

secretflow.stats.core.biclassification_eval_core.create_sorted_label_score_pair(y_true: array, y_score: array)[源代码]#: produce an n * 2 shaped array with the second column as the sorted scores, in decreasing order

secretflow.stats.core.biclassification_eval_core.eq_frequent_bin_evaluate(sorted_pairs: array, pos_count: int, bin_size: int) → List[array][源代码]#

Fill eq frequent bin report.

参数:

sorted_pairs – jnp.array Should be of shape n * 2 and with second col sorted
pos_count – int Total number of positive samples
bin_size – int Total number of bins

返回:

List[jnp.array]

返回类型:

bin_reports

secretflow.stats.core.biclassification_eval_core.eq_range_bin_evaluate(sorted_pairs: array, pos_count: int, bin_size: int) → List[array][源代码]#

Fill eq range bin report.

参数:

sorted_pairs – jnp.array Should be of shape n * 2 and with second col sorted.
pos_count – int Total number of positive samples
bin_size – int Total number of bins

返回:

List[jnp.array]

返回类型:

bin_reports

secretflow.stats.core.biclassification_eval_core.evaluate_bins(sorted_pairs: array, pos_count: int, split_points) → List[array][源代码]#: evaluate bins given sorted pairs, pos_count and split_points (in decreasing order)

secretflow.stats.core.biclassification_eval_core.bin_evaluate(sorted_pairs, start_pos, end_pos, total_pos_count, total_neg_count, cumulative_pos_count, cumulative_neg_count) → Tuple[array, int, int][源代码]#

Evaluate statistics for a bin.

返回:

jnp.array: an array of size BIN_REPORT_STATISTICS_ENTRY_COUNT

cumulative_pos_count: int

cumulative_neg_count: int

返回类型:

bin_report_arr

secretflow.stats.core.biclassification_eval_core.gen_pr_reports(sorted_pairs: array, thresholds: array) → List[array][源代码]#

Generate pr report per specified threshold.

参数:

sorted_pairs – jnp.array y_true y_score pairs sorted by y_score in increasing order shape n_samples * 2
thresholds – 1d jnp.ndarray prediction thresholds on which to evaluate

返回:

List[jnp.array]: a list of pr reports in jnp.array of shape 3 * 1, list len = len(thresholds)

返回类型:

pr_report_arr

secretflow.stats.core.biclassification_eval_core.precision_recall_false_positive_rate(true_positive, false_positive, false_negative, true_negative) → Tuple[float, float, float][源代码]#

secretflow.stats.core.biclassification_eval_core.confusion_matrix_from_cum_counts(cumulative_pos_count, cumulative_neg_count, total_neg_count, total_pos_count)[源代码]#

Compute the confusion matrix.

参数:

cumulative_pos_count – int
cumulative_neg_count – int
total_neg_count – int
total_pos_count – int

返回:

int

true_negative: int

false_positive: int

false_negative: int

返回类型:

true_positive

secretflow.stats.core.biclassification_eval_core.binary_clf_curve(sorted_pairs: array) → Tuple[array, array, array][源代码]#

Calculate true and false positives per binary classification threshold (can be used for roc curve or precision/recall curve).

参数:

sorted_pairs – jnp.array y_true y_score pairs sorted by y_score in decreasing order

返回:

1d ndarray: False positives counts, index i records the number of negative samples that got assigned a score >= thresholds[i]. The total number of negative samples is equal to fps[-1] (thus true negatives are given by fps[-1] - fps)
tps: 1d ndarray: True positives counts, index i records the number of positive samples that got assigned a score >= thresholds[i]. The total number of positive samples is equal to tps[-1] (thus false negatives are given by tps[-1] - tps)
thresholds1d ndarray: Distinct predicted score sorted in decreasing order

返回类型:

fps

引用

Github: scikit-learn _binary_clf_curve.

secretflow.stats.core.biclassification_eval_core.roc_curve(sorted_pairs: array) → Tuple[array, array, array][源代码]#

Compute Receiver operating characteristic (ROC).

Compared to sklearn implementation, this implementation eliminates most conditionals and ill-conditionals checking.

参数:

sorted_pairs – jnp.array y_true y_score pairs sorted by y_score in decreasing order

返回:

ndarray of shape (>2,): Increasing false positive rates such that element i is the false positive rate of predictions with score >= thresholds[i].
tpr: ndarray of shape (>2,): Increasing true positive rates such that element i is the true positive rate of predictions with score >= thresholds[i].
thresholds: ndarray of shape = (n_thresholds,): Decreasing thresholds on the decision function used to compute fpr and tpr. thresholds[0] represents no instances being predicted and is arbitrarily set to max(y_score) + 1.

返回类型:

fpr

引用

Github: scikit-learn roc_curve.

secretflow.stats.core.biclassification_eval_core.auc(x, y)[源代码]#

Compute Area Under the Curve (AUC) using the trapezoidal rule.

参数:

x – ndarray of shape (n,) monotonic X coordinates
y – ndarray of shape, (n,) Y coordinates

返回:

float: Area Under the Curve

返回类型:

auc

secretflow.stats.core.biclassification_eval_core.binary_roc_auc(sorted_pairs: array) → float[源代码]#

Compute Area Under the Curve (AUC) for ROC from labels and prediction scores in sorted_pairs.

Compared to sklearn implementation, this implementation is watered down with less options and eliminates most conditionals and ill-conditionals checking.

参数:: sorted_pairs – jnp.array y_true y_score pairs sorted by y_score in decreasing order, and it has shape n_samples * 2.
返回:: float
返回类型:: roc_auc

引用

Github: scikit-learn _binary_roc_auc_score.

secretflow.stats.core.biclassification_eval_core.compute_f1_score(true_positive: int, false_positive: int, false_negative: int) → float[源代码]#: Calculate the F1 score.

secretflow.stats.core.psi_core#

Functions:

`psi_index`(a, b)	(a - b) * ln(a/b).
`psi_score`(A, B)	Computes the psi score.
`distribution_generation`(X, split_points)	Generate a distribution of X according to split points.
`psi`(X, Y, split_points)	Calculate population stability index.

secretflow.stats.core.psi_core.psi_index(a, b)[源代码]#

(a - b) * ln(a/b).

参数:

a – array or float
b – array or float a, b must be of same type. They can be float or jnp.array or np.array.

返回:

array or float: same type as a or b.

返回类型:

result

secretflow.stats.core.psi_core.psi_score(A: array, B: array)[源代码]#

Computes the psi score.

参数:

A – jnp.array Distribution of sample A
B – jnp.array Distribution of sample B

返回:

float

返回类型:

result

secretflow.stats.core.psi_core.distribution_generation(X: array, split_points: array)[源代码]#

Generate a distribution of X according to split points.

参数:

X – jnp.array a collection of samples
split_points – jnp.array an ordered sequence of split points

返回:

jnp.array: distribution in forms of percentage of counts in each bin. bin[0] is [split_points[0], split_points[1])

返回类型:

dist_X

secretflow.stats.core.psi_core.psi(X: Union[DataFrame, array], Y: Union[DataFrame, array], split_points: array)[源代码]#

Calculate population stability index.

参数:

X – Union[pd.DataFrame, jnp.array] a collection of samples
Y – Union[pd.DataFrame, jnp.array] a collection of samples
split_points – jnp.array an ordered sequence of split points

返回:

float: population stability index

返回类型:

result

secretflow.stats.core.pva_core#

Functions:

pva(actual, prediction, target)

Compute Prediction Vs Actual score.

secretflow.stats.core.pva_core.pva(actual: Union[DataFrame, array], prediction: Union[DataFrame, array], target)[源代码]#

Compute Prediction Vs Actual score.

参数:

actual – Union[pd.DataFrame, jnp.array]
prediction – Union[pd.DataFrame, jnp.array]
target – numeric the target label in actual entries to consider.

返回:

float: abs(mean(prediction) - sum(actual == target)/count(actual))

返回类型:

result

secretflow.stats.core.utils#

Functions:

`newton_matrix_inverse`(x[, iter_round])	computing the inverse of a matrix by newton iteration.
`equal_obs`(x, n_bin)	Equal Frequency Split Point Search in x with bin size = n_bins In each bin, there is equal number of points in them
`equal_range`(x, n_bin)	Equal Range Search Split Points in x with bin size = n_bins :returns: jnp.array with size n_bin+1

secretflow.stats.core.utils.newton_matrix_inverse(x: ndarray, iter_round: int = 20)[源代码]#: computing the inverse of a matrix by newton iteration. https://aalexan3.math.ncsu.edu/articles/mat-inv-rep.pdf

secretflow.stats.core.utils.equal_obs(x, n_bin)[源代码]#

Equal Frequency Split Point Search in x with bin size = n_bins In each bin, there is equal number of points in them

参数:

x – array
n_bin – int

返回:

jnp.array with size n_bin+1

secretflow.stats.core.utils.equal_range(x, n_bin)[源代码]#: Equal Range Search Split Points in x with bin size = n_bins :returns: jnp.array with size n_bin+1