secretflow.stats.core#
secretflow.stats.core.biclassification_eval_core#
Classes:
|
Report containing all other reports for bi-classification evaluation |
|
Precision Related statistics Report. |
|
Summary Report for bi-classification evaluation. |
Report for each group |
|
|
Statistics Report for each bin. |
Functions:
|
Generate all reports. |
|
produce an n * 2 shaped array with the second column as the sorted scores, in decreasing order |
|
Fill eq frequent bin report. |
|
Fill eq range bin report. |
|
evaluate bins given sorted pairs, pos_count and split_points (in decreasing order) |
|
Evaluate statistics for a bin. |
|
Generate pr report per specified threshold. |
Compute the confusion matrix. |
|
|
Calculate true and false positives per binary classification threshold (can be used for roc curve or precision/recall curve). |
|
Compute Receiver operating characteristic (ROC). |
|
Compute Area Under the Curve (AUC) using the trapezoidal rule. |
|
Compute Area Under the Curve (AUC) for ROC from labels and prediction scores in sorted_pairs. |
|
Calculate the F1 score. |
- class secretflow.stats.core.biclassification_eval_core.Report(eq_frequent_result_arr_list, eq_range_result_arr_list, summary_report_arr, head_prs)[源代码]#
基类:
object
Report containing all other reports for bi-classification evaluation
- summary_report#
SummaryReport
- group_reports#
List[GroupReport]
- eq_frequent_bin_report#
List[EqBinReport]
- eq_range_bin_report#
List[EqBinReport]
- head_report#
List[PrReport] reports for fpr = 0.001, 0.005, 0.01, 0.05, 0.1, 0.2
Methods:
__init__
(eq_frequent_result_arr_list, ...)
- class secretflow.stats.core.biclassification_eval_core.PrReport(arr)[源代码]#
基类:
object
Precision Related statistics Report.
- fpr#
float FP/(FP+TN)
- precision#
float TP/(TP+FP)
- recall#
float TP/(TP+FN)
Methods:
__init__
(arr)
- class secretflow.stats.core.biclassification_eval_core.SummaryReport(arr)[源代码]#
基类:
object
Summary Report for bi-classification evaluation.
- total_samples#
int
- positive_samples#
int
- negative_samples#
int
- auc#
float auc: area under the curve: https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc
- ks#
float Kolmogorov-Smirnov statistics: https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
- f1_score#
float harmonic mean of precision and recall: https://en.wikipedia.org/wiki/F-score
Methods:
__init__
(arr)
- class secretflow.stats.core.biclassification_eval_core.GroupReport[源代码]#
基类:
object
Report for each group
Attributes:
- group_name: str#
- summary: SummaryReport#
- class secretflow.stats.core.biclassification_eval_core.EqBinReport(arr)[源代码]#
基类:
object
Statistics Report for each bin.
- start_value#
float
- end_value#
float
- positive#
int
- negative#
int
- total#
int
- precision#
float
- recall#
float
- false_positive_rate#
float
- f1_score#
float
- lift#
- predicted_positive_ratio#
float predicted positive samples / total samples.
- predicted_negative_ratio#
float predicted negative samples / total samples.
- cumulative_percent_of_positive#
float
- cumulative_percent_of_negative#
float
- total_cumulative_percent#
float
- ks#
float
- avg_score#
float
Methods:
__init__
(arr)
- secretflow.stats.core.biclassification_eval_core.gen_all_reports(y_true: Union[DataFrame, array], y_score: Union[DataFrame, array], bin_size: int)[源代码]#
Generate all reports.
- 参数:
y_true – Union[pd.DataFrame, jnp.array] should be of shape n * 1 and with binary entries 1 means positive sample
y_score – Union[pd.DataFrame, jnp.array] should be of shape n * 1 and with each entry between [0, 1] probability of being positive
bin_size – int number of bins to evaluate
Returns:
- secretflow.stats.core.biclassification_eval_core.create_sorted_label_score_pair(y_true: array, y_score: array)[源代码]#
produce an n * 2 shaped array with the second column as the sorted scores, in decreasing order
- secretflow.stats.core.biclassification_eval_core.eq_frequent_bin_evaluate(sorted_pairs: array, pos_count: int, bin_size: int) List[array] [源代码]#
Fill eq frequent bin report.
- 参数:
sorted_pairs – jnp.array Should be of shape n * 2 and with second col sorted
pos_count – int Total number of positive samples
bin_size – int Total number of bins
- 返回:
List[jnp.array]
- 返回类型:
bin_reports
- secretflow.stats.core.biclassification_eval_core.eq_range_bin_evaluate(sorted_pairs: array, pos_count: int, bin_size: int) List[array] [源代码]#
Fill eq range bin report.
- 参数:
sorted_pairs – jnp.array Should be of shape n * 2 and with second col sorted.
pos_count – int Total number of positive samples
bin_size – int Total number of bins
- 返回:
List[jnp.array]
- 返回类型:
bin_reports
- secretflow.stats.core.biclassification_eval_core.evaluate_bins(sorted_pairs: array, pos_count: int, split_points) List[array] [源代码]#
evaluate bins given sorted pairs, pos_count and split_points (in decreasing order)
- secretflow.stats.core.biclassification_eval_core.bin_evaluate(sorted_pairs, start_pos, end_pos, total_pos_count, total_neg_count, cumulative_pos_count, cumulative_neg_count) Tuple[array, int, int] [源代码]#
Evaluate statistics for a bin.
- 返回:
- jnp.array
an array of size BIN_REPORT_STATISTICS_ENTRY_COUNT
cumulative_pos_count: int
cumulative_neg_count: int
- 返回类型:
bin_report_arr
- secretflow.stats.core.biclassification_eval_core.gen_pr_reports(sorted_pairs: array, thresholds: array) List[array] [源代码]#
Generate pr report per specified threshold.
- 参数:
sorted_pairs – jnp.array y_true y_score pairs sorted by y_score in increasing order shape n_samples * 2
thresholds – 1d jnp.ndarray prediction thresholds on which to evaluate
- 返回:
- List[jnp.array]
a list of pr reports in jnp.array of shape 3 * 1, list len = len(thresholds)
- 返回类型:
pr_report_arr
- secretflow.stats.core.biclassification_eval_core.precision_recall_false_positive_rate(true_positive, false_positive, false_negative, true_negative) Tuple[float, float, float] [源代码]#
- secretflow.stats.core.biclassification_eval_core.confusion_matrix_from_cum_counts(cumulative_pos_count, cumulative_neg_count, total_neg_count, total_pos_count)[源代码]#
Compute the confusion matrix.
- 参数:
cumulative_pos_count – int
cumulative_neg_count – int
total_neg_count – int
total_pos_count – int
- 返回:
int
true_negative: int
false_positive: int
false_negative: int
- 返回类型:
true_positive
- secretflow.stats.core.biclassification_eval_core.binary_clf_curve(sorted_pairs: array) Tuple[array, array, array] [源代码]#
Calculate true and false positives per binary classification threshold (can be used for roc curve or precision/recall curve).
- 参数:
sorted_pairs – jnp.array y_true y_score pairs sorted by y_score in decreasing order
- 返回:
- 1d ndarray
False positives counts, index i records the number of negative samples that got assigned a score >= thresholds[i]. The total number of negative samples is equal to fps[-1] (thus true negatives are given by fps[-1] - fps)
- tps: 1d ndarray
True positives counts, index i records the number of positive samples that got assigned a score >= thresholds[i]. The total number of positive samples is equal to tps[-1] (thus false negatives are given by tps[-1] - tps)
- thresholds1d ndarray
Distinct predicted score sorted in decreasing order
- 返回类型:
fps
引用
Github: scikit-learn _binary_clf_curve.
- secretflow.stats.core.biclassification_eval_core.roc_curve(sorted_pairs: array) Tuple[array, array, array] [源代码]#
Compute Receiver operating characteristic (ROC).
Compared to sklearn implementation, this implementation eliminates most conditionals and ill-conditionals checking.
- 参数:
sorted_pairs – jnp.array y_true y_score pairs sorted by y_score in decreasing order
- 返回:
- ndarray of shape (>2,)
Increasing false positive rates such that element i is the false positive rate of predictions with score >= thresholds[i].
- tpr: ndarray of shape (>2,)
Increasing true positive rates such that element i is the true positive rate of predictions with score >= thresholds[i].
- thresholds: ndarray of shape = (n_thresholds,)
Decreasing thresholds on the decision function used to compute fpr and tpr. thresholds[0] represents no instances being predicted and is arbitrarily set to max(y_score) + 1.
- 返回类型:
fpr
引用
Github: scikit-learn roc_curve.
- secretflow.stats.core.biclassification_eval_core.auc(x, y)[源代码]#
Compute Area Under the Curve (AUC) using the trapezoidal rule.
- 参数:
x – ndarray of shape (n,) monotonic X coordinates
y – ndarray of shape, (n,) Y coordinates
- 返回:
- float
Area Under the Curve
- 返回类型:
auc
- secretflow.stats.core.biclassification_eval_core.binary_roc_auc(sorted_pairs: array) float [源代码]#
Compute Area Under the Curve (AUC) for ROC from labels and prediction scores in sorted_pairs.
Compared to sklearn implementation, this implementation is watered down with less options and eliminates most conditionals and ill-conditionals checking.
- 参数:
sorted_pairs – jnp.array y_true y_score pairs sorted by y_score in decreasing order, and it has shape n_samples * 2.
- 返回:
float
- 返回类型:
roc_auc
引用
Github: scikit-learn _binary_roc_auc_score.
secretflow.stats.core.psi_core#
Functions:
|
(a - b) * ln(a/b). |
|
Computes the psi score. |
|
Generate a distribution of X according to split points. |
|
Calculate population stability index. |
- secretflow.stats.core.psi_core.psi_index(a, b)[源代码]#
(a - b) * ln(a/b).
- 参数:
a – array or float
b – array or float a, b must be of same type. They can be float or jnp.array or np.array.
- 返回:
- array or float
same type as a or b.
- 返回类型:
result
- secretflow.stats.core.psi_core.psi_score(A: array, B: array)[源代码]#
Computes the psi score.
- 参数:
A – jnp.array Distribution of sample A
B – jnp.array Distribution of sample B
- 返回:
float
- 返回类型:
result
- secretflow.stats.core.psi_core.distribution_generation(X: array, split_points: array)[源代码]#
Generate a distribution of X according to split points.
- 参数:
X – jnp.array a collection of samples
split_points – jnp.array an ordered sequence of split points
- 返回:
- jnp.array
distribution in forms of percentage of counts in each bin. bin[0] is [split_points[0], split_points[1])
- 返回类型:
dist_X
- secretflow.stats.core.psi_core.psi(X: Union[DataFrame, array], Y: Union[DataFrame, array], split_points: array)[源代码]#
Calculate population stability index.
- 参数:
X – Union[pd.DataFrame, jnp.array] a collection of samples
Y – Union[pd.DataFrame, jnp.array] a collection of samples
split_points – jnp.array an ordered sequence of split points
- 返回:
- float
population stability index
- 返回类型:
result
secretflow.stats.core.pva_core#
Functions:
|
Compute Prediction Vs Actual score. |
- secretflow.stats.core.pva_core.pva(actual: Union[DataFrame, array], prediction: Union[DataFrame, array], target)[源代码]#
Compute Prediction Vs Actual score.
- 参数:
actual – Union[pd.DataFrame, jnp.array]
prediction – Union[pd.DataFrame, jnp.array]
target – numeric the target label in actual entries to consider.
- 返回:
- float
abs(mean(prediction) - sum(actual == target)/count(actual))
- 返回类型:
result
secretflow.stats.core.utils#
Functions:
|
computing the inverse of a matrix by newton iteration. |
|
Equal Frequency Split Point Search in x with bin size = n_bins In each bin, there is equal number of points in them |
|
Equal Range Search Split Points in x with bin size = n_bins :returns: jnp.array with size n_bin+1 |
- secretflow.stats.core.utils.newton_matrix_inverse(x: ndarray, iter_round: int = 20)[源代码]#
computing the inverse of a matrix by newton iteration. https://aalexan3.math.ncsu.edu/articles/mat-inv-rep.pdf