隐语SecretFlow金融风控全链路能力展示#
This tutorial is only available in Chinese.
Last updated: Nov 9, 2022
请使用v0.7.11或以上版本的隐语进行实验。
以下代码仅作为示例,请勿在生产环境直接使用。
本次实验将会展示如何使用隐语进行在风控领域常用的Logistic Regeression模型和XGB模型的模型研发工作。
隐语接下来将会开放模型部署和在线/离线模型预测功能,敬请期待。
实验目标#
在本次实验中,我们将会利用一个开源数据集训练一个金融风控场景常用的线性回归和XGB模型。在此过程中将包含以下步骤:
样本对齐
特征预处理
数据分析
模型训练
模型预测
模型评估
请依次执行所有步骤确保实验可以顺利完成。
实验前置工作#
初始化隐语框架#
在本次实验中,我们将会包含两个节点:alice 和 bob . 在真实业务场景,他们将会代表两个不同实体,他们之间的原始数据不被允许直接相互传输,但是他们的原始数据将会被一起用以研发一个模型。
在下面的代码中,我们建立了一个 SecretFlow Cluster, 基于 alice 和 bob 两个节点,我们还创建了三个device:
alice: PYU device, 负责在alice侧的本地计算,计算输入、计算过程和计算结果仅alice可见
bob: PYU device, 负责在bob侧的本地计算,计算输入、计算过程和计算结果仅bob可见
spu: SPU device, 负责alice和bob之间的密态计算,计算输入和计算结果为密态,由alice和bob各掌握一个分片,计算过程为MPC计算,由alice和bob各自的SPU Runtime一起执行。
如果你尚未理解以上的一些概念,比如SPU设备,请参考这篇`文档 <../developer/design/architecture.md>`__.
[1]:
import secretflow as sf
sf.shutdown()
sf.init(['alice', 'bob'], address='local')
alice, bob = sf.PYU('alice'), sf.PYU('bob')
spu = sf.SPU(sf.utils.testing.cluster_def(['alice', 'bob']))
/home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
在上面的log中,你应该发现,在spu的创建过程中,alice和bob两边都各有一个 SPURuntime 被建立并互相创建连接。
数据集#
本次实验我们采用的原始数据是来自UCI的Bank Marketing Data Set. 这个数据集汇集了一家葡萄牙银行机构电话营销的结果。
我们添加了uid这一列用于接下来隐私求交的实验。
我们首先看一下数据集所包含的信息。
[2]:
import pandas as pd
# secretflow.utils.simulation.datasets contains mirrors of some popular open dataset.
from secretflow.utils.simulation.datasets import dataset
df = pd.read_csv(dataset('bank_marketing_full'), sep=';')
df['uid'] = df.index + 1
df
[2]:
age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y | uid | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 58 | management | married | tertiary | no | 2143 | yes | no | unknown | 5 | may | 261 | 1 | -1 | 0 | unknown | no | 1 |
1 | 44 | technician | single | secondary | no | 29 | yes | no | unknown | 5 | may | 151 | 1 | -1 | 0 | unknown | no | 2 |
2 | 33 | entrepreneur | married | secondary | no | 2 | yes | yes | unknown | 5 | may | 76 | 1 | -1 | 0 | unknown | no | 3 |
3 | 47 | blue-collar | married | unknown | no | 1506 | yes | no | unknown | 5 | may | 92 | 1 | -1 | 0 | unknown | no | 4 |
4 | 33 | unknown | single | unknown | no | 1 | no | no | unknown | 5 | may | 198 | 1 | -1 | 0 | unknown | no | 5 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
45206 | 51 | technician | married | tertiary | no | 825 | no | no | cellular | 17 | nov | 977 | 3 | -1 | 0 | unknown | yes | 45207 |
45207 | 71 | retired | divorced | primary | no | 1729 | no | no | cellular | 17 | nov | 456 | 2 | -1 | 0 | unknown | yes | 45208 |
45208 | 72 | retired | married | secondary | no | 5715 | no | no | cellular | 17 | nov | 1127 | 5 | 184 | 3 | success | yes | 45209 |
45209 | 57 | blue-collar | married | secondary | no | 668 | no | no | telephone | 17 | nov | 508 | 4 | -1 | 0 | unknown | no | 45210 |
45210 | 37 | entrepreneur | married | secondary | no | 2971 | no | no | cellular | 17 | nov | 361 | 2 | 188 | 11 | other | no | 45211 |
45211 rows × 18 columns
该数据集包含了45211个样本,每一个样本代表了一个目标客户。
每个样本包含16个feature,我们这里简单描述一下这个数据集所有的feature。
feature |
描述 |
取值 |
---|---|---|
uid |
客户编码 |
数字 |
age |
年龄 |
数字 |
job |
工作类型 |
‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’,‘technician’,‘unemployed’,‘unknown’ |
marital |
婚姻状况 |
‘divorced’,‘married’,‘single’,‘unknown’ |
education |
教育状况 |
‘tertiary’, ‘secondary’, ‘unknown’, ‘primary’ |
default |
是否有不良信用记录 |
‘no’,‘yes’,‘unknown’ |
housing |
是否有房贷 |
‘no’,‘yes’,‘unknown’ |
loan |
是否有个人贷款 |
‘no’,‘yes’,‘unknown’ |
contact |
联系方式 |
‘cellular’,‘telephone’ |
month |
上次联系月份 |
‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’ |
day |
上次联系月日 |
数字 |
duration |
上次沟通时间 |
数字 |
campaign |
本次活动已经沟通的次数 |
数字 |
pdays |
距离上次沟通经过的天数 |
数字 |
previous |
在本次活动之前已经沟通的次数 |
数字 |
poutcome |
之前活动的结果 |
‘unknown’, ‘failure’, ‘other’, ‘success’ |
每个样本的label - y表示对于目标客户的营销结果(是否签订了定额存款合同),取值是’yes’,‘no’。
我们假定以上16个feature由两个机构分别掌握,具体如下。
alice: age, job, marital, education, default, balance, housing, loan
bob: contact, day, month, duration, campaign, pdays, previous, poutcome, y
在真实业务场景中, alice和bob所掌握的数据可能是没有对齐的,为了模拟这种情况,我们将数据集shuffle之后,再随机各取90%来模拟这个状况。
[3]:
import numpy as np
df_alice = df.iloc[:, np.r_[0:8, -1]].sample(frac=0.9)
df_alice
[3]:
age | job | marital | education | default | balance | housing | loan | uid | |
---|---|---|---|---|---|---|---|---|---|
30775 | 33 | technician | single | secondary | no | 10 | yes | no | 30776 |
38044 | 55 | admin. | divorced | secondary | no | -288 | yes | no | 38045 |
10075 | 60 | services | divorced | secondary | no | 47 | no | no | 10076 |
37076 | 33 | services | married | secondary | no | 56 | yes | no | 37077 |
11067 | 52 | management | married | tertiary | no | 7388 | no | yes | 11068 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9083 | 55 | retired | married | primary | no | 0 | no | no | 9084 |
39576 | 25 | student | single | tertiary | no | 241 | no | no | 39577 |
25186 | 40 | management | married | tertiary | no | 11 | yes | no | 25187 |
10436 | 36 | entrepreneur | married | primary | no | 6317 | no | no | 10437 |
1158 | 59 | retired | married | unknown | no | 0 | no | no | 1159 |
40690 rows × 9 columns
[4]:
df_bob = df.iloc[:, 8:].sample(frac=0.9)
df_bob
[4]:
contact | day | month | duration | campaign | pdays | previous | poutcome | y | uid | |
---|---|---|---|---|---|---|---|---|---|---|
11614 | unknown | 19 | jun | 211 | 1 | -1 | 0 | unknown | no | 11615 |
24743 | cellular | 18 | nov | 150 | 1 | -1 | 0 | unknown | no | 24744 |
42588 | cellular | 30 | dec | 158 | 3 | -1 | 0 | unknown | no | 42589 |
4322 | unknown | 19 | may | 187 | 3 | -1 | 0 | unknown | no | 4323 |
15930 | cellular | 22 | jul | 76 | 1 | -1 | 0 | unknown | no | 15931 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
30143 | cellular | 4 | feb | 76 | 2 | 204 | 3 | other | no | 30144 |
16730 | cellular | 24 | jul | 149 | 2 | -1 | 0 | unknown | no | 16731 |
35775 | cellular | 8 | may | 36 | 1 | -1 | 0 | unknown | no | 35776 |
7050 | unknown | 28 | may | 318 | 2 | -1 | 0 | unknown | no | 7051 |
42846 | cellular | 3 | feb | 355 | 1 | 301 | 5 | success | yes | 42847 |
40690 rows × 10 columns
我们这里将df_alice和df_bob保存为文件,作为alice和bob两方的原始输入。
至此,我们完成了所有实验准备工作。
[5]:
import tempfile
_, alice_path = tempfile.mkstemp()
_, bob_path = tempfile.mkstemp()
df_alice.reset_index(drop=True).to_csv(alice_path, index=False)
df_bob.reset_index(drop=True).to_csv(bob_path, index=False)
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
样本对齐(隐私求交)#
显然,第一步我们需要将两边的数据对齐。 隐私求交(Private Set Intersection)是一种密码学方法,可以获取两个集合的交集,而不泄露任何其他信息。 在隐语中,SPU设备支持三种隐私求交算法:
ECDH:半诚实模型, 基于公钥密码学,原本适用于小数据集,但是隐语优化后已经能支持10亿量级的数据。
KKRT:半诚实模型, 基于布谷鸟哈希(Cuckoo Hashing)以及高效不经意传输扩展(OT Extension),适用于大数据集(比如千万数据集)。
BC22PCG:半诚实模型, 基于随机相关函数生成器,适用于大数据集。
由于我们这里的数据集较小,我们这里采用的是ECDH方法。
方式一:将隐私求交结果保存至文件#
在一些应用场景场景中,alice和bob可能在隐私求交之后将结果直接保存至文件中,之后再进行后续操作。这个时候,请调用psi_csv接口。
在以下代码中,我们分别制定了两边需要求交的key以及输入和输出路径。
我们需要指定双方的输入文件和输出文件路径。对于ECDH来说,由于双方的地位是平等的,receiver并没有实际含义,你可以任意指定。我们需要设定正确的protocol。sort设为true之后,join的结果将会被排序。
请阅读 psi_csv 的文档。
[6]:
_, alice_psi_path = tempfile.mkstemp()
_, bob_psi_path = tempfile.mkstemp()
spu.psi_csv(
key="uid",
input_path={alice: alice_path, bob: bob_path},
output_path={alice: alice_psi_path, bob: bob_psi_path},
receiver="alice",
protocol="ECDH_PSI_2PC",
sort=True,
)
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(SPURuntime pid=45105) I1110 15:04:56.489761 45105 external/com_github_brpc_brpc/src/brpc/server.cpp:1070] Server[yacl::link::internal::ReceiverServiceImpl] is serving on port=23711.
(SPURuntime pid=45105) I1110 15:04:56.489876 45105 external/com_github_brpc_brpc/src/brpc/server.cpp:1073] Check out http://k69b13338.eu95sqa:23711 in web browser.
(SPURuntime pid=45105) I1110 15:04:56.591135 47953 external/com_github_brpc_brpc/src/brpc/socket.cpp:2236] Checking Socket{id=0 addr=127.0.0.1:39345} (0x5604daa51200)
(SPURuntime pid=45106) I1110 15:04:56.641627 45106 external/com_github_brpc_brpc/src/brpc/server.cpp:1070] Server[yacl::link::internal::ReceiverServiceImpl] is serving on port=39345.
(SPURuntime pid=45106) I1110 15:04:56.641724 45106 external/com_github_brpc_brpc/src/brpc/server.cpp:1073] Check out http://k69b13338.eu95sqa:39345 in web browser.
(SPURuntime pid=45105) I1110 15:04:59.591784 47928 external/com_github_brpc_brpc/src/brpc/socket.cpp:2296] Revived Socket{id=0 addr=127.0.0.1:39345} (0x5604daa51200) (Connectable)
(SPURuntime pid=45105) [2022-11-10 15:05:00.520] [info] [bucket_psi.cc:169] bucket size set to 1048576
(SPURuntime pid=45105) [2022-11-10 15:05:00.521] [info] [bucket_psi.cc:77] Begin sanity check for input file: /tmp/tmp0jqbsdge, precheck_switch:true
(SPURuntime pid=45105) [2022-11-10 15:05:00.544] [info] [csv_checker.cc:125] Executing duplicated scripts: LC_ALL=C sort --buffer-size=1G --temporary-directory=/tmp --stable selected-keys.1668063900521197637 | LC_ALL=C uniq -d > duplicate-keys.1668063900521197637
(SPURuntime pid=45105) [2022-11-10 15:05:00.568] [info] [bucket_psi.cc:90] End sanity check for input file: /tmp/tmp0jqbsdge, size=40690
(SPURuntime pid=45105) [2022-11-10 15:05:00.568] [info] [bucket_psi.cc:190] Run psi protocol=1, self_items_count=40690
(SPURuntime pid=45105) [2022-11-10 15:05:00.568] [info] [cryptor_selector.cc:50] Using libSodium
(SPURuntime pid=45105) [2022-11-10 15:05:00.568] [info] [cipher_store.cc:25] Disk cache choose num_bins=64
(SPURuntime pid=45106) [2022-11-10 15:05:00.520] [info] [bucket_psi.cc:169] bucket size set to 1048576
(SPURuntime pid=45106) [2022-11-10 15:05:00.520] [info] [bucket_psi.cc:77] Begin sanity check for input file: /tmp/tmphctf58aa, precheck_switch:true
(SPURuntime pid=45106) [2022-11-10 15:05:00.545] [info] [csv_checker.cc:125] Executing duplicated scripts: LC_ALL=C sort --buffer-size=1G --temporary-directory=/tmp --stable selected-keys.1668063900520830875 | LC_ALL=C uniq -d > duplicate-keys.1668063900520830875
(SPURuntime pid=45106) [2022-11-10 15:05:00.567] [info] [bucket_psi.cc:90] End sanity check for input file: /tmp/tmphctf58aa, size=40690
(SPURuntime pid=45106) [2022-11-10 15:05:00.568] [info] [bucket_psi.cc:190] Run psi protocol=1, self_items_count=40690
(SPURuntime pid=45106) [2022-11-10 15:05:00.568] [info] [cryptor_selector.cc:50] Using libSodium
(SPURuntime pid=45106) [2022-11-10 15:05:00.568] [info] [cipher_store.cc:25] Disk cache choose num_bins=64
(SPURuntime pid=45105) [2022-11-10 15:05:00.602] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
(SPURuntime pid=45106) [2022-11-10 15:05:00.602] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
(SPURuntime pid=45105) [2022-11-10 15:05:01.827] [info] [ecdh_psi.cc:70] MaskSelf:root--finished, batch_count=10
(SPURuntime pid=45105) [2022-11-10 15:05:01.934] [info] [ecdh_psi.cc:134] root recv last batch finished, batch_count=10
(SPURuntime pid=45105) [2022-11-10 15:05:01.989] [info] [ecdh_psi.cc:109] MaskPeer:root--finished, batch_count=10
(SPURuntime pid=45106) [2022-11-10 15:05:01.927] [info] [ecdh_psi.cc:109] MaskPeer:root--finished, batch_count=10
(SPURuntime pid=45106) [2022-11-10 15:05:01.939] [info] [ecdh_psi.cc:70] MaskSelf:root--finished, batch_count=10
(SPURuntime pid=45106) [2022-11-10 15:05:01.992] [info] [ecdh_psi.cc:134] root recv last batch finished, batch_count=10
[6]:
[{'party': 'alice', 'original_count': 40690, 'intersection_count': 36593},
{'party': 'bob', 'original_count': 40690, 'intersection_count': 36593}]
方式二:将求交结果保存至VDataFrame#
VDataFrame是隐语中保存垂直切分数据的数据结构,在接下来的任务中,我们将会不断使用VDataFrame的数据结构。
由于在本次实验中,经过隐私求交之后,我们还有后续操作,所以我们在这里使用 data.vertical.read_csv 来将原始数据隐私求交之后的结果直接转化为VDataFrame。
请阅读data.vertical.read_csv的文档。很多参数和psi_csv是一致的,这里不再赘述。
[7]:
from secretflow.data.vertical import read_csv as v_read_csv
vdf = v_read_csv(
{alice: alice_path, bob: bob_path},
spu=spu,
keys="uid",
drop_keys="uid",
psi_protocl="ECDH_PSI_2PC",
)
vdf.columns
(SPURuntime pid=45105) [2022-11-10 15:05:02.036] [info] [bucket_psi.cc:119] Begin post filtering, indices.size=36593, should_sort=true
(SPURuntime pid=45105) [2022-11-10 15:05:02.044] [info] [utils.cc:86] Executing sort scripts: tail -n +2 /tmp/tmp-sort-in-1668063902037299176 | LC_ALL=C sort --buffer-size=1G --temporary-directory=./ --stable --field-separator=, --key=9,9 >>/tmp/tmp-sort-out-1668063902037299176
(SPURuntime pid=45105) [2022-11-10 15:05:02.076] [info] [utils.cc:88] Finished sort scripts: tail -n +2 /tmp/tmp-sort-in-1668063902037299176 | LC_ALL=C sort --buffer-size=1G --temporary-directory=./ --stable --field-separator=, --key=9,9 >>/tmp/tmp-sort-out-1668063902037299176, ret=0
(SPURuntime pid=45105) [2022-11-10 15:05:02.076] [info] [bucket_psi.cc:157] End post filtering, in=/tmp/tmp0jqbsdge, out=/tmp/tmprde_yg6t
(SPURuntime pid=45106) [2022-11-10 15:05:02.039] [info] [bucket_psi.cc:119] Begin post filtering, indices.size=36593, should_sort=true
(SPURuntime pid=45106) [2022-11-10 15:05:02.047] [info] [utils.cc:86] Executing sort scripts: tail -n +2 /tmp/tmp-sort-in-1668063902040108117 | LC_ALL=C sort --buffer-size=1G --temporary-directory=./ --stable --field-separator=, --key=10,10 >>/tmp/tmp-sort-out-1668063902040108117
(SPURuntime pid=45106) [2022-11-10 15:05:02.080] [info] [utils.cc:88] Finished sort scripts: tail -n +2 /tmp/tmp-sort-in-1668063902040108117 | LC_ALL=C sort --buffer-size=1G --temporary-directory=./ --stable --field-separator=, --key=10,10 >>/tmp/tmp-sort-out-1668063902040108117, ret=0
(SPURuntime pid=45106) [2022-11-10 15:05:02.080] [info] [bucket_psi.cc:157] End post filtering, in=/tmp/tmphctf58aa, out=/tmp/tmpp8yo12c6
(SPURuntime pid=45106) [2022-11-10 15:05:04.478] [info] [bucket_psi.cc:169] bucket size set to 1048576
(SPURuntime pid=45106) [2022-11-10 15:05:04.478] [info] [bucket_psi.cc:77] Begin sanity check for input file: .data/1-af16e420-b8a9-4996-a437-8e10dd3f9ea8/psi-input.csv, precheck_switch:true
(SPURuntime pid=45106) [2022-11-10 15:05:04.502] [info] [csv_checker.cc:125] Executing duplicated scripts: LC_ALL=C sort --buffer-size=1G --temporary-directory=.data/1-af16e420-b8a9-4996-a437-8e10dd3f9ea8 --stable selected-keys.1668063904478962587 | LC_ALL=C uniq -d > duplicate-keys.1668063904478962587
(SPURuntime pid=45105) [2022-11-10 15:05:04.680] [info] [bucket_psi.cc:169] bucket size set to 1048576
(SPURuntime pid=45105) [2022-11-10 15:05:04.680] [info] [bucket_psi.cc:77] Begin sanity check for input file: .data/0-775a216f-7b52-44a0-8f56-e8ad66a83505/psi-input.csv, precheck_switch:true
(SPURuntime pid=45105) [2022-11-10 15:05:04.706] [info] [csv_checker.cc:125] Executing duplicated scripts: LC_ALL=C sort --buffer-size=1G --temporary-directory=.data/0-775a216f-7b52-44a0-8f56-e8ad66a83505 --stable selected-keys.1668063904680819077 | LC_ALL=C uniq -d > duplicate-keys.1668063904680819077
(SPURuntime pid=45105) [2022-11-10 15:05:04.740] [info] [bucket_psi.cc:90] End sanity check for input file: .data/0-775a216f-7b52-44a0-8f56-e8ad66a83505/psi-input.csv, size=40690
(SPURuntime pid=45105) [2022-11-10 15:05:04.741] [info] [bucket_psi.cc:190] Run psi protocol=1, self_items_count=40690
(SPURuntime pid=45105) [2022-11-10 15:05:04.741] [info] [cryptor_selector.cc:50] Using libSodium
(SPURuntime pid=45105) [2022-11-10 15:05:04.741] [info] [cipher_store.cc:25] Disk cache choose num_bins=64
(SPURuntime pid=45106) [2022-11-10 15:05:04.741] [info] [bucket_psi.cc:90] End sanity check for input file: .data/1-af16e420-b8a9-4996-a437-8e10dd3f9ea8/psi-input.csv, size=40690
(SPURuntime pid=45106) [2022-11-10 15:05:04.741] [info] [bucket_psi.cc:190] Run psi protocol=1, self_items_count=40690
(SPURuntime pid=45106) [2022-11-10 15:05:04.741] [info] [cryptor_selector.cc:50] Using libSodium
(SPURuntime pid=45106) [2022-11-10 15:05:04.741] [info] [cipher_store.cc:25] Disk cache choose num_bins=64
(SPURuntime pid=45105) [2022-11-10 15:05:06.016] [info] [ecdh_psi.cc:70] MaskSelf:root--finished, batch_count=10
(SPURuntime pid=45106) [2022-11-10 15:05:06.202] [info] [ecdh_psi.cc:70] MaskSelf:root--finished, batch_count=10
(SPURuntime pid=45105) [2022-11-10 15:05:06.358] [info] [ecdh_psi.cc:109] MaskPeer:root--finished, batch_count=10
(SPURuntime pid=45105) [2022-11-10 15:05:06.373] [info] [ecdh_psi.cc:134] root recv last batch finished, batch_count=10
(SPURuntime pid=45106) [2022-11-10 15:05:06.362] [info] [ecdh_psi.cc:134] root recv last batch finished, batch_count=10
(SPURuntime pid=45106) [2022-11-10 15:05:06.370] [info] [ecdh_psi.cc:109] MaskPeer:root--finished, batch_count=10
(SPURuntime pid=45105) [2022-11-10 15:05:06.597] [info] [bucket_psi.cc:119] Begin post filtering, indices.size=36593, should_sort=true
(SPURuntime pid=45105) [2022-11-10 15:05:06.642] [info] [utils.cc:86] Executing sort scripts: tail -n +2 .data/0-775a216f-7b52-44a0-8f56-e8ad66a83505/tmp-sort-in-1668063906598517121 | LC_ALL=C sort --buffer-size=1G --temporary-directory=./ --stable --field-separator=, --key=9,9 >>.data/0-775a216f-7b52-44a0-8f56-e8ad66a83505/tmp-sort-out-1668063906598517121
(SPURuntime pid=45106) [2022-11-10 15:05:06.641] [info] [bucket_psi.cc:119] Begin post filtering, indices.size=36593, should_sort=true
(SPURuntime pid=45106) [2022-11-10 15:05:06.655] [info] [utils.cc:86] Executing sort scripts: tail -n +2 .data/1-af16e420-b8a9-4996-a437-8e10dd3f9ea8/tmp-sort-in-1668063906641900342 | LC_ALL=C sort --buffer-size=1G --temporary-directory=./ --stable --field-separator=, --key=10,10 >>.data/1-af16e420-b8a9-4996-a437-8e10dd3f9ea8/tmp-sort-out-1668063906641900342
(SPURuntime pid=45105) [2022-11-10 15:05:06.693] [info] [utils.cc:88] Finished sort scripts: tail -n +2 .data/0-775a216f-7b52-44a0-8f56-e8ad66a83505/tmp-sort-in-1668063906598517121 | LC_ALL=C sort --buffer-size=1G --temporary-directory=./ --stable --field-separator=, --key=9,9 >>.data/0-775a216f-7b52-44a0-8f56-e8ad66a83505/tmp-sort-out-1668063906598517121, ret=0
(SPURuntime pid=45105) [2022-11-10 15:05:06.693] [info] [bucket_psi.cc:157] End post filtering, in=.data/0-775a216f-7b52-44a0-8f56-e8ad66a83505/psi-input.csv, out=.data/0-775a216f-7b52-44a0-8f56-e8ad66a83505/psi-output.csv
(SPURuntime pid=45106) [2022-11-10 15:05:06.714] [info] [utils.cc:88] Finished sort scripts: tail -n +2 .data/1-af16e420-b8a9-4996-a437-8e10dd3f9ea8/tmp-sort-in-1668063906641900342 | LC_ALL=C sort --buffer-size=1G --temporary-directory=./ --stable --field-separator=, --key=10,10 >>.data/1-af16e420-b8a9-4996-a437-8e10dd3f9ea8/tmp-sort-out-1668063906641900342, ret=0
(SPURuntime pid=45106) [2022-11-10 15:05:06.714] [info] [bucket_psi.cc:157] End post filtering, in=.data/1-af16e420-b8a9-4996-a437-8e10dd3f9ea8/psi-input.csv, out=.data/1-af16e420-b8a9-4996-a437-8e10dd3f9ea8/psi-output.csv
[7]:
Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays',
'previous', 'poutcome', 'y'],
dtype='object')
更多#
我们在这里展示的是两方单键的隐私求交,隐语也支持三方和多键的隐私求交技术,想要了解更多信息,你可以:
特征预处理#
一般情况下,我们都需要对用于建模的数据进行预处理,合理的预处理对模型训练效果非常关键。
在开始特征预处理之前,我们先使用 stats.table_statistics.table_statistics 来查看一下特征总体情况,我们会在后面专门讨论全表统计模块。
[8]:
from secretflow.stats.table_statistics import table_statistics
pd.set_option('display.max_rows', None)
data_stats = table_statistics(vdf)
data_stats
[8]:
datatype | total_count | count | count_na | min | max | mean | var | std | sem | ... | moment_2 | moment_3 | moment_4 | central_moment_2 | central_moment_3 | central_moment_4 | sum | sum_2 | sum_3 | sum_4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
age | int64 | 36593 | 36593 | 0 | 18.0 | 95.0 | 40.947039 | 1.128579e+02 | 10.623461 | 0.055535 | ... | 1.789515e+03 | 8.333137e+04 | 4.121843e+06 | 1.128548e+02 | 8.138900e+02 | 4.203357e+04 | 1498375.0 | 6.548372e+07 | 3.049345e+09 | 1.508306e+11 |
job | object | 36593 | 36593 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
marital | object | 36593 | 36593 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
education | object | 36593 | 36593 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
default | object | 36593 | 36593 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
balance | int64 | 36593 | 36593 | 0 | -6847.0 | 102127.0 | 1362.827563 | 9.160127e+06 | 3026.570165 | 15.821649 | ... | 1.101718e+07 | 2.688746e+11 | 4.396339e+15 | 9.159877e+06 | 2.288934e+11 | 1.211695e+16 | 49869949.0 | 4.031515e+11 | 9.838929e+15 | -5.145462e+18 |
housing | object | 36593 | 36593 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
loan | object | 36593 | 36593 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
contact | object | 36593 | 36593 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
day | int64 | 36593 | 36593 | 0 | 1.0 | 31.0 | 15.811795 | 6.918886e+01 | 8.317984 | 0.043483 | ... | 3.191998e+02 | 7.287211e+03 | 1.788647e+05 | 6.918697e+01 | 5.214820e+01 | 9.274288e+03 | 578601.0 | 1.168048e+07 | 2.666609e+08 | 6.545197e+09 |
month | object | 36593 | 36593 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
duration | int64 | 36593 | 36593 | 0 | 0.0 | 4918.0 | 257.190528 | 6.517951e+04 | 255.302773 | 1.334617 | ... | 1.313247e+05 | 1.187896e+08 | 1.717326e+11 | 6.517772e+04 | 5.148791e+07 | 8.852056e+10 | 9411373.0 | 4.805564e+09 | 4.346867e+12 | 6.284213e+15 |
campaign | int64 | 36593 | 36593 | 0 | 1.0 | 63.0 | 2.767005 | 9.729927e+00 | 3.119283 | 0.016306 | ... | 1.738598e+01 | 2.521050e+02 | 6.260525e+03 | 9.729661e+00 | 1.501539e+02 | 4.093040e+03 | 101253.0 | 6.362050e+05 | 9.225279e+06 | 2.290914e+08 |
pdays | int64 | 36593 | 36593 | 0 | -1.0 | 854.0 | 40.517175 | 1.014990e+04 | 100.746691 | 0.526662 | ... | 1.179126e+04 | 3.978166e+06 | 1.564747e+09 | 1.014962e+04 | 2.677950e+06 | 1.028068e+09 | 1482645.0 | 4.314776e+08 | 1.455730e+11 | 5.725880e+13 |
previous | int64 | 36593 | 36593 | 0 | 0.0 | 275.0 | 0.585276 | 5.783946e+00 | 2.404984 | 0.012572 | ... | 6.126336e+00 | 6.332842e+02 | 1.581319e+05 | 5.783788e+00 | 6.229283e+02 | 1.566616e+05 | 21417.0 | 2.241810e+05 | 2.317377e+07 | 5.786521e+09 |
poutcome | object | 36593 | 36593 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
y | object | 36593 | 36593 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
17 rows × 25 columns
[9]:
pd.reset_option('display.max_rows')
在接下来,我们将会展示隐语以下特征预处理能力:
值替换
缺失值填充
WOE分组/分箱转换
one-hot编码
标准化
值替换#
我们先对以下特征做值替换:
feature |
描述 |
取值和值替换规则 |
---|---|---|
education |
教育状况 |
‘tertiary’ -> 3, ‘secondary’ -> 2, ‘unknown’ -> 0, ‘primary’ -> 1 |
default |
是否有不良信用记录 |
‘no’ -> 0,‘yes’ -> 1,‘unknown’ -> NaN |
housing |
是否有房贷 |
‘no’ -> 0,‘yes’ -> 1,‘unknown’ -> NaN |
loan |
是否有个人贷款 |
‘no’ -> 0,‘yes’ -> 1,‘unknown’ -> NaN |
month |
上次联系月份 |
‘jan’ -> 1, ‘feb’ -> 2, ‘mar’ -> 3, …, ‘nov’ -> 11, ‘dec’ ->12 |
y |
label |
‘yes’ -> 1,‘no’ -> 0 |
替换完之后,我们使用 sf.reveal 来查看效果,请注意在生产中,sf.reveal 将会直接泄露数据,需要严格限制和进行审计。
在生产中,请严格限制sf.reveal的使用。
[10]:
vdf['education'] = vdf['education'].replace(
{'tertiary': 3, 'secondary': 2, 'primary': 1, 'unknown': np.NaN}
)
vdf['default'] = vdf['default'].replace({'no': 0, 'yes': 1, 'unknown': np.NaN})
vdf['housing'] = vdf['housing'].replace({'no': 0, 'yes': 1, 'unknown': np.NaN})
vdf['loan'] = vdf['loan'].replace({'no': 0, 'yes': 1, 'unknown': np.NaN})
vdf['month'] = vdf['month'].replace(
{
'jan': 1,
'feb': 2,
'mar': 3,
'apr': 4,
'may': 5,
'jun': 6,
'jul': 7,
'aug': 8,
'sep': 9,
'oct': 10,
'nov': 11,
'dec': 12,
}
)
vdf['y'] = vdf['y'].replace(
{
'no': 0,
'yes': 1,
}
)
print(sf.reveal(vdf.partitions[alice].data))
print(sf.reveal(vdf.partitions[bob].data))
age job marital education default balance housing loan
0 43 technician single 2.0 0 593 1 0
1 46 management married 3.0 0 229 1 0
2 42 technician married 2.0 0 8036 0 0
3 38 admin. married 1.0 0 1487 0 0
4 39 blue-collar married 2.0 0 138 0 0
... ... ... ... ... ... ... ... ...
36588 36 self-employed single 3.0 0 4844 0 0
36589 49 housemaid married 1.0 0 3376 0 0
36590 52 entrepreneur married 3.0 0 1115 1 0
36591 40 blue-collar married 1.0 0 48 0 0
36592 46 services married 3.0 0 474 0 0
[36593 rows x 8 columns]
contact day month duration campaign pdays previous poutcome y
0 unknown 5 5 55 1 -1 0 unknown 0
1 unknown 5 5 197 1 -1 0 unknown 0
2 unknown 9 6 948 5 -1 0 unknown 0
3 unknown 9 6 332 2 -1 0 unknown 0
4 unknown 9 6 61 2 -1 0 unknown 0
... ... ... ... ... ... ... ... ... ..
36588 unknown 9 6 1137 3 -1 0 unknown 1
36589 unknown 9 6 119 3 -1 0 unknown 0
36590 unknown 9 6 124 2 -1 0 unknown 0
36591 unknown 9 6 100 5 -1 0 unknown 0
36592 unknown 9 6 445 2 -1 0 unknown 0
[36593 rows x 9 columns]
安全性讨论#
值替换操作由数据所有者的PYU Device执行,不会泄露数据。
缺失值填充#
接下来我们对缺失值进行填充。我们在这里均填充了众数,其他可选的策略还包括平均数、中位数等。
其他可能的处理方法包括删除缺省的行, 或者可以使用数据完整的行作为训练集,以此来预测缺失值。
替换完之后,我们使用 sf.reveal 来查看效果。
[11]:
vdf["education"] = vdf["education"].fillna(vdf["education"].mode())
vdf["default"] = vdf["default"].fillna(vdf["default"].mode())
vdf["housing"] = vdf["housing"].fillna(vdf["housing"].mode())
vdf["loan"] = vdf["loan"].fillna(vdf["loan"].mode())
print(sf.reveal(vdf.partitions[alice].data))
print(sf.reveal(vdf.partitions[bob].data))
age job marital education default balance housing loan
0 43 technician single 2.0 0 593 1 0
1 46 management married 3.0 0 229 1 0
2 42 technician married 2.0 0 8036 0 0
3 38 admin. married 1.0 0 1487 0 0
4 39 blue-collar married 2.0 0 138 0 0
... ... ... ... ... ... ... ... ...
36588 36 self-employed single 3.0 0 4844 0 0
36589 49 housemaid married 1.0 0 3376 0 0
36590 52 entrepreneur married 3.0 0 1115 1 0
36591 40 blue-collar married 1.0 0 48 0 0
36592 46 services married 3.0 0 474 0 0
[36593 rows x 8 columns]
contact day month duration campaign pdays previous poutcome y
0 unknown 5 5 55 1 -1 0 unknown 0
1 unknown 5 5 197 1 -1 0 unknown 0
2 unknown 9 6 948 5 -1 0 unknown 0
3 unknown 9 6 332 2 -1 0 unknown 0
4 unknown 9 6 61 2 -1 0 unknown 0
... ... ... ... ... ... ... ... ... ..
36588 unknown 9 6 1137 3 -1 0 unknown 1
36589 unknown 9 6 119 3 -1 0 unknown 0
36590 unknown 9 6 124 2 -1 0 unknown 0
36591 unknown 9 6 100 5 -1 0 unknown 0
36592 unknown 9 6 445 2 -1 0 unknown 0
[36593 rows x 9 columns]
安全性讨论#
所填充的缺失值由属于数据所有者的PYU Device执行,并在接下来的缺失值操作中由数据所有者的PYU Device使用,不会泄露数据。
woe分箱#
woe分箱用于将连续值替换为离散值。
将连续型特征离散化的一个好处是可以有效地克服数据中隐藏的缺陷: 使模型结果更加稳定。例如,数据中的极端值是影响模型效果的一个重要因素。极端值导致模型参数过高或过低,或导致模型被虚假现象“迷惑”,把原来不存在的关系作为重要模式来学习。而离散化可以有效地减弱极端值和异常值的影响。
变量duration的75%分位数远小于最大值,而且该变量的标准差相对也比较大。因此需要对变量duration进行离散化。
[12]:
from secretflow.preprocessing.binning.vert_woe_binning import VertWoeBinning
from secretflow.preprocessing.binning.vert_woe_substitution import VertWOESubstitution
binning = VertWoeBinning(spu)
woe_rules = binning.binning(
vdf,
binning_method="chimerge",
bin_num=4,
bin_names={alice: [], bob: ["duration"]},
label_name="y",
)
woe_sub = VertWOESubstitution()
vdf = woe_sub.substitution(vdf, woe_rules)
print(sf.reveal(vdf.partitions[alice].data))
print(sf.reveal(vdf.partitions[bob].data))
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(_run pid=44647) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
age job marital education default balance housing loan
0 43 technician single 2.0 0 593 1 0
1 46 management married 3.0 0 229 1 0
2 42 technician married 2.0 0 8036 0 0
3 38 admin. married 1.0 0 1487 0 0
4 39 blue-collar married 2.0 0 138 0 0
... ... ... ... ... ... ... ... ...
36588 36 self-employed single 3.0 0 4844 0 0
36589 49 housemaid married 1.0 0 3376 0 0
36590 52 entrepreneur married 3.0 0 1115 1 0
36591 40 blue-collar married 1.0 0 48 0 0
36592 46 services married 3.0 0 474 0 0
[36593 rows x 8 columns]
contact day month duration campaign pdays previous poutcome y
0 unknown 5 5 -3.983588 1 -1 0 unknown 0
1 unknown 5 5 -1.232426 1 -1 0 unknown 0
2 unknown 9 6 2.351786 5 -1 0 unknown 0
3 unknown 9 6 0.185882 2 -1 0 unknown 0
4 unknown 9 6 -3.983588 2 -1 0 unknown 0
... ... ... ... ... ... ... ... ... ..
36588 unknown 9 6 2.351786 3 -1 0 unknown 1
36589 unknown 9 6 -1.193344 3 -1 0 unknown 0
36590 unknown 9 6 -1.193344 2 -1 0 unknown 0
36591 unknown 9 6 -1.667386 5 -1 0 unknown 0
36592 unknown 9 6 0.474665 2 -1 0 unknown 0
[36593 rows x 9 columns]
(_run pid=56064) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
安全性讨论#
woe分桶需要利用alice和bob两边的数据,因此相关的计算需要使用SPU device确保原始数据不被泄露。
One Hot编码#
one-hot编码适用于将类型编码转化为数值编码。 对于job、marital等特征我们需要one-hot编码。
[13]:
from secretflow.preprocessing.encoder import OneHotEncoder
encoder = OneHotEncoder()
# for vif and correlation only
vdf_hat = vdf.drop(columns=["job", "marital", "contact", "month", "day", "poutcome"])
tranformed_df = encoder.fit_transform(vdf['job'])
vdf[tranformed_df.dtypes.index] = tranformed_df
tranformed_df = encoder.fit_transform(vdf['marital'])
vdf[tranformed_df.dtypes.index] = tranformed_df
tranformed_df = encoder.fit_transform(vdf['contact'])
vdf[tranformed_df.dtypes.index] = tranformed_df
tranformed_df = encoder.fit_transform(vdf['month'])
vdf[tranformed_df.dtypes.index] = tranformed_df
tranformed_df = encoder.fit_transform(vdf['day'])
vdf[tranformed_df.dtypes.index] = tranformed_df
tranformed_df = encoder.fit_transform(vdf['poutcome'])
vdf[tranformed_df.dtypes.index] = tranformed_df
vdf = vdf.drop(columns=["job", "marital", "contact", "month", "day", "poutcome"])
print(sf.reveal(vdf.partitions[alice].data))
print(sf.reveal(vdf.partitions[bob].data))
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
age education default balance housing loan job_admin. \
0 43 2.0 0 593 1 0 0.0
1 46 3.0 0 229 1 0 0.0
2 42 2.0 0 8036 0 0 0.0
3 38 1.0 0 1487 0 0 1.0
4 39 2.0 0 138 0 0 0.0
... ... ... ... ... ... ... ...
36588 36 3.0 0 4844 0 0 0.0
36589 49 1.0 0 3376 0 0 0.0
36590 52 3.0 0 1115 1 0 0.0
36591 40 1.0 0 48 0 0 0.0
36592 46 3.0 0 474 0 0 0.0
job_blue-collar job_entrepreneur job_housemaid ... job_retired \
0 0.0 0.0 0.0 ... 0.0
1 0.0 0.0 0.0 ... 0.0
2 0.0 0.0 0.0 ... 0.0
3 0.0 0.0 0.0 ... 0.0
4 1.0 0.0 0.0 ... 0.0
... ... ... ... ... ...
36588 0.0 0.0 0.0 ... 0.0
36589 0.0 0.0 1.0 ... 0.0
36590 0.0 1.0 0.0 ... 0.0
36591 1.0 0.0 0.0 ... 0.0
36592 0.0 0.0 0.0 ... 0.0
job_self-employed job_services job_student job_technician \
0 0.0 0.0 0.0 1.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 1.0
3 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0
... ... ... ... ...
36588 1.0 0.0 0.0 0.0
36589 0.0 0.0 0.0 0.0
36590 0.0 0.0 0.0 0.0
36591 0.0 0.0 0.0 0.0
36592 0.0 1.0 0.0 0.0
job_unemployed job_unknown marital_divorced marital_married \
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 1.0
2 0.0 0.0 0.0 1.0
3 0.0 0.0 0.0 1.0
4 0.0 0.0 0.0 1.0
... ... ... ... ...
36588 0.0 0.0 0.0 0.0
36589 0.0 0.0 0.0 1.0
36590 0.0 0.0 0.0 1.0
36591 0.0 0.0 0.0 1.0
36592 0.0 0.0 0.0 1.0
marital_single
0 1.0
1 0.0
2 0.0
3 0.0
4 0.0
... ...
36588 1.0
36589 0.0
36590 0.0
36591 0.0
36592 0.0
[36593 rows x 21 columns]
duration campaign pdays previous y contact_cellular \
0 -3.983588 1 -1 0 0 0.0
1 -1.232426 1 -1 0 0 0.0
2 2.351786 5 -1 0 0 0.0
3 0.185882 2 -1 0 0 0.0
4 -3.983588 2 -1 0 0 0.0
... ... ... ... ... .. ...
36588 2.351786 3 -1 0 1 0.0
36589 -1.193344 3 -1 0 0 0.0
36590 -1.193344 2 -1 0 0 0.0
36591 -1.667386 5 -1 0 0 0.0
36592 0.474665 2 -1 0 0 0.0
contact_telephone contact_unknown month_1.0 month_2.0 ... \
0 0.0 1.0 0.0 0.0 ...
1 0.0 1.0 0.0 0.0 ...
2 0.0 1.0 0.0 0.0 ...
3 0.0 1.0 0.0 0.0 ...
4 0.0 1.0 0.0 0.0 ...
... ... ... ... ... ...
36588 0.0 1.0 0.0 0.0 ...
36589 0.0 1.0 0.0 0.0 ...
36590 0.0 1.0 0.0 0.0 ...
36591 0.0 1.0 0.0 0.0 ...
36592 0.0 1.0 0.0 0.0 ...
day_26.0 day_27.0 day_28.0 day_29.0 day_30.0 day_31.0 \
0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ...
36588 0.0 0.0 0.0 0.0 0.0 0.0
36589 0.0 0.0 0.0 0.0 0.0 0.0
36590 0.0 0.0 0.0 0.0 0.0 0.0
36591 0.0 0.0 0.0 0.0 0.0 0.0
36592 0.0 0.0 0.0 0.0 0.0 0.0
poutcome_failure poutcome_other poutcome_success poutcome_unknown
0 0.0 0.0 0.0 1.0
1 0.0 0.0 0.0 1.0
2 0.0 0.0 0.0 1.0
3 0.0 0.0 0.0 1.0
4 0.0 0.0 0.0 1.0
... ... ... ... ...
36588 0.0 0.0 0.0 1.0
36589 0.0 0.0 0.0 1.0
36590 0.0 0.0 0.0 1.0
36591 0.0 0.0 0.0 1.0
36592 0.0 0.0 0.0 1.0
[36593 rows x 55 columns]
安全性讨论#
one-hot编码操作由数据所有者的PYU Device执行,不会泄露数据。
标准化#
特征之间数值差距太大会使得模型收敛困难,我们一般先对数值进行标准化。
[14]:
from secretflow.preprocessing import StandardScaler
X = vdf.drop(columns=['y'])
y = vdf['y']
scaler = StandardScaler()
X = scaler.fit_transform(X)
vdf[X.columns] = X
print(sf.reveal(vdf.partitions[alice].data))
print(sf.reveal(vdf.partitions[bob].data))
(_run pid=57413) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but StandardScaler was fitted without feature names
(_run pid=57413) warnings.warn(
(_run pid=57412) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but StandardScaler was fitted without feature names
(_run pid=57412) warnings.warn(
age education default balance housing loan \
0 0.193250 -0.214015 -0.134056 -0.254360 0.890792 -0.437727
1 0.475648 1.317062 -0.134056 -0.374630 0.890792 -0.437727
2 0.099118 -0.214015 -0.134056 2.204893 -1.122596 -0.437727
3 -0.277412 -1.745093 -0.134056 0.041028 -1.122596 -0.437727
4 -0.183280 -0.214015 -0.134056 -0.404697 -1.122596 -0.437727
... ... ... ... ... ... ...
36588 -0.465677 1.317062 -0.134056 1.150219 -1.122596 -0.437727
36589 0.758046 -1.745093 -0.134056 0.665175 -1.122596 -0.437727
36590 1.040444 1.317062 -0.134056 -0.081885 0.890792 -0.437727
36591 -0.089147 -1.745093 -0.134056 -0.434434 -1.122596 -0.437727
36592 0.475648 1.317062 -0.134056 -0.293679 -1.122596 -0.437727
job_admin. job_blue-collar job_entrepreneur job_housemaid ... \
0 -0.361290 -0.524802 -0.185399 -0.168477 ...
1 -0.361290 -0.524802 -0.185399 -0.168477 ...
2 -0.361290 -0.524802 -0.185399 -0.168477 ...
3 2.767863 -0.524802 -0.185399 -0.168477 ...
4 -0.361290 1.905480 -0.185399 -0.168477 ...
... ... ... ... ... ...
36588 -0.361290 -0.524802 -0.185399 -0.168477 ...
36589 -0.361290 -0.524802 -0.185399 5.935545 ...
36590 -0.361290 -0.524802 5.393786 -0.168477 ...
36591 -0.361290 1.905480 -0.185399 -0.168477 ...
36592 -0.361290 -0.524802 -0.185399 -0.168477 ...
job_retired job_self-employed job_services job_student \
0 -0.229175 -0.188918 -0.317656 -0.145439
1 -0.229175 -0.188918 -0.317656 -0.145439
2 -0.229175 -0.188918 -0.317656 -0.145439
3 -0.229175 -0.188918 -0.317656 -0.145439
4 -0.229175 -0.188918 -0.317656 -0.145439
... ... ... ... ...
36588 -0.229175 5.293301 -0.317656 -0.145439
36589 -0.229175 -0.188918 -0.317656 -0.145439
36590 -0.229175 -0.188918 -0.317656 -0.145439
36591 -0.229175 -0.188918 -0.317656 -0.145439
36592 -0.229175 -0.188918 3.148056 -0.145439
job_technician job_unemployed job_unknown marital_divorced \
0 2.225530 -0.172045 -0.082437 -0.361145
1 -0.449331 -0.172045 -0.082437 -0.361145
2 2.225530 -0.172045 -0.082437 -0.361145
3 -0.449331 -0.172045 -0.082437 -0.361145
4 -0.449331 -0.172045 -0.082437 -0.361145
... ... ... ... ...
36588 -0.449331 -0.172045 -0.082437 -0.361145
36589 -0.449331 -0.172045 -0.082437 -0.361145
36590 -0.449331 -0.172045 -0.082437 -0.361145
36591 -0.449331 -0.172045 -0.082437 -0.361145
36592 -0.449331 -0.172045 -0.082437 -0.361145
marital_married marital_single
0 -1.231619 1.59589
1 0.811939 -0.62661
2 0.811939 -0.62661
3 0.811939 -0.62661
4 0.811939 -0.62661
... ... ...
36588 -1.231619 1.59589
36589 0.811939 -0.62661
36590 0.811939 -0.62661
36591 0.811939 -0.62661
36592 0.811939 -0.62661
[36593 rows x 21 columns]
duration campaign pdays previous y contact_cellular \
0 -1.990727 -0.566486 -0.4121 -0.243363 0 -1.357755
1 -0.300264 -0.566486 -0.4121 -0.243363 0 -1.357755
2 1.902071 0.715878 -0.4121 -0.243363 0 -1.357755
3 0.571222 -0.245895 -0.4121 -0.243363 0 -1.357755
4 -1.990727 -0.245895 -0.4121 -0.243363 0 -1.357755
... ... ... ... ... .. ...
36588 1.902071 0.074696 -0.4121 -0.243363 1 -1.357755
36589 -0.276250 0.074696 -0.4121 -0.243363 0 -1.357755
36590 -0.276250 -0.245895 -0.4121 -0.243363 0 -1.357755
36591 -0.567526 0.715878 -0.4121 -0.243363 0 -1.357755
36592 0.748666 -0.245895 -0.4121 -0.243363 0 -1.357755
contact_telephone contact_unknown month_1.0 month_2.0 ... \
0 -0.260775 1.572308 -0.179805 -0.249288 ...
1 -0.260775 1.572308 -0.179805 -0.249288 ...
2 -0.260775 1.572308 -0.179805 -0.249288 ...
3 -0.260775 1.572308 -0.179805 -0.249288 ...
4 -0.260775 1.572308 -0.179805 -0.249288 ...
... ... ... ... ... ...
36588 -0.260775 1.572308 -0.179805 -0.249288 ...
36589 -0.260775 1.572308 -0.179805 -0.249288 ...
36590 -0.260775 1.572308 -0.179805 -0.249288 ...
36591 -0.260775 1.572308 -0.179805 -0.249288 ...
36592 -0.260775 1.572308 -0.179805 -0.249288 ...
day_26.0 day_27.0 day_28.0 day_29.0 day_30.0 day_31.0 \
0 -0.154952 -0.159514 -0.206026 -0.201369 -0.187673 -0.118176
1 -0.154952 -0.159514 -0.206026 -0.201369 -0.187673 -0.118176
2 -0.154952 -0.159514 -0.206026 -0.201369 -0.187673 -0.118176
3 -0.154952 -0.159514 -0.206026 -0.201369 -0.187673 -0.118176
4 -0.154952 -0.159514 -0.206026 -0.201369 -0.187673 -0.118176
... ... ... ... ... ... ...
36588 -0.154952 -0.159514 -0.206026 -0.201369 -0.187673 -0.118176
36589 -0.154952 -0.159514 -0.206026 -0.201369 -0.187673 -0.118176
36590 -0.154952 -0.159514 -0.206026 -0.201369 -0.187673 -0.118176
36591 -0.154952 -0.159514 -0.206026 -0.201369 -0.187673 -0.118176
36592 -0.154952 -0.159514 -0.206026 -0.201369 -0.187673 -0.118176
poutcome_failure poutcome_other poutcome_success poutcome_unknown
0 -0.348698 -0.206673 -0.18697 0.473513
1 -0.348698 -0.206673 -0.18697 0.473513
2 -0.348698 -0.206673 -0.18697 0.473513
3 -0.348698 -0.206673 -0.18697 0.473513
4 -0.348698 -0.206673 -0.18697 0.473513
... ... ... ... ...
36588 -0.348698 -0.206673 -0.18697 0.473513
36589 -0.348698 -0.206673 -0.18697 0.473513
36590 -0.348698 -0.206673 -0.18697 0.473513
36591 -0.348698 -0.206673 -0.18697 0.473513
36592 -0.348698 -0.206673 -0.18697 0.473513
[36593 rows x 55 columns]
安全性讨论#
标准化操作由数据所有者的PYU Device执行,不会泄露数据。
更多#
隐语还支持其他更多的特征预处理能力,请参考这篇文档.
至此,我们已经完成了所有特征预处理工作。
本文主要目的是为了展示隐语的预处理能力,本文对于数据预处理方法的使用可能是有争议的,敬请谅解。
数据分析#
在建模之前,我们有必要分析一下我们所使用的数据,以便确认是否需要重复特征预处理的过程。
下面我们将会展示隐语以下数据分析能力:
全表统计
相关系数矩阵
VIF指标计算
全表统计#
我们提供了类似于 pd.DataFrame.describe 来展示所有特征的基本统计信息。
在特征预处理的过程中,你可以不断调用全表统计来关注预处理效果。
[15]:
from secretflow.stats.table_statistics import table_statistics
pd.set_option('display.max_rows', None)
data_stats = table_statistics(vdf)
data_stats
[15]:
datatype | total_count | count | count_na | min | max | mean | var | std | sem | ... | moment_2 | moment_3 | moment_4 | central_moment_2 | central_moment_3 | central_moment_4 | sum | sum_2 | sum_3 | sum_4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
duration | float64 | 36593 | 36593 | 0 | -1.990727 | 1.902071 | -6.601933e-18 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | -0.557851 | 2.899545 | 1.000000 | -0.557851 | 2.899545 | -2.415845e-13 | 36593.0 | -2.041344e+04 | 1.061031e+05 |
campaign | float64 | 36593 | 36593 | 0 | -0.566486 | 19.310148 | -7.611640e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 4.947549 | 43.236502 | 1.000000 | 4.947549 | 43.236502 | -2.785328e-12 | 36593.0 | 1.810457e+05 | 1.582153e+06 |
pdays | float64 | 36593 | 36593 | 0 | -0.412100 | 8.074647 | 3.728150e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 2.618954 | 9.979817 | 1.000000 | 2.618954 | 9.979817 | 1.364242e-12 | 36593.0 | 9.583538e+04 | 3.651915e+05 |
previous | float64 | 36593 | 36593 | 0 | -0.243363 | 114.104096 | 4.349509e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 44.783657 | 4683.146470 | 1.000000 | 44.783657 | 4683.146470 | 1.591616e-12 | 36593.0 | 1.638768e+06 | 1.713704e+08 |
y | int64 | 36593 | 36593 | 0 | 0.000000 | 1.000000 | 1.162517e-01 | 0.102740 | 0.320531 | 0.001676 | ... | 0.116252 | 0.116252 | 0.116252 | 0.102737 | 0.078851 | 0.071072 | 4.254000e+03 | 4254.0 | 4.254000e+03 | 4.254000e+03 |
contact_cellular | float64 | 36593 | 36593 | 0 | -1.357755 | 0.736510 | -7.456301e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | -0.621246 | 1.385946 | 1.000000 | -0.621246 | 1.385946 | -2.728484e-12 | 36593.0 | -2.273325e+04 | 5.071593e+04 |
contact_telephone | float64 | 36593 | 36593 | 0 | -0.260775 | 3.834729 | 4.349509e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 3.573955 | 13.773154 | 1.000000 | 3.573955 | 13.773154 | 1.591616e-12 | 36593.0 | 1.307817e+05 | 5.040010e+05 |
contact_unknown | float64 | 36593 | 36593 | 0 | -0.636008 | 1.572308 | 0.000000e+00 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 0.936300 | 1.876657 | 1.000000 | 0.936300 | 1.876657 | 0.000000e+00 | 36593.0 | 3.426201e+04 | 6.867251e+04 |
month_1.0 | float64 | 36593 | 36593 | 0 | -0.179805 | 5.561570 | 9.320376e-18 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 5.381765 | 29.963395 | 1.000000 | 5.381765 | 29.963395 | 3.410605e-13 | 36593.0 | 1.969349e+05 | 1.096450e+06 |
month_2.0 | float64 | 36593 | 36593 | 0 | -0.249288 | 4.011427 | 6.524263e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 3.762139 | 15.153690 | 1.000000 | 3.762139 | 15.153690 | 2.387424e-12 | 36593.0 | 1.376680e+05 | 5.545190e+05 |
month_3.0 | float64 | 36593 | 36593 | 0 | -0.103252 | 9.685067 | 9.320376e-18 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 9.581815 | 92.811179 | 1.000000 | 9.581815 | 92.811179 | 3.410605e-13 | 36593.0 | 3.506274e+05 | 3.396239e+06 |
month_4.0 | float64 | 36593 | 36593 | 0 | -0.264224 | 3.784667 | -6.524263e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 3.520443 | 13.393516 | 1.000000 | 3.520443 | 13.393516 | -2.387424e-12 | 36593.0 | 1.288236e+05 | 4.901089e+05 |
month_5.0 | float64 | 36593 | 36593 | 0 | -0.662036 | 1.510493 | -1.864075e-16 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 0.848457 | 1.719880 | 1.000000 | 0.848457 | 1.719880 | -6.821210e-12 | 36593.0 | 3.104760e+04 | 6.293557e+04 |
month_6.0 | float64 | 36593 | 36593 | 0 | -0.364617 | 2.742607 | -3.728150e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 2.377990 | 6.654836 | 1.000000 | 2.377990 | 6.654836 | -1.364242e-12 | 36593.0 | 8.701779e+04 | 2.435204e+05 |
month_7.0 | float64 | 36593 | 36593 | 0 | -0.424534 | 2.355525 | 0.000000e+00 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 1.930991 | 4.728726 | 1.000000 | 1.930991 | 4.728726 | 0.000000e+00 | 36593.0 | 7.066075e+04 | 1.730383e+05 |
month_8.0 | float64 | 36593 | 36593 | 0 | -0.400629 | 2.496075 | -1.242717e-16 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 2.095446 | 5.390893 | 1.000000 | 2.095446 | 5.390893 | -4.547474e-12 | 36593.0 | 7.667865e+04 | 1.972689e+05 |
month_9.0 | float64 | 36593 | 36593 | 0 | -0.113697 | 8.795317 | -4.038830e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 8.681620 | 76.370529 | 1.000000 | 8.681620 | 76.370529 | -1.477929e-12 | 36593.0 | 3.176865e+05 | 2.794627e+06 |
month_10.0 | float64 | 36593 | 36593 | 0 | -0.129440 | 7.725601 | -4.660188e-18 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 7.596161 | 58.701663 | 1.000000 | 7.596161 | 58.701663 | -1.705303e-13 | 36593.0 | 2.779663e+05 | 2.148070e+06 |
month_11.0 | float64 | 36593 | 36593 | 0 | -0.309774 | 3.228163 | 1.366988e-16 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 2.918389 | 9.516996 | 1.000000 | 2.918389 | 9.516996 | 5.002221e-12 | 36593.0 | 1.067926e+05 | 3.482554e+05 |
month_12.0 | float64 | 36593 | 36593 | 0 | -0.067096 | 14.903961 | 6.213584e-18 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 14.836865 | 221.132551 | 1.000000 | 14.836865 | 221.132551 | 2.273737e-13 | 36593.0 | 5.429254e+05 | 8.091903e+06 |
day_1.0 | float64 | 36593 | 36593 | 0 | -0.085571 | 11.686217 | 9.320376e-18 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 11.600646 | 135.574992 | 1.000000 | 11.600646 | 135.574992 | 3.410605e-13 | 36593.0 | 4.245024e+05 | 4.961096e+06 |
day_2.0 | float64 | 36593 | 36593 | 0 | -0.170439 | 5.867198 | -1.242717e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 5.696758 | 33.453057 | 1.000000 | 5.696758 | 33.453057 | -4.547474e-13 | 36593.0 | 2.084615e+05 | 1.224148e+06 |
day_3.0 | float64 | 36593 | 36593 | 0 | -0.156425 | 6.392841 | 0.000000e+00 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 6.236416 | 39.892890 | 1.000000 | 6.236416 | 39.892890 | 0.000000e+00 | 36593.0 | 2.282092e+05 | 1.459801e+06 |
day_4.0 | float64 | 36593 | 36593 | 0 | -0.182141 | 5.490262 | -7.456301e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 5.308122 | 29.176154 | 1.000000 | 5.308122 | 29.176154 | -2.728484e-12 | 36593.0 | 1.942401e+05 | 1.067643e+06 |
day_5.0 | float64 | 36593 | 36593 | 0 | -0.207749 | 4.813497 | -7.456301e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 4.605747 | 22.212909 | 1.000000 | 4.605747 | 22.212909 | -2.728484e-12 | 36593.0 | 1.685381e+05 | 8.128370e+05 |
day_6.0 | float64 | 36593 | 36593 | 0 | -0.211302 | 4.732553 | 8.077659e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 4.521251 | 21.441708 | 1.000000 | 4.521251 | 21.441708 | 2.955858e-12 | 36593.0 | 1.654461e+05 | 7.846164e+05 |
day_7.0 | float64 | 36593 | 36593 | 0 | -0.206170 | 4.850375 | 9.320376e-18 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 4.644206 | 22.568645 | 1.000000 | 4.644206 | 22.568645 | 3.410605e-13 | 36593.0 | 1.699454e+05 | 8.258544e+05 |
day_8.0 | float64 | 36593 | 36593 | 0 | -0.207749 | 4.813497 | -6.213584e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 4.605747 | 22.212909 | 1.000000 | 4.605747 | 22.212909 | -2.273737e-12 | 36593.0 | 1.685381e+05 | 8.128370e+05 |
day_9.0 | float64 | 36593 | 36593 | 0 | -0.190156 | 5.258844 | 6.213584e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 5.068688 | 26.691602 | 1.000000 | 5.068688 | 26.691602 | 2.273737e-12 | 36593.0 | 1.854785e+05 | 9.767258e+05 |
day_10.0 | float64 | 36593 | 36593 | 0 | -0.106317 | 9.405819 | -1.864075e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 9.299502 | 87.480741 | 1.000000 | 9.299502 | 87.480741 | -6.821210e-13 | 36593.0 | 3.402967e+05 | 3.201183e+06 |
day_11.0 | float64 | 36593 | 36593 | 0 | -0.182221 | 5.487850 | 7.145621e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 5.305629 | 29.149701 | 1.000000 | 5.305629 | 29.149701 | 2.614797e-12 | 36593.0 | 1.941489e+05 | 1.066675e+06 |
day_12.0 | float64 | 36593 | 36593 | 0 | -0.190079 | 5.260979 | -3.106792e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 5.070900 | 26.714030 | 1.000000 | 5.070900 | 26.714030 | -1.136868e-12 | 36593.0 | 1.855595e+05 | 9.775465e+05 |
day_13.0 | float64 | 36593 | 36593 | 0 | -0.191310 | 5.227117 | -2.174754e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 5.035807 | 26.359355 | 1.000000 | 5.035807 | 26.359355 | -7.958079e-13 | 36593.0 | 1.842753e+05 | 9.645679e+05 |
day_14.0 | float64 | 36593 | 36593 | 0 | -0.207032 | 4.830161 | 6.213584e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 4.623128 | 22.373315 | 1.000000 | 4.623128 | 22.373315 | 2.273737e-12 | 36593.0 | 1.691741e+05 | 8.187067e+05 |
day_15.0 | float64 | 36593 | 36593 | 0 | -0.198263 | 5.043811 | 3.106792e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 4.845548 | 24.479337 | 1.000000 | 4.845548 | 24.479337 | 1.136868e-12 | 36593.0 | 1.773131e+05 | 8.957724e+05 |
day_16.0 | float64 | 36593 | 36593 | 0 | -0.179886 | 5.559067 | -1.864075e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 5.379181 | 29.935585 | 1.000000 | 5.379181 | 29.935585 | -6.821210e-13 | 36593.0 | 1.968404e+05 | 1.095433e+06 |
day_17.0 | float64 | 36593 | 36593 | 0 | -0.212148 | 4.713694 | -5.592226e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 4.501546 | 21.263915 | 1.000000 | 4.501546 | 21.263915 | -2.046363e-12 | 36593.0 | 1.647251e+05 | 7.781105e+05 |
day_18.0 | float64 | 36593 | 36593 | 0 | -0.231608 | 4.317635 | -1.242717e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 4.086027 | 17.695618 | 1.000000 | 4.086027 | 17.695618 | -4.547474e-13 | 36593.0 | 1.495200e+05 | 6.475357e+05 |
day_19.0 | float64 | 36593 | 36593 | 0 | -0.200854 | 4.978743 | 8.699017e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 4.777889 | 23.828221 | 1.000000 | 4.777889 | 23.828221 | 3.183231e-12 | 36593.0 | 1.748373e+05 | 8.719461e+05 |
day_20.0 | float64 | 36593 | 36593 | 0 | -0.253222 | 3.949109 | 3.417471e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 3.695888 | 14.659586 | 1.000000 | 3.695888 | 14.659586 | 1.250555e-12 | 36593.0 | 1.352436e+05 | 5.364382e+05 |
day_21.0 | float64 | 36593 | 36593 | 0 | -0.216265 | 4.623964 | -5.902905e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 4.407699 | 20.427810 | 1.000000 | 4.407699 | 20.427810 | -2.160050e-12 | 36593.0 | 1.612909e+05 | 7.475149e+05 |
day_22.0 | float64 | 36593 | 36593 | 0 | -0.143765 | 6.955808 | 3.417471e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 6.812043 | 47.403934 | 1.000000 | 6.812043 | 47.403934 | 1.250555e-12 | 36593.0 | 2.492731e+05 | 1.734652e+06 |
day_23.0 | float64 | 36593 | 36593 | 0 | -0.146513 | 6.825333 | -4.349509e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 6.678820 | 45.606642 | 1.000000 | 6.678820 | 45.606642 | -1.591616e-12 | 36593.0 | 2.443981e+05 | 1.668884e+06 |
day_24.0 | float64 | 36593 | 36593 | 0 | -0.100513 | 9.948913 | 4.815528e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 9.848400 | 97.990977 | 1.000000 | 9.848400 | 97.990977 | 1.762146e-12 | 36593.0 | 3.603825e+05 | 3.585784e+06 |
day_25.0 | float64 | 36593 | 36593 | 0 | -0.138733 | 7.208092 | -9.320376e-18 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 7.069359 | 50.975831 | 1.000000 | 7.069359 | 50.975831 | -3.410605e-13 | 36593.0 | 2.586890e+05 | 1.865359e+06 |
day_26.0 | float64 | 36593 | 36593 | 0 | -0.154952 | 6.453618 | 1.242717e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 6.298666 | 40.673194 | 1.000000 | 6.298666 | 40.673194 | 4.547474e-13 | 36593.0 | 2.304871e+05 | 1.488354e+06 |
day_27.0 | float64 | 36593 | 36593 | 0 | -0.159514 | 6.269024 | -4.349509e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 6.109509 | 38.326106 | 1.000000 | 6.109509 | 38.326106 | -1.591616e-12 | 36593.0 | 2.235653e+05 | 1.402467e+06 |
day_28.0 | float64 | 36593 | 36593 | 0 | -0.206026 | 4.853768 | -7.456301e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 4.647742 | 22.601507 | 1.000000 | 4.647742 | 22.601507 | -2.728484e-12 | 36593.0 | 1.700748e+05 | 8.270569e+05 |
day_29.0 | float64 | 36593 | 36593 | 0 | -0.201369 | 4.966014 | -6.213584e-18 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 4.764645 | 23.701840 | 1.000000 | 4.764645 | 23.701840 | -2.273737e-13 | 36593.0 | 1.743526e+05 | 8.673214e+05 |
day_30.0 | float64 | 36593 | 36593 | 0 | -0.187673 | 5.328411 | 6.213584e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 5.140738 | 27.427189 | 1.000000 | 5.140738 | 27.427189 | 2.273737e-12 | 36593.0 | 1.881150e+05 | 1.003643e+06 |
day_31.0 | float64 | 36593 | 36593 | 0 | -0.118176 | 8.461983 | 2.485434e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 8.343808 | 70.619124 | 1.000000 | 8.343808 | 70.619124 | 9.094947e-13 | 36593.0 | 3.053249e+05 | 2.584166e+06 |
poutcome_failure | float64 | 36593 | 36593 | 0 | -0.348698 | 2.867813 | -6.834942e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 2.519115 | 7.345941 | 1.000000 | 2.519115 | 7.345941 | -2.501110e-12 | 36593.0 | 9.218198e+04 | 2.688100e+05 |
poutcome_other | float64 | 36593 | 36593 | 0 | -0.206673 | 4.838554 | -5.281546e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 4.631881 | 22.454322 | 1.000000 | 4.631881 | 22.454322 | -1.932676e-12 | 36593.0 | 1.694944e+05 | 8.216710e+05 |
poutcome_success | float64 | 36593 | 36593 | 0 | -0.186970 | 5.348457 | 1.242717e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 5.161487 | 27.640945 | 1.000000 | 5.161487 | 27.640945 | 4.547474e-13 | 36593.0 | 1.888743e+05 | 1.011465e+06 |
poutcome_unknown | float64 | 36593 | 36593 | 0 | -2.111874 | 0.473513 | -1.304853e-16 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | -1.638361 | 3.684227 | 1.000000 | -1.638361 | 3.684227 | -4.774847e-12 | 36593.0 | -5.995254e+04 | 1.348169e+05 |
age | float64 | 36593 | 36593 | 0 | -2.160064 | 5.088144 | -3.106792e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 0.678868 | 3.300319 | 1.000000 | 0.678868 | 3.300319 | -1.136868e-12 | 36593.0 | 2.484182e+04 | 1.207686e+05 |
education | float64 | 36593 | 36593 | 0 | -1.745093 | 1.317062 | -2.609705e-16 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | -0.150154 | 2.303197 | 1.000000 | -0.150154 | 2.303197 | -9.549694e-12 | 36593.0 | -5.494580e+03 | 8.428090e+04 |
default | float64 | 36593 | 36593 | 0 | -0.134056 | 7.459592 | -8.543678e-18 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 7.325536 | 54.663482 | 1.000000 | 7.325536 | 54.663482 | -3.126388e-13 | 36593.0 | 2.680633e+05 | 2.000301e+06 |
balance | float64 | 36593 | 36593 | 0 | -2.712622 | 33.293644 | -1.553396e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 8.256555 | 144.415577 | 1.000000 | 8.256555 | 144.415577 | -5.684342e-13 | 36593.0 | 3.021321e+05 | 5.284599e+06 |
housing | float64 | 36593 | 36593 | 0 | -1.122596 | 0.890792 | -1.864075e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | -0.231804 | 1.053733 | 1.000000 | -0.231804 | 1.053733 | -6.821210e-13 | 36593.0 | -8.482406e+03 | 3.855926e+04 |
loan | float64 | 36593 | 36593 | 0 | -0.437727 | 2.284528 | 6.058244e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 1.846801 | 4.410674 | 1.000000 | 1.846801 | 4.410674 | 2.216893e-12 | 36593.0 | 6.757999e+04 | 1.613998e+05 |
job_admin. | float64 | 36593 | 36593 | 0 | -0.361290 | 2.767863 | 2.019415e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 2.406573 | 6.791595 | 1.000000 | 2.406573 | 6.791595 | 7.389644e-13 | 36593.0 | 8.806374e+04 | 2.485248e+05 |
job_blue-collar | float64 | 36593 | 36593 | 0 | -0.524802 | 1.905480 | -8.699017e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 1.380677 | 2.906270 | 1.000000 | 1.380677 | 2.906270 | -3.183231e-12 | 36593.0 | 5.052313e+04 | 1.063492e+05 |
job_entrepreneur | float64 | 36593 | 36593 | 0 | -0.185399 | 5.393786 | 1.398056e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 5.208387 | 28.127300 | 1.000000 | 5.208387 | 28.127300 | 5.115908e-13 | 36593.0 | 1.905905e+05 | 1.029262e+06 |
job_housemaid | float64 | 36593 | 36593 | 0 | -0.168477 | 5.935545 | -3.728150e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 5.767068 | 34.259077 | 1.000000 | 5.767068 | 34.259077 | -1.364242e-12 | 36593.0 | 2.110343e+05 | 1.253642e+06 |
job_management | float64 | 36593 | 36593 | 0 | -0.511776 | 1.953980 | 6.213584e-18 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 1.442204 | 3.079953 | 1.000000 | 1.442204 | 3.079953 | 2.273737e-13 | 36593.0 | 5.277458e+04 | 1.127047e+05 |
job_retired | float64 | 36593 | 36593 | 0 | -0.229175 | 4.363482 | 5.902905e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 4.134308 | 18.092499 | 1.000000 | 4.134308 | 18.092499 | 2.160050e-12 | 36593.0 | 1.512867e+05 | 6.620588e+05 |
job_self-employed | float64 | 36593 | 36593 | 0 | -0.188918 | 5.293301 | 2.019415e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 5.104383 | 27.054723 | 1.000000 | 5.104383 | 27.054723 | 7.389644e-13 | 36593.0 | 1.867847e+05 | 9.900135e+05 |
job_services | float64 | 36593 | 36593 | 0 | -0.317656 | 3.148056 | 2.796113e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 2.830400 | 9.011162 | 1.000000 | 2.830400 | 9.011162 | 1.023182e-12 | 36593.0 | 1.035728e+05 | 3.297455e+05 |
job_student | float64 | 36593 | 36593 | 0 | -0.145439 | 6.875735 | 1.553396e-18 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 6.730296 | 46.296878 | 1.000000 | 6.730296 | 46.296878 | 5.684342e-14 | 36593.0 | 2.462817e+05 | 1.694142e+06 |
job_technician | float64 | 36593 | 36593 | 0 | -0.449331 | 2.225530 | 1.553396e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 1.776199 | 4.154884 | 1.000000 | 1.776199 | 4.154884 | 5.684342e-13 | 36593.0 | 6.499646e+04 | 1.520397e+05 |
job_unemployed | float64 | 36593 | 36593 | 0 | -0.172045 | 5.812420 | 3.495141e-18 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 5.640374 | 32.813820 | 1.000000 | 5.640374 | 32.813820 | 1.278977e-13 | 36593.0 | 2.063982e+05 | 1.200756e+06 |
job_unknown | float64 | 36593 | 36593 | 0 | -0.082437 | 12.130532 | 4.504848e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 12.048095 | 146.156593 | 1.000000 | 12.048095 | 146.156593 | 1.648459e-12 | 36593.0 | 4.408759e+05 | 5.348308e+06 |
marital_divorced | float64 | 36593 | 36593 | 0 | -0.361145 | 2.768974 | -7.844650e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 2.407830 | 6.797645 | 1.000000 | 2.407830 | 6.797645 | -2.870593e-12 | 36593.0 | 8.810972e+04 | 2.487462e+05 |
marital_married | float64 | 36593 | 36593 | 0 | -1.231619 | 0.811939 | 1.198057e-16 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | -0.419680 | 1.176131 | 1.000000 | -0.419680 | 1.176131 | 4.384049e-12 | 36593.0 | -1.535734e+04 | 4.303817e+04 |
marital_single | float64 | 36593 | 36593 | 0 | -0.626610 | 1.595890 | -2.213589e-17 | 1.000027 | 1.000014 | 0.005228 | ... | 1.000000 | 0.969280 | 1.939504 | 1.000000 | 0.969280 | 1.939504 | -8.100187e-13 | 36593.0 | 3.546887e+04 | 7.097227e+04 |
76 rows × 25 columns
[16]:
pd.reset_option('display.max_rows')
安全性讨论#
请注意,全表统计会暴露数据整体统计结果,其背后实际上蕴含了sf.reveal,请谨慎使用。
相关系数矩阵#
我们接下来计算特征和特征之间,特征和标签之间的相关系数矩阵。
计算相关系数矩阵时,one-hot编码各列无需参与计算。
[17]:
from secretflow.stats.ss_pearsonr_v import PearsonR
pearson_r_calculator = PearsonR(spu)
corr_matrix = pearson_r_calculator.pearsonr(vdf_hat)
import numpy as np
np.set_printoptions(formatter={'float': lambda x: "{0:0.3f}".format(x)})
corr_matrix
(_run pid=56064) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but StandardScaler was fitted without feature names
(_run pid=56064) warnings.warn(
(_run pid=57413) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but StandardScaler was fitted without feature names
(_run pid=57413) warnings.warn(
(_run pid=56064) [2022-11-10 15:05:24.416] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
(_run pid=57413) [2022-11-10 15:05:24.835] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
(_run pid=57413) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_spu_compile pid=44652) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
[17]:
array([[1.000, -0.185, 0.013, 0.010, 0.331, -0.007, -0.001, -0.005,
0.016, 0.003, -0.008],
[-0.185, 1.000, -0.091, -0.032, -0.074, 0.004, 0.002, 0.013,
-0.012, -0.023, 0.013],
[0.013, -0.091, 1.000, 0.439, 0.104, -0.021, 0.005, -0.031, 0.003,
0.123, -0.022],
[0.010, -0.032, 0.439, 1.000, 0.091, 0.003, 0.022, -0.019, 0.015,
0.037, -0.009],
[0.331, -0.074, 0.104, 0.091, 1.000, 0.022, 0.069, -0.023, 0.053,
-0.135, -0.068],
[-0.007, 0.004, -0.021, 0.003, 0.022, 1.000, -0.169, -0.016,
0.096, -0.183, -0.014],
[-0.001, 0.002, 0.005, 0.022, 0.069, -0.169, 1.000, -0.012, 0.068,
-0.070, -0.030],
[-0.005, 0.013, -0.031, -0.019, -0.023, -0.016, -0.012, 1.000,
-0.067, -0.010, 0.079],
[0.016, -0.012, 0.003, 0.015, 0.053, 0.096, 0.068, -0.067, 1.000,
-0.072, -0.084],
[0.003, -0.023, 0.123, 0.037, -0.135, -0.183, -0.070, -0.010,
-0.072, 1.000, 0.043],
[-0.008, 0.013, -0.022, -0.009, -0.068, -0.014, -0.030, 0.079,
-0.084, 0.043, 1.000]], dtype=float32)
安全性讨论#
相关系数矩阵的计算需要利用alice和bob两边的数据,因此相关的计算需要使用SPU device确保原始数据不被泄露。
VIF指标计算#
隐语还支持VIF的计算来进行多重共线性检验。
计算VIF指标时,one-hot编码各列无需参与计算。
[18]:
from secretflow.stats.ss_vif_v import VIF
vif_calculator = VIF(spu)
vif_results = vif_calculator.vif(vdf_hat)
print(vdf_hat.columns)
print(vif_results)
(_run pid=44652) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but StandardScaler was fitted without feature names
(_run pid=44652) warnings.warn(
(_run pid=57413) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but StandardScaler was fitted without feature names
(_run pid=57413) warnings.warn(
(_run pid=44652) [2022-11-10 15:05:25.323] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
Index(['duration', 'campaign', 'pdays', 'previous', 'y', 'age', 'education',
'default', 'balance', 'housing', 'loan'],
dtype='object')
[1.162 1.044 1.276 1.243 1.177 1.082 1.052 1.011 1.030 1.089 1.018]
安全性讨论#
VIF指标的计算需要利用alice和bob两边的数据,因此相关的计算需要使用SPU device确保原始数据不被泄露。
模型训练#
接下来,我们将会分别训练一个逻辑回归模型和一个XGB模型。
随机分割#
在训练之前,我们需要将数据分割为训练集和验证集。
其中train_x和train_y为训练集的特征和标签。test_x和test_y为训练集的特征和标签。
[19]:
from secretflow.data.split import train_test_split
random_state = 1234
train_vdf, test_vdf = train_test_split(vdf, train_size=0.8, random_state=random_state)
train_x = train_vdf.drop(columns=['y'])
train_y = train_vdf['y']
test_x = test_vdf.drop(columns=['y'])
test_y = test_vdf['y']
安全性讨论#
随机分割时,每一方会共享随机数种子,并由每一方数据的owner分别执行各自的数据分割并且确保最终分割结果仍然是对齐的。
PSI(人群稳定性分析)#
样本稳定指数是衡量样本变化所产生的偏移量的一种重要指标,通常用来衡量样本的稳定程度,比如样本在两个月份之间的变化是否稳定。通常变量的PSI值在0.1以下表示变化不太显著,在0.1到0.25之间表示有比较显著的变化,大于0.25表示变量变化比较剧烈,需要特殊关注。
接下来以balance
为例子,确认两次抽样的样本分布是否接近。
根据业务需求,PSI分析也可以在数据分析或者特征预处理的时候进行。
[20]:
stats_df = table_statistics(train_x['balance'])
[21]:
min_val, max_val = stats_df['min'], stats_df['max']
[22]:
from secretflow.stats import psi_eval
from secretflow.stats.core.utils import equal_range
import jax.numpy as jnp
split_points = equal_range(jnp.array([min_val, max_val]), 3)
balance_psi_score = psi_eval(train_x['balance'], test_x['balance'], split_points)
sf.reveal(balance_psi_score)
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
[22]:
DeviceArray(0.000, dtype=float32)
安全性讨论#
PSI分析是一个单方运算,由数据owner的PYU Device执行计算。
逻辑回归模型#
使用 ml.linear.ss_sgd.SSRegression 可以进行密态逻辑回归模型的训练。
请参考相关的API文档。
[23]:
from secretflow.ml.linear.ss_sgd import SSRegression
lr_model = SSRegression(spu)
lr_model.fit(
x=train_x,
y=train_y,
epochs=3,
learning_rate=0.1,
batch_size=1024,
sig_type='t1',
reg_type='logistic',
penalty='l2',
l2_norm=0.5,
)
(_spu_compile pid=44652) /* error: missing value */
(_spu_compile pid=44652) {}:task_name:_spu_compile
(_run pid=57412) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
你可能会对为何上面的语句很快就执行完毕感到困惑,原因是在隐语中,语句都是lazy evaluation的,在上面的例子中,直到lr_model被真正被使用的时候,lr_model.fit才会被执行。
安全性讨论#
SSRegression的训练基于SPU Device,双方的原始数据将会被保护。
XGBoost模型#
使用 ml.boost.ss_xgb_v.Xgb 可以进行密态XGBoost模型的训练。
请参考相关的API文档。
[24]:
from secretflow.ml.boost.ss_xgb_v import Xgb
xgb = Xgb(spu)
params = {
'num_boost_round': 3,
'max_depth': 5,
'sketch_eps': 0.25,
'objective': 'logistic',
'reg_lambda': 0.2,
'subsample': 1,
'colsample_bytree': 1,
'base_score': 0.5,
}
xgb_model = xgb.train(params=params, dtrain=train_x, label=train_y)
(_spu_compile pid=57412) /* error: missing value */
(_spu_compile pid=57412) {}
(_run pid=68105) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(_run pid=57412) [2022-11-10 15:05:38.196] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
(_run pid=68077) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=68077) [2022-11-10 15:05:40.288] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
(_spu_compile pid=68105) /* error: missing value */
(_spu_compile pid=68105) {}:task_name:_run
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(scheduler +57s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(scheduler +57s) Warning: The following resource request cannot be scheduled right now: {'alice': 1.0, 'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
(_spu_compile pid=75168) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=72626) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=72627) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=72628) 2022-11-10 15:05:59.676639: E external/org_tensorflow/tensorflow/core/tpu/tpu_initializer_helper.cc:230] Unable to open shared memory for GCS file system creator.
(_run pid=72628) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=75170) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=72636) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=75169) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=72628) [2022-11-10 15:06:10.337] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
(_run pid=75169) [2022-11-10 15:06:10.540] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
(_spu_compile pid=72628) /* error: missing value */
(_spu_compile pid=72628) {}:task_name:_run
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(_spu_compile pid=89969) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=89765) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=89767) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=89768) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=89971) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=89766) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=89972) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=89768) [2022-11-10 15:06:41.189] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
(_run pid=89766) [2022-11-10 15:06:41.335] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
(_spu_compile pid=89768) /* error: missing value */
(_spu_compile pid=89768) {}:task_name:_run
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(_spu_compile pid=102624) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=101745) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=101753) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=101755) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=102625) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=102626) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
Xgb.train将会直接执行,请耐心等待。
安全性讨论#
Xgb的训练基于SPU Device,双方的原始数据将会被保护。
模型预测#
接下来,我们将会分别利用刚刚训练好的模型来预测测试集。
逻辑回归模型#
由于在我们的场景下,数据集标签的持有者是bob,因此我们在这里将预测结果reveal给bob.
[25]:
lr_y_hat = lr_model.predict(x=test_x, batch_size=1024, to_pyu=bob)
安全性讨论#
逻辑回归的预测基于SPU Device,双方的原始数据将会被保护。
当设置to_pyu,预测结果将会被reveal给该方,否则将仍然保持秘密分享的状态。
XGBoost模型#
由于在我们的场景下,数据集标签的持有者是bob,因此我们在这里将预测结果reveal给bob.
[26]:
xgb_y_hat = xgb_model.predict(dtrain=test_x, to_pyu=bob)
(_run pid=101755) [2022-11-10 15:07:12.900] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
(_run pid=102626) [2022-11-10 15:07:12.900] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
安全性讨论#
XGBoost模型的预测基于SPU Device,双方的原始数据将会被保护。
当设置to_pyu,预测结果将会被reveal给该方,否则将仍然保持秘密分享的状态。
模型评估#
接下来,我们将利用测试数据集对模型效果进行评估,包括:
二分类评估
PVA
P-Value
评分卡转换
二分类评估#
隐语中对二分类的评估有集成的支持。
BiClassificationEval
将计算 AUC
, KS
, F1 Score
, Lift
, K-S
, Gain
, Precision
, Recall
等统计数值, 并提供(基于prediction score的)等频和等距分箱的统计报告和总报告。
不同分桶中评估模型的预测的threshold
不同。总报告中依赖threshold
的统计取的是各个分桶的最佳值。
详情可以参考API文档。
[27]:
from secretflow.stats.biclassification_eval import BiClassificationEval
biclassification_evaluator = BiClassificationEval(
y_true=test_y, y_score=lr_y_hat, bucket_size=20
)
lr_report = sf.reveal(biclassification_evaluator.get_all_reports())
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(_run pid=117458) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(_run pid=117896) [2022-11-10 15:07:21.122] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
(_run pid=117896) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_spu_compile pid=101746) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_spu_compile pid=115860) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_spu_compile pid=115859) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=101746) [2022-11-10 15:07:21.342] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
[28]:
print(f'positive_samples: {lr_report.summary_report.positive_samples}')
print(f'negative_samples: {lr_report.summary_report.negative_samples}')
print(f'total_samples: {lr_report.summary_report.total_samples}')
print(f'auc: {lr_report.summary_report.auc}')
print(f'ks: {lr_report.summary_report.ks}')
print(f'f1_score: {lr_report.summary_report.f1_score}')
positive_samples: 860.0
negative_samples: 6459.0
total_samples: 7319.0
auc: 0.8958946466445923
ks: 0.6431037187576294
f1_score: 0.5415411591529846
[29]:
biclassification_evaluator = BiClassificationEval(
y_true=test_y, y_score=xgb_y_hat, bucket_size=20
)
xgb_report = sf.reveal(biclassification_evaluator.get_all_reports())
[30]:
print(f'positive_samples: {xgb_report.summary_report.positive_samples}')
print(f'negative_samples: {xgb_report.summary_report.negative_samples}')
print(f'total_samples: {xgb_report.summary_report.total_samples}')
print(f'auc: {xgb_report.summary_report.auc}')
print(f'ks: {xgb_report.summary_report.ks}')
print(f'f1_score: {xgb_report.summary_report.f1_score}')
positive_samples: 860.0
negative_samples: 6459.0
total_samples: 7319.0
auc: 0.8133864402770996
ks: 0.5009485483169556
f1_score: 0.4135618805885315
PVA (预测和实际平均值比较)#
结果由abs(mean(Acutal) - mean(Prediction))
计算获得, 值越小越好。
[31]:
from secretflow.stats import pva_eval
lr_pva_score = pva_eval(test_y, lr_y_hat, 1)
sf.reveal(lr_pva_score)
[31]:
DeviceArray(0.051, dtype=float32)
[32]:
xgb_pva_score = pva_eval(test_y, xgb_y_hat, 1)
sf.reveal(xgb_pva_score)
[32]:
DeviceArray(0.065, dtype=float32)
P-Value#
双方可通过p-value的值来判断参数是否显著,即该自变量是否可以有效预测因变量的变异, 从而判定对应的解释变量是否应包括在模型中。
[33]:
from secretflow.stats import SSPValue
model = lr_model.save_model()
sspv = SSPValue(spu)
pvalues = sspv.pvalues(test_x, test_y, model)
pvalues
(_run pid=117458) [2022-11-10 15:08:18.378] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(raylet) /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!
(raylet) warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
(_run pid=9587) [2022-11-10 15:08:21.419] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
(_run pid=9587) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_spu_compile pid=117457) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=9615) WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=9615) [2022-11-10 15:08:21.780] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
[33]:
array([0.000, 0.682, 0.855, 0.755, 0.981, 0.989, 0.975, 0.964, 0.986,
0.773, 0.963, 0.981, 0.977, 0.973, 0.975, 0.859, 0.833, 0.982,
0.827, 0.946, 0.993, 0.994, 0.994, 0.987, 0.984, 0.979, 0.990,
0.986, 0.947, 0.997, 0.978, 0.970, 0.997, 0.976, 0.994, 0.961,
0.994, 0.967, 0.979, 0.997, 0.986, 0.975, 0.997, 0.987, 0.987,
0.969, 0.997, 0.975, 0.962, 0.990, 0.976, 0.992, 0.819, 0.977,
0.645, 0.416, 0.805, 0.452, 0.024, 0.201, 0.998, 0.992, 0.986,
0.974, 0.999, 0.965, 0.998, 0.988, 0.919, 0.998, 0.988, 0.987,
0.998, 0.990, 0.987, 0.000])
评分卡转换#
严格来说,评分卡转化是对预测结果的后续处理,并不属于模型评估。
我们将 y = 1
的概率设为p
, odds = p / (1 - p)
, 评分卡设定的分值刻度可以通过将分值表示为比率对数的线性表达式来定义,即可表示为下式:
Score = A - B log(odds)
, A 和 B 是可以设定的常数。隐语中提供了评分卡转换功能,详情可以参考API文档。
[34]:
from secretflow.stats import BiClassificationEval, ScoreCard
sc = ScoreCard(20, 600, 20)
score = sc.transform(xgb_y_hat)
sf.reveal(score.partitions[bob])
[34]:
array([[496.647],
[453.126],
[496.647],
...,
[451.232],
[494.179],
[466.241]])
安全性讨论#
以上所有模型评估的方法均为单方运算,由label拥有者的PYU Device进行运算。
实验结束#
最后,我们需要清理临时文件,并关闭隐语cluster。
[35]:
import os
try:
os.remove(alice_path)
os.remove(alice_psi_path)
os.remove(bob_path)
os.remove(bob_psi_path)
except OSError:
pass
sf.shutdown()
恭喜!你已经完成了隐语金融风控全链路的全部实验内容。
如果你对本实验有任何建议和问题,请在Github Issues上联系我们。