隐语SecretFlow实际场景MPC算法开发实践#
This tutorial is only available in Chinese.
推荐使用conda
创建一个新环境 > conda create -n sf python=3.8
直接使用pip
安装secretflow > pip install -U secretflow
基于secretflow:0.7.7b1版本
此代码示例主要是展示了如何基于secretflow以及SPU隐私计算设备完成一个实际的应用的开发,推荐先看前一个教程spu_basics熟悉基本的SPU概念。
任务介绍#
Vehicle Insurance Claim Fraud Detection
该数据集来源于kaggle,包含 - 车辆数据集-属性、模型、事故详细信息等 - 保单详细信息-保单类型、有效期等
目标是检测索赔申请是否欺诈: 字段FraudFound_P
(0 or 1) 即为预测的target值,是一个典型的二分类场景。
实验目标#
在本次实验中,我们将会利用一个开源数据集在隐语上完成隐私保护的逻辑回归、神经网络模型和XGB模型。主要涉及到如下的几个流程: 1. 数据加载 2. 数据洞察 3. 数据预处理 4. 模型构建 5. 模型的训练与预测
前置工作#
Ray集群启动(多机部署)#
考虑多机部署的情况,在启动secretflow之前需要先将ray集群启动。在header节点和worker节点上各自执行下述的指令。 > P.S. 启动集群之后,可以执行ray status
看一下集群是否正确启动完成
Header节点
RAY_DISABLE_REMOTE_CODE=true \
ray start --head --node-ip-address="head_ip" --port="head_port" --resources='{"alice": 20}' --include-dashboard=False
Worker节点
RAY_DISABLE_REMOTE_CODE=true \
ray start --address="head_ip:head_port" --resources='{"bob": 20}'
[1]:
# 如下是多机版初始化secretflow的代码,需要给出header节点的IP和PORT
# head_ip = "xxx"
# head_port = "xxx"
# sf.init(address=f'{head_ip}:{head_port}')
单机部署#
我们在此使用单机部署的方式做一个样例展示。 通过调用sf.init()
我们实例化了一个ray集群,有5个节点,也就对应了5个物理设备。
[2]:
import secretflow as sf
sf.shutdown()
# Standalone Mode
sf.init(
['alice', 'bob', 'carol', 'davy', 'eric'],
address='local'
)
2022-11-09 20:12:39.876005: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/rh-ruby25/root/usr/local/lib64:/opt/rh/rh-ruby25/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst
定义明文计算设备PYU#
我们在启动了上述5个节点之后,明确隐语中的逻辑设备。这里我们将alice、bob、carol三方作为数据的提供方,可以本地执行明文计算,也就是 PYU (PYthon runtime Unit) 设备。
[3]:
alice = sf.PYU('alice')
bob = sf.PYU('bob')
carol = sf.PYU('carol')
print(alice)
alice_
定义密文计算设备SPU (3PC)#
进一步,我们以SPU (Secure Processing Unit) 为例,选择3个物理节点组成基于MPC(下例为三方的ABY3协议)的隐私计算设备。
[4]:
import spu
from secretflow.utils.testing import unused_tcp_port
aby3_cluster_def = {
'nodes': [
{
'party': 'alice',
'address': f'127.0.0.1:{unused_tcp_port()}',
},
{'party': 'bob', 'id': 'local:1', 'address': f'127.0.0.1:{unused_tcp_port()}'},
{
'party': 'carol',
'address': f'127.0.0.1:{unused_tcp_port()}',
},
],
'runtime_config': {
'protocol': spu.spu_pb2.ABY3,
'field': spu.spu_pb2.FM64,
},
}
my_spu = sf.SPU(aby3_cluster_def)
数据加载#
Load Data (Mock)#
在定义好隐语中的逻辑设备概念之后,我们演示一下如何进行数据的读入。这里使用一个mock的data load方法get_data_mock()
做一个演示。
[5]:
def get_data_mock():
return 2
x_plaintext = get_data_mock()
print(f"x_plaintext: {x_plaintext}")
x_plaintext: 2
指定PYU设备读取数据
[6]:
x_alice_pyu = alice(get_data_mock)()
print(f"Plaintext Python Object: {x_plaintext}, PYU object: {x_alice_pyu}")
print(f"Reveal PYU object: {sf.reveal(x_alice_pyu)}")
Plaintext Python Object: 2, PYU object: <secretflow.device.device.pyu.PYUObject object at 0x7f0931996fd0>
Reveal PYU object: 2
PYU->SPU 数据转换
[7]:
x_alice_spu = x_alice_pyu.to(my_spu)
print(f"SPU object: {x_alice_spu}")
print(f"Reveal SPU object: {sf.reveal(x_alice_spu)}")
SPU object: <secretflow.device.device.spu.SPUObject object at 0x7f0a38d9b460>
Reveal SPU object: 2
Load Data (Distributed)#
我们下面考虑对一个实际应用场景的数据进行读取,也就是全集数据垂直分布在不同的参与方中。 > 出于演示的目的,我们这里将中心化的明文数据进行垂直分割的拆分,首先观察下此数据集的特征。
读入明文全集数据#
[28]:
import os
"""
Create dir to save dataset files
This will create a directory `data` to store the dataset file
"""
if not os.path.exists('data'):
os.mkdir('data')
"""
The original data is from Kaggle: https://www.kaggle.com/datasets/shivamb/vehicle-claim-fraud-detection.
We promise we only use the data for demo only.
"""
path = "https://secretflow-data.oss-accelerate.aliyuncs.com/datasets/vehicle_nsurance_claim/fraud_oracle.csv"
if not os.path.exists('data/fraud_oracle.csv'):
res = os.system('cd data && wget {}'.format(path))
if res != 0:
raise Exception('File: {} download fails!'.format(path))
else:
print(f'File already downloaded.')
--2022-11-09 20:25:18-- https://secretflow-data.oss-accelerate.aliyuncs.com/datasets/vehicle_nsurance_claim/fraud_oracle.csv
Resolving secretflow-data.oss-accelerate.aliyuncs.com (secretflow-data.oss-accelerate.aliyuncs.com)... 101.133.111.250
Connecting to secretflow-data.oss-accelerate.aliyuncs.com (secretflow-data.oss-accelerate.aliyuncs.com)|101.133.111.250|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3618564 (3.5M) [text/csv]
Saving to: ‘fraud_oracle.csv’
0K .......... .......... .......... .......... .......... 1% 3.73M 1s
50K .......... .......... .......... .......... .......... 2% 6.70M 1s
100K .......... .......... .......... .......... .......... 4% 10.2M 1s
150K .......... .......... .......... .......... .......... 5% 18.5M 0s
200K .......... .......... .......... .......... .......... 7% 13.9M 0s
250K .......... .......... .......... .......... .......... 8% 33.5M 0s
300K .......... .......... .......... .......... .......... 9% 21.8M 0s
350K .......... .......... .......... .......... .......... 11% 47.4M 0s
400K .......... .......... .......... .......... .......... 12% 31.8M 0s
450K .......... .......... .......... .......... .......... 14% 38.6M 0s
500K .......... .......... .......... .......... .......... 15% 1.04M 0s
550K .......... .......... .......... .......... .......... 16% 92.8M 0s
600K .......... .......... .......... .......... .......... 18% 81.2M 0s
650K .......... .......... .......... .......... .......... 19% 91.2M 0s
700K .......... .......... .......... .......... .......... 21% 101M 0s
750K .......... .......... .......... .......... .......... 22% 79.5M 0s
800K .......... .......... .......... .......... .......... 24% 84.9M 0s
850K .......... .......... .......... .......... .......... 25% 83.2M 0s
900K .......... .......... .......... .......... .......... 26% 79.3M 0s
950K .......... .......... .......... .......... .......... 28% 110M 0s
1000K .......... .......... .......... .......... .......... 29% 1014K 0s
1050K .......... .......... .......... .......... .......... 31% 106M 0s
1100K .......... .......... .......... .......... .......... 32% 104M 0s
1150K .......... .......... .......... .......... .......... 33% 108M 0s
1200K .......... .......... .......... .......... .......... 35% 113M 0s
1250K .......... .......... .......... .......... .......... 36% 109M 0s
1300K .......... .......... .......... .......... .......... 38% 114M 0s
1350K .......... .......... .......... .......... .......... 39% 112M 0s
1400K .......... .......... .......... .......... .......... 41% 113M 0s
1450K .......... .......... .......... .......... .......... 42% 111M 0s
1500K .......... .......... .......... .......... .......... 43% 111M 0s
1550K .......... .......... .......... .......... .......... 45% 115M 0s
1600K .......... .......... .......... .......... .......... 46% 15.4M 0s
1650K .......... .......... .......... .......... .......... 48% 108M 0s
1700K .......... .......... .......... .......... .......... 49% 104M 0s
1750K .......... .......... .......... .......... .......... 50% 110M 0s
1800K .......... .......... .......... .......... .......... 52% 88.6M 0s
1850K .......... .......... .......... .......... .......... 53% 103M 0s
1900K .......... .......... .......... .......... .......... 55% 114M 0s
1950K .......... .......... .......... .......... .......... 56% 110M 0s
2000K .......... .......... .......... .......... .......... 58% 876K 0s
2050K .......... .......... .......... .......... .......... 59% 94.3M 0s
2100K .......... .......... .......... .......... .......... 60% 107M 0s
2150K .......... .......... .......... .......... .......... 62% 105M 0s
2200K .......... .......... .......... .......... .......... 63% 103M 0s
2250K .......... .......... .......... .......... .......... 65% 105M 0s
2300K .......... .......... .......... .......... .......... 66% 103M 0s
2350K .......... .......... .......... .......... .......... 67% 108M 0s
2400K .......... .......... .......... .......... .......... 69% 106M 0s
2450K .......... .......... .......... .......... .......... 70% 106M 0s
2500K .......... .......... .......... .......... .......... 72% 110M 0s
2550K .......... .......... .......... .......... .......... 73% 108M 0s
2600K .......... .......... .......... .......... .......... 74% 112M 0s
2650K .......... .......... .......... .......... .......... 76% 108M 0s
2700K .......... .......... .......... .......... .......... 77% 112M 0s
2750K .......... .......... .......... .......... .......... 79% 110M 0s
2800K .......... .......... .......... .......... .......... 80% 111M 0s
2850K .......... .......... .......... .......... .......... 82% 98.3M 0s
2900K .......... .......... .......... .......... .......... 83% 89.0M 0s
2950K .......... .......... .......... .......... .......... 84% 115M 0s
3000K .......... .......... .......... .......... .......... 86% 109M 0s
3050K .......... .......... .......... .......... .......... 87% 32.2M 0s
3100K .......... .......... .......... .......... .......... 89% 104M 0s
3150K .......... .......... .......... .......... .......... 90% 108M 0s
3200K .......... .......... .......... .......... .......... 91% 101M 0s
3250K .......... .......... .......... .......... .......... 93% 110M 0s
3300K .......... .......... .......... .......... .......... 94% 112M 0s
3350K .......... .......... .......... .......... .......... 96% 113M 0s
3400K .......... .......... .......... .......... .......... 97% 114M 0s
3450K .......... .......... .......... .......... .......... 99% 114M 0s
3500K .......... .......... .......... ... 100% 109M=0.2s
2022-11-09 20:25:18 (15.5 MB/s) - ‘fraud_oracle.csv’ saved [3618564/3618564]
[9]:
from sklearn.model_selection import train_test_split
import pandas as pd
"""
This should point to the data downloaded from Kaggle.
By default, the .csv file shall be in the data directory
"""
full_data_path = 'data/fraud_oracle.csv'
df = pd.read_csv(full_data_path)
df.head()
[9]:
Month | WeekOfMonth | DayOfWeek | Make | AccidentArea | DayOfWeekClaimed | MonthClaimed | WeekOfMonthClaimed | Sex | MaritalStatus | ... | AgeOfVehicle | AgeOfPolicyHolder | PoliceReportFiled | WitnessPresent | AgentType | NumberOfSuppliments | AddressChange_Claim | NumberOfCars | Year | BasePolicy | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Dec | 5 | Wednesday | Honda | Urban | Tuesday | Jan | 1 | Female | Single | ... | 3 years | 26 to 30 | No | No | External | none | 1 year | 3 to 4 | 1994 | Liability |
1 | Jan | 3 | Wednesday | Honda | Urban | Monday | Jan | 4 | Male | Single | ... | 6 years | 31 to 35 | Yes | No | External | none | no change | 1 vehicle | 1994 | Collision |
2 | Oct | 5 | Friday | Honda | Urban | Thursday | Nov | 2 | Male | Married | ... | 7 years | 41 to 50 | No | No | External | none | no change | 1 vehicle | 1994 | Collision |
3 | Jun | 2 | Saturday | Toyota | Rural | Friday | Jul | 1 | Male | Married | ... | more than 7 | 51 to 65 | Yes | No | External | more than 5 | no change | 1 vehicle | 1994 | Liability |
4 | Jan | 5 | Monday | Honda | Urban | Tuesday | Feb | 2 | Female | Single | ... | 5 years | 31 to 35 | No | No | External | none | no change | 1 vehicle | 1994 | Collision |
5 rows × 33 columns
数据三方垂直拆分#
我们首先对这个数据进行一个拆分的处理,来模拟一个数据垂直分割的三方场景:
alice持有前10个属性
bob持有中间的10个属性
carol持有剩下的所有属性以及标签值
同时为了方便各方之间的样本做对齐,我们加了一个新的特征UID
来标识数据样本。
我们预先基于sklearn将全集数据拆分成训练集和测试集,方便后续进行模型训练效果的验证。
[10]:
train_alice_path = "data/alice_train.csv"
train_bob_path = "data/bob_train.csv"
train_carol_path = "data/carol_train.csv"
test_alice_path = "data/alice_test.csv"
test_bob_path = "data/bob_test.csv"
test_carol_path = "data/carol_test.csv"
def load_dataset_full(data_path):
df = pd.read_csv(data_path)
df = df.drop([0])
df = df.loc[df['DayOfWeekClaimed']!='0']
y = df['FraudFound_P']
X = df.drop(columns='FraudFound_P')
return X, y
def split_data():
x, y = load_dataset_full(full_data_path)
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.3, random_state=10
)
print(x_train.shape)
train_alice_csv = x_train.iloc[:, :10]
train_bob_csv = x_train.iloc[:, 10:20]
train_carol_csv = pd.concat([x_train.iloc[:, 20:], y_train], axis=1)
train_alice_csv.to_csv(train_alice_path, index_label='UID')
train_bob_csv.to_csv(train_bob_path, index_label='UID')
train_carol_csv.to_csv(train_carol_path, index_label='UID')
print(x_test.shape)
test_alice_csv = x_test.iloc[:, :10]
test_bob_csv = x_test.iloc[:, 10:20]
test_carol_csv = pd.concat([x_test.iloc[:, 20:], y_test], axis=1)
test_alice_csv.to_csv(test_alice_path, index_label='UID')
test_bob_csv.to_csv(test_bob_path, index_label='UID')
test_carol_csv.to_csv(test_carol_path, index_label='UID')
split_data()
(10792, 32)
(4626, 32)
[11]:
alice_train_df = pd.read_csv(train_alice_path)
alice_train_df.head()
[11]:
UID | Month | WeekOfMonth | DayOfWeek | Make | AccidentArea | DayOfWeekClaimed | MonthClaimed | WeekOfMonthClaimed | Sex | MaritalStatus | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2853 | Mar | 4 | Sunday | Toyota | Urban | Friday | Apr | 1 | Male | Married |
1 | 7261 | Apr | 4 | Saturday | Honda | Urban | Monday | Apr | 4 | Male | Married |
2 | 9862 | Jun | 4 | Sunday | Toyota | Rural | Monday | Jun | 4 | Female | Single |
3 | 14037 | Mar | 2 | Monday | Mazda | Urban | Monday | Mar | 2 | Male | Single |
4 | 10199 | Jun | 3 | Friday | Mazda | Urban | Tuesday | Jun | 4 | Female | Single |
[12]:
bob_train_df = pd.read_csv(train_bob_path)
bob_train_df.head()
[12]:
UID | Age | Fault | PolicyType | VehicleCategory | VehiclePrice | PolicyNumber | RepNumber | Deductible | DriverRating | Days_Policy_Accident | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2853 | 39 | Policy Holder | Sedan - All Perils | Sedan | 20000 to 29000 | 2854 | 8 | 400 | 2 | more than 30 |
1 | 7261 | 58 | Policy Holder | Sedan - Liability | Sport | 20000 to 29000 | 7262 | 4 | 400 | 4 | more than 30 |
2 | 9862 | 28 | Policy Holder | Sedan - All Perils | Sedan | less than 20000 | 9863 | 5 | 400 | 4 | more than 30 |
3 | 14037 | 28 | Policy Holder | Sedan - Collision | Sedan | 20000 to 29000 | 14038 | 11 | 400 | 4 | more than 30 |
4 | 10199 | 35 | Policy Holder | Sedan - Collision | Sedan | 20000 to 29000 | 10200 | 12 | 400 | 4 | more than 30 |
[13]:
carol_train_df = pd.read_csv(train_carol_path)
carol_train_df.head()
[13]:
UID | Days_Policy_Claim | PastNumberOfClaims | AgeOfVehicle | AgeOfPolicyHolder | PoliceReportFiled | WitnessPresent | AgentType | NumberOfSuppliments | AddressChange_Claim | NumberOfCars | Year | BasePolicy | FraudFound_P | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2853 | more than 30 | 1 | 7 years | 36 to 40 | No | No | External | more than 5 | no change | 1 vehicle | 1994 | All Perils | 0 |
1 | 7261 | more than 30 | none | more than 7 | 51 to 65 | No | No | External | 1 to 2 | no change | 1 vehicle | 1995 | Liability | 0 |
2 | 9862 | more than 30 | none | 7 years | 31 to 35 | No | No | External | none | no change | 1 vehicle | 1995 | All Perils | 0 |
3 | 14037 | more than 30 | 1 | 6 years | 31 to 35 | No | No | External | none | no change | 1 vehicle | 1996 | Collision | 0 |
4 | 10199 | more than 30 | none | 5 years | 31 to 35 | No | No | Internal | none | no change | 1 vehicle | 1995 | Collision | 0 |
三方数据加载#
注意:这里的接口里面需要显示地指明用于多方之间样本对齐的key,以及明确使用何种设备来执行PSI。
[14]:
from secretflow.data.vertical import read_csv as v_read_csv
train_ds = v_read_csv({alice: train_alice_path, bob: train_bob_path, carol: train_carol_path}, keys='UID', drop_keys='UID', spu=my_spu)
test_ds = v_read_csv({alice: test_alice_path, bob: test_bob_path, carol: test_carol_path}, keys='UID', drop_keys='UID', spu=my_spu)
print(train_ds)
print(train_ds.columns)
VDataFrame(partitions={<secretflow.device.device.pyu.PYU object at 0x7f09319baf70>: Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7f09319a2910>), <secretflow.device.device.pyu.PYU object at 0x7f09319bac10>: Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7f0a38d1acd0>), <secretflow.device.device.pyu.PYU object at 0x7f09319bafd0>: Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7f0a38cfe190>)}, aligned=True)
Index(['Month', 'WeekOfMonth', 'DayOfWeek', 'Make', 'AccidentArea',
'DayOfWeekClaimed', 'MonthClaimed', 'WeekOfMonthClaimed', 'Sex',
'MaritalStatus', 'Age', 'Fault', 'PolicyType', 'VehicleCategory',
'VehiclePrice', 'PolicyNumber', 'RepNumber', 'Deductible',
'DriverRating', 'Days_Policy_Accident', 'Days_Policy_Claim',
'PastNumberOfClaims', 'AgeOfVehicle', 'AgeOfPolicyHolder',
'PoliceReportFiled', 'WitnessPresent', 'AgentType',
'NumberOfSuppliments', 'AddressChange_Claim', 'NumberOfCars', 'Year',
'BasePolicy', 'FraudFound_P'],
dtype='object')
数据洞察#
基于上层封装的VDataFrame抽象,隐语提供了多种数据分析的API,例如统计信息、查改某些列的信息等。
[15]:
print(train_ds['WeekOfMonth'].count())
print(train_ds['WeekOfMonth'].max())
print(train_ds['WeekOfMonth'].min())
WeekOfMonth 10792
dtype: int64
WeekOfMonth 5
dtype: int64
WeekOfMonth 1
dtype: int64
数据预处理#
在读取完数据之后,下面我们演示如何在隐语上对一个实际多方持有的数据进行数据预处理。
Label Encoder#
对无序且二值的值,我们可以使用label encoding,转化为0/1表示
[16]:
from secretflow.preprocessing import LabelEncoder
cols = ['AccidentArea', 'Sex', 'Fault', 'PoliceReportFiled', 'WitnessPresent', 'AgentType']
for col in cols:
print(f"Col name {col}: {df[col].unique()}")
train_ds_v1 = train_ds.copy()
test_ds_v1 = test_ds.copy()
label_encoder = LabelEncoder()
for col in cols:
label_encoder.fit(train_ds_v1[col])
train_ds_v1[col] = label_encoder.transform(train_ds_v1[col])
test_ds_v1[col] = label_encoder.transform(test_ds_v1[col])
Col name AccidentArea: ['Urban' 'Rural']
Col name Sex: ['Female' 'Male']
Col name Fault: ['Policy Holder' 'Third Party']
Col name PoliceReportFiled: ['No' 'Yes']
Col name WitnessPresent: ['No' 'Yes']
Col name AgentType: ['External' 'Internal']
(Ordinal) Categorical Features#
对于有序的类别数据,我们构建映射,将类别数据转化为0~n-1的整数
[17]:
cols1 = [
"Days_Policy_Accident",
"Days_Policy_Claim",
"AgeOfPolicyHolder",
"AddressChange_Claim",
"NumberOfCars",
]
col_disc = [
{
"Days_Policy_Accident": {
"more than 30": 31,
"15 to 30": 22.5,
"none": 0,
"1 to 7": 4,
"8 to 15": 11.5,
}
},
{
"Days_Policy_Claim": {
"more than 30": 31,
"15 to 30": 22.5,
"8 to 15": 11.5,
"none": 0,
}
},
{
"AgeOfPolicyHolder": {
"26 to 30": 28,
"31 to 35": 33,
"41 to 50": 45.5,
"51 to 65": 58,
"21 to 25": 23,
"36 to 40": 38,
"16 to 17": 16.5,
"over 65": 66,
"18 to 20": 19,
}
},
{
"AddressChange_Claim": {
"1 year": 1,
"no change": 0,
"4 to 8 years": 6,
"2 to 3 years": 2.5,
"under 6 months": 0.5,
}
},
{
"NumberOfCars": {
"3 to 4": 3.5,
"1 vehicle": 1,
"2 vehicles": 2,
"5 to 8": 6.5,
"more than 8": 9,
}
},
]
cols2 = [
"Month",
"DayOfWeek",
"DayOfWeekClaimed",
"MonthClaimed",
"PastNumberOfClaims",
"NumberOfSuppliments",
"VehiclePrice",
"AgeOfVehicle",
]
col_ordering = [
{
"Month": {
"Jan": 1,
"Feb": 2,
"Mar": 3,
"Apr": 4,
"May": 5,
"Jun": 6,
"Jul": 7,
"Aug": 8,
"Sep": 9,
"Oct": 10,
"Nov": 11,
"Dec": 12,
}
},
{
"DayOfWeek": {
"Monday": 1,
"Tuesday": 2,
"Wednesday": 3,
"Thursday": 4,
"Friday": 5,
"Saturday": 6,
"Sunday": 7,
}
},
{
"DayOfWeekClaimed": {
"Monday": 1,
"Tuesday": 2,
"Wednesday": 3,
"Thursday": 4,
"Friday": 5,
"Saturday": 6,
"Sunday": 7,
}
},
{
"MonthClaimed": {
"Jan": 1,
"Feb": 2,
"Mar": 3,
"Apr": 4,
"May": 5,
"Jun": 6,
"Jul": 7,
"Aug": 8,
"Sep": 9,
"Oct": 10,
"Nov": 11,
"Dec": 12,
}
},
{"PastNumberOfClaims": {"none": 0, "1": 1, "2 to 4": 2, "more than 4": 5}},
{"NumberOfSuppliments": {"none": 0, "1 to 2": 1, "3 to 5": 3, "more than 5": 6}},
{
"VehiclePrice": {
"more than 69000": 69001,
"20000 to 29000": 24500,
"30000 to 39000": 34500,
"less than 20000": 19999,
"40000 to 59000": 49500,
"60000 to 69000": 64500,
}
},
{
"AgeOfVehicle": {
"3 years": 3,
"6 years": 6,
"7 years": 7,
"more than 7": 8,
"5 years": 5,
"new": 0,
"4 years": 4,
"2 years": 2,
}
},
]
from secretflow.data.vertical import VDataFrame
def replace(df, col_maps):
df = df.copy()
def func_(df, col_map):
col_name = list(col_map.keys())[0]
col_dict = list(col_map.values())[0]
if col_name not in df.columns:
return
new_list = []
for i in df[col_name]:
new_list.append(col_dict[i])
df[col_name] = new_list
for col_map in col_maps:
func_(df, col_map)
return df
col_maps = col_disc + col_ordering
train_ds_v2 = train_ds_v1.copy()
test_ds_v2 = test_ds_v1.copy()
# NOTE: Reveal is only used for demo only!!
print(f"orig ds in alice:\n {sf.reveal(train_ds_v2.partitions[alice].data)}")
train_ds_v2.partitions[alice].data = alice(replace)(
train_ds_v2.partitions[alice].data, col_maps
)
train_ds_v2.partitions[bob].data = bob(replace)(
train_ds_v2.partitions[bob].data, col_maps
)
train_ds_v2.partitions[carol].data = carol(replace)(
train_ds_v2.partitions[carol].data, col_maps
)
print(f"orig ds in alice:\n {sf.reveal(train_ds_v2.partitions[alice].data)}")
test_ds_v2.partitions[alice].data = alice(replace)(
test_ds_v2.partitions[alice].data, col_maps
)
test_ds_v2.partitions[bob].data = bob(replace)(
test_ds_v2.partitions[bob].data, col_maps
)
test_ds_v2.partitions[carol].data = carol(replace)(
test_ds_v2.partitions[carol].data, col_maps
)
orig ds in alice:
Month WeekOfMonth DayOfWeek Make AccidentArea DayOfWeekClaimed \
0 Mar 4 Sunday Toyota 1 Friday
1 Apr 4 Saturday Honda 1 Monday
2 Jun 4 Sunday Toyota 0 Monday
3 Mar 2 Monday Mazda 1 Monday
4 Jun 3 Friday Mazda 1 Tuesday
... ... ... ... ... ... ...
10787 Sep 2 Wednesday Chevrolet 1 Monday
10788 Apr 4 Tuesday Honda 1 Tuesday
10789 Jul 1 Monday Ford 0 Wednesday
10790 Feb 3 Wednesday Pontiac 1 Monday
10791 Feb 5 Monday Mercury 1 Friday
MonthClaimed WeekOfMonthClaimed Sex MaritalStatus
0 Apr 1 1 Married
1 Apr 4 1 Married
2 Jun 4 0 Single
3 Mar 2 1 Single
4 Jun 4 0 Single
... ... ... ... ...
10787 Sep 2 0 Married
10788 May 1 1 Married
10789 Jul 1 1 Married
10790 Feb 3 1 Married
10791 Mar 1 1 Married
[10792 rows x 10 columns]
orig ds in alice:
Month WeekOfMonth DayOfWeek Make AccidentArea \
0 3 4 7 Toyota 1
1 4 4 6 Honda 1
2 6 4 7 Toyota 0
3 3 2 1 Mazda 1
4 6 3 5 Mazda 1
... ... ... ... ... ...
10787 9 2 3 Chevrolet 1
10788 4 4 2 Honda 1
10789 7 1 1 Ford 0
10790 2 3 3 Pontiac 1
10791 2 5 1 Mercury 1
DayOfWeekClaimed MonthClaimed WeekOfMonthClaimed Sex MaritalStatus
0 5 4 1 1 Married
1 1 4 4 1 Married
2 1 6 4 0 Single
3 1 3 2 1 Single
4 2 6 4 0 Single
... ... ... ... ... ...
10787 1 9 2 0 Married
10788 2 5 1 1 Married
10789 3 7 1 1 Married
10790 1 2 3 1 Married
10791 5 3 1 1 Married
[10792 rows x 10 columns]
(Nominal) Categorical Features#
无序的类别数据,我们直接采用onehot encoder进行01编码
Onehot Encoder#
[18]:
from secretflow.preprocessing import OneHotEncoder
onehot_cols = ['Make','MaritalStatus','PolicyType','VehicleCategory','BasePolicy']
onehot_encoder = OneHotEncoder()
onehot_encoder.fit(train_ds_v2[onehot_cols])
enc_feats = onehot_encoder.transform(train_ds_v2[onehot_cols])
feature_names = enc_feats.columns
train_ds_v3 = train_ds_v2.drop(columns=onehot_cols)
train_ds_v3[feature_names] = enc_feats
enc_feats = onehot_encoder.transform(test_ds_v2[onehot_cols])
test_ds_v3 = test_ds_v2.drop(columns=onehot_cols)
test_ds_v3[feature_names] = enc_feats
print(f"orig ds in alice:\n {sf.reveal(train_ds_v3.partitions[alice].data)}")
orig ds in alice:
Month WeekOfMonth DayOfWeek AccidentArea DayOfWeekClaimed \
0 3 4 7 1 5
1 4 4 6 1 1
2 6 4 7 0 1
3 3 2 1 1 1
4 6 3 5 1 2
... ... ... ... ... ...
10787 9 2 3 1 1
10788 4 4 2 1 2
10789 7 1 1 0 3
10790 2 3 3 1 1
10791 2 5 1 1 5
MonthClaimed WeekOfMonthClaimed Sex Make_Accura Make_BMW ... \
0 4 1 1 0.0 0.0 ...
1 4 4 1 0.0 0.0 ...
2 6 4 0 0.0 0.0 ...
3 3 2 1 0.0 0.0 ...
4 6 4 0 0.0 0.0 ...
... ... ... ... ... ... ...
10787 9 2 0 0.0 0.0 ...
10788 5 1 1 0.0 0.0 ...
10789 7 1 1 0.0 0.0 ...
10790 2 3 1 0.0 0.0 ...
10791 3 1 1 0.0 0.0 ...
Make_Pontiac Make_Porche Make_Saab Make_Saturn Make_Toyota \
0 0.0 0.0 0.0 0.0 1.0
1 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 1.0
3 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ...
10787 0.0 0.0 0.0 0.0 0.0
10788 0.0 0.0 0.0 0.0 0.0
10789 0.0 0.0 0.0 0.0 0.0
10790 1.0 0.0 0.0 0.0 0.0
10791 0.0 0.0 0.0 0.0 0.0
Make_VW MaritalStatus_Divorced MaritalStatus_Married \
0 0.0 0.0 1.0
1 0.0 0.0 1.0
2 0.0 0.0 0.0
3 0.0 0.0 0.0
4 0.0 0.0 0.0
... ... ... ...
10787 0.0 0.0 1.0
10788 0.0 0.0 1.0
10789 0.0 0.0 1.0
10790 0.0 0.0 1.0
10791 0.0 0.0 1.0
MaritalStatus_Single MaritalStatus_Widow
0 0.0 0.0
1 0.0 0.0
2 1.0 0.0
3 1.0 0.0
4 1.0 0.0
... ... ...
10787 0.0 0.0
10788 0.0 0.0
10789 0.0 0.0
10790 0.0 0.0
10791 0.0 0.0
[10792 rows x 31 columns]
[19]:
train_ds_final = train_ds_v3.copy()
test_ds_final = test_ds_v3.copy()
X_train = train_ds_v3.drop(columns=['FraudFound_P'])
y_train = train_ds_final['FraudFound_P']
X_test = test_ds_final.drop(columns='FraudFound_P')
y_test = test_ds_final['FraudFound_P']
print("data load done")
data load done
数据对象转换#
此处我们将PYUObject 转化为 SPUObject,方便输入到SPU device执行基于MPC协议的隐私计算
[20]:
import jax
import jax.numpy as jnp
"""
Convert the VDataFrame object to SPUObject
"""
def vdataframe_to_spu(vdf: VDataFrame):
spu_partitions = []
for device in [alice, bob, carol]:
spu_partitions.append(vdf.partitions[device].data.to(my_spu))
base_partition = spu_partitions[0]
for i in range(1, len(spu_partitions)):
base_partition = my_spu(lambda x, y: jnp.concatenate([x, y], axis=1))(
base_partition, spu_partitions[i]
)
return base_partition
X_train_spu = vdataframe_to_spu(X_train)
y_train_spu = y_train.partitions[carol].data.to(my_spu)
X_test_spu = vdataframe_to_spu(X_test)
y_test_spu = y_test.partitions[carol].data.to(my_spu)
print(f"X_train type: {X_train}\n\nX_train_spu type: {X_train_spu}")
"""
NOTE: This is only for demo only!! This shall not be used in production.
"""
X_train_plaintext = sf.reveal(X_train_spu)
y_train_plaintext = sf.reveal(y_train_spu)
X_test_plaintext = sf.reveal(X_test_spu)
y_test_plaintext = sf.reveal(y_test_spu)
print(f'X_train_plaintext: \n{X_train_plaintext}')
X_train type: VDataFrame(partitions={<secretflow.device.device.pyu.PYU object at 0x7f09319baf70>: Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7f0a38d1a8b0>), <secretflow.device.device.pyu.PYU object at 0x7f09319bac10>: Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7f0a38d1a850>), <secretflow.device.device.pyu.PYU object at 0x7f09319bafd0>: Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7f0a38c98070>)}, aligned=True)
X_train_spu type: <secretflow.device.device.spu.SPUObject object at 0x7f0a38cd7850>
X_train_plaintext:
[[3. 4. 7. ... 1. 0. 0.]
[4. 4. 6. ... 0. 0. 1.]
[6. 4. 7. ... 1. 0. 0.]
...
[7. 1. 1. ... 0. 1. 0.]
[2. 3. 3. ... 0. 0. 1.]
[2. 5. 1. ... 0. 0. 1.]]
模型构建#
在完成数据的读入之后,下面我们进行模型的构建。在本demo中,主要提供了三种模型的构建: - LR: 逻辑回归 - NN:神经网络模型 - XGB: XGBoost 树模型
注意,本示例主要是演示在隐语上进行算法开发的流程,并没有针对模型 (LR, NN) 进行调参。我们分别提供了明文和密文的计算结果,实验结果显示两者的输出是基本一致的,表明隐语的密态计算能够和明文计算保持精度一致。
LR ( jax ) using SPU#
[21]:
from jax.example_libraries import optimizers, stax
from jax.example_libraries.stax import (
Conv,
MaxPool,
AvgPool,
Flatten,
Dense,
Relu,
Sigmoid,
LogSoftmax,
Softmax,
BatchNorm,
)
def sigmoid(x):
x = (x - jnp.min(x)) / (jnp.max(x) - jnp.min(x))
return 1 / (1 + jnp.exp(-x))
# Outputs probability of a label being true.
def predict_lr(W, b, inputs):
return sigmoid(jnp.dot(inputs, W) + b)
# Training loss is the negative log-likelihood of the training examples.
def loss_lr(W, b, inputs, targets):
preds = predict_lr(W, b, inputs)
label_probs = preds * targets + (1 - preds) * (1 - targets)
return -jnp.mean(jnp.log(label_probs))
def train_step(W, b, X, y, learning_rate):
loss_value, Wb_grad = jax.value_and_grad(loss_lr, (0, 1))(W, b, X, y)
W -= learning_rate * Wb_grad[0]
b -= learning_rate * Wb_grad[1]
return loss_value, W, b
def fit(W, b, X, y, epochs=1, learning_rate=1e-2, batch_size=128):
losses = jnp.array([])
xs = jnp.array_split(X, len(X) / batch_size, axis=0)
ys = jnp.array_split(y, len(y) / batch_size, axis=0)
for _ in range(epochs):
for (batch_x, batch_y) in zip(xs, ys):
l, W, b = train_step(
W, b, batch_x, batch_y, learning_rate=learning_rate
)
losses = jnp.append(losses, l)
return losses, W, b
[22]:
from jax import random
import sys
import time
import logging
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logging.getLogger().setLevel(logging.INFO)
from sklearn.metrics import roc_auc_score
# Hyperparameter
key = random.PRNGKey(42)
W = jax.random.normal(key, shape=(64,))
b = 0.0
epochs = 1
learning_rate = 1e-2
batch_size = 128
"""
CPU-version plaintext computation
"""
losses_cpu, W_cpu, b_cpu = fit(
W,
b,
X_train_plaintext,
y_train_plaintext,
epochs=epochs,
learning_rate=learning_rate,
batch_size=batch_size,
)
y_pred_cpu = predict_lr(W_cpu, b_cpu, X_test_plaintext)
print(f"\033[31m(Jax LR CPU) auc: {roc_auc_score(y_test_plaintext, y_pred_cpu)}\033[0m")
"""
SPU-version secure computation
"""
W_, b_ = (
sf.to(alice, W).to(my_spu),
sf.to(alice, b).to(my_spu),
)
losses_spu, W_spu, b_spu = my_spu(
fit,
static_argnames=["epochs", "learning_rate", "batch_size"],
num_returns_policy=sf.device.SPUCompilerNumReturnsPolicy.FROM_USER,
user_specified_num_returns=3,
)(
W_,
b_,
X_train_spu,
y_train_spu,
epochs=epochs,
learning_rate=learning_rate,
batch_size=batch_size,
)
y_pred_spu = my_spu(predict_lr)(W_spu, b_spu, X_test_spu)
y_pred = sf.reveal(y_pred_spu)
print(f"\033[31m(Jax LR SPU) auc: {roc_auc_score(y_test_plaintext, y_pred)}\033[0m")
INFO:absl:Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker:
INFO:absl:Unable to initialize backend 'gpu': NOT_FOUND: Could not find registered platform with name: "cuda". Available platform names are: Interpreter Host
INFO:absl:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(Jax LR CPU) auc: 0.5243556504755678
(Jax LR SPU) auc: 0.5249493212966679
NN ( jax + flax ) using SPU#
[ ]:
import sys
!{sys.executable} -m pip install flax==0.6.0
[23]:
from typing import Sequence
import flax.linen as nn
class MLP(nn.Module):
features: Sequence[int]
@nn.compact
def __call__(self, x):
for feat in self.features[:-1]:
x = nn.relu(nn.Dense(feat)(x))
x = nn.Dense(self.features[-1])(x)
return x
FEATURES = [1]
flax_nn = MLP(FEATURES)
def predict(params, x):
from typing import Sequence
import flax.linen as nn
class MLP(nn.Module):
features: Sequence[int]
@nn.compact
def __call__(self, x):
for feat in self.features[:-1]:
x = nn.relu(nn.Dense(feat)(x))
x = nn.Dense(self.features[-1])(x)
return x
FEATURES = [1]
flax_nn = MLP(FEATURES)
return flax_nn.apply(params, x)
def loss_func(params, x, y):
preds = predict(params, x)
label_probs = preds * y + (1 - preds) * (1 - y)
return -jnp.mean(jnp.log(label_probs))
def train_auto_grad(X, y, params, batch_size=10, epochs=10, learning_rate=0.01):
xs = jnp.array_split(X, len(X) / batch_size, axis=0)
ys = jnp.array_split(y, len(y) / batch_size, axis=0)
for _ in range(epochs):
for (batch_x, batch_y) in zip(xs, ys):
_, grads = jax.value_and_grad(loss_func)(params, batch_x, batch_y)
params = jax.tree_util.tree_map(
lambda p, g: p - learning_rate * g, params, grads
)
return params
epochs = 1
learning_rate = 1e-2
batch_size = 128
feature_dim = 64 # from the dataset
init_params = flax_nn.init(jax.random.PRNGKey(1), jnp.ones((batch_size, feature_dim)))
"""
CPU-version plaintext computation
"""
params = train_auto_grad(
X_train_plaintext, y_train_plaintext, init_params, batch_size, epochs, learning_rate
)
y_pred = predict(params, X_test_plaintext)
print(f"\033[31m(Flax NN CPU) auc: {roc_auc_score(y_test_plaintext, y_pred)}\033[0m")
"""
SPU-version secure computation
"""
params_spu = sf.to(alice, init_params).to(my_spu)
params_spu = my_spu(train_auto_grad, static_argnames=['batch_size', 'epochs', 'learning_rate'])(
X_train_spu, y_train_spu, params_spu, batch_size=batch_size, epochs=epochs, learning_rate=learning_rate
)
y_pred_spu = my_spu(predict)(params_spu, X_test_spu)
y_pred_ = sf.reveal(y_pred_spu)
print(f"\033[31m(Flax NN SPU) auc: {roc_auc_score(y_test_plaintext, y_pred_)}\033[0m")
(Flax NN CPU) auc: 0.5022025986877814
(Flax NN SPU) auc: 0.5022042816667214
XGB ( jax ) using SPU#
[24]:
from secretflow.ml.boost.ss_xgb_v import Xgb
import time
from sklearn.metrics import roc_auc_score
"""
SPU-version Secure computation
"""
xgb = Xgb(my_spu)
params = {
# <<< !!! >>> change args to your test settings.
# for more detail, see Xgb.train.__doc__
'num_boost_round': 10,
'max_depth': 4,
'learning_rate': 0.05,
'sketch_eps': 0.05,
'objective': 'logistic',
'reg_lambda': 1,
'subsample': 0.75,
'colsample_bytree': 1,
'base_score': 0.5,
}
start = time.time()
model = xgb.train(params, X_train, y_train)
print(f"train time: {time.time() - start}")
start =time.time()
spu_yhat = model.predict(X_test)
print(f"predict time: {time.time() - start}")
yhat = sf.reveal(spu_yhat)
print(f"\033[31m(SS-XGB) auc: {roc_auc_score(y_test_plaintext, yhat)}\033[0m")
INFO:root:global_setup time 5.707190036773682s
INFO:root:epoch 0 time 13.447269678115845s
INFO:root:epoch 1 time 11.487506866455078s
INFO:root:epoch 2 time 16.403863430023193s
INFO:root:epoch 3 time 10.947153091430664s
INFO:root:epoch 4 time 10.782062530517578s
INFO:root:epoch 5 time 10.983924865722656s
INFO:root:epoch 6 time 13.287509441375732s
INFO:root:epoch 7 time 10.768491506576538s
INFO:root:epoch 8 time 11.075066804885864s
INFO:root:epoch 9 time 11.336798429489136s
train time: 126.23692321777344
predict time: 0.0565030574798584
(SS-XGB) auc: 0.6917051858471569
[29]:
"""
Plaintext baseline
"""
import xgboost as SKxgb
params = {
# <<< !!! >>> change args to your test settings.
# for more detail, see Xgb.train.__doc__
"n_estimators": 10,
"max_depth": 4,
'eval_metric': 'auc',
"learning_rate": 0.05,
"sketch_eps": 0.05,
"objective": "binary:logistic",
"reg_lambda": 1,
"subsample": 0.75,
"colsample_bytree": 1,
"base_score": 0.5,
}
raw_xgb = SKxgb.XGBClassifier()
raw_xgb.fit(X_train_plaintext, y_train_plaintext)
y_pred = raw_xgb.predict(X_test_plaintext)
print(f"\033[31m(Sklearn-XGB) auc: {roc_auc_score(y_test_plaintext, y_pred)}\033[0m")
/home/haoqi.whq/miniconda3/envs/sf-demo/lib/python3.8/site-packages/xgboost/sklearn.py:1224: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
warnings.warn(label_encoder_deprecation_msg, UserWarning)
/home/haoqi.whq/miniconda3/envs/sf-demo/lib/python3.8/site-packages/sklearn/preprocessing/_label.py:98: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/home/haoqi.whq/miniconda3/envs/sf-demo/lib/python3.8/site-packages/sklearn/preprocessing/_label.py:133: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
[20:26:03] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
(Sklearn-XGB) auc: 0.7106100882806603
The End#
显示地调用sf.shutdown()
关闭实例化的集群。 > 注意:如果是在.py文件中运行代码,不需要显示地执行shutdown,在程序进程运行结束后会隐式地执行shutdown
函数。
[26]:
sf.shutdown()
小结一下#
介绍了如何针对一个实际场景的应用,在隐语上进行开发,提供隐私保护的能力
隐语上的数据加载、预处理、建模、训练流程
下一步,自己实现任意的计算(jax实现的计算),对于TF,pytorch的支持WIP