拆分学习:银行营销#
以下代码仅作为演示用,请勿直接在生产环境使用。
在这个教程中,我们将以银行的市场营销模型为例,展示在 SecretFlow 框架下如何完成垂直场景下的拆分学习。 SecretFlow 框架提供了一套用户友好的Api,可以很方便的将您的keras模型或者pytorch模型应用到拆分学习场景,成为拆分学习模型,进而完成垂直场景的联合建模任务.
在接下来的教程中我们将手把手演示,如何将您已有的 keras 模型变成 secretflow 下的拆分学习模型,完成联邦多方建模任务。
什么是拆分学习?#
拆分学习的核心思想是将网络结构进行拆分,每个设备(机构)只保留一部分网络结构,所有设备的子网络结构组合在一起,构成一个完整的网络模型。在训练过程中,不同的设备(机构)只对本地的网络结构进行前向或反向计算,并将计算结果传递给下一个设备,多个设备端通过联合模型,完成训练,直到收敛为止。
Alice 方用本方的数据经过 model_base_alice 得到 hidden0 ,发送给Bob
Bob 方用本方的数据经过 model_base_bob 得到 hidden1
hidden_0 和 hidden_1 输入给 AggLayer 层做聚合,输出聚合后的 hidden_merge
Bob 方将 hidden_merge 输入给 model_fuse 结合 label 得到梯度,并进行回传
梯度经过 AggLayer 拆分成两部分 g0 , g1 ,将 g0 和 g1 分别发送给 Alice 和 Bob
Alice 和 Bob 的 basenet 分别根据 g0 和 g1 对本方的模型进行更新
任务#
市场营销是银行业在不断变化的市场环境中,为满足客户需要、实现经营目标的整体性经营和销售的活动。在目前大数据的环境下,数据分析为银行业提供了更有效的分析手段。在客户需求分析,了解目标市场趋势以及更宏观的市场策略都能提供依据与方向。
这个数据集来自 kaggle ,是一个经典的银行营销数据集,是一家葡萄牙银行机构电话直销活动,目标变量是客户是否订阅存款产品。
数据#
样本量总计11162个,其中训练集8929, 测试集2233
特征16维,标签维2分类
我们预先对数据进行了切割,alice持有其中的4维基础属性特征,bob持有12维银行交易特征,对应的label只有alice方持有
我们先来看看我们银行市场营销数据长什么样的?
原始数据经过分拆后分成bank_alice和bank_bob,分别存在alice和bob两方。这里的csv是原始数据仅经过分拆,没有做预处理的数据
[1]:
%load_ext autoreload
%autoreload 2
import secretflow as sf
import matplotlib.pyplot as plt
sf.shutdown()
sf.init(['alice', 'bob'], address='local')
alice, bob = sf.PYU('alice'), sf.PYU('bob')
2023-04-24 10:20:58,140 INFO worker.py:1538 -- Started a local Ray instance.
数据准备#
[2]:
import pandas as pd
from secretflow.utils.simulation.datasets import dataset
df = pd.read_csv(dataset('bank_marketing'), sep=';')
我们假设Alice是一个新银行,他们只有用户的基本信息,和买来的是否购买过理财产品的label
[3]:
alice_data = df[["age", "job", "marital", "education", "y"]]
alice_data
[3]:
age | job | marital | education | y | |
---|---|---|---|---|---|
0 | 30 | unemployed | married | primary | no |
1 | 33 | services | married | secondary | no |
2 | 35 | management | single | tertiary | no |
3 | 30 | management | married | tertiary | no |
4 | 59 | blue-collar | married | secondary | no |
... | ... | ... | ... | ... | ... |
4516 | 33 | services | married | secondary | no |
4517 | 57 | self-employed | married | tertiary | no |
4518 | 57 | technician | married | secondary | no |
4519 | 28 | blue-collar | married | secondary | no |
4520 | 44 | entrepreneur | single | tertiary | no |
4521 rows × 5 columns
bob端是一个老银行,他们有用户的账户余额,是否有房,是否有贷款,以及最近的营销反馈
[4]:
bob_data = df[["default", "balance", "housing", "loan", "contact",
"day","month","duration","campaign","pdays","previous","poutcome"]]
bob_data
[4]:
default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | no | 1787 | no | no | cellular | 19 | oct | 79 | 1 | -1 | 0 | unknown |
1 | no | 4789 | yes | yes | cellular | 11 | may | 220 | 1 | 339 | 4 | failure |
2 | no | 1350 | yes | no | cellular | 16 | apr | 185 | 1 | 330 | 1 | failure |
3 | no | 1476 | yes | yes | unknown | 3 | jun | 199 | 4 | -1 | 0 | unknown |
4 | no | 0 | yes | no | unknown | 5 | may | 226 | 1 | -1 | 0 | unknown |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4516 | no | -333 | yes | no | cellular | 30 | jul | 329 | 5 | -1 | 0 | unknown |
4517 | yes | -3313 | yes | yes | unknown | 9 | may | 153 | 1 | -1 | 0 | unknown |
4518 | no | 295 | no | no | cellular | 19 | aug | 151 | 11 | -1 | 0 | unknown |
4519 | no | 1137 | no | no | cellular | 6 | feb | 129 | 4 | 211 | 3 | other |
4520 | no | 1136 | yes | yes | cellular | 3 | apr | 345 | 2 | 249 | 7 | other |
4521 rows × 12 columns
环境的搭建#
在 Secretflow 环境中创建 2 个实体 [Alice, Bob],其中 ‘Alice’ 和 ‘Bob’ 是两个 PY。 一旦你构建了这两个对象,你可以愉快地开始拆分学习
引入依赖#
[5]:
from secretflow.data.split import train_test_split
from secretflow.ml.nn import SLModel
2023-04-24 10:20:59.841732: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-04-24 10:21:00.576963: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-04-24 10:21:00.577064: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-04-24 10:21:00.577078: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
准备数据#
创建联邦表
联邦表是一个跨多方的虚拟概念,我们定义 VDataFrame
用于垂直设置。
联邦表中各方的数据存储在本地,不允许出域。
除了拥有数据的一方之外,没有人可以访问数据存储。
联邦表的任何操作都会由驱动调度给每个worker,执行指令会逐层传递,直到特定worker的Python Runtime。 框架确保只有
worker.device
和Object
两个概念。设备可以同时操作数据。联合表旨在从中心角度管理和操作多方数据。
Federated Table
的接口与 pandas.DataFrame 对齐,以降低多方数据操作的成本。SecretFlow 框架提供 Plain&Ciphertext 混合编程能力。垂直联合表是使用
SPU
构建的,MPC-PSI
用于安全地获取来自各方的交集和对齐数据。
VDataFrame 提供类似于 pandas 的 read_csv 接口,不同之处在于secretflow.read_csv 接收一个定义双方数据路径的字典。我们可以使用 secretflow.vertical.read_csv
来构建 VDataFrame
。
read_csv(file_dict,delimiter,ppu,keys,drop_key)
filepath: Path of the participant file. The address can be a relative or absolute path to a local file
ppu: PPU Device for PSI; If this parameter is not specified, data must be prealigned
keys: Key for intersection
创建spu object
[6]:
spu = sf.SPU(sf.utils.testing.cluster_def(['alice', 'bob']))
[7]:
from secretflow.utils.simulation.datasets import load_bank_marketing
# Alice has the first four features,
# while bob has the left features
data = load_bank_marketing(parts={alice: (0, 4), bob: (4, 16)}, axis=1)
# Alice holds the label.
label = load_bank_marketing(parts={alice: (16, 17)}, axis=1)
data 为构建好的垂直联邦表,他从全局上只拥有所有数据的 Schema
我们进一步来看一下VDF的数据管理
[8]:
data['age'].partitions[alice].data
[8]:
<secretflow.device.device.pyu.PYUObject at 0x7fb54bc6d250>
[9]:
# You can uncomment this and you will get a KeyError.
# data['age'].partitions[bob]
[10]:
from secretflow.preprocessing.scaler import MinMaxScaler
from secretflow.preprocessing.encoder import LabelEncoder
[11]:
encoder = LabelEncoder()
data['job'] = encoder.fit_transform(data['job'])
data['marital'] = encoder.fit_transform(data['marital'])
data['education'] = encoder.fit_transform(data['education'])
data['default'] = encoder.fit_transform(data['default'])
data['housing'] = encoder.fit_transform(data['housing'])
data['loan'] = encoder.fit_transform(data['loan'])
data['contact'] = encoder.fit_transform(data['contact'])
data['poutcome'] = encoder.fit_transform(data['poutcome'])
data['month'] = encoder.fit_transform(data['month'])
label = encoder.fit_transform(label)
(SPURuntime pid=240157) 2023-04-24 10:21:03.623 [error] [context.cc:operator():132] connect to rank=0 failed with error [external/yacl/yacl/link/transport/channel_brpc.cc:368] send, rpc failed=112, message=[E111]Fail to connect Socket{id=0 addr=127.0.0.1:34003} (0x0x4821fc0): Connection refused [R1][E112]Not connected to 127.0.0.1:34003 yet, server_id=0 [R2][E112]Not connected to 127.0.0.1:34003 yet, server_id=0 [R3][E112]Not connected to 127.0.0.1:34003 yet, server_id=0
[12]:
print(f"label= {type(label)},\ndata = {type(data)}")
label= <class 'secretflow.data.vertical.dataframe.VDataFrame'>,
data = <class 'secretflow.data.vertical.dataframe.VDataFrame'>
通过MinMaxScaler做数据标准化
[13]:
scaler = MinMaxScaler()
data = scaler.fit_transform(data)
接着我们将数据集划分成train-set和test-set
[14]:
from secretflow.data.split import train_test_split
random_state = 1234
train_data,test_data = train_test_split(data, train_size=0.8, random_state=random_state)
train_label,test_label = train_test_split(label, train_size=0.8, random_state=random_state)
(_run pid=239859) /home/limingbo/.conda/envs/sf/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but MinMaxScaler was fitted without feature names
(_run pid=239859) warnings.warn(
(_run pid=239859) /home/limingbo/.conda/envs/sf/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but MinMaxScaler was fitted without feature names
(_run pid=239859) warnings.warn(
小结:到这里为止,我们就完成了联邦表的定义,数据预处理,以及训练集和测试集的划分。 secretflow框架定义了跨越多方的 联邦表 概念,同时定义了一套构建在联邦表上的操作,逻辑对等 pandas.DataFrame ,同时定义了对于联邦表的预处理操作,逻辑对等 sklearn ,您在使用过程中遇到问题,可以参考我们的文档以及api介绍,进一步了解其他的功能
模型介绍#
单机版本:”对于该任务一个基本的DNN就可以完成,输入16维特征,经过一个DNN网络,输出对于正负样本的概率。”
联邦版本:”* Alice:” - base_net:输入4维特征,经过一个dnn网络得到hidden” - fuse_net:接收自己的hidden_alice,以及bob计算得到的hidden特征,输入的fuse_net,进行特征融合,送入之后的网络完成整个forward过程和backward过程”* Bob:” - base_net:输入12维特征,经过一个dnn网络得到hidden,然后将hidden发送给alice方,完成接下来的运算
定义模型#
接下来我们开始创建联邦模型”在垂直场景我们定义了 SLTFModel 和 SLTorchModel(WIP) ,用于构建垂直场景的拆分学习,我们定义了简单易用可扩展的接口,可以很方便的将您已有的模型,转换成SF—Model,进而进行垂直场景联邦建模。
拆分学习即将一个模型拆分开来,一部分放在数据的本地执行,另外一部分放在有label的一方,或者server端执行。首先我们来定义本地执行的模型——base_model
[15]:
def create_base_model(input_dim, output_dim, name='base_model'):
# Create model
def create_model():
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow as tf
model = keras.Sequential(
[
keras.Input(shape=input_dim),
layers.Dense(100,activation ="relu" ),
layers.Dense(output_dim, activation="relu"),
]
)
# Compile model
model.summary()
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=["accuracy",tf.keras.metrics.AUC()])
return model
return create_model
我们使用create_base_model分别为 alice 和 bob 创建他们的base model
[16]:
# prepare model
hidden_size = 64
# get the number of features of each party.
# When the input data changes, the network automatically adjusts to the input data
alice_input_feature_num = train_data.values.partition_shape()[alice][1]
bob_input_feature_num = train_data.values.partition_shape()[bob][1]
model_base_alice = create_base_model(alice_input_feature_num, hidden_size)
model_base_bob = create_base_model(bob_input_feature_num, hidden_size)
[17]:
model_base_alice()
model_base_bob()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 100) 500
dense_1 (Dense) (None, 64) 6464
=================================================================
Total params: 6,964
Trainable params: 6,964
Non-trainable params: 0
_________________________________________________________________
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_2 (Dense) (None, 100) 1300
dense_3 (Dense) (None, 64) 6464
=================================================================
Total params: 7,764
Trainable params: 7,764
Non-trainable params: 0
_________________________________________________________________
[17]:
<keras.engine.sequential.Sequential at 0x7fb54bc6d610>
接下来我们定义有label的一方,或者server端的模型——fuse_model在fuse_model的定义中,我们需要正确的定义loss,optimizer,metrics。这里可以兼容所有您已有的keras模型的配置
[18]:
def create_fuse_model(input_dim, output_dim, party_nums, name='fuse_model'):
def create_model():
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow as tf
# input
input_layers = []
for i in range(party_nums):
input_layers.append(keras.Input(input_dim,))
merged_layer = layers.concatenate(input_layers)
fuse_layer = layers.Dense(64, activation='relu')(merged_layer)
output = layers.Dense(output_dim, activation='sigmoid')(fuse_layer)
model = keras.Model(inputs=input_layers, outputs=output)
model.summary()
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=["accuracy",tf.keras.metrics.AUC()])
return model
return create_model
[19]:
model_fuse = create_fuse_model(
input_dim=hidden_size, party_nums=2, output_dim=1)
[20]:
model_fuse()
Model: "model"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_3 (InputLayer) [(None, 64)] 0 []
input_4 (InputLayer) [(None, 64)] 0 []
concatenate (Concatenate) (None, 128) 0 ['input_3[0][0]',
'input_4[0][0]']
dense_4 (Dense) (None, 64) 8256 ['concatenate[0][0]']
dense_5 (Dense) (None, 1) 65 ['dense_4[0][0]']
==================================================================================================
Total params: 8,321
Trainable params: 8,321
Non-trainable params: 0
__________________________________________________________________________________________________
[20]:
<keras.engine.functional.Functional at 0x7fb548506d00>
创建拆分学习模型#
secretflow提供了拆分学习的模型 SLModelSLModel模型初始化需要3个参数* base_model_dict:一个字典需要传入参与训练的所有client以及base_model映射* device_y:PYU,哪一方持有label* model_fuse:融合模型,具体的优化器以及损失函数都在这个模型中进行定义
定义 base_model_dict
base_model_dict:Dict[PYU,model_fn]
[21]:
base_model_dict = {
alice: model_base_alice,
bob: model_base_bob
}
[22]:
from secretflow.security.privacy import DPStrategy, GaussianEmbeddingDP, LabelDP
# Define DP operations
train_batch_size = 128
gaussian_embedding_dp = GaussianEmbeddingDP(
noise_multiplier=0.5,
l2_norm_clip=1.0,
batch_size=train_batch_size,
num_samples=train_data.values.partition_shape()[alice][0],
is_secure_generator=False,
)
dp_strategy_alice = DPStrategy(embedding_dp=gaussian_embedding_dp)
label_dp = LabelDP(eps=64.0)
dp_strategy_bob = DPStrategy(label_dp=label_dp)
dp_strategy_dict = {alice: dp_strategy_alice, bob: dp_strategy_bob}
dp_spent_step_freq = 10
[23]:
sl_model = SLModel(
base_model_dict=base_model_dict,
device_y=alice,
model_fuse=model_fuse,
dp_strategy_dict=dp_strategy_dict,)
INFO:root:Create proxy actor <class 'secretflow.ml.nn.sl.backend.tensorflow.sl_base.PYUSLTFModel'> with party alice.
INFO:root:Create proxy actor <class 'secretflow.ml.nn.sl.backend.tensorflow.sl_base.PYUSLTFModel'> with party bob.
[24]:
sf.reveal(test_data.partitions[alice].data), sf.reveal(test_label.partitions[alice].data)
[24]:
( age job marital education
1426 0.279412 0.181818 0.5 0.333333
416 0.176471 0.636364 1.0 0.333333
3977 0.264706 0.000000 0.5 0.666667
2291 0.338235 0.000000 0.5 0.333333
257 0.132353 0.909091 1.0 0.333333
... ... ... ... ...
1508 0.264706 0.818182 1.0 0.333333
979 0.544118 0.090909 0.0 0.000000
3494 0.455882 0.090909 0.5 0.000000
42 0.485294 0.090909 0.5 0.333333
1386 0.455882 0.636364 0.5 0.333333
[905 rows x 4 columns],
y
1426 0
416 0
3977 0
2291 0
257 0
... ..
1508 0
979 0
3494 0
42 0
1386 0
[905 rows x 1 columns])
[25]:
sf.reveal(train_data.partitions[alice].data), sf.reveal(train_label.partitions[alice].data)
[25]:
( age job marital education
1106 0.235294 0.090909 0.5 0.333333
1309 0.176471 0.363636 0.5 0.333333
2140 0.411765 0.272727 1.0 0.666667
2134 0.573529 0.454545 0.5 0.333333
960 0.485294 0.818182 0.5 0.333333
... ... ... ... ...
664 0.397059 0.090909 1.0 0.333333
3276 0.235294 0.181818 0.5 0.666667
1318 0.220588 0.818182 0.5 0.333333
723 0.220588 0.636364 0.5 0.333333
2863 0.176471 0.363636 1.0 0.666667
[3616 rows x 4 columns],
y
1106 0
1309 0
2140 1
2134 0
960 0
... ..
664 0
3276 0
1318 0
723 0
2863 0
[3616 rows x 1 columns])
[26]:
history = sl_model.fit(train_data,
train_label,
validation_data=(test_data,test_label),
epochs=10,
batch_size=train_batch_size,
shuffle=True,
verbose=1,
validation_freq=1,
dp_spent_step_freq=dp_spent_step_freq,)
INFO:root:SL Train Params: {'self': <secretflow.ml.nn.sl.sl_model.SLModel object at 0x7fb548492910>, 'x': VDataFrame(partitions={alice: Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7fb54bc8ff10>), bob: Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7fb54bc8b970>)}, aligned=True), 'y': VDataFrame(partitions={alice: Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7fb54bc6cfd0>)}, aligned=True), 'batch_size': 128, 'epochs': 10, 'verbose': 1, 'callbacks': None, 'validation_data': (VDataFrame(partitions={alice: Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7fb54bc8f610>), bob: Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7fb54bc8b8e0>)}, aligned=True), VDataFrame(partitions={alice: Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7fb54bc6c970>)}, aligned=True)), 'shuffle': True, 'sample_weight': None, 'validation_freq': 1, 'dp_spent_step_freq': 10, 'dataset_builder': None, 'audit_log_dir': None, 'audit_log_params': {}, 'random_seed': 13780}
(pid=240300) 2023-04-24 10:21:13.143944: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(pid=240309) 2023-04-24 10:21:13.339222: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(pid=240300) 2023-04-24 10:21:13.923649: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
(pid=240300) 2023-04-24 10:21:13.923745: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
(pid=240300) 2023-04-24 10:21:13.923756: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(pid=240309) 2023-04-24 10:21:14.119980: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
(pid=240309) 2023-04-24 10:21:14.120083: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
(pid=240309) 2023-04-24 10:21:14.120095: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(PYUSLTFModel pid=240300) 2023-04-24 10:21:16.143007: E tensorflow/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
(PYUSLTFModel pid=240309) 2023-04-24 10:21:16.357362: E tensorflow/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
(PYUSLTFModel pid=240300) Model: "sequential"
(PYUSLTFModel pid=240300) _________________________________________________________________
(PYUSLTFModel pid=240300) Layer (type) Output Shape Param #
(PYUSLTFModel pid=240300) =================================================================
(PYUSLTFModel pid=240300) dense (Dense) (None, 100) 500
(PYUSLTFModel pid=240300)
(PYUSLTFModel pid=240300) dense_1 (Dense) (None, 64) 6464
(PYUSLTFModel pid=240300)
(PYUSLTFModel pid=240300) =================================================================
(PYUSLTFModel pid=240300) Total params: 6,964
(PYUSLTFModel pid=240300) Trainable params: 6,964
(PYUSLTFModel pid=240300) Non-trainable params: 0
(PYUSLTFModel pid=240300) _________________________________________________________________
(PYUSLTFModel pid=240300) Model: "model"
(PYUSLTFModel pid=240300) __________________________________________________________________________________________________
(PYUSLTFModel pid=240300) Layer (type) Output Shape Param # Connected to
(PYUSLTFModel pid=240300) ==================================================================================================
(PYUSLTFModel pid=240300) input_2 (InputLayer) [(None, 64)] 0 []
(PYUSLTFModel pid=240300)
(PYUSLTFModel pid=240300) input_3 (InputLayer) [(None, 64)] 0 []
(PYUSLTFModel pid=240300)
(PYUSLTFModel pid=240300) concatenate (Concatenate) (None, 128) 0 ['input_2[0][0]',
(PYUSLTFModel pid=240300) 'input_3[0][0]']
(PYUSLTFModel pid=240300)
(PYUSLTFModel pid=240300) dense_2 (Dense) (None, 64) 8256 ['concatenate[0][0]']
(PYUSLTFModel pid=240300)
(PYUSLTFModel pid=240300) dense_3 (Dense) (None, 1) 65 ['dense_2[0][0]']
(PYUSLTFModel pid=240300)
(PYUSLTFModel pid=240300) ==================================================================================================
(PYUSLTFModel pid=240300) Total params: 8,321
(PYUSLTFModel pid=240300) Trainable params: 8,321
(PYUSLTFModel pid=240300) Non-trainable params: 0
(PYUSLTFModel pid=240300) __________________________________________________________________________________________________
7%|▋ | 2/29 [00:00<00:02, 13.20it/s]
(PYUSLTFModel pid=240309) Model: "sequential"
(PYUSLTFModel pid=240309) _________________________________________________________________
(PYUSLTFModel pid=240309) Layer (type) Output Shape Param #
(PYUSLTFModel pid=240309) =================================================================
(PYUSLTFModel pid=240309) dense (Dense) (None, 100) 1300
(PYUSLTFModel pid=240309)
(PYUSLTFModel pid=240309) dense_1 (Dense) (None, 64) 6464
(PYUSLTFModel pid=240309)
(PYUSLTFModel pid=240309) =================================================================
(PYUSLTFModel pid=240309) Total params: 7,764
(PYUSLTFModel pid=240309) Trainable params: 7,764
(PYUSLTFModel pid=240309) Non-trainable params: 0
(PYUSLTFModel pid=240309) _________________________________________________________________
100%|██████████| 29/29 [00:03<00:00, 9.22it/s, epoch: 1/10 - train_loss:0.4876196086406708 train_accuracy:0.779037594795227 train_auc_1:0.523348331451416 val_loss:0.3805972635746002 val_accuracy:0.8729282021522522 val_auc_1:0.6476444602012634 ]
100%|██████████| 29/29 [00:00<00:00, 42.60it/s, epoch: 2/10 - train_loss:0.3252045810222626 train_accuracy:0.8960129022598267 train_auc_1:0.6373960375785828 val_loss:0.3656504452228546 val_accuracy:0.8729282021522522 val_auc_1:0.6615355014801025 ]
100%|██████████| 29/29 [00:00<00:00, 42.80it/s, epoch: 3/10 - train_loss:0.33422568440437317 train_accuracy:0.8832964897155762 train_auc_1:0.7038192749023438 val_loss:0.358445405960083 val_accuracy:0.8729282021522522 val_auc_1:0.6819758415222168 ]
100%|██████████| 29/29 [00:00<00:00, 42.65it/s, epoch: 4/10 - train_loss:0.31387007236480713 train_accuracy:0.88606196641922 train_auc_1:0.7519825100898743 val_loss:0.3427862823009491 val_accuracy:0.8729282021522522 val_auc_1:0.7419042587280273 ]
100%|██████████| 29/29 [00:00<00:00, 45.85it/s, epoch: 5/10 - train_loss:0.2894230782985687 train_accuracy:0.8866150379180908 train_auc_1:0.8085392713546753 val_loss:0.33072948455810547 val_accuracy:0.870718240737915 val_auc_1:0.7843313217163086 ]
100%|██████████| 29/29 [00:00<00:00, 44.84it/s, epoch: 6/10 - train_loss:0.27044418454170227 train_accuracy:0.8869742751121521 train_auc_1:0.8391960859298706 val_loss:0.3120502531528473 val_accuracy:0.8674033284187317 val_auc_1:0.8096477389335632 ]
100%|██████████| 29/29 [00:00<00:00, 42.93it/s, epoch: 7/10 - train_loss:0.25070708990097046 train_accuracy:0.8962942361831665 train_auc_1:0.8619815707206726 val_loss:0.31437328457832336 val_accuracy:0.8718231916427612 val_auc_1:0.838728666305542 ]
100%|██████████| 29/29 [00:00<00:00, 43.73it/s, epoch: 8/10 - train_loss:0.25882866978645325 train_accuracy:0.8933189511299133 train_auc_1:0.8460560441017151 val_loss:0.2909625768661499 val_accuracy:0.8773480653762817 val_auc_1:0.8433351516723633 ]
100%|██████████| 29/29 [00:00<00:00, 45.62it/s, epoch: 9/10 - train_loss:0.254334032535553 train_accuracy:0.8940818309783936 train_auc_1:0.8722440004348755 val_loss:0.2853069305419922 val_accuracy:0.8828729391098022 val_auc_1:0.8439790606498718 ]
100%|██████████| 29/29 [00:00<00:00, 50.14it/s, epoch: 10/10 - train_loss:0.24358023703098297 train_accuracy:0.8957411646842957 train_auc_1:0.8758358359336853 val_loss:0.2825777232646942 val_accuracy:0.8784530162811279 val_auc_1:0.8505613803863525 ]
Let’s visualize the training process
[27]:
print(history)
print(history.keys())
{'train_loss': [0.4876196, 0.32520458, 0.33422568, 0.31387007, 0.28942308, 0.27044418, 0.2507071, 0.25882867, 0.25433403, 0.24358024], 'train_accuracy': [0.7790376, 0.8960129, 0.8832965, 0.88606197, 0.88661504, 0.8869743, 0.89629424, 0.89331895, 0.89408183, 0.89574116], 'train_auc_1': [0.52334833, 0.63739604, 0.7038193, 0.7519825, 0.8085393, 0.8391961, 0.8619816, 0.84605604, 0.872244, 0.87583584], 'val_loss': [0.38059726, 0.36565045, 0.3584454, 0.34278628, 0.33072948, 0.31205025, 0.31437328, 0.29096258, 0.28530693, 0.28257772], 'val_accuracy': [0.8729282, 0.8729282, 0.8729282, 0.8729282, 0.87071824, 0.8674033, 0.8718232, 0.87734807, 0.88287294, 0.878453], 'val_auc_1': [0.64764446, 0.6615355, 0.68197584, 0.74190426, 0.7843313, 0.80964774, 0.83872867, 0.84333515, 0.84397906, 0.8505614]}
dict_keys(['train_loss', 'train_accuracy', 'train_auc_1', 'val_loss', 'val_accuracy', 'val_auc_1'])
[28]:
# Plot the change of loss during training
plt.plot(history['train_loss'])
plt.plot(history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train','Val'], loc='upper right')
plt.show()

[29]:
# Plot the change of accuracy during training
plt.plot(history['train_accuracy'])
plt.plot(history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Val'], loc='upper left')
plt.show()

[30]:
# Plot the Area Under Curve(AUC) of loss during training
plt.plot(history['train_auc_1'])
plt.plot(history['val_auc_1'])
plt.title('Model Area Under Curve')
plt.ylabel('Area Under Curve')
plt.xlabel('Epoch')
plt.legend(['Train', 'Val'], loc='upper left')
plt.show()

我们来调用一下评估函数,看下训练效果怎么样
[31]:
global_metric = sl_model.evaluate(test_data, test_label, batch_size=128)
print(global_metric)
Evaluate Processing:: 100%|██████████| 8/8 [00:00<00:00, 175.93it/s, loss:0.28720757365226746 accuracy:0.8883978128433228 auc_1:0.8435608148574829]
{'loss': 0.28720757, 'accuracy': 0.8883978, 'auc_1': 0.8435608}
和单方模型的对比#
模型结构和上面split learning的模型保持一致,但是这里只用了有label的alice方的模型结构,模型定义参考下面的代码。 数据数据同样使用kaggle的反欺诈数据,单方模型这里我们只是用了新银行alice方数据
[32]:
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow as tf
from sklearn.model_selection import train_test_split
def create_model():
model = keras.Sequential(
[
keras.Input(shape=4),
layers.Dense(100,activation ="relu" ),
layers.Dense(64, activation='relu'),
layers.Dense(64, activation='relu'),
layers.Dense(1, activation='sigmoid')
]
)
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=["accuracy",tf.keras.metrics.AUC()])
return model
single_model = create_model()
数据处理
[33]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
single_part_data = alice_data.copy()
single_part_data['job'] = encoder.fit_transform(alice_data['job'])
single_part_data['marital'] = encoder.fit_transform(alice_data['marital'])
single_part_data['education'] = encoder.fit_transform(alice_data['education'])
single_part_data['y'] = encoder.fit_transform(alice_data['y'])
[34]:
single_part_label = single_part_data['y']
single_part_data_no_label = single_part_data.drop(columns=['y'],inplace=False)
[35]:
scaler = MinMaxScaler()
single_part_data_no_label = scaler.fit_transform(single_part_data_no_label)
[36]:
train_data,test_data = train_test_split(single_part_data_no_label, train_size=0.8,random_state=random_state)
train_label,test_label = train_test_split(single_part_label, train_size=0.8,random_state=random_state)
[37]:
test_data.shape
[37]:
(905, 4)
[38]:
history =single_model.fit(train_data,train_label,validation_data=(test_data,test_label),batch_size=128,epochs=10,shuffle=False)
Epoch 1/10
29/29 [==============================] - 2s 13ms/step - loss: 0.5258 - accuracy: 0.8653 - auc_3: 0.4494 - val_loss: 0.4046 - val_accuracy: 0.8729 - val_auc_3: 0.4320
Epoch 2/10
29/29 [==============================] - 0s 4ms/step - loss: 0.3747 - accuracy: 0.8877 - auc_3: 0.4590 - val_loss: 0.4003 - val_accuracy: 0.8729 - val_auc_3: 0.4279
Epoch 3/10
29/29 [==============================] - 0s 4ms/step - loss: 0.3684 - accuracy: 0.8877 - auc_3: 0.4383 - val_loss: 0.3941 - val_accuracy: 0.8729 - val_auc_3: 0.4223
Epoch 4/10
29/29 [==============================] - 0s 4ms/step - loss: 0.3623 - accuracy: 0.8877 - auc_3: 0.4465 - val_loss: 0.3904 - val_accuracy: 0.8729 - val_auc_3: 0.4248
Epoch 5/10
29/29 [==============================] - 0s 4ms/step - loss: 0.3600 - accuracy: 0.8877 - auc_3: 0.4533 - val_loss: 0.3877 - val_accuracy: 0.8729 - val_auc_3: 0.4401
Epoch 6/10
29/29 [==============================] - 0s 4ms/step - loss: 0.3578 - accuracy: 0.8877 - auc_3: 0.4655 - val_loss: 0.3857 - val_accuracy: 0.8729 - val_auc_3: 0.4659
Epoch 7/10
29/29 [==============================] - 0s 4ms/step - loss: 0.3562 - accuracy: 0.8877 - auc_3: 0.4869 - val_loss: 0.3841 - val_accuracy: 0.8729 - val_auc_3: 0.4851
Epoch 8/10
29/29 [==============================] - 0s 4ms/step - loss: 0.3550 - accuracy: 0.8877 - auc_3: 0.4975 - val_loss: 0.3828 - val_accuracy: 0.8729 - val_auc_3: 0.4969
Epoch 9/10
29/29 [==============================] - 0s 4ms/step - loss: 0.3539 - accuracy: 0.8877 - auc_3: 0.5105 - val_loss: 0.3816 - val_accuracy: 0.8729 - val_auc_3: 0.5166
Epoch 10/10
29/29 [==============================] - 0s 4ms/step - loss: 0.3528 - accuracy: 0.8877 - auc_3: 0.5216 - val_loss: 0.3807 - val_accuracy: 0.8729 - val_auc_3: 0.5241
Referring to the above visualization code, the training process of a local model can also be visualized
小结#
The above two experiments simulate a typical vertical scene training problem. Alice and Bob have the same sample group, but each side has only a part of the features. If Alice only uses her own data to train the model, an accuracy of 0.8729, AUC 0.5241 model can be obtained. However, if Bob’s data are combined, a model with an accuracy of 0.8884 and AUC 0.8436 can be obtained.
总结#
本篇我们介绍了什么是拆分学习,以及如何在secretflow框架下进行拆分学习
从实验数据可以看出,split learning在扩充样本维度,通过联合多方训练提升模型效果方面有显著优势
本文档使用明文聚合来做演示,同时没有考虑隐层的泄露问题,secretflow提供了AggLayer通过MPC,TEE,HE,以及DP等方式规避隐层明文传输泄露的问题,感兴趣可以看相关文档。
下一步,你可能想尝试不同的数据集,您需要先将数据集进行垂直切分,然后按照本教程的流程进行