拆分学习：银行营销#

以下代码仅作为演示用，请勿直接在生产环境使用。

在这个教程中，我们将以银行的市场营销模型为例，展示在 SecretFlow 框架下如何完成垂直场景下的拆分学习。 SecretFlow 框架提供了一套用户友好的Api，可以很方便的将您的keras模型或者pytorch模型应用到拆分学习场景，成为拆分学习模型，进而完成垂直场景的联合建模任务.

在接下来的教程中我们将手把手演示，如何将您已有的 keras 模型变成 secretflow 下的拆分学习模型，完成联邦多方建模任务。

什么是拆分学习？#

拆分学习的核心思想是将网络结构进行拆分，每个设备（机构）只保留一部分网络结构，所有设备的子网络结构组合在一起，构成一个完整的网络模型。在训练过程中，不同的设备（机构）只对本地的网络结构进行前向或反向计算，并将计算结果传递给下一个设备，多个设备端通过联合模型，完成训练，直到收敛为止。

Alice ：拥有 data_alice ， model_base_alice

Bob ：拥有 data_bob ， model_base_bob ， model_fuse

Alice 方用本方的数据经过 model_base_alice 得到 hidden0 ，发送给Bob
Bob 方用本方的数据经过 model_base_bob 得到 hidden1
hidden_0 和 hidden_1 输入给 AggLayer 层做聚合，输出聚合后的 hidden_merge
Bob 方将 hidden_merge 输入给 model_fuse 结合 label 得到梯度，并进行回传
梯度经过 AggLayer 拆分成两部分 g0 , g1 ，将 g0 和 g1 分别发送给 Alice 和 Bob
Alice 和 Bob 的 basenet 分别根据 g0 和 g1 对本方的模型进行更新

任务#

市场营销是银行业在不断变化的市场环境中，为满足客户需要、实现经营目标的整体性经营和销售的活动。在目前大数据的环境下，数据分析为银行业提供了更有效的分析手段。在客户需求分析，了解目标市场趋势以及更宏观的市场策略都能提供依据与方向。

这个数据集来自 kaggle ，是一个经典的银行营销数据集，是一家葡萄牙银行机构电话直销活动，目标变量是客户是否订阅存款产品。

数据#

样本量总计11162个，其中训练集8929，测试集2233
特征16维，标签维2分类
我们预先对数据进行了切割，alice持有其中的4维基础属性特征，bob持有12维银行交易特征，对应的label只有alice方持有

我们先来看看我们银行市场营销数据长什么样的?

原始数据经过分拆后分成bank_alice和bank_bob，分别存在alice和bob两方。这里的csv是原始数据仅经过分拆，没有做预处理的数据

[1]:

%load_ext autoreload
%autoreload 2

import secretflow as sf
import matplotlib.pyplot as plt

sf.shutdown()
sf.init(['alice', 'bob'], address='local')
alice, bob = sf.PYU('alice'), sf.PYU('bob')

2023-04-24 10:20:58,140 INFO worker.py:1538 -- Started a local Ray instance.

数据准备#

[2]:

import pandas as pd
from secretflow.utils.simulation.datasets import dataset

df = pd.read_csv(dataset('bank_marketing'), sep=';')

我们假设Alice是一个新银行，他们只有用户的基本信息，和买来的是否购买过理财产品的label

[3]:

alice_data = df[["age", "job", "marital", "education", "y"]]
alice_data

[3]:

	age	job	marital	education	y
0	30	unemployed	married	primary	no
1	33	services	married	secondary	no
2	35	management	single	tertiary	no
3	30	management	married	tertiary	no
4	59	blue-collar	married	secondary	no
...	...	...	...	...	...
4516	33	services	married	secondary	no
4517	57	self-employed	married	tertiary	no
4518	57	technician	married	secondary	no
4519	28	blue-collar	married	secondary	no
4520	44	entrepreneur	single	tertiary	no

4521 rows × 5 columns

bob端是一个老银行，他们有用户的账户余额，是否有房，是否有贷款，以及最近的营销反馈

[4]:

bob_data = df[["default", "balance", "housing", "loan", "contact",
             "day","month","duration","campaign","pdays","previous","poutcome"]]
bob_data

[4]:

	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	previous	poutcome
0	no	1787	no	no	cellular	19	oct	79	1	-1	0	unknown
1	no	4789	yes	yes	cellular	11	may	220	1	339	4	failure
2	no	1350	yes	no	cellular	16	apr	185	1	330	1	failure
3	no	1476	yes	yes	unknown	3	jun	199	4	-1	0	unknown
4	no	0	yes	no	unknown	5	may	226	1	-1	0	unknown
...	...	...	...	...	...	...	...	...	...	...	...	...
4516	no	-333	yes	no	cellular	30	jul	329	5	-1	0	unknown
4517	yes	-3313	yes	yes	unknown	9	may	153	1	-1	0	unknown
4518	no	295	no	no	cellular	19	aug	151	11	-1	0	unknown
4519	no	1137	no	no	cellular	6	feb	129	4	211	3	other
4520	no	1136	yes	yes	cellular	3	apr	345	2	249	7	other

4521 rows × 12 columns

环境的搭建#

在 Secretflow 环境中创建 2 个实体 [Alice, Bob]，其中 ‘Alice’ 和 ‘Bob’ 是两个 PY。一旦你构建了这两个对象，你可以愉快地开始拆分学习

引入依赖#

[5]:

from secretflow.data.split import train_test_split
from secretflow.ml.nn import SLModel

2023-04-24 10:20:59.841732: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-04-24 10:21:00.576963: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-04-24 10:21:00.577064: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-04-24 10:21:00.577078: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

准备数据#

创建联邦表

联邦表是一个跨多方的虚拟概念，我们定义 VDataFrame 用于垂直设置。

联邦表中各方的数据存储在本地，不允许出域。
除了拥有数据的一方之外，没有人可以访问数据存储。
联邦表的任何操作都会由驱动调度给每个worker，执行指令会逐层传递，直到特定worker的Python Runtime。框架确保只有 worker.device 和 Object 两个概念。设备可以同时操作数据。
联合表旨在从中心角度管理和操作多方数据。
Federated Table 的接口与 pandas.DataFrame 对齐，以降低多方数据操作的成本。
SecretFlow 框架提供 Plain&Ciphertext 混合编程能力。垂直联合表是使用 SPU 构建的， MPC-PSI 用于安全地获取来自各方的交集和对齐数据。

VDataFrame 提供类似于 pandas 的 read_csv 接口，不同之处在于secretflow.read_csv 接收一个定义双方数据路径的字典。我们可以使用 secretflow.vertical.read_csv 来构建 VDataFrame 。

read_csv(file_dict,delimiter,ppu,keys,drop_key)
    filepath: Path of the participant file. The address can be a relative or absolute path to a local file
    ppu: PPU Device for PSI; If this parameter is not specified, data must be prealigned
    keys: Key for intersection

创建spu object

[6]:

spu = sf.SPU(sf.utils.testing.cluster_def(['alice', 'bob']))

[7]:

from secretflow.utils.simulation.datasets import load_bank_marketing

# Alice has the first four features,
# while bob has the left features
data = load_bank_marketing(parts={alice: (0, 4), bob: (4, 16)}, axis=1)
# Alice holds the label.
label = load_bank_marketing(parts={alice: (16, 17)}, axis=1)

data 为构建好的垂直联邦表，他从全局上只拥有所有数据的 Schema

我们进一步来看一下VDF的数据管理

通过一个实例可以看出，age这个字段是属于alice的，所以在alice方的partition可以得到对应的列，但是bob方想要去获取age的时候会报`KeyError`错误。

这里有一个Partition的概念，是我们定义的一个数据分片，每个partition都会有自己的device归属，只有归属的device才可以操作数据。

[8]:

data['age'].partitions[alice].data

[8]:

<secretflow.device.device.pyu.PYUObject at 0x7fb54bc6d250>

[9]:

# You can uncomment this and you will get a KeyError.
# data['age'].partitions[bob]

我们接着对生成的联邦表做数据预处理。

我们这里以LabelEncoder和MinMaxScaler为例，这两个预处理函数在`sklearn`中有对应的概念，他的使用方法和sklearn中也是类似的

[10]:

from secretflow.preprocessing.scaler import MinMaxScaler
from secretflow.preprocessing.encoder import LabelEncoder

[11]:

encoder = LabelEncoder()
data['job'] = encoder.fit_transform(data['job'])
data['marital'] = encoder.fit_transform(data['marital'])
data['education'] = encoder.fit_transform(data['education'])
data['default'] = encoder.fit_transform(data['default'])
data['housing'] = encoder.fit_transform(data['housing'])
data['loan'] = encoder.fit_transform(data['loan'])
data['contact'] = encoder.fit_transform(data['contact'])
data['poutcome'] = encoder.fit_transform(data['poutcome'])
data['month'] = encoder.fit_transform(data['month'])
label = encoder.fit_transform(label)

(SPURuntime pid=240157) 2023-04-24 10:21:03.623 [error] [context.cc:operator():132] connect to rank=0 failed with error [external/yacl/yacl/link/transport/channel_brpc.cc:368] send, rpc failed=112, message=[E111]Fail to connect Socket{id=0 addr=127.0.0.1:34003} (0x0x4821fc0): Connection refused [R1][E112]Not connected to 127.0.0.1:34003 yet, server_id=0 [R2][E112]Not connected to 127.0.0.1:34003 yet, server_id=0 [R3][E112]Not connected to 127.0.0.1:34003 yet, server_id=0

[12]:

print(f"label= {type(label)},\ndata = {type(data)}")

label= <class 'secretflow.data.vertical.dataframe.VDataFrame'>,
data = <class 'secretflow.data.vertical.dataframe.VDataFrame'>

通过MinMaxScaler做数据标准化

[13]:

scaler = MinMaxScaler()

data = scaler.fit_transform(data)

接着我们将数据集划分成train-set和test-set

[14]:

from secretflow.data.split import train_test_split
random_state = 1234
train_data,test_data = train_test_split(data, train_size=0.8, random_state=random_state)
train_label,test_label = train_test_split(label, train_size=0.8, random_state=random_state)

(_run pid=239859) /home/limingbo/.conda/envs/sf/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but MinMaxScaler was fitted without feature names
(_run pid=239859)   warnings.warn(
(_run pid=239859) /home/limingbo/.conda/envs/sf/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but MinMaxScaler was fitted without feature names
(_run pid=239859)   warnings.warn(

小结：到这里为止，我们就完成了联邦表的定义，数据预处理，以及训练集和测试集的划分。 secretflow框架定义了跨越多方的联邦表概念，同时定义了一套构建在联邦表上的操作，逻辑对等 pandas.DataFrame ，同时定义了对于联邦表的预处理操作，逻辑对等 sklearn ,您在使用过程中遇到问题，可以参考我们的文档以及api介绍，进一步了解其他的功能

模型介绍#

单机版本：”对于该任务一个基本的DNN就可以完成，输入16维特征，经过一个DNN网络，输出对于正负样本的概率。”

联邦版本：”* Alice：” - base_net:输入4维特征，经过一个dnn网络得到hidden” - fuse_net:接收自己的hidden_alice,以及bob计算得到的hidden特征，输入的fuse_net，进行特征融合，送入之后的网络完成整个forward过程和backward过程”* Bob：” - base_net:输入12维特征，经过一个dnn网络得到hidden，然后将hidden发送给alice方，完成接下来的运算

定义模型#

接下来我们开始创建联邦模型”在垂直场景我们定义了 SLTFModel 和 SLTorchModel(WIP) ,用于构建垂直场景的拆分学习，我们定义了简单易用可扩展的接口，可以很方便的将您已有的模型，转换成SF—Model，进而进行垂直场景联邦建模。

拆分学习即将一个模型拆分开来，一部分放在数据的本地执行，另外一部分放在有label的一方，或者server端执行。首先我们来定义本地执行的模型——base_model

[15]:

def create_base_model(input_dim, output_dim,  name='base_model'):
    # Create model
    def create_model():
        from tensorflow import keras
        from tensorflow.keras import layers
        import tensorflow as tf
        model = keras.Sequential(
            [
                keras.Input(shape=input_dim),
                layers.Dense(100,activation ="relu" ),
                layers.Dense(output_dim, activation="relu"),
            ]
        )
        # Compile model
        model.summary()
        model.compile(loss='binary_crossentropy',
                      optimizer='adam',
                      metrics=["accuracy",tf.keras.metrics.AUC()])
        return model
    return create_model

我们使用create_base_model分别为 alice 和 bob 创建他们的base model

[16]:

# prepare model
hidden_size = 64
# get the number of features of each party.
# When the input data changes, the network automatically adjusts to the input data
alice_input_feature_num = train_data.values.partition_shape()[alice][1]
bob_input_feature_num = train_data.values.partition_shape()[bob][1]

model_base_alice = create_base_model(alice_input_feature_num, hidden_size)
model_base_bob = create_base_model(bob_input_feature_num, hidden_size)

[17]:

model_base_alice()
model_base_bob()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #
=================================================================
 dense (Dense)               (None, 100)               500

 dense_1 (Dense)             (None, 64)                6464

=================================================================
Total params: 6,964
Trainable params: 6,964
Non-trainable params: 0
_________________________________________________________________
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #
=================================================================
 dense_2 (Dense)             (None, 100)               1300

 dense_3 (Dense)             (None, 64)                6464

=================================================================
Total params: 7,764
Trainable params: 7,764
Non-trainable params: 0
_________________________________________________________________

[17]:

<keras.engine.sequential.Sequential at 0x7fb54bc6d610>

接下来我们定义有label的一方，或者server端的模型——fuse_model在fuse_model的定义中，我们需要正确的定义loss，optimizer，metrics。这里可以兼容所有您已有的keras模型的配置

[18]:

def create_fuse_model(input_dim, output_dim, party_nums, name='fuse_model'):
    def create_model():
        from tensorflow import keras
        from tensorflow.keras import layers
        import tensorflow as tf
        # input
        input_layers = []
        for i in range(party_nums):
            input_layers.append(keras.Input(input_dim,))

        merged_layer = layers.concatenate(input_layers)
        fuse_layer = layers.Dense(64, activation='relu')(merged_layer)
        output = layers.Dense(output_dim, activation='sigmoid')(fuse_layer)

        model = keras.Model(inputs=input_layers, outputs=output)
        model.summary()

        model.compile(loss='binary_crossentropy',
                      optimizer='adam',
                      metrics=["accuracy",tf.keras.metrics.AUC()])
        return model
    return create_model

[19]:

model_fuse = create_fuse_model(
    input_dim=hidden_size, party_nums=2, output_dim=1)

[20]:

model_fuse()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to
==================================================================================================
 input_3 (InputLayer)           [(None, 64)]         0           []

 input_4 (InputLayer)           [(None, 64)]         0           []

 concatenate (Concatenate)      (None, 128)          0           ['input_3[0][0]',
                                                                  'input_4[0][0]']

 dense_4 (Dense)                (None, 64)           8256        ['concatenate[0][0]']

 dense_5 (Dense)                (None, 1)            65          ['dense_4[0][0]']

==================================================================================================
Total params: 8,321
Trainable params: 8,321
Non-trainable params: 0
__________________________________________________________________________________________________

[20]:

<keras.engine.functional.Functional at 0x7fb548506d00>

创建拆分学习模型#

secretflow提供了拆分学习的模型 SLModelSLModel模型初始化需要3个参数* base_model_dict：一个字典需要传入参与训练的所有client以及base_model映射* device_y：PYU，哪一方持有label* model_fuse：融合模型，具体的优化器以及损失函数都在这个模型中进行定义

定义 base_model_dict

base_model_dict:Dict[PYU,model_fn]

[21]:

base_model_dict = {
    alice: model_base_alice,
    bob:   model_base_bob
}

[22]:

from secretflow.security.privacy import DPStrategy, GaussianEmbeddingDP, LabelDP

# Define DP operations
train_batch_size = 128
gaussian_embedding_dp = GaussianEmbeddingDP(
    noise_multiplier=0.5,
    l2_norm_clip=1.0,
    batch_size=train_batch_size,
    num_samples=train_data.values.partition_shape()[alice][0],
    is_secure_generator=False,
)
dp_strategy_alice = DPStrategy(embedding_dp=gaussian_embedding_dp)
label_dp = LabelDP(eps=64.0)
dp_strategy_bob = DPStrategy(label_dp=label_dp)
dp_strategy_dict = {alice: dp_strategy_alice, bob: dp_strategy_bob}
dp_spent_step_freq = 10

[23]:

sl_model = SLModel(
    base_model_dict=base_model_dict,
    device_y=alice,
    model_fuse=model_fuse,
    dp_strategy_dict=dp_strategy_dict,)

INFO:root:Create proxy actor <class 'secretflow.ml.nn.sl.backend.tensorflow.sl_base.PYUSLTFModel'> with party alice.
INFO:root:Create proxy actor <class 'secretflow.ml.nn.sl.backend.tensorflow.sl_base.PYUSLTFModel'> with party bob.

[24]:

sf.reveal(test_data.partitions[alice].data), sf.reveal(test_label.partitions[alice].data)

[24]:

(           age       job  marital  education
 1426  0.279412  0.181818      0.5   0.333333
 416   0.176471  0.636364      1.0   0.333333
 3977  0.264706  0.000000      0.5   0.666667
 2291  0.338235  0.000000      0.5   0.333333
 257   0.132353  0.909091      1.0   0.333333
 ...        ...       ...      ...        ...
 1508  0.264706  0.818182      1.0   0.333333
 979   0.544118  0.090909      0.0   0.000000
 3494  0.455882  0.090909      0.5   0.000000
 42    0.485294  0.090909      0.5   0.333333
 1386  0.455882  0.636364      0.5   0.333333

 [905 rows x 4 columns],
       y
 1426  0
 416   0
 3977  0
 2291  0
 257   0
 ...  ..
 1508  0
 979   0
 3494  0
 42    0
 1386  0

 [905 rows x 1 columns])

[25]:

sf.reveal(train_data.partitions[alice].data), sf.reveal(train_label.partitions[alice].data)

[25]:

(           age       job  marital  education
 1106  0.235294  0.090909      0.5   0.333333
 1309  0.176471  0.363636      0.5   0.333333
 2140  0.411765  0.272727      1.0   0.666667
 2134  0.573529  0.454545      0.5   0.333333
 960   0.485294  0.818182      0.5   0.333333
 ...        ...       ...      ...        ...
 664   0.397059  0.090909      1.0   0.333333
 3276  0.235294  0.181818      0.5   0.666667
 1318  0.220588  0.818182      0.5   0.333333
 723   0.220588  0.636364      0.5   0.333333
 2863  0.176471  0.363636      1.0   0.666667

 [3616 rows x 4 columns],
       y
 1106  0
 1309  0
 2140  1
 2134  0
 960   0
 ...  ..
 664   0
 3276  0
 1318  0
 723   0
 2863  0

 [3616 rows x 1 columns])

[26]:

history =  sl_model.fit(train_data,
             train_label,
             validation_data=(test_data,test_label),
             epochs=10,
             batch_size=train_batch_size,
             shuffle=True,
             verbose=1,
             validation_freq=1,
             dp_spent_step_freq=dp_spent_step_freq,)

INFO:root:SL Train Params: {'self': <secretflow.ml.nn.sl.sl_model.SLModel object at 0x7fb548492910>, 'x': VDataFrame(partitions={alice: Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7fb54bc8ff10>), bob: Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7fb54bc8b970>)}, aligned=True), 'y': VDataFrame(partitions={alice: Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7fb54bc6cfd0>)}, aligned=True), 'batch_size': 128, 'epochs': 10, 'verbose': 1, 'callbacks': None, 'validation_data': (VDataFrame(partitions={alice: Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7fb54bc8f610>), bob: Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7fb54bc8b8e0>)}, aligned=True), VDataFrame(partitions={alice: Partition(data=<secretflow.device.device.pyu.PYUObject object at 0x7fb54bc6c970>)}, aligned=True)), 'shuffle': True, 'sample_weight': None, 'validation_freq': 1, 'dp_spent_step_freq': 10, 'dataset_builder': None, 'audit_log_dir': None, 'audit_log_params': {}, 'random_seed': 13780}
(pid=240300) 2023-04-24 10:21:13.143944: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(pid=240309) 2023-04-24 10:21:13.339222: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(pid=240300) 2023-04-24 10:21:13.923649: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
(pid=240300) 2023-04-24 10:21:13.923745: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
(pid=240300) 2023-04-24 10:21:13.923756: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(pid=240309) 2023-04-24 10:21:14.119980: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
(pid=240309) 2023-04-24 10:21:14.120083: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
(pid=240309) 2023-04-24 10:21:14.120095: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(PYUSLTFModel pid=240300) 2023-04-24 10:21:16.143007: E tensorflow/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
(PYUSLTFModel pid=240309) 2023-04-24 10:21:16.357362: E tensorflow/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

(PYUSLTFModel pid=240300) Model: "sequential"
(PYUSLTFModel pid=240300) _________________________________________________________________
(PYUSLTFModel pid=240300)  Layer (type)                Output Shape              Param #
(PYUSLTFModel pid=240300) =================================================================
(PYUSLTFModel pid=240300)  dense (Dense)               (None, 100)               500
(PYUSLTFModel pid=240300)
(PYUSLTFModel pid=240300)  dense_1 (Dense)             (None, 64)                6464
(PYUSLTFModel pid=240300)
(PYUSLTFModel pid=240300) =================================================================
(PYUSLTFModel pid=240300) Total params: 6,964
(PYUSLTFModel pid=240300) Trainable params: 6,964
(PYUSLTFModel pid=240300) Non-trainable params: 0
(PYUSLTFModel pid=240300) _________________________________________________________________
(PYUSLTFModel pid=240300) Model: "model"
(PYUSLTFModel pid=240300) __________________________________________________________________________________________________
(PYUSLTFModel pid=240300)  Layer (type)                   Output Shape         Param #     Connected to
(PYUSLTFModel pid=240300) ==================================================================================================
(PYUSLTFModel pid=240300)  input_2 (InputLayer)           [(None, 64)]         0           []
(PYUSLTFModel pid=240300)
(PYUSLTFModel pid=240300)  input_3 (InputLayer)           [(None, 64)]         0           []
(PYUSLTFModel pid=240300)
(PYUSLTFModel pid=240300)  concatenate (Concatenate)      (None, 128)          0           ['input_2[0][0]',
(PYUSLTFModel pid=240300)                                                                   'input_3[0][0]']
(PYUSLTFModel pid=240300)
(PYUSLTFModel pid=240300)  dense_2 (Dense)                (None, 64)           8256        ['concatenate[0][0]']
(PYUSLTFModel pid=240300)
(PYUSLTFModel pid=240300)  dense_3 (Dense)                (None, 1)            65          ['dense_2[0][0]']
(PYUSLTFModel pid=240300)
(PYUSLTFModel pid=240300) ==================================================================================================
(PYUSLTFModel pid=240300) Total params: 8,321
(PYUSLTFModel pid=240300) Trainable params: 8,321
(PYUSLTFModel pid=240300) Non-trainable params: 0
(PYUSLTFModel pid=240300) __________________________________________________________________________________________________

  7%|▋         | 2/29 [00:00<00:02, 13.20it/s]

(PYUSLTFModel pid=240309) Model: "sequential"
(PYUSLTFModel pid=240309) _________________________________________________________________
(PYUSLTFModel pid=240309)  Layer (type)                Output Shape              Param #
(PYUSLTFModel pid=240309) =================================================================
(PYUSLTFModel pid=240309)  dense (Dense)               (None, 100)               1300
(PYUSLTFModel pid=240309)
(PYUSLTFModel pid=240309)  dense_1 (Dense)             (None, 64)                6464
(PYUSLTFModel pid=240309)
(PYUSLTFModel pid=240309) =================================================================
(PYUSLTFModel pid=240309) Total params: 7,764
(PYUSLTFModel pid=240309) Trainable params: 7,764
(PYUSLTFModel pid=240309) Non-trainable params: 0
(PYUSLTFModel pid=240309) _________________________________________________________________

100%|██████████| 29/29 [00:03<00:00,  9.22it/s, epoch: 1/10 -  train_loss:0.4876196086406708  train_accuracy:0.779037594795227  train_auc_1:0.523348331451416  val_loss:0.3805972635746002  val_accuracy:0.8729282021522522  val_auc_1:0.6476444602012634 ]
100%|██████████| 29/29 [00:00<00:00, 42.60it/s, epoch: 2/10 -  train_loss:0.3252045810222626  train_accuracy:0.8960129022598267  train_auc_1:0.6373960375785828  val_loss:0.3656504452228546  val_accuracy:0.8729282021522522  val_auc_1:0.6615355014801025 ]
100%|██████████| 29/29 [00:00<00:00, 42.80it/s, epoch: 3/10 -  train_loss:0.33422568440437317  train_accuracy:0.8832964897155762  train_auc_1:0.7038192749023438  val_loss:0.358445405960083  val_accuracy:0.8729282021522522  val_auc_1:0.6819758415222168 ]
100%|██████████| 29/29 [00:00<00:00, 42.65it/s, epoch: 4/10 -  train_loss:0.31387007236480713  train_accuracy:0.88606196641922  train_auc_1:0.7519825100898743  val_loss:0.3427862823009491  val_accuracy:0.8729282021522522  val_auc_1:0.7419042587280273 ]
100%|██████████| 29/29 [00:00<00:00, 45.85it/s, epoch: 5/10 -  train_loss:0.2894230782985687  train_accuracy:0.8866150379180908  train_auc_1:0.8085392713546753  val_loss:0.33072948455810547  val_accuracy:0.870718240737915  val_auc_1:0.7843313217163086 ]
100%|██████████| 29/29 [00:00<00:00, 44.84it/s, epoch: 6/10 -  train_loss:0.27044418454170227  train_accuracy:0.8869742751121521  train_auc_1:0.8391960859298706  val_loss:0.3120502531528473  val_accuracy:0.8674033284187317  val_auc_1:0.8096477389335632 ]
100%|██████████| 29/29 [00:00<00:00, 42.93it/s, epoch: 7/10 -  train_loss:0.25070708990097046  train_accuracy:0.8962942361831665  train_auc_1:0.8619815707206726  val_loss:0.31437328457832336  val_accuracy:0.8718231916427612  val_auc_1:0.838728666305542 ]
100%|██████████| 29/29 [00:00<00:00, 43.73it/s, epoch: 8/10 -  train_loss:0.25882866978645325  train_accuracy:0.8933189511299133  train_auc_1:0.8460560441017151  val_loss:0.2909625768661499  val_accuracy:0.8773480653762817  val_auc_1:0.8433351516723633 ]
100%|██████████| 29/29 [00:00<00:00, 45.62it/s, epoch: 9/10 -  train_loss:0.254334032535553  train_accuracy:0.8940818309783936  train_auc_1:0.8722440004348755  val_loss:0.2853069305419922  val_accuracy:0.8828729391098022  val_auc_1:0.8439790606498718 ]
100%|██████████| 29/29 [00:00<00:00, 50.14it/s, epoch: 10/10 -  train_loss:0.24358023703098297  train_accuracy:0.8957411646842957  train_auc_1:0.8758358359336853  val_loss:0.2825777232646942  val_accuracy:0.8784530162811279  val_auc_1:0.8505613803863525 ]

Let’s visualize the training process

[27]:

print(history)
print(history.keys())

{'train_loss': [0.4876196, 0.32520458, 0.33422568, 0.31387007, 0.28942308, 0.27044418, 0.2507071, 0.25882867, 0.25433403, 0.24358024], 'train_accuracy': [0.7790376, 0.8960129, 0.8832965, 0.88606197, 0.88661504, 0.8869743, 0.89629424, 0.89331895, 0.89408183, 0.89574116], 'train_auc_1': [0.52334833, 0.63739604, 0.7038193, 0.7519825, 0.8085393, 0.8391961, 0.8619816, 0.84605604, 0.872244, 0.87583584], 'val_loss': [0.38059726, 0.36565045, 0.3584454, 0.34278628, 0.33072948, 0.31205025, 0.31437328, 0.29096258, 0.28530693, 0.28257772], 'val_accuracy': [0.8729282, 0.8729282, 0.8729282, 0.8729282, 0.87071824, 0.8674033, 0.8718232, 0.87734807, 0.88287294, 0.878453], 'val_auc_1': [0.64764446, 0.6615355, 0.68197584, 0.74190426, 0.7843313, 0.80964774, 0.83872867, 0.84333515, 0.84397906, 0.8505614]}
dict_keys(['train_loss', 'train_accuracy', 'train_auc_1', 'val_loss', 'val_accuracy', 'val_auc_1'])

[28]:

# Plot the change of loss during training
plt.plot(history['train_loss'])
plt.plot(history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train','Val'], loc='upper right')
plt.show()

../_images/tutorial_Split_Learning_for_bank_marketing_63_0.png

[29]:

# Plot the change of accuracy during training
plt.plot(history['train_accuracy'])
plt.plot(history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Val'], loc='upper left')
plt.show()

../_images/tutorial_Split_Learning_for_bank_marketing_64_0.png

[30]:

# Plot the Area Under Curve(AUC) of loss during training
plt.plot(history['train_auc_1'])
plt.plot(history['val_auc_1'])
plt.title('Model Area Under Curve')
plt.ylabel('Area Under Curve')
plt.xlabel('Epoch')
plt.legend(['Train', 'Val'], loc='upper left')
plt.show()

../_images/tutorial_Split_Learning_for_bank_marketing_65_0.png

我们来调用一下评估函数，看下训练效果怎么样

[31]:

global_metric = sl_model.evaluate(test_data, test_label, batch_size=128)
print(global_metric)

Evaluate Processing:: 100%|██████████| 8/8 [00:00<00:00, 175.93it/s, loss:0.28720757365226746 accuracy:0.8883978128433228 auc_1:0.8435608148574829]

{'loss': 0.28720757, 'accuracy': 0.8883978, 'auc_1': 0.8435608}

和单方模型的对比#

模型结构和上面split learning的模型保持一致，但是这里只用了有label的alice方的模型结构，模型定义参考下面的代码。数据数据同样使用kaggle的反欺诈数据，单方模型这里我们只是用了新银行alice方数据

[32]:

from tensorflow import keras
from tensorflow.keras import layers
import tensorflow as tf
from sklearn.model_selection import train_test_split

def create_model():

    model = keras.Sequential(
        [
            keras.Input(shape=4),
            layers.Dense(100,activation ="relu" ),
            layers.Dense(64, activation='relu'),
            layers.Dense(64, activation='relu'),
            layers.Dense(1, activation='sigmoid')
        ]
    )
    model.compile(loss='binary_crossentropy',
                      optimizer='adam',
                      metrics=["accuracy",tf.keras.metrics.AUC()])
    return model

single_model = create_model()

数据处理

[33]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
single_part_data = alice_data.copy()

single_part_data['job'] = encoder.fit_transform(alice_data['job'])
single_part_data['marital'] = encoder.fit_transform(alice_data['marital'])
single_part_data['education'] = encoder.fit_transform(alice_data['education'])
single_part_data['y'] =  encoder.fit_transform(alice_data['y'])

[34]:

single_part_label = single_part_data['y']
single_part_data_no_label = single_part_data.drop(columns=['y'],inplace=False)

[35]:

scaler = MinMaxScaler()
single_part_data_no_label = scaler.fit_transform(single_part_data_no_label)

[36]:

train_data,test_data = train_test_split(single_part_data_no_label, train_size=0.8,random_state=random_state)
train_label,test_label = train_test_split(single_part_label, train_size=0.8,random_state=random_state)

[37]:

test_data.shape

[37]:

(905, 4)

[38]:

history =single_model.fit(train_data,train_label,validation_data=(test_data,test_label),batch_size=128,epochs=10,shuffle=False)

Epoch 1/10
29/29 [==============================] - 2s 13ms/step - loss: 0.5258 - accuracy: 0.8653 - auc_3: 0.4494 - val_loss: 0.4046 - val_accuracy: 0.8729 - val_auc_3: 0.4320
Epoch 2/10
29/29 [==============================] - 0s 4ms/step - loss: 0.3747 - accuracy: 0.8877 - auc_3: 0.4590 - val_loss: 0.4003 - val_accuracy: 0.8729 - val_auc_3: 0.4279
Epoch 3/10
29/29 [==============================] - 0s 4ms/step - loss: 0.3684 - accuracy: 0.8877 - auc_3: 0.4383 - val_loss: 0.3941 - val_accuracy: 0.8729 - val_auc_3: 0.4223
Epoch 4/10
29/29 [==============================] - 0s 4ms/step - loss: 0.3623 - accuracy: 0.8877 - auc_3: 0.4465 - val_loss: 0.3904 - val_accuracy: 0.8729 - val_auc_3: 0.4248
Epoch 5/10
29/29 [==============================] - 0s 4ms/step - loss: 0.3600 - accuracy: 0.8877 - auc_3: 0.4533 - val_loss: 0.3877 - val_accuracy: 0.8729 - val_auc_3: 0.4401
Epoch 6/10
29/29 [==============================] - 0s 4ms/step - loss: 0.3578 - accuracy: 0.8877 - auc_3: 0.4655 - val_loss: 0.3857 - val_accuracy: 0.8729 - val_auc_3: 0.4659
Epoch 7/10
29/29 [==============================] - 0s 4ms/step - loss: 0.3562 - accuracy: 0.8877 - auc_3: 0.4869 - val_loss: 0.3841 - val_accuracy: 0.8729 - val_auc_3: 0.4851
Epoch 8/10
29/29 [==============================] - 0s 4ms/step - loss: 0.3550 - accuracy: 0.8877 - auc_3: 0.4975 - val_loss: 0.3828 - val_accuracy: 0.8729 - val_auc_3: 0.4969
Epoch 9/10
29/29 [==============================] - 0s 4ms/step - loss: 0.3539 - accuracy: 0.8877 - auc_3: 0.5105 - val_loss: 0.3816 - val_accuracy: 0.8729 - val_auc_3: 0.5166
Epoch 10/10
29/29 [==============================] - 0s 4ms/step - loss: 0.3528 - accuracy: 0.8877 - auc_3: 0.5216 - val_loss: 0.3807 - val_accuracy: 0.8729 - val_auc_3: 0.5241

Referring to the above visualization code, the training process of a local model can also be visualized

小结#

The above two experiments simulate a typical vertical scene training problem. Alice and Bob have the same sample group, but each side has only a part of the features. If Alice only uses her own data to train the model, an accuracy of 0.8729, AUC 0.5241 model can be obtained. However, if Bob’s data are combined, a model with an accuracy of 0.8884 and AUC 0.8436 can be obtained.

总结#

本篇我们介绍了什么是拆分学习，以及如何在secretflow框架下进行拆分学习
从实验数据可以看出，split learning在扩充样本维度，通过联合多方训练提升模型效果方面有显著优势
本文档使用明文聚合来做演示，同时没有考虑隐层的泄露问题，secretflow提供了AggLayer通过MPC,TEE,HE，以及DP等方式规避隐层明文传输泄露的问题，感兴趣可以看相关文档。
下一步，你可能想尝试不同的数据集，您需要先将数据集进行垂直切分，然后按照本教程的流程进行