水平联邦XGBoost#
以下代码仅作为演示用,请勿直接在生产环境使用。
在本教程中,我们将学习如何使用 SecretFlow 来训练水平联邦的树模型。Secretflow 为水平场景提供了 tree modeling
能力(SFXgboost
),SFXgboost
类似于 XGBoost
,您可以轻松地将现有的 XGBoost 程序转换为 SecretFlow 的联合模型。
Xgboost#
XGBoost 是一个优化的分布式梯度提升库,旨在高效、灵活和便携。 它在 Gradient Boosting 框架下实现机器学习算法
官方文档 XGBoost tutorials
准备secretflow devices#
[1]:
%load_ext autoreload
%autoreload 2
import secretflow as sf
# In case you have a running secretflow runtime already.
sf.shutdown()
sf.init(['alice', 'bob', 'charlie'], address='local')
alice, bob, charlie = sf.PYU('alice'), sf.PYU('bob'), sf.PYU('charlie')
2022-08-19 13:47:10.795519: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/rh-ruby25/root/usr/local/lib64:/opt/rh/rh-ruby25/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst
XGBoost示例#
[2]:
import xgboost as xgb
import pandas as pd
from secretflow.utils.simulation.datasets import dataset
df = pd.read_csv(dataset('dermatology'))
df.fillna(value=0)
print(df.dtypes)
y = df['class']
y = y - 1
x = df.drop(columns="class")
dtrain = xgb.DMatrix(x, y)
dtest = dtrain
params = {
'max_depth': 4,
'objective': 'multi:softmax',
'min_child_weight': 1,
'max_bin': 10,
'num_class': 6,
'eval_metric': 'merror',
}
num_round = 4
watchlist = [(dtrain, 'train')]
bst = xgb.train(params, dtrain, num_round, evals=watchlist, early_stopping_rounds=2)
/home/fengjun.feng/miniconda3/envs/py3.8/lib/python3.8/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
from pandas import MultiIndex, Int64Index
/home/fengjun.feng/miniconda3/envs/py3.8/lib/python3.8/site-packages/xgboost/data.py:262: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
erythema int64
scaling int64
definite_borders int64
itching int64
koebner_phenomenon int64
polygonal_papules int64
follicular_papules int64
oral_mucosal_involvement int64
knee_and_elbow_involvement int64
scalp_involvement int64
family_history int64
melanin_incontinence int64
eosinophils_in_the_infiltrate int64
pnl_infiltrate int64
fibrosis_of_the_papillary_dermis int64
exocytosis int64
acanthosis int64
hyperkeratosis int64
parakeratosis int64
clubbing_of_the_rete_ridges int64
elongation_of_the_rete_ridges int64
thinning_of_the_suprapapillary_epidermis int64
spongiform_pustule int64
munro_microabcess int64
focal_hypergranulosis int64
disappearance_of_the_granular_layer int64
vacuolisation_and_damage_of_basal_layer int64
spongiosis int64
saw-tooth_appearance_of_retes int64
follicular_horn_plug int64
perifollicular_parakeratosis int64
inflammatory_monoluclear_inflitrate int64
band-like_infiltrate int64
age float64
class int64
dtype: object
[0] train-merror:0.01913
[1] train-merror:0.01366
[2] train-merror:0.01366
[3] train-merror:0.00820
那么,我们在SecretFlow中应该怎么做联邦XGBoost呢#
使用基于迭代的federate binning方法联合各方数据计算全局分桶信息,作为candidate splits进入后续的建树流程
数据输入到各个Client xgboost引擎中,计算G & H
进行联邦建树流程
进行数据reassign,分配到待分裂的节点上
根据之前计算好的binning分桶计算sum_of_grad 和sum_of_hess
发送给server端,server端做secure aggregation,挑选分裂信息发送回client端
Clients更新本地分裂
完成训练,并保存模型。
在 Secretflow 环境中创建 3 个实体 [Alice, Bob, Charlie]’Alice’和’Bob’是客户端,’Charlie’是服务器,那么你可以愉快地开始 Federate Boosting
了。
准备数据#
[3]:
from secretflow.data.horizontal import read_csv
from secretflow.security.aggregation import SecureAggregator
from secretflow.security.compare import SPUComparator
from secretflow.utils.simulation.datasets import load_dermatology
aggr = SecureAggregator(charlie, [alice, bob])
spu = sf.SPU(sf.utils.testing.cluster_def(['alice', 'bob']))
comp = SPUComparator(spu)
data = load_dermatology(parts=[alice, bob], aggregator=aggr, comparator=comp)
data.fillna(value=0, inplace=True)
(pid=3817970) 2022-08-19 13:47:17.904107: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/rh-ruby25/root/usr/local/lib64:/opt/rh/rh-ruby25/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst
(pid=3817969) 2022-08-19 13:47:17.904107: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/rh-ruby25/root/usr/local/lib64:/opt/rh/rh-ruby25/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst
准备超参#
[4]:
params = {
# XGBoost parameter tutorial
# https://xgboost.readthedocs.io/en/latest/parameter.html
'max_depth': 4, # max depth
'eta': 0.3, # learning rate
'objective': 'multi:softmax', # objection function,support "binary:logistic","reg:logistic","multi:softmax","multi:softprob","reg:squarederror"
'min_child_weight': 1, # The minimum value of weight
'lambda': 0.1, # L2 regularization term on weights (xgb's lambda)
'alpha': 0, # L1 regularization term on weights (xgb's alpha)
'max_bin': 10, # Max num of binning
'num_class': 6, # Only required in multi-class classification
'gamma': 0, # Same to min_impurity_split,The minimux gain for a split
'subsample': 1.0, # Subsample rate by rows
'colsample_bytree': 1.0, # Feature selection rate by tree
'colsample_bylevel': 1.0, # Feature selection rate by level
'eval_metric': 'merror', # supported eval metric:
# 1. rmse
# 2. rmsle
# 3. mape
# 4. logloss
# 5. error
# 6. error@t
# 7. merror
# 8. mlogloss
# 9. auc
# 10. aucpr
# Special params in SFXgboost
# Required
'hess_key': 'hess', # Required, Mark hess columns, optionally choosing a column name that is not in the data set
'grad_key': 'grad', # Required,Mark grad columns, optionally choosing a column name that is not in the data set
'label_key': 'class', # Required,ark label columns, optionally choosing a column name that is not in the data set
}
创建SFXgboost#
[6]:
from secretflow.ml.boost.homo_boost import SFXgboost
bst = SFXgboost(server=charlie, clients=[alice, bob])
运行SFXgboost
[7]:
bst.train(data, data, params=params, num_boost_round=6)
(_run pid=3817967) 2022-08-19 13:48:05.541675: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/rh-ruby25/root/usr/local/lib64:/opt/rh/rh-ruby25/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst
(_run pid=3817967) 2022-08-19 13:48:07,217,217 WARNING [xla_bridge.py:backends:265] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(_run pid=3817957) 2022-08-19 13:48:07.943512: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/rh-ruby25/root/usr/local/lib64:/opt/rh/rh-ruby25/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst
(pid=3817961) 2022-08-19 13:48:08.108831: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/rh-ruby25/root/usr/local/lib64:/opt/rh/rh-ruby25/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst
(pid=3817959) 2022-08-19 13:48:08.068793: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/rh-ruby25/root/usr/local/lib64:/opt/rh/rh-ruby25/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst
(pid=3817954) 2022-08-19 13:48:08.108831: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/rh-ruby25/root/usr/local/lib64:/opt/rh/rh-ruby25/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst
(pid=3817964) 2022-08-19 13:48:08.108831: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/rh-ruby25/root/usr/local/lib64:/opt/rh/rh-ruby25/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst
(pid=3817963) 2022-08-19 13:48:08.111619: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/rh-ruby25/root/usr/local/lib64:/opt/rh/rh-ruby25/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst
(pid=3817956) 2022-08-19 13:48:08.108832: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/rh-ruby25/root/usr/local/lib64:/opt/rh/rh-ruby25/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst
(pid=3817955) 2022-08-19 13:48:08.127188: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/rh-ruby25/root/usr/local/lib64:/opt/rh/rh-ruby25/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst
(pid=3817960) 2022-08-19 13:48:08.157280: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/rh-ruby25/root/usr/local/lib64:/opt/rh/rh-ruby25/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst
(pid=3817962) 2022-08-19 13:48:08.127188: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/rh-ruby25/root/usr/local/lib64:/opt/rh/rh-ruby25/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst
(pid=3817968) 2022-08-19 13:48:08.140477: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/rh-ruby25/root/usr/local/lib64:/opt/rh/rh-ruby25/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst
(_run pid=3817957) 2022-08-19 13:48:09,720,720 WARNING [xla_bridge.py:backends:265] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
(pid=3817954) /home/fengjun.feng/miniconda3/envs/py3.8/lib/python3.8/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
(pid=3817954) from pandas import MultiIndex, Int64Index
(_run pid=3817964) /home/fengjun.feng/miniconda3/envs/py3.8/lib/python3.8/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
(_run pid=3817964) from pandas import MultiIndex, Int64Index
(pid=3817963) /home/fengjun.feng/miniconda3/envs/py3.8/lib/python3.8/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
(pid=3817963) from pandas import MultiIndex, Int64Index
(HomoBooster pid=3817954) /home/fengjun.feng/miniconda3/envs/py3.8/lib/python3.8/site-packages/xgboost/data.py:262: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
(HomoBooster pid=3817954) elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
(HomoBooster pid=3817954) /home/fengjun.feng/miniconda3/envs/py3.8/lib/python3.8/site-packages/xgboost/data.py:262: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
(HomoBooster pid=3817954) elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
(HomoBooster pid=3817964) /home/fengjun.feng/miniconda3/envs/py3.8/lib/python3.8/site-packages/xgboost/data.py:262: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
(HomoBooster pid=3817964) elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
(HomoBooster pid=3817964) /home/fengjun.feng/miniconda3/envs/py3.8/lib/python3.8/site-packages/xgboost/data.py:262: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
(HomoBooster pid=3817964) elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
(HomoBooster pid=3817963) /home/fengjun.feng/miniconda3/envs/py3.8/lib/python3.8/site-packages/xgboost/data.py:262: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
(HomoBooster pid=3817963) elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
(HomoBooster pid=3817963) /home/fengjun.feng/miniconda3/envs/py3.8/lib/python3.8/site-packages/xgboost/data.py:262: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
(HomoBooster pid=3817963) elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
(HomoBooster pid=3817964) [0] train-merror:0.01366 valid-merror:0.01366
(HomoBooster pid=3817954) [0] train-merror:0.01366 valid-merror:0.01366
(HomoBooster pid=3817963) [0] train-merror:0.01366 valid-merror:0.01366
(HomoBooster pid=3817964) [1] train-merror:0.00820 valid-merror:0.00820
(HomoBooster pid=3817954) [1] train-merror:0.00820 valid-merror:0.00820
(HomoBooster pid=3817963) [1] train-merror:0.00820 valid-merror:0.00820
(HomoBooster pid=3817964) [2] train-merror:0.00820 valid-merror:0.00820
(HomoBooster pid=3817954) [2] train-merror:0.00820 valid-merror:0.00820
(HomoBooster pid=3817963) [2] train-merror:0.00820 valid-merror:0.00820
(HomoBooster pid=3817964) [3] train-merror:0.01093 valid-merror:0.01093
(HomoBooster pid=3817954) [3] train-merror:0.01093 valid-merror:0.01093
(HomoBooster pid=3817963) [3] train-merror:0.01093 valid-merror:0.01093
(HomoBooster pid=3817964) [4] train-merror:0.00820 valid-merror:0.00820
(HomoBooster pid=3817954) [4] train-merror:0.00820 valid-merror:0.00820
(HomoBooster pid=3817963) [4] train-merror:0.00820 valid-merror:0.00820
(HomoBooster pid=3817964) [5] train-merror:0.00546 valid-merror:0.00546
(HomoBooster pid=3817954) [5] train-merror:0.00546 valid-merror:0.00546
(HomoBooster pid=3817963) [5] train-merror:0.00546 valid-merror:0.00546
到这里我们的联邦XGBoost训练就已经完成,bst就是我们这里构建好的FedBoost对象
总结#
本教程介绍如何使用树模型进行训练等
SFXgboost 封装了联邦子树模型的建树逻辑。经过 SFXgboost 训练的模型仍然与 XGBoost 兼容,我们可以直接使用现有的基础设施进行在线预测等。
下一步,您可以将自己的数据应用在SFXgboost上面,只需要follow这个文档即可完成