# Vertically Federated XGB (SecureBoost) 

>The following codes are demos only. It's **NOT for production** due to system security concerns, please **DO NOT** use it directly in production.

Welcome to this tutorial on SecureBoost!

In this tutorial, we will explore how to use SecretFlow's tree modeling capabilities to perform vertical federated learning using the SecureBoost algorithm. SecureBoost is a classical algorithm that prioritizes the protection of label information on vertically partitioned datasets. It accomplishes this using Homomorphic Encryption technology, which allows for the encryption of labels and the execution of key tree boosting steps in ciphertext. The outcome is a distributed boosted-trees model comprised of PYUObjects, with each party having knowledge only of their own split points. This implementation utilizes both HEU and PYU devices to achieve high performance with ease.

Let's dive into the details and learn how to use SecureBoost with SecretFlow!

### Set up the devices

Similar to other algorithms, setting up a secure cluster and specifying devices is necessary for SecureBoost implementation. 

In particular, a HEU device must be designated for SecureBoost to ensure the encryption of labels and the protection of sensitive information.

In [1]:
import spu
from sklearn.metrics import roc_auc_score

import secretflow as sf
from secretflow.data import FedNdarray, PartitionWay
from secretflow.device.driver import reveal, wait
from secretflow.ml.boost.sgb_v import Sgb
from secretflow.ml.boost.sgb_v.model import load_model

In [2]:
alice_ip = '127.0.0.1'
bob_ip = '127.0.0.1'
ip_party_map = {bob_ip: 'bob', alice_ip: 'alice'}

_system_config = {'lineage_pinning_enabled': False}
sf.shutdown()
# init cluster
sf.init(
 ['alice', 'bob'],
 address='local',
 _system_config=_system_config,
 object_store_memory=5 * 1024 * 1024 * 1024,
)

# SPU settings
cluster_def = {
 'nodes': [
 {'party': 'alice', 'id': 'local:0', 'address': alice_ip + ':12945'},
 {'party': 'bob', 'id': 'local:1', 'address': bob_ip + ':12946'},
 # {'party': 'carol', 'id': 'local:2', 'address': '127.0.0.1:12347'},
 ],
 'runtime_config': {
 # SEMI2K support 2/3 PC, ABY3 only support 3PC, CHEETAH only support 2PC.
 # pls pay attention to size of nodes above. nodes size need match to PC setting.
 'protocol': spu.spu_pb2.SEMI2K,
 'field': spu.spu_pb2.FM128,
 },
}

# HEU settings
heu_config = {
 'sk_keeper': {'party': 'alice'},
 'evaluators': [{'party': 'bob'}],
 'mode': 'PHEU',
 'he_parameters': {
 # ou is a fast encryption schema that is as secure as paillier.
 'schema': 'ou',
 'key_pair': {
 'generate': {
 # bit size should be 2048 to provide sufficient security.
 'bit_size': 2048,
 },
 },
 },
 'encoding': {
 'cleartext_type': 'DT_I32',
 'encoder': "IntegerEncoder",
 'encoder_args': {"scale": 1},
 },
}

2023-04-19 13:56:51,630	INFO worker.py:1544 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8265 [39m[22m


In [3]:
alice = sf.PYU('alice')
bob = sf.PYU('bob')
heu = sf.HEU(heu_config, cluster_def['runtime_config']['field'])



### Prepare Data
Basically we are preparing a vertical dataset.

In [4]:
from sklearn.datasets import load_breast_cancer

ds = load_breast_cancer()
x, y = ds['data'], ds['target']

v_data = FedNdarray(
 {
 alice: (alice(lambda: x[:, :15])()),
 bob: (bob(lambda: x[:, 15:])()),
 },
 partition_way=PartitionWay.VERTICAL,
)
label_data = FedNdarray(
 {alice: (alice(lambda: y)())},
 partition_way=PartitionWay.VERTICAL,
)

### Prepare Params

In [5]:
params = {
 'num_boost_round': 5,
 'max_depth': 5,
 # about 13 bin numbers
 'sketch_eps': 0.08,
 # use 'linear' if want to do regression
 # for classification, currently only supports binary classfication
 'objective': 'logistic',
 'reg_lambda': 0.3,
 'subsample': 0.9,
 'colsample_by_tree': 0.9,
 # pre-pruning parameter. splits with gain value less than it will be pruned.
 'gamma': 1,
}

### Run Sgb
We create a Sgb object with heu device and fit the data.

In [6]:
sgb = Sgb(heu)
model = sgb.train(params, v_data, label_data)

INFO:root:Create proxy actor with party alice.
INFO:root:Create proxy actor with party bob.
INFO:root:Create proxy actor with party alice.
INFO:root:global_setup time 1.6802808750071563s
[2m[36m(LabelHolder pid=3311307)[0m INFO:jax._src.xla_bridge:Remote TPU is not linked into jax; skipping remote TPU.
[2m[36m(LabelHolder pid=3311307)[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu_driver': Could not initialize backend 'tpu_driver'
[2m[36m(LabelHolder pid=3311307)[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[2m[36m(LabelHolder pid=3311307)[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[2m[36m(LabelHolder pid=3311307)[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
[2m[36m(LabelHolder pid=3311307)[0m INFO:jax._src.xla_b

[2m[36m(HEUSkKeeper pid=3309706)[0m [2023-04-19 13:57:02.080] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
[2m[36m(HEUEvaluator pid=3311083)[0m [2023-04-19 13:57:02.127] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63


INFO:root:epoch 0 time 3.0825682659633458s


[2m[36m(_run pid=3303716)[0m [2023-04-19 13:57:02.636] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63
[2m[36m(_run pid=3303717)[0m [2023-04-19 13:57:02.630] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63


INFO:root:epoch 1 time 0.5092060890165158s
INFO:root:epoch 2 time 0.4486289119813591s
INFO:root:epoch 3 time 0.47898129001259804s
INFO:root:epoch 4 time 0.3945654829731211s


### Model Evaluation
Now we can compare the model outputs with true labels. 

In [7]:
yhat = model.predict(v_data)
yhat = reveal(yhat)
print(f"auc: {roc_auc_score(y, yhat)}")

[2m[36m(_run pid=3303717)[0m INFO:jax._src.xla_bridge:Remote TPU is not linked into jax; skipping remote TPU.
[2m[36m(_run pid=3303717)[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu_driver': Could not initialize backend 'tpu_driver'
[2m[36m(_run pid=3303717)[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[2m[36m(_run pid=3303717)[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[2m[36m(_run pid=3303717)[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
[2m[36m(_run pid=3303717)[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable 

auc: 0.9994979123724961


### Model Save and Load
We can now save the model and load it to use later. Note that the model is a distributed identity, we will save to and load from multiple parties.

Let's first define the paths.

In [8]:
# each participant party needs a location to store 
saving_path_dict = {
 # in production we may use remote oss, for example.
 device: "./" + device.party 
 for device in v_data.partitions.keys()
}

Then let's save the model.

In [9]:
r = model.save_model(saving_path_dict)
wait(r)

Now you can check the files at specified location.

Finally, let's load the model and do a sanity check.

In [10]:
# alice is our label holder
model_loaded = load_model(saving_path_dict, alice)
fed_yhat_loaded = model_loaded.predict(v_data, alice)
yhat_loaded = reveal(fed_yhat_loaded.partitions[alice])

assert (
 yhat == yhat_loaded
).all(), "loaded model predictions should match original, yhat {} vs yhat_loaded {}".format(
 yhat, yhat_loaded
)

## Conclusion

Great job on completing the tutorial!

In conclusion, we have learned how to use tree models for training in SecretFlow and explored SecureBoost, a high-performance boosting algorithm designed specifically for vertically partitioned datasets. SecureBoost is similar to XGBoost but has a key focus on protecting sensitive labels in vertical learning scenarios. By utilizing homomorphic encryption and PYUObjects, SecureBoost allows us to train powerful distributed forest models while maintaining the privacy and security of our data.

Thank you for participating in this tutorial, and we hope you found it informative and helpful!
