{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "9d11321e", "metadata": {}, "source": [ "# Vertically Federated XGB (SecureBoost) " ] }, { "cell_type": "markdown", "id": "079c8a84", "metadata": {}, "source": [ ">The following codes are demos only. It's **NOT for production** due to system security concerns, please **DO NOT** use it directly in production." ] }, { "attachments": {}, "cell_type": "markdown", "id": "89135d8c", "metadata": {}, "source": [ "Welcome to this tutorial on SecureBoost!\n", "\n", "In this tutorial, we will explore how to use SecretFlow's tree modeling capabilities to perform vertical federated learning using the SecureBoost algorithm. SecureBoost is a classical algorithm that prioritizes the protection of label information on vertically partitioned datasets. It accomplishes this using Homomorphic Encryption technology, which allows for the encryption of labels and the execution of key tree boosting steps in ciphertext. The outcome is a distributed boosted-trees model comprised of PYUObjects, with each party having knowledge only of their own split points. This implementation utilizes both HEU and PYU devices to achieve high performance with ease.\n", "\n", "Let's dive into the details and learn how to use SecureBoost with SecretFlow!" ] }, { "attachments": {}, "cell_type": "markdown", "id": "6c3af201", "metadata": {}, "source": [ "### Set up the devices\n", "\n", "Similar to other algorithms, setting up a secure cluster and specifying devices is necessary for SecureBoost implementation. \n", "\n", "In particular, a HEU device must be designated for SecureBoost to ensure the encryption of labels and the protection of sensitive information." ] }, { "cell_type": "code", "execution_count": 1, "id": "991ffbd0", "metadata": {}, "outputs": [], "source": [ "import spu\n", "from sklearn.metrics import roc_auc_score\n", "\n", "import secretflow as sf\n", "from secretflow.data import FedNdarray, PartitionWay\n", "from secretflow.device.driver import reveal, wait\n", "from secretflow.ml.boost.sgb_v import Sgb\n", "from secretflow.ml.boost.sgb_v.model import load_model" ] }, { "cell_type": "code", "execution_count": 2, "id": "9705a245", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2023-04-19 13:56:51,630\tINFO worker.py:1544 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n" ] } ], "source": [ "alice_ip = '127.0.0.1'\n", "bob_ip = '127.0.0.1'\n", "ip_party_map = {bob_ip: 'bob', alice_ip: 'alice'}\n", "\n", "_system_config = {'lineage_pinning_enabled': False}\n", "sf.shutdown()\n", "# init cluster\n", "sf.init(\n", " ['alice', 'bob'],\n", " address='local',\n", " _system_config=_system_config,\n", " object_store_memory=5 * 1024 * 1024 * 1024,\n", ")\n", "\n", "# SPU settings\n", "cluster_def = {\n", " 'nodes': [\n", " {'party': 'alice', 'id': 'local:0', 'address': alice_ip + ':12945'},\n", " {'party': 'bob', 'id': 'local:1', 'address': bob_ip + ':12946'},\n", " # {'party': 'carol', 'id': 'local:2', 'address': '127.0.0.1:12347'},\n", " ],\n", " 'runtime_config': {\n", " # SEMI2K support 2/3 PC, ABY3 only support 3PC, CHEETAH only support 2PC.\n", " # pls pay attention to size of nodes above. nodes size need match to PC setting.\n", " 'protocol': spu.spu_pb2.SEMI2K,\n", " 'field': spu.spu_pb2.FM128,\n", " },\n", "}\n", "\n", "# HEU settings\n", "heu_config = {\n", " 'sk_keeper': {'party': 'alice'},\n", " 'evaluators': [{'party': 'bob'}],\n", " 'mode': 'PHEU',\n", " 'he_parameters': {\n", " # ou is a fast encryption schema that is as secure as paillier.\n", " 'schema': 'ou',\n", " 'key_pair': {\n", " 'generate': {\n", " # bit size should be 2048 to provide sufficient security.\n", " 'bit_size': 2048,\n", " },\n", " },\n", " },\n", " 'encoding': {\n", " 'cleartext_type': 'DT_I32',\n", " 'encoder': \"IntegerEncoder\",\n", " 'encoder_args': {\"scale\": 1},\n", " },\n", "}" ] }, { "cell_type": "code", "execution_count": 3, "id": "377039e0", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/zoupeicheng.zpc/miniconda3/envs/py38/lib/python3.8/site-packages/ray/dashboard/modules/reporter/reporter_agent.py:56: UserWarning: `gpustat` package is not installed. GPU monitoring is not available. To have full functionality of the dashboard please install `pip install ray[default]`.)\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\n" ] } ], "source": [ "alice = sf.PYU('alice')\n", "bob = sf.PYU('bob')\n", "heu = sf.HEU(heu_config, cluster_def['runtime_config']['field'])" ] }, { "attachments": {}, "cell_type": "markdown", "id": "67d6f007", "metadata": {}, "source": [ "### Prepare Data\n", "Basically we are preparing a vertical dataset." ] }, { "cell_type": "code", "execution_count": 4, "id": "54ac3a76", "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_breast_cancer\n", "\n", "ds = load_breast_cancer()\n", "x, y = ds['data'], ds['target']\n", "\n", "v_data = FedNdarray(\n", " {\n", " alice: (alice(lambda: x[:, :15])()),\n", " bob: (bob(lambda: x[:, 15:])()),\n", " },\n", " partition_way=PartitionWay.VERTICAL,\n", ")\n", "label_data = FedNdarray(\n", " {alice: (alice(lambda: y)())},\n", " partition_way=PartitionWay.VERTICAL,\n", ")" ] }, { "cell_type": "markdown", "id": "baffdd20", "metadata": {}, "source": [ "### Prepare Params" ] }, { "cell_type": "code", "execution_count": 5, "id": "2d51d646", "metadata": {}, "outputs": [], "source": [ "params = {\n", " 'num_boost_round': 5,\n", " 'max_depth': 5,\n", " # about 13 bin numbers\n", " 'sketch_eps': 0.08,\n", " # use 'linear' if want to do regression\n", " # for classification, currently only supports binary classfication\n", " 'objective': 'logistic',\n", " 'reg_lambda': 0.3,\n", " 'subsample': 0.9,\n", " 'colsample_by_tree': 0.9,\n", " # pre-pruning parameter. splits with gain value less than it will be pruned.\n", " 'gamma': 1,\n", "}" ] }, { "attachments": {}, "cell_type": "markdown", "id": "57bf92f0", "metadata": {}, "source": [ "### Run Sgb\n", "We create a Sgb object with heu device and fit the data." ] }, { "cell_type": "code", "execution_count": 6, "id": "4bde4412", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:root:Create proxy actor with party alice.\n", "INFO:root:Create proxy actor with party bob.\n", "INFO:root:Create proxy actor with party alice.\n", "INFO:root:global_setup time 1.6802808750071563s\n", "\u001b[2m\u001b[36m(LabelHolder pid=3311307)\u001b[0m INFO:jax._src.xla_bridge:Remote TPU is not linked into jax; skipping remote TPU.\n", "\u001b[2m\u001b[36m(LabelHolder pid=3311307)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu_driver': Could not initialize backend 'tpu_driver'\n", "\u001b[2m\u001b[36m(LabelHolder pid=3311307)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'\n", "\u001b[2m\u001b[36m(LabelHolder pid=3311307)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'\n", "\u001b[2m\u001b[36m(LabelHolder pid=3311307)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.\n", "\u001b[2m\u001b[36m(LabelHolder pid=3311307)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.\n", "\u001b[2m\u001b[36m(LabelHolder pid=3311307)\u001b[0m WARNING:jax._src.xla_bridge:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(HEUSkKeeper pid=3309706)\u001b[0m [2023-04-19 13:57:02.080] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n", "\u001b[2m\u001b[36m(HEUEvaluator pid=3311083)\u001b[0m [2023-04-19 13:57:02.127] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:root:epoch 0 time 3.0825682659633458s\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=3303716)\u001b[0m [2023-04-19 13:57:02.636] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n", "\u001b[2m\u001b[36m(_run pid=3303717)\u001b[0m [2023-04-19 13:57:02.630] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "INFO:root:epoch 1 time 0.5092060890165158s\n", "INFO:root:epoch 2 time 0.4486289119813591s\n", "INFO:root:epoch 3 time 0.47898129001259804s\n", "INFO:root:epoch 4 time 0.3945654829731211s\n" ] } ], "source": [ "sgb = Sgb(heu)\n", "model = sgb.train(params, v_data, label_data)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "56f1ee3c", "metadata": {}, "source": [ "### Model Evaluation\n", "Now we can compare the model outputs with true labels. " ] }, { "cell_type": "code", "execution_count": 7, "id": "13c24066", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=3303717)\u001b[0m INFO:jax._src.xla_bridge:Remote TPU is not linked into jax; skipping remote TPU.\n", "\u001b[2m\u001b[36m(_run pid=3303717)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu_driver': Could not initialize backend 'tpu_driver'\n", "\u001b[2m\u001b[36m(_run pid=3303717)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'\n", "\u001b[2m\u001b[36m(_run pid=3303717)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'\n", "\u001b[2m\u001b[36m(_run pid=3303717)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.\n", "\u001b[2m\u001b[36m(_run pid=3303717)\u001b[0m INFO:jax._src.xla_bridge:Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.\n", "\u001b[2m\u001b[36m(_run pid=3303717)\u001b[0m WARNING:jax._src.xla_bridge:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "INFO:jax._src.xla_bridge:Remote TPU is not linked into jax; skipping remote TPU.\n", "INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu_driver': Could not initialize backend 'tpu_driver'\n", "INFO:jax._src.xla_bridge:Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'\n", "INFO:jax._src.xla_bridge:Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'\n", "INFO:jax._src.xla_bridge:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.\n", "INFO:jax._src.xla_bridge:Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.\n", "WARNING:jax._src.xla_bridge:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "auc: 0.9994979123724961\n" ] } ], "source": [ "yhat = model.predict(v_data)\n", "yhat = reveal(yhat)\n", "print(f\"auc: {roc_auc_score(y, yhat)}\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "ab3fa323", "metadata": {}, "source": [ "### Model Save and Load\n", "We can now save the model and load it to use later. Note that the model is a distributed identity, we will save to and load from multiple parties.\n", "\n", "Let's first define the paths." ] }, { "cell_type": "code", "execution_count": 8, "id": "544c1c73", "metadata": {}, "outputs": [], "source": [ "# each participant party needs a location to store \n", "saving_path_dict = {\n", " # in production we may use remote oss, for example.\n", " device: \"./\" + device.party \n", " for device in v_data.partitions.keys()\n", "}" ] }, { "attachments": {}, "cell_type": "markdown", "id": "30ff1297", "metadata": {}, "source": [ "Then let's save the model." ] }, { "cell_type": "code", "execution_count": 9, "id": "2c6c5d75", "metadata": {}, "outputs": [], "source": [ "r = model.save_model(saving_path_dict)\n", "wait(r)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "9559e3da", "metadata": {}, "source": [ "Now you can check the files at specified location.\n", "\n", "Finally, let's load the model and do a sanity check." ] }, { "cell_type": "code", "execution_count": 10, "id": "1210d5c7", "metadata": {}, "outputs": [], "source": [ "# alice is our label holder\n", "model_loaded = load_model(saving_path_dict, alice)\n", "fed_yhat_loaded = model_loaded.predict(v_data, alice)\n", "yhat_loaded = reveal(fed_yhat_loaded.partitions[alice])\n", "\n", "assert (\n", " yhat == yhat_loaded\n", ").all(), \"loaded model predictions should match original, yhat {} vs yhat_loaded {}\".format(\n", " yhat, yhat_loaded\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "fcebea9b", "metadata": {}, "source": [ "## Conclusion\n", "\n", "Great job on completing the tutorial!\n", "\n", "In conclusion, we have learned how to use tree models for training in SecretFlow and explored SecureBoost, a high-performance boosting algorithm designed specifically for vertically partitioned datasets. SecureBoost is similar to XGBoost but has a key focus on protecting sensitive labels in vertical learning scenarios. By utilizing homomorphic encryption and PYUObjects, SecureBoost allows us to train powerful distributed forest models while maintaining the privacy and security of our data.\n", "\n", "Thank you for participating in this tutorial, and we hope you found it informative and helpful!\n" ] }, { "cell_type": "markdown", "id": "d0d88145", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.13 ('sf')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.15" }, "vscode": { "interpreter": { "hash": "db45a4cb4cd37a8de684dfb7fcf899b68fccb8bd32d97c5ad13e5de1245c0986" } } }, "nbformat": 4, "nbformat_minor": 5 }