Decision Trees ============== With the help of Secret Sharing, a secure multi-party computation technique, SecretFlow implements provably secure gradient boosting model :py:meth:`~secretflow.ml.boost.ss_xgb_v.model.Xgb` to support both regression and binary classification machine learning tasks. Dataset Settings ---------------- vertically partitioned dataset: - samples are aligned among the participants - different participant obtains different features - one participant owns the label .. image:: resources/v_dataset.png XGBoost Training Algorithm -------------------------- Algorithm details can be found in `the official documents `_. The main process of building a single tree is as follows: - Statistics calculating: calculate the first-order gradient :math:`g_{i}` and second-order gradient :math:`h_{i}` for each sample with current prediction and label, according to the definition of loss function. - Node splitting: enumerates all possible split candidates and choose the best one with the maximal gain. A split candidate is consisted of a split feature and a split value, which divides the samples in current node :math:`I` into two child nodes :math:`I_{L}` and :math:`I_{R}`, according to their feature values. Then, a split gain is computed with the following formula: .. image:: resources/gain_formula.png :height: 120px :width: 992px :scale: 50 % where :math:`\lambda` and :math:`\gamma` are the regularizers for the leaf number and leaf weights respectively. In this way, we can split the nodes recursively until the leaf. - Weight calculating: calculate the weights of leaf nodes with the following formula: .. image:: resources/weight_formula.png :height: 138px :width: 382px :scale: 45 % Regression and classification share the same training process except: 1. they employs different loss functions, i.e. MSE for regression and Logloss for classification. 2. classification executes an extra sigmoid function to transform the prediction into a probability. SS-XGB Training --------------- SS-XGB :py:meth:`~secretflow.ml.boost.ss_xgb_v.model.Xgb` use secret sharing to compute the split gain and leaf weights. In order to implement a secure joint training, we replace all the computations with secret sharing protocols, e.g. Addition, Multiplication, etc. In addition, we have to take special care to accumulate the gradients without leaking out the feature partial order of samples. This problem can be solved by introducing an indicator vector 𝑆. .. image:: resources/indicator_vecto.jpg The samples to be accumulated is marked as 1 in 𝑆 and 0 otherwise. To preserve privacy, the indicator vector also transformed to secret shares. In this way, the sum of the gradients of the samples can be computed as the inner product of the indicator vector and the gradient vector, which can be securely computed by secret sharing protocols. Similarly, the indicator trick can be used to hide the instance distribution on nodes. Refer to our paper `Large-Scale Secure XGB for Vertical Federated Learning `_ for more details about SS-XGB algorithm and security analysis. Example -------- A local cluster(Standalone Mode) needs to be initialized as the running environment for this example. See `Deployment <../../getting_started/deployment.html>`_ and refer to the 'Cluster Mode'. For more details about the APIs, see :py:meth:`~secretflow.ml.boost.ss_xgb_v.model.Xgb` .. code-block:: python import sys import time import logging import secretflow as sf from secretflow.ml.boost.ss_xgb_v import Xgb from secretflow.device.driver import wait, reveal from secretflow.data import FedNdarray, PartitionWay from secretflow.data.split import train_test_split import numpy as np from sklearn.metrics import roc_auc_score from sklearn.metrics import accuracy_score, classification_report # init log logging.basicConfig(stream=sys.stdout, level=logging.INFO) # init all nodes in local Standalone Mode. sf.init(['alice', 'bob', 'carol'], address='local') # init PYU, the Python Processing Unit, process plaintext in each node. alice = sf.PYU('alice') bob = sf.PYU('bob') carol = sf.PYU('carol') # init SPU, the Secure Processing Unit, # process ciphertext under the protection of a multi-party secure computing protocol spu = sf.SPU(sf.utils.testing.cluster_def(['alice', 'bob', 'carol'])) # read data in each party def read_x(start, end): from sklearn.datasets import load_breast_cancer x = load_breast_cancer()['data'] return x[:, start:end] def read_y(): from sklearn.datasets import load_breast_cancer return load_breast_cancer()['target'] # alice / bob / carol each hold one third of the features of the data v_data = FedNdarray( partitions={ alice: alice(read_x)(0, 10), bob: bob(read_x)(10, 20), carol: carol(read_x)(20, 30), }, partition_way=PartitionWay.VERTICAL, ) # Y label belongs to alice label_data = FedNdarray( partitions={alice: alice(read_y)()}, partition_way=PartitionWay.VERTICAL, ) # wait IO finished wait([p.data for p in v_data.partitions.values()]) wait([p.data for p in label_data.partitions.values()]) # split train data and test date random_state = 1234 split_factor = 0.8 v_train_data, v_test_data = train_test_split(v_data, train_size=split_factor, random_state=random_state) v_train_label, v_test_label= train_test_split(label_data, train_size=split_factor, random_state=random_state) # run SS-XGB xgb = Xgb(spu) start = time.time() params = { # for more detail, see Xgb API doc 'num_boost_round': 5, 'max_depth': 5, 'learning_rate': 0.1, 'sketch_eps': 0.08, 'objective': 'logistic', 'reg_lambda': 0.1, 'subsample': 1, 'colsample_bytree': 1, 'base_score': 0.5, } model = xgb.train(params, v_train_data,v_train_label) logging.info(f"train time: {time.time() - start}") # Do predict start = time.time() # Now the result is saved in the spu by ciphertext spu_yhat = model.predict(v_test_data) # reveal for auc, acc and classification report test. yhat = reveal(spu_yhat) logging.info(f"predict time: {time.time() - start}") y = reveal(v_test_label.partitions[alice]) # get the area under curve(auc) score of classification logging.info(f"auc: {roc_auc_score(y, yhat)}") binary_class_results = np.where(yhat>0.5, 1, 0) # get the accuracy score of classification logging.info(f"acc: {accuracy_score(y, binary_class_results)}") # get the report of classification print("classification report:") print(classification_report(y, binary_class_results))