{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 隐语SecretFlow实际场景MPC算法开发实践\n", "\n", "> This tutorial is only available in Chinese.\n", "\n", "推荐使用`conda`创建一个新环境\n", "> conda create -n sf python=3.8\n", "\n", "直接使用`pip`安装secretflow\n", "> pip install -U secretflow\n", "\n", "基于secretflow：**0.7.7b1**版本\n", "\n", "此代码示例主要是展示了如何基于secretflow以及SPU隐私计算设备完成一个实际的应用的开发，推荐先看前一个教程[spu_basics](./spu_basics.ipynb)熟悉基本的SPU概念。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 任务介绍" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Vehicle Insurance Claim Fraud Detection\n", "\n", "该数据集来源于[kaggle](https://www.kaggle.com/datasets/shivamb/vehicle-claim-fraud-detection)，包含\n", "- 车辆数据集-属性、模型、事故详细信息等\n", "- 保单详细信息-保单类型、有效期等\n", "\n", "目标是检测索赔申请是否欺诈：字段`FraudFound_P` (0 or 1) 即为预测的target值，是一个典型的**二分类场景**。" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 实验目标\n", "在本次实验中，我们将会利用一个开源数据集在隐语上完成隐私保护的逻辑回归、神经网络模型和XGB模型。主要涉及到如下的几个流程：\n", "1. 数据加载\n", "2. 数据洞察\n", "3. 数据预处理\n", "4. 模型构建\n", "5. 模型的训练与预测" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 前置工作" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ray集群启动（多机部署）\n", "考虑多机部署的情况，在启动secretflow之前需要先将ray集群启动。在header节点和worker节点上各自执行下述的指令。\n", "> P.S. 启动集群之后，可以执行`ray status`看一下集群是否正确启动完成" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Header节点**\n", "``` Bash\n", "RAY_DISABLE_REMOTE_CODE=true \\\n", "ray start --head --node-ip-address=\"head_ip\" --port=\"head_port\" --resources='{\"alice\": 20}' --include-dashboard=False\n", "```\n", "**Worker节点**\n", "``` Bash\n", "RAY_DISABLE_REMOTE_CODE=true \\\n", "ray start --address=\"head_ip:head_port\" --resources='{\"bob\": 20}'\n", "```" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# 如下是多机版初始化secretflow的代码，需要给出header节点的IP和PORT\n", "# head_ip = \"xxx\"\n", "# head_port = \"xxx\" \n", "# sf.init(address=f'{head_ip}:{head_port}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 单机部署\n", "我们在此使用单机部署的方式做一个样例展示。\n", "通过调用`sf.init()`我们实例化了一个ray集群，有5个节点，也就对应了5个物理设备。" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2022-11-09 20:12:39.876005: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/rh-ruby25/root/usr/local/lib64:/opt/rh/rh-ruby25/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib64:/opt/rh/devtoolset-11/root/usr/lib:/opt/rh/devtoolset-11/root/usr/lib64/dyninst:/opt/rh/devtoolset-11/root/usr/lib/dyninst\n" ] } ], "source": [ "import secretflow as sf\n", "sf.shutdown()\n", "# Standalone Mode\n", "sf.init(\n", " ['alice', 'bob', 'carol', 'davy', 'eric'],\n", " address='local'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 定义明文计算设备PYU\n", "我们在启动了上述5个节点之后，明确隐语中的逻辑设备。这里我们将alice、bob、carol三方作为数据的提供方，可以本地执行明文计算，也就是 **PYU (PYthon runtime Unit)** 设备。" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "alice_\n" ] } ], "source": [ "alice = sf.PYU('alice')\n", "bob = sf.PYU('bob')\n", "carol = sf.PYU('carol')\n", "\n", "print(alice)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 定义密文计算设备SPU (3PC)\n", "进一步，我们以**SPU (Secure Processing Unit)** 为例，选择3个物理节点组成基于MPC（下例为三方的ABY3协议）的隐私计算设备。" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import spu\n", "from secretflow.utils.testing import unused_tcp_port\n", "\n", "aby3_cluster_def = {\n", " 'nodes': [\n", " {\n", " 'party': 'alice',\n", " 'address': f'127.0.0.1:{unused_tcp_port()}',\n", " },\n", " {'party': 'bob', 'id': 'local:1', 'address': f'127.0.0.1:{unused_tcp_port()}'},\n", " {\n", " 'party': 'carol',\n", " 'address': f'127.0.0.1:{unused_tcp_port()}',\n", " },\n", " ],\n", " 'runtime_config': {\n", " 'protocol': spu.spu_pb2.ABY3,\n", " 'field': spu.spu_pb2.FM64,\n", " },\n", "}\n", "\n", "my_spu = sf.SPU(aby3_cluster_def)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 数据加载" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Data (Mock)\n", "在定义好隐语中的逻辑设备概念之后，我们演示一下如何进行数据的读入。这里使用一个mock的data load方法`get_data_mock()`做一个演示。" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "x_plaintext: 2\n" ] } ], "source": [ "def get_data_mock():\n", " return 2\n", "\n", "x_plaintext = get_data_mock()\n", "print(f\"x_plaintext: {x_plaintext}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "指定PYU设备读取数据" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Plaintext Python Object: 2, PYU object: \n", "Reveal PYU object: 2\n" ] } ], "source": [ "x_alice_pyu = alice(get_data_mock)()\n", "\n", "print(f\"Plaintext Python Object: {x_plaintext}, PYU object: {x_alice_pyu}\")\n", "print(f\"Reveal PYU object: {sf.reveal(x_alice_pyu)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "PYU->SPU 数据转换" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SPU object: \n", "Reveal SPU object: 2\n" ] } ], "source": [ "x_alice_spu = x_alice_pyu.to(my_spu)\n", "print(f\"SPU object: {x_alice_spu}\")\n", "\n", "print(f\"Reveal SPU object: {sf.reveal(x_alice_spu)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Data (Distributed)\n", "我们下面考虑对一个实际应用场景的数据进行读取，也就是全集数据垂直分布在不同的参与方中。\n", "> 出于演示的目的，我们这里将中心化的明文数据进行垂直分割的拆分，首先观察下此数据集的特征。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 读入明文全集数据" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "--2022-11-09 20:25:18-- https://secretflow-data.oss-accelerate.aliyuncs.com/datasets/vehicle_nsurance_claim/fraud_oracle.csv\n", "Resolving secretflow-data.oss-accelerate.aliyuncs.com (secretflow-data.oss-accelerate.aliyuncs.com)... 101.133.111.250\n", "Connecting to secretflow-data.oss-accelerate.aliyuncs.com (secretflow-data.oss-accelerate.aliyuncs.com)|101.133.111.250|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 3618564 (3.5M) [text/csv]\n", "Saving to: ‘fraud_oracle.csv’\n", "\n", " 0K .......... .......... .......... .......... .......... 1% 3.73M 1s\n", " 50K .......... .......... .......... .......... .......... 2% 6.70M 1s\n", " 100K .......... .......... .......... .......... .......... 4% 10.2M 1s\n", " 150K .......... .......... .......... .......... .......... 5% 18.5M 0s\n", " 200K .......... .......... .......... .......... .......... 7% 13.9M 0s\n", " 250K .......... .......... .......... .......... .......... 8% 33.5M 0s\n", " 300K .......... .......... .......... .......... .......... 9% 21.8M 0s\n", " 350K .......... .......... .......... .......... .......... 11% 47.4M 0s\n", " 400K .......... .......... .......... .......... .......... 12% 31.8M 0s\n", " 450K .......... .......... .......... .......... .......... 14% 38.6M 0s\n", " 500K .......... .......... .......... .......... .......... 15% 1.04M 0s\n", " 550K .......... .......... .......... .......... .......... 16% 92.8M 0s\n", " 600K .......... .......... .......... .......... .......... 18% 81.2M 0s\n", " 650K .......... .......... .......... .......... .......... 19% 91.2M 0s\n", " 700K .......... .......... .......... .......... .......... 21% 101M 0s\n", " 750K .......... .......... .......... .......... .......... 22% 79.5M 0s\n", " 800K .......... .......... .......... .......... .......... 24% 84.9M 0s\n", " 850K .......... .......... .......... .......... .......... 25% 83.2M 0s\n", " 900K .......... .......... .......... .......... .......... 26% 79.3M 0s\n", " 950K .......... .......... .......... .......... .......... 28% 110M 0s\n", " 1000K .......... .......... .......... .......... .......... 29% 1014K 0s\n", " 1050K .......... .......... .......... .......... .......... 31% 106M 0s\n", " 1100K .......... .......... .......... .......... .......... 32% 104M 0s\n", " 1150K .......... .......... .......... .......... .......... 33% 108M 0s\n", " 1200K .......... .......... .......... .......... .......... 35% 113M 0s\n", " 1250K .......... .......... .......... .......... .......... 36% 109M 0s\n", " 1300K .......... .......... .......... .......... .......... 38% 114M 0s\n", " 1350K .......... .......... .......... .......... .......... 39% 112M 0s\n", " 1400K .......... .......... .......... .......... .......... 41% 113M 0s\n", " 1450K .......... .......... .......... .......... .......... 42% 111M 0s\n", " 1500K .......... .......... .......... .......... .......... 43% 111M 0s\n", " 1550K .......... .......... .......... .......... .......... 45% 115M 0s\n", " 1600K .......... .......... .......... .......... .......... 46% 15.4M 0s\n", " 1650K .......... .......... .......... .......... .......... 48% 108M 0s\n", " 1700K .......... .......... .......... .......... .......... 49% 104M 0s\n", " 1750K .......... .......... .......... .......... .......... 50% 110M 0s\n", " 1800K .......... .......... .......... .......... .......... 52% 88.6M 0s\n", " 1850K .......... .......... .......... .......... .......... 53% 103M 0s\n", " 1900K .......... .......... .......... .......... .......... 55% 114M 0s\n", " 1950K .......... .......... .......... .......... .......... 56% 110M 0s\n", " 2000K .......... .......... .......... .......... .......... 58% 876K 0s\n", " 2050K .......... .......... .......... .......... .......... 59% 94.3M 0s\n", " 2100K .......... .......... .......... .......... .......... 60% 107M 0s\n", " 2150K .......... .......... .......... .......... .......... 62% 105M 0s\n", " 2200K .......... .......... .......... .......... .......... 63% 103M 0s\n", " 2250K .......... .......... .......... .......... .......... 65% 105M 0s\n", " 2300K .......... .......... .......... .......... .......... 66% 103M 0s\n", " 2350K .......... .......... .......... .......... .......... 67% 108M 0s\n", " 2400K .......... .......... .......... .......... .......... 69% 106M 0s\n", " 2450K .......... .......... .......... .......... .......... 70% 106M 0s\n", " 2500K .......... .......... .......... .......... .......... 72% 110M 0s\n", " 2550K .......... .......... .......... .......... .......... 73% 108M 0s\n", " 2600K .......... .......... .......... .......... .......... 74% 112M 0s\n", " 2650K .......... .......... .......... .......... .......... 76% 108M 0s\n", " 2700K .......... .......... .......... .......... .......... 77% 112M 0s\n", " 2750K .......... .......... .......... .......... .......... 79% 110M 0s\n", " 2800K .......... .......... .......... .......... .......... 80% 111M 0s\n", " 2850K .......... .......... .......... .......... .......... 82% 98.3M 0s\n", " 2900K .......... .......... .......... .......... .......... 83% 89.0M 0s\n", " 2950K .......... .......... .......... .......... .......... 84% 115M 0s\n", " 3000K .......... .......... .......... .......... .......... 86% 109M 0s\n", " 3050K .......... .......... .......... .......... .......... 87% 32.2M 0s\n", " 3100K .......... .......... .......... .......... .......... 89% 104M 0s\n", " 3150K .......... .......... .......... .......... .......... 90% 108M 0s\n", " 3200K .......... .......... .......... .......... .......... 91% 101M 0s\n", " 3250K .......... .......... .......... .......... .......... 93% 110M 0s\n", " 3300K .......... .......... .......... .......... .......... 94% 112M 0s\n", " 3350K .......... .......... .......... .......... .......... 96% 113M 0s\n", " 3400K .......... .......... .......... .......... .......... 97% 114M 0s\n", " 3450K .......... .......... .......... .......... .......... 99% 114M 0s\n", " 3500K .......... .......... .......... ... 100% 109M=0.2s\n", "\n", "2022-11-09 20:25:18 (15.5 MB/s) - ‘fraud_oracle.csv’ saved [3618564/3618564]\n", "\n" ] } ], "source": [ "import os\n", "\n", "\"\"\"\n", "Create dir to save dataset files\n", "This will create a directory `data` to store the dataset file\n", "\"\"\"\n", "if not os.path.exists('data'):\n", " os.mkdir('data')\n", "\n", "\"\"\"\n", "The original data is from Kaggle: https://www.kaggle.com/datasets/shivamb/vehicle-claim-fraud-detection. \n", "We promise we only use the data for demo only.\n", "\"\"\"\n", "path = \"https://secretflow-data.oss-accelerate.aliyuncs.com/datasets/vehicle_nsurance_claim/fraud_oracle.csv\"\n", "if not os.path.exists('data/fraud_oracle.csv'):\n", " res = os.system('cd data && wget {}'.format(path))\n", " if res != 0:\n", " raise Exception('File: {} download fails!'.format(path))\n", "else:\n", " print(f'File already downloaded.')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

	Month	WeekOfMonth	DayOfWeek	Make	AccidentArea	DayOfWeekClaimed	MonthClaimed	WeekOfMonthClaimed	Sex	MaritalStatus	...	AgeOfVehicle	AgeOfPolicyHolder	PoliceReportFiled	WitnessPresent	AgentType	NumberOfSuppliments	AddressChange_Claim	NumberOfCars	Year	BasePolicy
0	Dec	5	Wednesday	Honda	Urban	Tuesday	Jan	1	Female	Single	...	3 years	26 to 30	No	No	External	none	1 year	3 to 4	1994	Liability
1	Jan	3	Wednesday	Honda	Urban	Monday	Jan	4	Male	Single	...	6 years	31 to 35	Yes	No	External	none	no change	1 vehicle	1994	Collision
2	Oct	5	Friday	Honda	Urban	Thursday	Nov	2	Male	Married	...	7 years	41 to 50	No	No	External	none	no change	1 vehicle	1994	Collision
3	Jun	2	Saturday	Toyota	Rural	Friday	Jul	1	Male	Married	...	more than 7	51 to 65	Yes	No	External	more than 5	no change	1 vehicle	1994	Liability
4	Jan	5	Monday	Honda	Urban	Tuesday	Feb	2	Female	Single	...	5 years	31 to 35	No	No	External	none	no change	1 vehicle	1994	Collision

\n", "

5 rows × 33 columns

\n", "

" ], "text/plain": [ " Month WeekOfMonth DayOfWeek Make AccidentArea DayOfWeekClaimed \\\n", "0 Dec 5 Wednesday Honda Urban Tuesday \n", "1 Jan 3 Wednesday Honda Urban Monday \n", "2 Oct 5 Friday Honda Urban Thursday \n", "3 Jun 2 Saturday Toyota Rural Friday \n", "4 Jan 5 Monday Honda Urban Tuesday \n", "\n", " MonthClaimed WeekOfMonthClaimed Sex MaritalStatus ... AgeOfVehicle \\\n", "0 Jan 1 Female Single ... 3 years \n", "1 Jan 4 Male Single ... 6 years \n", "2 Nov 2 Male Married ... 7 years \n", "3 Jul 1 Male Married ... more than 7 \n", "4 Feb 2 Female Single ... 5 years \n", "\n", " AgeOfPolicyHolder PoliceReportFiled WitnessPresent AgentType \\\n", "0 26 to 30 No No External \n", "1 31 to 35 Yes No External \n", "2 41 to 50 No No External \n", "3 51 to 65 Yes No External \n", "4 31 to 35 No No External \n", "\n", " NumberOfSuppliments AddressChange_Claim NumberOfCars Year BasePolicy \n", "0 none 1 year 3 to 4 1994 Liability \n", "1 none no change 1 vehicle 1994 Collision \n", "2 none no change 1 vehicle 1994 Collision \n", "3 more than 5 no change 1 vehicle 1994 Liability \n", "4 none no change 1 vehicle 1994 Collision \n", "\n", "[5 rows x 33 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.model_selection import train_test_split\n", "import pandas as pd\n", "\n", "\"\"\"\n", "This should point to the data downloaded from Kaggle.\n", "By default, the .csv file shall be in the data directory\n", "\"\"\"\n", "full_data_path = 'data/fraud_oracle.csv'\n", "df = pd.read_csv(full_data_path)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 数据三方垂直拆分\n", "我们首先对这个数据进行一个拆分的处理，来模拟一个数据垂直分割的三方场景：\n", "\n", "- alice持有前10个属性\n", "- bob持有中间的10个属性\n", "- carol持有剩下的所有属性以及标签值\n", "\n", "同时为了方便各方之间的样本做对齐，我们加了一个新的特征`UID`来标识数据样本。\n", "\n", "我们预先基于sklearn将全集数据拆分成训练集和测试集，方便后续进行模型训练效果的验证。" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(10792, 32)\n", "(4626, 32)\n" ] } ], "source": [ "train_alice_path = \"data/alice_train.csv\"\n", "train_bob_path = \"data/bob_train.csv\"\n", "train_carol_path = \"data/carol_train.csv\"\n", "\n", "test_alice_path = \"data/alice_test.csv\"\n", "test_bob_path = \"data/bob_test.csv\"\n", "test_carol_path = \"data/carol_test.csv\"\n", "\n", "def load_dataset_full(data_path):\n", " df = pd.read_csv(data_path)\n", " df = df.drop([0])\n", " df = df.loc[df['DayOfWeekClaimed']!='0']\n", " y = df['FraudFound_P']\n", " X = df.drop(columns='FraudFound_P')\n", " return X, y\n", "\n", "def split_data():\n", " x, y = load_dataset_full(full_data_path)\n", " x_train, x_test, y_train, y_test = train_test_split(\n", " x, y, test_size=0.3, random_state=10\n", " )\n", "\n", " print(x_train.shape)\n", " train_alice_csv = x_train.iloc[:, :10]\n", " train_bob_csv = x_train.iloc[:, 10:20]\n", " train_carol_csv = pd.concat([x_train.iloc[:, 20:], y_train], axis=1)\n", "\n", " train_alice_csv.to_csv(train_alice_path, index_label='UID')\n", " train_bob_csv.to_csv(train_bob_path, index_label='UID')\n", " train_carol_csv.to_csv(train_carol_path, index_label='UID')\n", "\n", " print(x_test.shape)\n", " test_alice_csv = x_test.iloc[:, :10]\n", " test_bob_csv = x_test.iloc[:, 10:20]\n", " test_carol_csv = pd.concat([x_test.iloc[:, 20:], y_test], axis=1)\n", "\n", " test_alice_csv.to_csv(test_alice_path, index_label='UID')\n", " test_bob_csv.to_csv(test_bob_path, index_label='UID')\n", " test_carol_csv.to_csv(test_carol_path, index_label='UID')\n", "\n", "split_data()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

	UID	Month	WeekOfMonth	DayOfWeek	Make	AccidentArea	DayOfWeekClaimed	MonthClaimed	WeekOfMonthClaimed	Sex	MaritalStatus
0	2853	Mar	4	Sunday	Toyota	Urban	Friday	Apr	1	Male	Married
1	7261	Apr	4	Saturday	Honda	Urban	Monday	Apr	4	Male	Married
2	9862	Jun	4	Sunday	Toyota	Rural	Monday	Jun	4	Female	Single
3	14037	Mar	2	Monday	Mazda	Urban	Monday	Mar	2	Male	Single
4	10199	Jun	3	Friday	Mazda	Urban	Tuesday	Jun	4	Female	Single

\n", "

" ], "text/plain": [ " UID Month WeekOfMonth DayOfWeek Make AccidentArea DayOfWeekClaimed \\\n", "0 2853 Mar 4 Sunday Toyota Urban Friday \n", "1 7261 Apr 4 Saturday Honda Urban Monday \n", "2 9862 Jun 4 Sunday Toyota Rural Monday \n", "3 14037 Mar 2 Monday Mazda Urban Monday \n", "4 10199 Jun 3 Friday Mazda Urban Tuesday \n", "\n", " MonthClaimed WeekOfMonthClaimed Sex MaritalStatus \n", "0 Apr 1 Male Married \n", "1 Apr 4 Male Married \n", "2 Jun 4 Female Single \n", "3 Mar 2 Male Single \n", "4 Jun 4 Female Single " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alice_train_df = pd.read_csv(train_alice_path)\n", "alice_train_df.head()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

	UID	Age	Fault	PolicyType	VehicleCategory	VehiclePrice	PolicyNumber	RepNumber	Deductible	DriverRating	Days_Policy_Accident
0	2853	39	Policy Holder	Sedan - All Perils	Sedan	20000 to 29000	2854	8	400	2	more than 30
1	7261	58	Policy Holder	Sedan - Liability	Sport	20000 to 29000	7262	4	400	4	more than 30
2	9862	28	Policy Holder	Sedan - All Perils	Sedan	less than 20000	9863	5	400	4	more than 30
3	14037	28	Policy Holder	Sedan - Collision	Sedan	20000 to 29000	14038	11	400	4	more than 30
4	10199	35	Policy Holder	Sedan - Collision	Sedan	20000 to 29000	10200	12	400	4	more than 30

\n", "

" ], "text/plain": [ " UID Age Fault PolicyType VehicleCategory \\\n", "0 2853 39 Policy Holder Sedan - All Perils Sedan \n", "1 7261 58 Policy Holder Sedan - Liability Sport \n", "2 9862 28 Policy Holder Sedan - All Perils Sedan \n", "3 14037 28 Policy Holder Sedan - Collision Sedan \n", "4 10199 35 Policy Holder Sedan - Collision Sedan \n", "\n", " VehiclePrice PolicyNumber RepNumber Deductible DriverRating \\\n", "0 20000 to 29000 2854 8 400 2 \n", "1 20000 to 29000 7262 4 400 4 \n", "2 less than 20000 9863 5 400 4 \n", "3 20000 to 29000 14038 11 400 4 \n", "4 20000 to 29000 10200 12 400 4 \n", "\n", " Days_Policy_Accident \n", "0 more than 30 \n", "1 more than 30 \n", "2 more than 30 \n", "3 more than 30 \n", "4 more than 30 " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bob_train_df = pd.read_csv(train_bob_path)\n", "bob_train_df.head()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

	UID	Days_Policy_Claim	PastNumberOfClaims	AgeOfVehicle	AgeOfPolicyHolder	PoliceReportFiled	WitnessPresent	AgentType	NumberOfSuppliments	AddressChange_Claim	NumberOfCars	Year	BasePolicy	FraudFound_P
0	2853	more than 30	1	7 years	36 to 40	No	No	External	more than 5	no change	1 vehicle	1994	All Perils	0
1	7261	more than 30	none	more than 7	51 to 65	No	No	External	1 to 2	no change	1 vehicle	1995	Liability	0
2	9862	more than 30	none	7 years	31 to 35	No	No	External	none	no change	1 vehicle	1995	All Perils	0
3	14037	more than 30	1	6 years	31 to 35	No	No	External	none	no change	1 vehicle	1996	Collision	0
4	10199	more than 30	none	5 years	31 to 35	No	No	Internal	none	no change	1 vehicle	1995	Collision	0

\n", "

" ], "text/plain": [ " UID Days_Policy_Claim PastNumberOfClaims AgeOfVehicle AgeOfPolicyHolder \\\n", "0 2853 more than 30 1 7 years 36 to 40 \n", "1 7261 more than 30 none more than 7 51 to 65 \n", "2 9862 more than 30 none 7 years 31 to 35 \n", "3 14037 more than 30 1 6 years 31 to 35 \n", "4 10199 more than 30 none 5 years 31 to 35 \n", "\n", " PoliceReportFiled WitnessPresent AgentType NumberOfSuppliments \\\n", "0 No No External more than 5 \n", "1 No No External 1 to 2 \n", "2 No No External none \n", "3 No No External none \n", "4 No No Internal none \n", "\n", " AddressChange_Claim NumberOfCars Year BasePolicy FraudFound_P \n", "0 no change 1 vehicle 1994 All Perils 0 \n", "1 no change 1 vehicle 1995 Liability 0 \n", "2 no change 1 vehicle 1995 All Perils 0 \n", "3 no change 1 vehicle 1996 Collision 0 \n", "4 no change 1 vehicle 1995 Collision 0 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "carol_train_df = pd.read_csv(train_carol_path)\n", "carol_train_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 三方数据加载\n", "注意：这里的接口里面需要显示地指明用于多方之间样本对齐的key，以及明确使用何种设备来执行PSI。" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "VDataFrame(partitions={: Partition(data=), : Partition(data=), : Partition(data=)}, aligned=True)\n", "Index(['Month', 'WeekOfMonth', 'DayOfWeek', 'Make', 'AccidentArea',\n", " 'DayOfWeekClaimed', 'MonthClaimed', 'WeekOfMonthClaimed', 'Sex',\n", " 'MaritalStatus', 'Age', 'Fault', 'PolicyType', 'VehicleCategory',\n", " 'VehiclePrice', 'PolicyNumber', 'RepNumber', 'Deductible',\n", " 'DriverRating', 'Days_Policy_Accident', 'Days_Policy_Claim',\n", " 'PastNumberOfClaims', 'AgeOfVehicle', 'AgeOfPolicyHolder',\n", " 'PoliceReportFiled', 'WitnessPresent', 'AgentType',\n", " 'NumberOfSuppliments', 'AddressChange_Claim', 'NumberOfCars', 'Year',\n", " 'BasePolicy', 'FraudFound_P'],\n", " dtype='object')\n" ] } ], "source": [ "\n", "from secretflow.data.vertical import read_csv as v_read_csv\n", "\n", "train_ds = v_read_csv({alice: train_alice_path, bob: train_bob_path, carol: train_carol_path}, keys='UID', drop_keys='UID', spu=my_spu)\n", "test_ds = v_read_csv({alice: test_alice_path, bob: test_bob_path, carol: test_carol_path}, keys='UID', drop_keys='UID', spu=my_spu)\n", "print(train_ds)\n", "print(train_ds.columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 数据洞察\n", "基于上层封装的VDataFrame抽象，隐语提供了多种数据分析的API，例如统计信息、查改某些列的信息等。" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WeekOfMonth 10792\n", "dtype: int64\n", "WeekOfMonth 5\n", "dtype: int64\n", "WeekOfMonth 1\n", "dtype: int64\n" ] } ], "source": [ "print(train_ds['WeekOfMonth'].count())\n", "\n", "print(train_ds['WeekOfMonth'].max())\n", "print(train_ds['WeekOfMonth'].min())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 数据预处理\n", "在读取完数据之后，下面我们演示如何在隐语上对一个实际多方持有的数据进行数据预处理。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Label Encoder\n", "\n", "对无序且二值的值，我们可以使用label encoding，转化为0/1表示" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Col name AccidentArea: ['Urban' 'Rural']\n", "Col name Sex: ['Female' 'Male']\n", "Col name Fault: ['Policy Holder' 'Third Party']\n", "Col name PoliceReportFiled: ['No' 'Yes']\n", "Col name WitnessPresent: ['No' 'Yes']\n", "Col name AgentType: ['External' 'Internal']\n" ] } ], "source": [ "from secretflow.preprocessing import LabelEncoder\n", "\n", "cols = ['AccidentArea', 'Sex', 'Fault', 'PoliceReportFiled', 'WitnessPresent', 'AgentType']\n", "for col in cols:\n", " print(f\"Col name {col}: {df[col].unique()}\")\n", "\n", "train_ds_v1 = train_ds.copy()\n", "test_ds_v1 = test_ds.copy()\n", "\n", "label_encoder = LabelEncoder()\n", "for col in cols:\n", " label_encoder.fit(train_ds_v1[col])\n", " train_ds_v1[col] = label_encoder.transform(train_ds_v1[col])\n", " test_ds_v1[col] = label_encoder.transform(test_ds_v1[col])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (Ordinal) Categorical Features\n", "对于有序的类别数据，我们构建映射，将类别数据转化为0~n-1的整数" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "orig ds in alice:\n", " Month WeekOfMonth DayOfWeek Make AccidentArea DayOfWeekClaimed \\\n", "0 Mar 4 Sunday Toyota 1 Friday \n", "1 Apr 4 Saturday Honda 1 Monday \n", "2 Jun 4 Sunday Toyota 0 Monday \n", "3 Mar 2 Monday Mazda 1 Monday \n", "4 Jun 3 Friday Mazda 1 Tuesday \n", "... ... ... ... ... ... ... \n", "10787 Sep 2 Wednesday Chevrolet 1 Monday \n", "10788 Apr 4 Tuesday Honda 1 Tuesday \n", "10789 Jul 1 Monday Ford 0 Wednesday \n", "10790 Feb 3 Wednesday Pontiac 1 Monday \n", "10791 Feb 5 Monday Mercury 1 Friday \n", "\n", " MonthClaimed WeekOfMonthClaimed Sex MaritalStatus \n", "0 Apr 1 1 Married \n", "1 Apr 4 1 Married \n", "2 Jun 4 0 Single \n", "3 Mar 2 1 Single \n", "4 Jun 4 0 Single \n", "... ... ... ... ... \n", "10787 Sep 2 0 Married \n", "10788 May 1 1 Married \n", "10789 Jul 1 1 Married \n", "10790 Feb 3 1 Married \n", "10791 Mar 1 1 Married \n", "\n", "[10792 rows x 10 columns]\n", "orig ds in alice:\n", " Month WeekOfMonth DayOfWeek Make AccidentArea \\\n", "0 3 4 7 Toyota 1 \n", "1 4 4 6 Honda 1 \n", "2 6 4 7 Toyota 0 \n", "3 3 2 1 Mazda 1 \n", "4 6 3 5 Mazda 1 \n", "... ... ... ... ... ... \n", "10787 9 2 3 Chevrolet 1 \n", "10788 4 4 2 Honda 1 \n", "10789 7 1 1 Ford 0 \n", "10790 2 3 3 Pontiac 1 \n", "10791 2 5 1 Mercury 1 \n", "\n", " DayOfWeekClaimed MonthClaimed WeekOfMonthClaimed Sex MaritalStatus \n", "0 5 4 1 1 Married \n", "1 1 4 4 1 Married \n", "2 1 6 4 0 Single \n", "3 1 3 2 1 Single \n", "4 2 6 4 0 Single \n", "... ... ... ... ... ... \n", "10787 1 9 2 0 Married \n", "10788 2 5 1 1 Married \n", "10789 3 7 1 1 Married \n", "10790 1 2 3 1 Married \n", "10791 5 3 1 1 Married \n", "\n", "[10792 rows x 10 columns]\n" ] } ], "source": [ "cols1 = [\n", " \"Days_Policy_Accident\",\n", " \"Days_Policy_Claim\",\n", " \"AgeOfPolicyHolder\",\n", " \"AddressChange_Claim\",\n", " \"NumberOfCars\",\n", "]\n", "col_disc = [\n", " {\n", " \"Days_Policy_Accident\": {\n", " \"more than 30\": 31,\n", " \"15 to 30\": 22.5,\n", " \"none\": 0,\n", " \"1 to 7\": 4,\n", " \"8 to 15\": 11.5,\n", " }\n", " },\n", " {\n", " \"Days_Policy_Claim\": {\n", " \"more than 30\": 31,\n", " \"15 to 30\": 22.5,\n", " \"8 to 15\": 11.5,\n", " \"none\": 0,\n", " }\n", " },\n", " {\n", " \"AgeOfPolicyHolder\": {\n", " \"26 to 30\": 28,\n", " \"31 to 35\": 33,\n", " \"41 to 50\": 45.5,\n", " \"51 to 65\": 58,\n", " \"21 to 25\": 23,\n", " \"36 to 40\": 38,\n", " \"16 to 17\": 16.5,\n", " \"over 65\": 66,\n", " \"18 to 20\": 19,\n", " }\n", " },\n", " {\n", " \"AddressChange_Claim\": {\n", " \"1 year\": 1,\n", " \"no change\": 0,\n", " \"4 to 8 years\": 6,\n", " \"2 to 3 years\": 2.5,\n", " \"under 6 months\": 0.5,\n", " }\n", " },\n", " {\n", " \"NumberOfCars\": {\n", " \"3 to 4\": 3.5,\n", " \"1 vehicle\": 1,\n", " \"2 vehicles\": 2,\n", " \"5 to 8\": 6.5,\n", " \"more than 8\": 9,\n", " }\n", " },\n", "]\n", "\n", "cols2 = [\n", " \"Month\",\n", " \"DayOfWeek\",\n", " \"DayOfWeekClaimed\",\n", " \"MonthClaimed\",\n", " \"PastNumberOfClaims\",\n", " \"NumberOfSuppliments\",\n", " \"VehiclePrice\",\n", " \"AgeOfVehicle\",\n", "]\n", "col_ordering = [\n", " {\n", " \"Month\": {\n", " \"Jan\": 1,\n", " \"Feb\": 2,\n", " \"Mar\": 3,\n", " \"Apr\": 4,\n", " \"May\": 5,\n", " \"Jun\": 6,\n", " \"Jul\": 7,\n", " \"Aug\": 8,\n", " \"Sep\": 9,\n", " \"Oct\": 10,\n", " \"Nov\": 11,\n", " \"Dec\": 12,\n", " }\n", " },\n", " {\n", " \"DayOfWeek\": {\n", " \"Monday\": 1,\n", " \"Tuesday\": 2,\n", " \"Wednesday\": 3,\n", " \"Thursday\": 4,\n", " \"Friday\": 5,\n", " \"Saturday\": 6,\n", " \"Sunday\": 7,\n", " }\n", " },\n", " {\n", " \"DayOfWeekClaimed\": {\n", " \"Monday\": 1,\n", " \"Tuesday\": 2,\n", " \"Wednesday\": 3,\n", " \"Thursday\": 4,\n", " \"Friday\": 5,\n", " \"Saturday\": 6,\n", " \"Sunday\": 7,\n", " }\n", " },\n", " {\n", " \"MonthClaimed\": {\n", " \"Jan\": 1,\n", " \"Feb\": 2,\n", " \"Mar\": 3,\n", " \"Apr\": 4,\n", " \"May\": 5,\n", " \"Jun\": 6,\n", " \"Jul\": 7,\n", " \"Aug\": 8,\n", " \"Sep\": 9,\n", " \"Oct\": 10,\n", " \"Nov\": 11,\n", " \"Dec\": 12,\n", " }\n", " },\n", " {\"PastNumberOfClaims\": {\"none\": 0, \"1\": 1, \"2 to 4\": 2, \"more than 4\": 5}},\n", " {\"NumberOfSuppliments\": {\"none\": 0, \"1 to 2\": 1, \"3 to 5\": 3, \"more than 5\": 6}},\n", " {\n", " \"VehiclePrice\": {\n", " \"more than 69000\": 69001,\n", " \"20000 to 29000\": 24500,\n", " \"30000 to 39000\": 34500,\n", " \"less than 20000\": 19999,\n", " \"40000 to 59000\": 49500,\n", " \"60000 to 69000\": 64500,\n", " }\n", " },\n", " {\n", " \"AgeOfVehicle\": {\n", " \"3 years\": 3,\n", " \"6 years\": 6,\n", " \"7 years\": 7,\n", " \"more than 7\": 8,\n", " \"5 years\": 5,\n", " \"new\": 0,\n", " \"4 years\": 4,\n", " \"2 years\": 2,\n", " }\n", " },\n", "]\n", "\n", "from secretflow.data.vertical import VDataFrame\n", "\n", "def replace(df, col_maps):\n", " df = df.copy()\n", "\n", " def func_(df, col_map):\n", " col_name = list(col_map.keys())[0]\n", " col_dict = list(col_map.values())[0]\n", " if col_name not in df.columns:\n", " return\n", " new_list = []\n", " for i in df[col_name]:\n", " new_list.append(col_dict[i])\n", " df[col_name] = new_list\n", "\n", " for col_map in col_maps:\n", " func_(df, col_map)\n", " return df\n", "\n", "col_maps = col_disc + col_ordering\n", "\n", "train_ds_v2 = train_ds_v1.copy()\n", "test_ds_v2 = test_ds_v1.copy()\n", "\n", "# NOTE: Reveal is only used for demo only!!\n", "print(f\"orig ds in alice:\\n {sf.reveal(train_ds_v2.partitions[alice].data)}\")\n", "train_ds_v2.partitions[alice].data = alice(replace)(\n", " train_ds_v2.partitions[alice].data, col_maps\n", ")\n", "train_ds_v2.partitions[bob].data = bob(replace)(\n", " train_ds_v2.partitions[bob].data, col_maps\n", ")\n", "train_ds_v2.partitions[carol].data = carol(replace)(\n", " train_ds_v2.partitions[carol].data, col_maps\n", ")\n", "\n", "print(f\"orig ds in alice:\\n {sf.reveal(train_ds_v2.partitions[alice].data)}\")\n", "\n", "test_ds_v2.partitions[alice].data = alice(replace)(\n", " test_ds_v2.partitions[alice].data, col_maps\n", ")\n", "test_ds_v2.partitions[bob].data = bob(replace)(\n", " test_ds_v2.partitions[bob].data, col_maps\n", ")\n", "test_ds_v2.partitions[carol].data = carol(replace)(\n", " test_ds_v2.partitions[carol].data, col_maps\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (Nominal) Categorical Features\n", "无序的类别数据，我们直接采用onehot encoder进行01编码" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Onehot Encoder" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "orig ds in alice:\n", " Month WeekOfMonth DayOfWeek AccidentArea DayOfWeekClaimed \\\n", "0 3 4 7 1 5 \n", "1 4 4 6 1 1 \n", "2 6 4 7 0 1 \n", "3 3 2 1 1 1 \n", "4 6 3 5 1 2 \n", "... ... ... ... ... ... \n", "10787 9 2 3 1 1 \n", "10788 4 4 2 1 2 \n", "10789 7 1 1 0 3 \n", "10790 2 3 3 1 1 \n", "10791 2 5 1 1 5 \n", "\n", " MonthClaimed WeekOfMonthClaimed Sex Make_Accura Make_BMW ... \\\n", "0 4 1 1 0.0 0.0 ... \n", "1 4 4 1 0.0 0.0 ... \n", "2 6 4 0 0.0 0.0 ... \n", "3 3 2 1 0.0 0.0 ... \n", "4 6 4 0 0.0 0.0 ... \n", "... ... ... ... ... ... ... \n", "10787 9 2 0 0.0 0.0 ... \n", "10788 5 1 1 0.0 0.0 ... \n", "10789 7 1 1 0.0 0.0 ... \n", "10790 2 3 1 0.0 0.0 ... \n", "10791 3 1 1 0.0 0.0 ... \n", "\n", " Make_Pontiac Make_Porche Make_Saab Make_Saturn Make_Toyota \\\n", "0 0.0 0.0 0.0 0.0 1.0 \n", "1 0.0 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 1.0 \n", "3 0.0 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 0.0 0.0 \n", "... ... ... ... ... ... \n", "10787 0.0 0.0 0.0 0.0 0.0 \n", "10788 0.0 0.0 0.0 0.0 0.0 \n", "10789 0.0 0.0 0.0 0.0 0.0 \n", "10790 1.0 0.0 0.0 0.0 0.0 \n", "10791 0.0 0.0 0.0 0.0 0.0 \n", "\n", " Make_VW MaritalStatus_Divorced MaritalStatus_Married \\\n", "0 0.0 0.0 1.0 \n", "1 0.0 0.0 1.0 \n", "2 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 \n", "... ... ... ... \n", "10787 0.0 0.0 1.0 \n", "10788 0.0 0.0 1.0 \n", "10789 0.0 0.0 1.0 \n", "10790 0.0 0.0 1.0 \n", "10791 0.0 0.0 1.0 \n", "\n", " MaritalStatus_Single MaritalStatus_Widow \n", "0 0.0 0.0 \n", "1 0.0 0.0 \n", "2 1.0 0.0 \n", "3 1.0 0.0 \n", "4 1.0 0.0 \n", "... ... ... \n", "10787 0.0 0.0 \n", "10788 0.0 0.0 \n", "10789 0.0 0.0 \n", "10790 0.0 0.0 \n", "10791 0.0 0.0 \n", "\n", "[10792 rows x 31 columns]\n" ] } ], "source": [ "from secretflow.preprocessing import OneHotEncoder\n", "onehot_cols = ['Make','MaritalStatus','PolicyType','VehicleCategory','BasePolicy']\n", "\n", "onehot_encoder = OneHotEncoder()\n", "onehot_encoder.fit(train_ds_v2[onehot_cols])\n", "\n", "enc_feats = onehot_encoder.transform(train_ds_v2[onehot_cols])\n", "feature_names = enc_feats.columns\n", "train_ds_v3 = train_ds_v2.drop(columns=onehot_cols)\n", "train_ds_v3[feature_names] = enc_feats\n", "\n", "\n", "enc_feats = onehot_encoder.transform(test_ds_v2[onehot_cols])\n", "test_ds_v3 = test_ds_v2.drop(columns=onehot_cols)\n", "test_ds_v3[feature_names] = enc_feats\n", "\n", "print(f\"orig ds in alice:\\n {sf.reveal(train_ds_v3.partitions[alice].data)}\")" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "data load done\n" ] } ], "source": [ "train_ds_final = train_ds_v3.copy()\n", "test_ds_final = test_ds_v3.copy()\n", "\n", "X_train = train_ds_v3.drop(columns=['FraudFound_P'])\n", "y_train = train_ds_final['FraudFound_P']\n", "X_test = test_ds_final.drop(columns='FraudFound_P')\n", "y_test = test_ds_final['FraudFound_P']\n", "\n", "print(\"data load done\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 数据对象转换\n", "此处我们将PYUObject 转化为 SPUObject，方便输入到SPU device执行基于MPC协议的隐私计算" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "X_train type: VDataFrame(partitions={: Partition(data=), : Partition(data=), : Partition(data=)}, aligned=True)\n", "\n", "X_train_spu type: \n", "X_train_plaintext: \n", "[[3. 4. 7. ... 1. 0. 0.]\n", " [4. 4. 6. ... 0. 0. 1.]\n", " [6. 4. 7. ... 1. 0. 0.]\n", " ...\n", " [7. 1. 1. ... 0. 1. 0.]\n", " [2. 3. 3. ... 0. 0. 1.]\n", " [2. 5. 1. ... 0. 0. 1.]]\n" ] } ], "source": [ "import jax\n", "import jax.numpy as jnp\n", "\n", "\"\"\"\n", "Convert the VDataFrame object to SPUObject\n", "\"\"\"\n", "def vdataframe_to_spu(vdf: VDataFrame):\n", " spu_partitions = []\n", " for device in [alice, bob, carol]:\n", " spu_partitions.append(vdf.partitions[device].data.to(my_spu))\n", " base_partition = spu_partitions[0]\n", " for i in range(1, len(spu_partitions)):\n", " base_partition = my_spu(lambda x, y: jnp.concatenate([x, y], axis=1))(\n", " base_partition, spu_partitions[i]\n", " )\n", " return base_partition\n", "\n", "\n", "X_train_spu = vdataframe_to_spu(X_train)\n", "y_train_spu = y_train.partitions[carol].data.to(my_spu)\n", "X_test_spu = vdataframe_to_spu(X_test)\n", "y_test_spu = y_test.partitions[carol].data.to(my_spu)\n", "print(f\"X_train type: {X_train}\\n\\nX_train_spu type: {X_train_spu}\")\n", "\n", "\"\"\"\n", "NOTE: This is only for demo only!! This shall not be used in production.\n", "\"\"\"\n", "X_train_plaintext = sf.reveal(X_train_spu)\n", "y_train_plaintext = sf.reveal(y_train_spu)\n", "X_test_plaintext = sf.reveal(X_test_spu)\n", "y_test_plaintext = sf.reveal(y_test_spu)\n", "\n", "print(f'X_train_plaintext: \\n{X_train_plaintext}')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## 模型构建\n", "在完成数据的读入之后，下面我们进行模型的构建。在本demo中，主要提供了三种模型的构建：\n", "- LR: 逻辑回归\n", "- NN：神经网络模型\n", "- XGB: XGBoost 树模型\n", "\n", "> 注意，本示例主要是演示在隐语上进行算法开发的流程，并没有针对模型 (LR, NN) 进行调参。我们分别提供了明文和密文的计算结果，实验结果显示两者的输出是基本一致的，表明隐语的密态计算能够和明文计算保持精度一致。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### LR ( jax ) using SPU" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "from jax.example_libraries import optimizers, stax\n", "from jax.example_libraries.stax import (\n", " Conv,\n", " MaxPool,\n", " AvgPool,\n", " Flatten,\n", " Dense,\n", " Relu,\n", " Sigmoid,\n", " LogSoftmax,\n", " Softmax,\n", " BatchNorm,\n", ")\n", "\n", "\n", "def sigmoid(x):\n", " x = (x - jnp.min(x)) / (jnp.max(x) - jnp.min(x))\n", " return 1 / (1 + jnp.exp(-x))\n", "\n", "# Outputs probability of a label being true.\n", "def predict_lr(W, b, inputs):\n", " return sigmoid(jnp.dot(inputs, W) + b)\n", "\n", "# Training loss is the negative log-likelihood of the training examples.\n", "def loss_lr(W, b, inputs, targets):\n", " preds = predict_lr(W, b, inputs)\n", " label_probs = preds * targets + (1 - preds) * (1 - targets)\n", " return -jnp.mean(jnp.log(label_probs))\n", "\n", "def train_step(W, b, X, y, learning_rate):\n", " loss_value, Wb_grad = jax.value_and_grad(loss_lr, (0, 1))(W, b, X, y)\n", " W -= learning_rate * Wb_grad[0]\n", " b -= learning_rate * Wb_grad[1]\n", " return loss_value, W, b\n", "\n", "def fit(W, b, X, y, epochs=1, learning_rate=1e-2, batch_size=128):\n", " losses = jnp.array([])\n", "\n", " xs = jnp.array_split(X, len(X) / batch_size, axis=0)\n", " ys = jnp.array_split(y, len(y) / batch_size, axis=0)\n", "\n", " for _ in range(epochs):\n", " for (batch_x, batch_y) in zip(xs, ys):\n", " l, W, b = train_step(\n", " W, b, batch_x, batch_y, learning_rate=learning_rate\n", " )\n", " losses = jnp.append(losses, l)\n", " return losses, W, b\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO:absl:Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker: \n", "INFO:absl:Unable to initialize backend 'gpu': NOT_FOUND: Could not find registered platform with name: \"cuda\". Available platform names are: Interpreter Host\n", "INFO:absl:Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.\n", "WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[31m(Jax LR CPU) auc: 0.5243556504755678\u001b[0m\n", "\u001b[31m(Jax LR SPU) auc: 0.5249493212966679\u001b[0m\n" ] } ], "source": [ "from jax import random\n", "import sys\n", "import time\n", "import logging\n", "logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)\n", "logging.getLogger().setLevel(logging.INFO)\n", "\n", "from sklearn.metrics import roc_auc_score\n", "\n", "\n", "# Hyperparameter\n", "key = random.PRNGKey(42)\n", "W = jax.random.normal(key, shape=(64,))\n", "b = 0.0\n", "epochs = 1\n", "learning_rate = 1e-2\n", "batch_size = 128\n", "\n", "\"\"\"\n", "CPU-version plaintext computation\n", "\"\"\"\n", "losses_cpu, W_cpu, b_cpu = fit(\n", " W,\n", " b,\n", " X_train_plaintext,\n", " y_train_plaintext,\n", " epochs=epochs,\n", " learning_rate=learning_rate,\n", " batch_size=batch_size,\n", ")\n", "y_pred_cpu = predict_lr(W_cpu, b_cpu, X_test_plaintext)\n", "print(f\"\\033[31m(Jax LR CPU) auc: {roc_auc_score(y_test_plaintext, y_pred_cpu)}\\033[0m\")\n", "\n", "\"\"\"\n", "SPU-version secure computation\n", "\"\"\"\n", "W_, b_ = (\n", " sf.to(alice, W).to(my_spu),\n", " sf.to(alice, b).to(my_spu),\n", ")\n", "losses_spu, W_spu, b_spu = my_spu(\n", " fit,\n", " static_argnames=[\"epochs\", \"learning_rate\", \"batch_size\"],\n", " num_returns_policy=sf.device.SPUCompilerNumReturnsPolicy.FROM_USER,\n", " user_specified_num_returns=3,\n", ")(\n", " W_,\n", " b_,\n", " X_train_spu,\n", " y_train_spu,\n", " epochs=epochs,\n", " learning_rate=learning_rate,\n", " batch_size=batch_size,\n", ")\n", "\n", "y_pred_spu = my_spu(predict_lr)(W_spu, b_spu, X_test_spu)\n", "y_pred = sf.reveal(y_pred_spu)\n", "print(f\"\\033[31m(Jax LR SPU) auc: {roc_auc_score(y_test_plaintext, y_pred)}\\033[0m\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NN ( jax + flax ) using SPU" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sys\n", "!{sys.executable} -m pip install flax==0.6.0" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[31m(Flax NN CPU) auc: 0.5022025986877814\u001b[0m\n", "\u001b[31m(Flax NN SPU) auc: 0.5022042816667214\u001b[0m\n" ] } ], "source": [ "from typing import Sequence\n", "import flax.linen as nn\n", "\n", "\n", "class MLP(nn.Module):\n", " features: Sequence[int]\n", "\n", " @nn.compact\n", " def __call__(self, x):\n", " for feat in self.features[:-1]:\n", " x = nn.relu(nn.Dense(feat)(x))\n", " x = nn.Dense(self.features[-1])(x)\n", " return x\n", "\n", "FEATURES = [1]\n", "flax_nn = MLP(FEATURES)\n", "\n", "def predict(params, x):\n", " from typing import Sequence\n", " import flax.linen as nn\n", " class MLP(nn.Module):\n", " features: Sequence[int]\n", "\n", " @nn.compact\n", " def __call__(self, x):\n", " for feat in self.features[:-1]:\n", " x = nn.relu(nn.Dense(feat)(x))\n", " x = nn.Dense(self.features[-1])(x)\n", " return x\n", " FEATURES = [1]\n", " flax_nn = MLP(FEATURES)\n", " return flax_nn.apply(params, x)\n", "\n", "\n", "def loss_func(params, x, y):\n", " preds = predict(params, x)\n", " label_probs = preds * y + (1 - preds) * (1 - y)\n", " return -jnp.mean(jnp.log(label_probs))\n", "\n", "\n", "def train_auto_grad(X, y, params, batch_size=10, epochs=10, learning_rate=0.01):\n", " xs = jnp.array_split(X, len(X) / batch_size, axis=0)\n", " ys = jnp.array_split(y, len(y) / batch_size, axis=0)\n", "\n", " for _ in range(epochs):\n", " for (batch_x, batch_y) in zip(xs, ys):\n", " _, grads = jax.value_and_grad(loss_func)(params, batch_x, batch_y)\n", " params = jax.tree_util.tree_map(\n", " lambda p, g: p - learning_rate * g, params, grads\n", " )\n", " return params\n", "\n", "epochs = 1\n", "learning_rate = 1e-2\n", "batch_size = 128\n", "\n", "feature_dim = 64 # from the dataset\n", "init_params = flax_nn.init(jax.random.PRNGKey(1), jnp.ones((batch_size, feature_dim)))\n", "\n", "\"\"\"\n", "CPU-version plaintext computation\n", "\"\"\"\n", "params = train_auto_grad(\n", " X_train_plaintext, y_train_plaintext, init_params, batch_size, epochs, learning_rate\n", ")\n", "y_pred = predict(params, X_test_plaintext)\n", "print(f\"\\033[31m(Flax NN CPU) auc: {roc_auc_score(y_test_plaintext, y_pred)}\\033[0m\")\n", "\n", "\"\"\"\n", "SPU-version secure computation\n", "\"\"\"\n", "params_spu = sf.to(alice, init_params).to(my_spu)\n", "params_spu = my_spu(train_auto_grad, static_argnames=['batch_size', 'epochs', 'learning_rate'])(\n", " X_train_spu, y_train_spu, params_spu, batch_size=batch_size, epochs=epochs, learning_rate=learning_rate\n", ")\n", "y_pred_spu = my_spu(predict)(params_spu, X_test_spu)\n", "y_pred_ = sf.reveal(y_pred_spu)\n", "print(f\"\\033[31m(Flax NN SPU) auc: {roc_auc_score(y_test_plaintext, y_pred_)}\\033[0m\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### XGB ( jax ) using SPU" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO:root:global_setup time 5.707190036773682s\n", "INFO:root:epoch 0 time 13.447269678115845s\n", "INFO:root:epoch 1 time 11.487506866455078s\n", "INFO:root:epoch 2 time 16.403863430023193s\n", "INFO:root:epoch 3 time 10.947153091430664s\n", "INFO:root:epoch 4 time 10.782062530517578s\n", "INFO:root:epoch 5 time 10.983924865722656s\n", "INFO:root:epoch 6 time 13.287509441375732s\n", "INFO:root:epoch 7 time 10.768491506576538s\n", "INFO:root:epoch 8 time 11.075066804885864s\n", "INFO:root:epoch 9 time 11.336798429489136s\n", "train time: 126.23692321777344\n", "predict time: 0.0565030574798584\n", "\u001b[31m(SS-XGB) auc: 0.6917051858471569\u001b[0m\n" ] } ], "source": [ "from secretflow.ml.boost.ss_xgb_v import Xgb\n", "import time\n", "from sklearn.metrics import roc_auc_score\n", "\n", "\"\"\"\n", "SPU-version Secure computation\n", "\"\"\"\n", "xgb = Xgb(my_spu)\n", "params = {\n", " # <<< !!! >>> change args to your test settings.\n", " # for more detail, see Xgb.train.__doc__\n", " 'num_boost_round': 10,\n", " 'max_depth': 4,\n", " 'learning_rate': 0.05,\n", " 'sketch_eps': 0.05,\n", " 'objective': 'logistic',\n", " 'reg_lambda': 1,\n", " 'subsample': 0.75,\n", " 'colsample_bytree': 1,\n", " 'base_score': 0.5,\n", "}\n", "\n", "start = time.time()\n", "model = xgb.train(params, X_train, y_train)\n", "print(f\"train time: {time.time() - start}\")\n", "\n", "start =time.time()\n", "spu_yhat = model.predict(X_test)\n", "print(f\"predict time: {time.time() - start}\")\n", "\n", "yhat = sf.reveal(spu_yhat)\n", "print(f\"\\033[31m(SS-XGB) auc: {roc_auc_score(y_test_plaintext, yhat)}\\033[0m\")" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/haoqi.whq/miniconda3/envs/sf-demo/lib/python3.8/site-packages/xgboost/sklearn.py:1224: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].\n", " warnings.warn(label_encoder_deprecation_msg, UserWarning)\n", "/home/haoqi.whq/miniconda3/envs/sf-demo/lib/python3.8/site-packages/sklearn/preprocessing/_label.py:98: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "/home/haoqi.whq/miniconda3/envs/sf-demo/lib/python3.8/site-packages/sklearn/preprocessing/_label.py:133: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[20:26:03] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "\u001b[31m(Sklearn-XGB) auc: 0.7106100882806603\u001b[0m\n" ] } ], "source": [ "\"\"\"\n", "Plaintext baseline\n", "\"\"\"\n", "import xgboost as SKxgb\n", "\n", "params = {\n", " # <<< !!! >>> change args to your test settings.\n", " # for more detail, see Xgb.train.__doc__\n", " \"n_estimators\": 10,\n", " \"max_depth\": 4,\n", " 'eval_metric': 'auc',\n", " \"learning_rate\": 0.05,\n", " \"sketch_eps\": 0.05,\n", " \"objective\": \"binary:logistic\",\n", " \"reg_lambda\": 1,\n", " \"subsample\": 0.75,\n", " \"colsample_bytree\": 1,\n", " \"base_score\": 0.5,\n", "}\n", "raw_xgb = SKxgb.XGBClassifier()\n", "raw_xgb.fit(X_train_plaintext, y_train_plaintext)\n", "y_pred = raw_xgb.predict(X_test_plaintext)\n", "print(f\"\\033[31m(Sklearn-XGB) auc: {roc_auc_score(y_test_plaintext, y_pred)}\\033[0m\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The End\n", "显示地调用`sf.shutdown()`关闭实例化的集群。\n", "> 注意：如果是在.py文件中运行代码，不需要显示地执行shutdown，在程序进程运行结束后会隐式地执行`shutdown`函数。" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "sf.shutdown()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 小结一下\n", "- 介绍了如何针对一个实际场景的应用，在隐语上进行开发，提供隐私保护的能力\n", "- 隐语上的数据加载、预处理、建模、训练流程\n", "- 下一步，自己实现任意的计算（jax实现的计算），对于TF，pytorch的支持WIP" ] } ], "metadata": { "kernelspec": { "display_name": "sf", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "vscode": { "interpreter": { "hash": "a8435d1c2867c43d8e4505e2a7d45b70fc4003dc549050cc77e922e23eb0b16c" } } }, "nbformat": 4, "nbformat_minor": 2 }