{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 隐语SecretFlow金融风控全链路能力展示\n", "\n", "> This tutorial is only available in Chinese." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Last updated: Nov 9, 2022\n", ">\n", "> 请使用v0.7.11或以上版本的隐语进行实验。\n", ">\n", "> 以下代码仅作为示例,请勿在生产环境直接使用。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "本次实验将会展示如何使用隐语进行在风控领域常用的Logistic Regeression模型和XGB模型的模型研发工作。\n", "\n", "隐语接下来将会开放模型部署和在线/离线模型预测功能,敬请期待。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 实验目标\n", "\n", "在本次实验中,我们将会利用一个开源数据集训练一个金融风控场景常用的线性回归和XGB模型。在此过程中将包含以下步骤:\n", "\n", "- 样本对齐\n", "- 特征预处理\n", "- 数据分析\n", "- 模型训练\n", "- 模型预测\n", "- 模型评估\n", "\n", "请依次执行所有步骤确保实验可以顺利完成。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 实验前置工作" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 初始化隐语框架\n", "\n", "在本次实验中,我们将会包含两个节点:**alice** 和 **bob** . 在真实业务场景,他们将会代表两个不同实体,他们之间的原始数据不被允许直接相互传输,但是他们的原始数据将会被一起用以研发一个模型。\n", "\n", "在下面的代码中,我们建立了一个 **SecretFlow Cluster**, 基于 **alice** 和 **bob** 两个节点,我们还创建了三个device:\n", "\n", "- alice: PYU device, 负责在alice侧的本地计算,计算输入、计算过程和计算结果仅alice可见\n", "- bob: PYU device, 负责在bob侧的本地计算,计算输入、计算过程和计算结果仅bob可见\n", "- spu: SPU device, 负责alice和bob之间的密态计算,计算输入和计算结果为密态,由alice和bob各掌握一个分片,计算过程为MPC计算,由alice和bob各自的SPU Runtime一起执行。\n", "\n", ">  如果你尚未理解以上的一些概念,比如SPU设备,请参考这篇[文档](../developer/design/architecture.md).\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", " warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n" ] } ], "source": [ "import secretflow as sf\n", "\n", "sf.shutdown()\n", "sf.init(['alice', 'bob'], address='local')\n", "alice, bob = sf.PYU('alice'), sf.PYU('bob')\n", "spu = sf.SPU(sf.utils.testing.cluster_def(['alice', 'bob']))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在上面的log中,你应该发现,在**spu**的创建过程中,alice和bob两边都各有一个 **SPURuntime** 被建立并互相创建连接。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 数据集\n", "\n", "本次实验我们采用的原始数据是来自UCI的[Bank Marketing Data Set](https://archive.ics.uci.edu/ml/datasets/bank+marketing). 这个数据集汇集了一家葡萄牙银行机构电话营销的结果。\n", "\n", "我们添加了**uid**这一列用于接下来隐私求交的实验。\n", "\n", "我们首先看一下数据集所包含的信息。\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agejobmaritaleducationdefaultbalancehousingloancontactdaymonthdurationcampaignpdayspreviouspoutcomeyuid
058managementmarriedtertiaryno2143yesnounknown5may2611-10unknownno1
144techniciansinglesecondaryno29yesnounknown5may1511-10unknownno2
233entrepreneurmarriedsecondaryno2yesyesunknown5may761-10unknownno3
347blue-collarmarriedunknownno1506yesnounknown5may921-10unknownno4
433unknownsingleunknownno1nonounknown5may1981-10unknownno5
.........................................................
4520651technicianmarriedtertiaryno825nonocellular17nov9773-10unknownyes45207
4520771retireddivorcedprimaryno1729nonocellular17nov4562-10unknownyes45208
4520872retiredmarriedsecondaryno5715nonocellular17nov112751843successyes45209
4520957blue-collarmarriedsecondaryno668nonotelephone17nov5084-10unknownno45210
4521037entrepreneurmarriedsecondaryno2971nonocellular17nov361218811otherno45211
\n", "

45211 rows × 18 columns

\n", "
" ], "text/plain": [ " age job marital education default balance housing loan \\\n", "0 58 management married tertiary no 2143 yes no \n", "1 44 technician single secondary no 29 yes no \n", "2 33 entrepreneur married secondary no 2 yes yes \n", "3 47 blue-collar married unknown no 1506 yes no \n", "4 33 unknown single unknown no 1 no no \n", "... ... ... ... ... ... ... ... ... \n", "45206 51 technician married tertiary no 825 no no \n", "45207 71 retired divorced primary no 1729 no no \n", "45208 72 retired married secondary no 5715 no no \n", "45209 57 blue-collar married secondary no 668 no no \n", "45210 37 entrepreneur married secondary no 2971 no no \n", "\n", " contact day month duration campaign pdays previous poutcome \\\n", "0 unknown 5 may 261 1 -1 0 unknown \n", "1 unknown 5 may 151 1 -1 0 unknown \n", "2 unknown 5 may 76 1 -1 0 unknown \n", "3 unknown 5 may 92 1 -1 0 unknown \n", "4 unknown 5 may 198 1 -1 0 unknown \n", "... ... ... ... ... ... ... ... ... \n", "45206 cellular 17 nov 977 3 -1 0 unknown \n", "45207 cellular 17 nov 456 2 -1 0 unknown \n", "45208 cellular 17 nov 1127 5 184 3 success \n", "45209 telephone 17 nov 508 4 -1 0 unknown \n", "45210 cellular 17 nov 361 2 188 11 other \n", "\n", " y uid \n", "0 no 1 \n", "1 no 2 \n", "2 no 3 \n", "3 no 4 \n", "4 no 5 \n", "... ... ... \n", "45206 yes 45207 \n", "45207 yes 45208 \n", "45208 yes 45209 \n", "45209 no 45210 \n", "45210 no 45211 \n", "\n", "[45211 rows x 18 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "# secretflow.utils.simulation.datasets contains mirrors of some popular open dataset.\n", "from secretflow.utils.simulation.datasets import dataset\n", "\n", "df = pd.read_csv(dataset('bank_marketing_full'), sep=';')\n", "df['uid'] = df.index + 1\n", "\n", "df\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "该数据集包含了45211个样本,每一个样本代表了一个目标客户。\n", "\n", "每个样本包含16个feature,我们这里简单描述一下这个数据集所有的feature。\n", "\n", "\n", "| feature | 描述 | 取值 |\n", "| :-----| :---- | :---- |\n", "| uid | 客户编码 | 数字 |\n", "| age | 年龄 | 数字 |\n", "| job | 工作类型 | 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown' |\n", "| marital | 婚姻状况 | 'divorced','married','single','unknown' |\n", "| education | 教育状况 | 'tertiary', 'secondary', 'unknown', 'primary' |\n", "| default | 是否有不良信用记录 | 'no','yes','unknown' |\n", "| housing | 是否有房贷 | 'no','yes','unknown' |\n", "| loan | 是否有个人贷款 | 'no','yes','unknown' |\n", "| contact | 联系方式 | 'cellular','telephone' |\n", "| month | 上次联系月份 | 'jan', 'feb', 'mar', ..., 'nov', 'dec' |\n", "| day | 上次联系月日 |数字|\n", "| duration | 上次沟通时间 | 数字 |\n", "| campaign | 本次活动已经沟通的次数 | 数字 |\n", "| pdays | 距离上次沟通经过的天数 | 数字 |\n", "| previous | 在本次活动之前已经沟通的次数 | 数字 |\n", "| poutcome | 之前活动的结果 | 'unknown', 'failure', 'other', 'success' | \n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "每个样本的label - y表示对于目标客户的营销结果(是否签订了定额存款合同),取值是'yes','no'。\n", "\n", "我们假定以上16个feature由两个机构分别掌握,具体如下。\n", "\n", "- alice: age, job, marital, education, default, balance, housing, loan\n", "- bob: contact, day, month, duration, campaign, pdays, previous, poutcome, y\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在真实业务场景中, alice和bob所掌握的数据可能是没有对齐的,为了模拟这种情况,我们将数据集shuffle之后,再随机各取90%来模拟这个状况。" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agejobmaritaleducationdefaultbalancehousingloanuid
3077533techniciansinglesecondaryno10yesno30776
3804455admin.divorcedsecondaryno-288yesno38045
1007560servicesdivorcedsecondaryno47nono10076
3707633servicesmarriedsecondaryno56yesno37077
1106752managementmarriedtertiaryno7388noyes11068
..............................
908355retiredmarriedprimaryno0nono9084
3957625studentsingletertiaryno241nono39577
2518640managementmarriedtertiaryno11yesno25187
1043636entrepreneurmarriedprimaryno6317nono10437
115859retiredmarriedunknownno0nono1159
\n", "

40690 rows × 9 columns

\n", "
" ], "text/plain": [ " age job marital education default balance housing loan \\\n", "30775 33 technician single secondary no 10 yes no \n", "38044 55 admin. divorced secondary no -288 yes no \n", "10075 60 services divorced secondary no 47 no no \n", "37076 33 services married secondary no 56 yes no \n", "11067 52 management married tertiary no 7388 no yes \n", "... ... ... ... ... ... ... ... ... \n", "9083 55 retired married primary no 0 no no \n", "39576 25 student single tertiary no 241 no no \n", "25186 40 management married tertiary no 11 yes no \n", "10436 36 entrepreneur married primary no 6317 no no \n", "1158 59 retired married unknown no 0 no no \n", "\n", " uid \n", "30775 30776 \n", "38044 38045 \n", "10075 10076 \n", "37076 37077 \n", "11067 11068 \n", "... ... \n", "9083 9084 \n", "39576 39577 \n", "25186 25187 \n", "10436 10437 \n", "1158 1159 \n", "\n", "[40690 rows x 9 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "df_alice = df.iloc[:, np.r_[0:8, -1]].sample(frac=0.9)\n", "\n", "df_alice\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
contactdaymonthdurationcampaignpdayspreviouspoutcomeyuid
11614unknown19jun2111-10unknownno11615
24743cellular18nov1501-10unknownno24744
42588cellular30dec1583-10unknownno42589
4322unknown19may1873-10unknownno4323
15930cellular22jul761-10unknownno15931
.................................
30143cellular4feb7622043otherno30144
16730cellular24jul1492-10unknownno16731
35775cellular8may361-10unknownno35776
7050unknown28may3182-10unknownno7051
42846cellular3feb35513015successyes42847
\n", "

40690 rows × 10 columns

\n", "
" ], "text/plain": [ " contact day month duration campaign pdays previous poutcome y \\\n", "11614 unknown 19 jun 211 1 -1 0 unknown no \n", "24743 cellular 18 nov 150 1 -1 0 unknown no \n", "42588 cellular 30 dec 158 3 -1 0 unknown no \n", "4322 unknown 19 may 187 3 -1 0 unknown no \n", "15930 cellular 22 jul 76 1 -1 0 unknown no \n", "... ... ... ... ... ... ... ... ... ... \n", "30143 cellular 4 feb 76 2 204 3 other no \n", "16730 cellular 24 jul 149 2 -1 0 unknown no \n", "35775 cellular 8 may 36 1 -1 0 unknown no \n", "7050 unknown 28 may 318 2 -1 0 unknown no \n", "42846 cellular 3 feb 355 1 301 5 success yes \n", "\n", " uid \n", "11614 11615 \n", "24743 24744 \n", "42588 42589 \n", "4322 4323 \n", "15930 15931 \n", "... ... \n", "30143 30144 \n", "16730 16731 \n", "35775 35776 \n", "7050 7051 \n", "42846 42847 \n", "\n", "[40690 rows x 10 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_bob = df.iloc[:, 8:].sample(frac=0.9)\n", "\n", "df_bob\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "我们这里将df_alice和df_bob保存为文件,作为alice和bob两方的原始输入。\n", "\n", "至此,我们完成了所有实验准备工作。" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n" ] } ], "source": [ "import tempfile\n", "\n", "_, alice_path = tempfile.mkstemp()\n", "_, bob_path = tempfile.mkstemp()\n", "df_alice.reset_index(drop=True).to_csv(alice_path, index=False)\n", "df_bob.reset_index(drop=True).to_csv(bob_path, index=False)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 样本对齐(隐私求交)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "显然,第一步我们需要将两边的数据对齐。\n", "隐私求交([Private Set Intersection](https://en.wikipedia.org/wiki/Private_set_intersection))是一种密码学方法,可以获取两个集合的交集,而不泄露任何其他信息。\n", "在隐语中,SPU设备支持三种隐私求交算法:\n", "\n", "- [ECDH](https://ieeexplore.ieee.org/document/6234849/):半诚实模型, 基于公钥密码学,原本适用于小数据集,但是隐语优化后已经能支持10亿量级的数据。\n", "- [KKRT](https://eprint.iacr.org/2016/799.pdf):半诚实模型, 基于布谷鸟哈希(Cuckoo Hashing)以及高效不经意传输扩展(OT Extension),适用于大数据集(比如千万数据集)。\n", "- [BC22PCG](https://eprint.iacr.org/2022/334):半诚实模型, 基于随机相关函数生成器,适用于大数据集。\n", "\n", "由于我们这里的数据集较小,我们这里采用的是ECDH方法。\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 方式一:将隐私求交结果保存至文件\n", "\n", "在一些应用场景场景中,alice和bob可能在隐私求交之后将结果直接保存至文件中,之后再进行后续操作。这个时候,请调用**psi_csv**接口。\n", "\n", "在以下代码中,我们分别制定了两边需要求交的key以及输入和输出路径。\n", "\n", "我们需要指定双方的输入文件和输出文件路径。对于ECDH来说,由于双方的地位是平等的,receiver并没有实际含义,你可以任意指定。我们需要设定正确的protocol。sort设为true之后,join的结果将会被排序。\n", "\n", "> 请阅读 psi_csv 的文档。" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m I1110 15:04:56.489761 45105 external/com_github_brpc_brpc/src/brpc/server.cpp:1070] Server[yacl::link::internal::ReceiverServiceImpl] is serving on port=23711.\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m I1110 15:04:56.489876 45105 external/com_github_brpc_brpc/src/brpc/server.cpp:1073] Check out http://k69b13338.eu95sqa:23711 in web browser.\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m I1110 15:04:56.591135 47953 external/com_github_brpc_brpc/src/brpc/socket.cpp:2236] Checking Socket{id=0 addr=127.0.0.1:39345} (0x5604daa51200)\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m I1110 15:04:56.641627 45106 external/com_github_brpc_brpc/src/brpc/server.cpp:1070] Server[yacl::link::internal::ReceiverServiceImpl] is serving on port=39345.\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m I1110 15:04:56.641724 45106 external/com_github_brpc_brpc/src/brpc/server.cpp:1073] Check out http://k69b13338.eu95sqa:39345 in web browser.\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m I1110 15:04:59.591784 47928 external/com_github_brpc_brpc/src/brpc/socket.cpp:2296] Revived Socket{id=0 addr=127.0.0.1:39345} (0x5604daa51200) (Connectable)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:00.520] [info] [bucket_psi.cc:169] bucket size set to 1048576\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:00.521] [info] [bucket_psi.cc:77] Begin sanity check for input file: /tmp/tmp0jqbsdge, precheck_switch:true\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:00.544] [info] [csv_checker.cc:125] Executing duplicated scripts: LC_ALL=C sort --buffer-size=1G --temporary-directory=/tmp --stable selected-keys.1668063900521197637 | LC_ALL=C uniq -d > duplicate-keys.1668063900521197637\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:00.568] [info] [bucket_psi.cc:90] End sanity check for input file: /tmp/tmp0jqbsdge, size=40690\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:00.568] [info] [bucket_psi.cc:190] Run psi protocol=1, self_items_count=40690\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:00.568] [info] [cryptor_selector.cc:50] Using libSodium\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:00.568] [info] [cipher_store.cc:25] Disk cache choose num_bins=64\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:00.520] [info] [bucket_psi.cc:169] bucket size set to 1048576\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:00.520] [info] [bucket_psi.cc:77] Begin sanity check for input file: /tmp/tmphctf58aa, precheck_switch:true\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:00.545] [info] [csv_checker.cc:125] Executing duplicated scripts: LC_ALL=C sort --buffer-size=1G --temporary-directory=/tmp --stable selected-keys.1668063900520830875 | LC_ALL=C uniq -d > duplicate-keys.1668063900520830875\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:00.567] [info] [bucket_psi.cc:90] End sanity check for input file: /tmp/tmphctf58aa, size=40690\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:00.568] [info] [bucket_psi.cc:190] Run psi protocol=1, self_items_count=40690\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:00.568] [info] [cryptor_selector.cc:50] Using libSodium\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:00.568] [info] [cipher_store.cc:25] Disk cache choose num_bins=64\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:00.602] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:00.602] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:01.827] [info] [ecdh_psi.cc:70] MaskSelf:root--finished, batch_count=10\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:01.934] [info] [ecdh_psi.cc:134] root recv last batch finished, batch_count=10\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:01.989] [info] [ecdh_psi.cc:109] MaskPeer:root--finished, batch_count=10\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:01.927] [info] [ecdh_psi.cc:109] MaskPeer:root--finished, batch_count=10\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:01.939] [info] [ecdh_psi.cc:70] MaskSelf:root--finished, batch_count=10\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:01.992] [info] [ecdh_psi.cc:134] root recv last batch finished, batch_count=10\n" ] }, { "data": { "text/plain": [ "[{'party': 'alice', 'original_count': 40690, 'intersection_count': 36593},\n", " {'party': 'bob', 'original_count': 40690, 'intersection_count': 36593}]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "_, alice_psi_path = tempfile.mkstemp()\n", "_, bob_psi_path = tempfile.mkstemp()\n", "\n", "spu.psi_csv(\n", " key=\"uid\",\n", " input_path={alice: alice_path, bob: bob_path},\n", " output_path={alice: alice_psi_path, bob: bob_psi_path},\n", " receiver=\"alice\",\n", " protocol=\"ECDH_PSI_2PC\",\n", " sort=True,\n", ")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 方式二:将求交结果保存至VDataFrame\n", "\n", "VDataFrame是隐语中保存垂直切分数据的数据结构,在接下来的任务中,我们将会不断使用VDataFrame的数据结构。\n", "\n", "由于在本次实验中,经过隐私求交之后,我们还有后续操作,所以我们在这里使用 **data.vertical.read_csv** 来将原始数据隐私求交之后的结果直接转化为VDataFrame。\n", "\n", "> 请阅读data.vertical.read_csv的文档。很多参数和psi_csv是一致的,这里不再赘述。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:02.036] [info] [bucket_psi.cc:119] Begin post filtering, indices.size=36593, should_sort=true\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:02.044] [info] [utils.cc:86] Executing sort scripts: tail -n +2 /tmp/tmp-sort-in-1668063902037299176 | LC_ALL=C sort --buffer-size=1G --temporary-directory=./ --stable --field-separator=, --key=9,9 >>/tmp/tmp-sort-out-1668063902037299176\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:02.076] [info] [utils.cc:88] Finished sort scripts: tail -n +2 /tmp/tmp-sort-in-1668063902037299176 | LC_ALL=C sort --buffer-size=1G --temporary-directory=./ --stable --field-separator=, --key=9,9 >>/tmp/tmp-sort-out-1668063902037299176, ret=0\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:02.076] [info] [bucket_psi.cc:157] End post filtering, in=/tmp/tmp0jqbsdge, out=/tmp/tmprde_yg6t\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:02.039] [info] [bucket_psi.cc:119] Begin post filtering, indices.size=36593, should_sort=true\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:02.047] [info] [utils.cc:86] Executing sort scripts: tail -n +2 /tmp/tmp-sort-in-1668063902040108117 | LC_ALL=C sort --buffer-size=1G --temporary-directory=./ --stable --field-separator=, --key=10,10 >>/tmp/tmp-sort-out-1668063902040108117\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:02.080] [info] [utils.cc:88] Finished sort scripts: tail -n +2 /tmp/tmp-sort-in-1668063902040108117 | LC_ALL=C sort --buffer-size=1G --temporary-directory=./ --stable --field-separator=, --key=10,10 >>/tmp/tmp-sort-out-1668063902040108117, ret=0\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:02.080] [info] [bucket_psi.cc:157] End post filtering, in=/tmp/tmphctf58aa, out=/tmp/tmpp8yo12c6\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:04.478] [info] [bucket_psi.cc:169] bucket size set to 1048576\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:04.478] [info] [bucket_psi.cc:77] Begin sanity check for input file: .data/1-af16e420-b8a9-4996-a437-8e10dd3f9ea8/psi-input.csv, precheck_switch:true\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:04.502] [info] [csv_checker.cc:125] Executing duplicated scripts: LC_ALL=C sort --buffer-size=1G --temporary-directory=.data/1-af16e420-b8a9-4996-a437-8e10dd3f9ea8 --stable selected-keys.1668063904478962587 | LC_ALL=C uniq -d > duplicate-keys.1668063904478962587\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:04.680] [info] [bucket_psi.cc:169] bucket size set to 1048576\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:04.680] [info] [bucket_psi.cc:77] Begin sanity check for input file: .data/0-775a216f-7b52-44a0-8f56-e8ad66a83505/psi-input.csv, precheck_switch:true\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:04.706] [info] [csv_checker.cc:125] Executing duplicated scripts: LC_ALL=C sort --buffer-size=1G --temporary-directory=.data/0-775a216f-7b52-44a0-8f56-e8ad66a83505 --stable selected-keys.1668063904680819077 | LC_ALL=C uniq -d > duplicate-keys.1668063904680819077\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:04.740] [info] [bucket_psi.cc:90] End sanity check for input file: .data/0-775a216f-7b52-44a0-8f56-e8ad66a83505/psi-input.csv, size=40690\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:04.741] [info] [bucket_psi.cc:190] Run psi protocol=1, self_items_count=40690\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:04.741] [info] [cryptor_selector.cc:50] Using libSodium\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:04.741] [info] [cipher_store.cc:25] Disk cache choose num_bins=64\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:04.741] [info] [bucket_psi.cc:90] End sanity check for input file: .data/1-af16e420-b8a9-4996-a437-8e10dd3f9ea8/psi-input.csv, size=40690\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:04.741] [info] [bucket_psi.cc:190] Run psi protocol=1, self_items_count=40690\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:04.741] [info] [cryptor_selector.cc:50] Using libSodium\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:04.741] [info] [cipher_store.cc:25] Disk cache choose num_bins=64\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:06.016] [info] [ecdh_psi.cc:70] MaskSelf:root--finished, batch_count=10\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:06.202] [info] [ecdh_psi.cc:70] MaskSelf:root--finished, batch_count=10\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:06.358] [info] [ecdh_psi.cc:109] MaskPeer:root--finished, batch_count=10\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:06.373] [info] [ecdh_psi.cc:134] root recv last batch finished, batch_count=10\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:06.362] [info] [ecdh_psi.cc:134] root recv last batch finished, batch_count=10\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:06.370] [info] [ecdh_psi.cc:109] MaskPeer:root--finished, batch_count=10\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:06.597] [info] [bucket_psi.cc:119] Begin post filtering, indices.size=36593, should_sort=true\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:06.642] [info] [utils.cc:86] Executing sort scripts: tail -n +2 .data/0-775a216f-7b52-44a0-8f56-e8ad66a83505/tmp-sort-in-1668063906598517121 | LC_ALL=C sort --buffer-size=1G --temporary-directory=./ --stable --field-separator=, --key=9,9 >>.data/0-775a216f-7b52-44a0-8f56-e8ad66a83505/tmp-sort-out-1668063906598517121\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:06.641] [info] [bucket_psi.cc:119] Begin post filtering, indices.size=36593, should_sort=true\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:06.655] [info] [utils.cc:86] Executing sort scripts: tail -n +2 .data/1-af16e420-b8a9-4996-a437-8e10dd3f9ea8/tmp-sort-in-1668063906641900342 | LC_ALL=C sort --buffer-size=1G --temporary-directory=./ --stable --field-separator=, --key=10,10 >>.data/1-af16e420-b8a9-4996-a437-8e10dd3f9ea8/tmp-sort-out-1668063906641900342\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:06.693] [info] [utils.cc:88] Finished sort scripts: tail -n +2 .data/0-775a216f-7b52-44a0-8f56-e8ad66a83505/tmp-sort-in-1668063906598517121 | LC_ALL=C sort --buffer-size=1G --temporary-directory=./ --stable --field-separator=, --key=9,9 >>.data/0-775a216f-7b52-44a0-8f56-e8ad66a83505/tmp-sort-out-1668063906598517121, ret=0\n", "\u001b[2m\u001b[36m(SPURuntime pid=45105)\u001b[0m [2022-11-10 15:05:06.693] [info] [bucket_psi.cc:157] End post filtering, in=.data/0-775a216f-7b52-44a0-8f56-e8ad66a83505/psi-input.csv, out=.data/0-775a216f-7b52-44a0-8f56-e8ad66a83505/psi-output.csv\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:06.714] [info] [utils.cc:88] Finished sort scripts: tail -n +2 .data/1-af16e420-b8a9-4996-a437-8e10dd3f9ea8/tmp-sort-in-1668063906641900342 | LC_ALL=C sort --buffer-size=1G --temporary-directory=./ --stable --field-separator=, --key=10,10 >>.data/1-af16e420-b8a9-4996-a437-8e10dd3f9ea8/tmp-sort-out-1668063906641900342, ret=0\n", "\u001b[2m\u001b[36m(SPURuntime pid=45106)\u001b[0m [2022-11-10 15:05:06.714] [info] [bucket_psi.cc:157] End post filtering, in=.data/1-af16e420-b8a9-4996-a437-8e10dd3f9ea8/psi-input.csv, out=.data/1-af16e420-b8a9-4996-a437-8e10dd3f9ea8/psi-output.csv\n" ] }, { "data": { "text/plain": [ "Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',\n", " 'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays',\n", " 'previous', 'poutcome', 'y'],\n", " dtype='object')" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from secretflow.data.vertical import read_csv as v_read_csv\n", "\n", "vdf = v_read_csv(\n", " {alice: alice_path, bob: bob_path},\n", " spu=spu,\n", " keys=\"uid\",\n", " drop_keys=\"uid\",\n", " psi_protocl=\"ECDH_PSI_2PC\",\n", ")\n", "vdf.columns\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 更多\n", "\n", "我们在这里展示的是两方单键的隐私求交,隐语也支持三方和多键的隐私求交技术,想要了解更多信息,你可以:\n", "\n", "- 阅读这篇[文档](https://www.secretflow.org.cn/docs/spu/en/development/psi.html)了解隐语SPU的隐私求交能力。\n", "- 阅读该[教程](./PSI_On_SPU.ipynb)了解使用的例子。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 特征预处理\n", "\n", "一般情况下,我们都需要对用于建模的数据进行预处理,合理的预处理对模型训练效果非常关键。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在开始特征预处理之前,我们先使用 **stats.table_statistics.table_statistics** 来查看一下特征总体情况,我们会在后面专门讨论全表统计模块。" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datatypetotal_countcountcount_naminmaxmeanvarstdsem...moment_2moment_3moment_4central_moment_2central_moment_3central_moment_4sumsum_2sum_3sum_4
ageint643659336593018.095.040.9470391.128579e+0210.6234610.055535...1.789515e+038.333137e+044.121843e+061.128548e+028.138900e+024.203357e+041498375.06.548372e+073.049345e+091.508306e+11
jobobject36593365930NaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
maritalobject36593365930NaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
educationobject36593365930NaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
defaultobject36593365930NaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
balanceint6436593365930-6847.0102127.01362.8275639.160127e+063026.57016515.821649...1.101718e+072.688746e+114.396339e+159.159877e+062.288934e+111.211695e+1649869949.04.031515e+119.838929e+15-5.145462e+18
housingobject36593365930NaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
loanobject36593365930NaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
contactobject36593365930NaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
dayint64365933659301.031.015.8117956.918886e+018.3179840.043483...3.191998e+027.287211e+031.788647e+056.918697e+015.214820e+019.274288e+03578601.01.168048e+072.666609e+086.545197e+09
monthobject36593365930NaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
durationint64365933659300.04918.0257.1905286.517951e+04255.3027731.334617...1.313247e+051.187896e+081.717326e+116.517772e+045.148791e+078.852056e+109411373.04.805564e+094.346867e+126.284213e+15
campaignint64365933659301.063.02.7670059.729927e+003.1192830.016306...1.738598e+012.521050e+026.260525e+039.729661e+001.501539e+024.093040e+03101253.06.362050e+059.225279e+062.290914e+08
pdaysint6436593365930-1.0854.040.5171751.014990e+04100.7466910.526662...1.179126e+043.978166e+061.564747e+091.014962e+042.677950e+061.028068e+091482645.04.314776e+081.455730e+115.725880e+13
previousint64365933659300.0275.00.5852765.783946e+002.4049840.012572...6.126336e+006.332842e+021.581319e+055.783788e+006.229283e+021.566616e+0521417.02.241810e+052.317377e+075.786521e+09
poutcomeobject36593365930NaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
yobject36593365930NaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

17 rows × 25 columns

\n", "
" ], "text/plain": [ " datatype total_count count count_na min max \\\n", "age int64 36593 36593 0 18.0 95.0 \n", "job object 36593 36593 0 NaN NaN \n", "marital object 36593 36593 0 NaN NaN \n", "education object 36593 36593 0 NaN NaN \n", "default object 36593 36593 0 NaN NaN \n", "balance int64 36593 36593 0 -6847.0 102127.0 \n", "housing object 36593 36593 0 NaN NaN \n", "loan object 36593 36593 0 NaN NaN \n", "contact object 36593 36593 0 NaN NaN \n", "day int64 36593 36593 0 1.0 31.0 \n", "month object 36593 36593 0 NaN NaN \n", "duration int64 36593 36593 0 0.0 4918.0 \n", "campaign int64 36593 36593 0 1.0 63.0 \n", "pdays int64 36593 36593 0 -1.0 854.0 \n", "previous int64 36593 36593 0 0.0 275.0 \n", "poutcome object 36593 36593 0 NaN NaN \n", "y object 36593 36593 0 NaN NaN \n", "\n", " mean var std sem ... \\\n", "age 40.947039 1.128579e+02 10.623461 0.055535 ... \n", "job NaN NaN NaN NaN ... \n", "marital NaN NaN NaN NaN ... \n", "education NaN NaN NaN NaN ... \n", "default NaN NaN NaN NaN ... \n", "balance 1362.827563 9.160127e+06 3026.570165 15.821649 ... \n", "housing NaN NaN NaN NaN ... \n", "loan NaN NaN NaN NaN ... \n", "contact NaN NaN NaN NaN ... \n", "day 15.811795 6.918886e+01 8.317984 0.043483 ... \n", "month NaN NaN NaN NaN ... \n", "duration 257.190528 6.517951e+04 255.302773 1.334617 ... \n", "campaign 2.767005 9.729927e+00 3.119283 0.016306 ... \n", "pdays 40.517175 1.014990e+04 100.746691 0.526662 ... \n", "previous 0.585276 5.783946e+00 2.404984 0.012572 ... \n", "poutcome NaN NaN NaN NaN ... \n", "y NaN NaN NaN NaN ... \n", "\n", " moment_2 moment_3 moment_4 central_moment_2 \\\n", "age 1.789515e+03 8.333137e+04 4.121843e+06 1.128548e+02 \n", "job NaN NaN NaN NaN \n", "marital NaN NaN NaN NaN \n", "education NaN NaN NaN NaN \n", "default NaN NaN NaN NaN \n", "balance 1.101718e+07 2.688746e+11 4.396339e+15 9.159877e+06 \n", "housing NaN NaN NaN NaN \n", "loan NaN NaN NaN NaN \n", "contact NaN NaN NaN NaN \n", "day 3.191998e+02 7.287211e+03 1.788647e+05 6.918697e+01 \n", "month NaN NaN NaN NaN \n", "duration 1.313247e+05 1.187896e+08 1.717326e+11 6.517772e+04 \n", "campaign 1.738598e+01 2.521050e+02 6.260525e+03 9.729661e+00 \n", "pdays 1.179126e+04 3.978166e+06 1.564747e+09 1.014962e+04 \n", "previous 6.126336e+00 6.332842e+02 1.581319e+05 5.783788e+00 \n", "poutcome NaN NaN NaN NaN \n", "y NaN NaN NaN NaN \n", "\n", " central_moment_3 central_moment_4 sum sum_2 \\\n", "age 8.138900e+02 4.203357e+04 1498375.0 6.548372e+07 \n", "job NaN NaN NaN NaN \n", "marital NaN NaN NaN NaN \n", "education NaN NaN NaN NaN \n", "default NaN NaN NaN NaN \n", "balance 2.288934e+11 1.211695e+16 49869949.0 4.031515e+11 \n", "housing NaN NaN NaN NaN \n", "loan NaN NaN NaN NaN \n", "contact NaN NaN NaN NaN \n", "day 5.214820e+01 9.274288e+03 578601.0 1.168048e+07 \n", "month NaN NaN NaN NaN \n", "duration 5.148791e+07 8.852056e+10 9411373.0 4.805564e+09 \n", "campaign 1.501539e+02 4.093040e+03 101253.0 6.362050e+05 \n", "pdays 2.677950e+06 1.028068e+09 1482645.0 4.314776e+08 \n", "previous 6.229283e+02 1.566616e+05 21417.0 2.241810e+05 \n", "poutcome NaN NaN NaN NaN \n", "y NaN NaN NaN NaN \n", "\n", " sum_3 sum_4 \n", "age 3.049345e+09 1.508306e+11 \n", "job NaN NaN \n", "marital NaN NaN \n", "education NaN NaN \n", "default NaN NaN \n", "balance 9.838929e+15 -5.145462e+18 \n", "housing NaN NaN \n", "loan NaN NaN \n", "contact NaN NaN \n", "day 2.666609e+08 6.545197e+09 \n", "month NaN NaN \n", "duration 4.346867e+12 6.284213e+15 \n", "campaign 9.225279e+06 2.290914e+08 \n", "pdays 1.455730e+11 5.725880e+13 \n", "previous 2.317377e+07 5.786521e+09 \n", "poutcome NaN NaN \n", "y NaN NaN \n", "\n", "[17 rows x 25 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from secretflow.stats.table_statistics import table_statistics\n", "\n", "pd.set_option('display.max_rows', None)\n", "data_stats = table_statistics(vdf)\n", "data_stats" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "pd.reset_option('display.max_rows')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在接下来,我们将会展示隐语以下特征预处理能力:\n", "\n", "- 值替换\n", "- 缺失值填充\n", "- WOE分组/分箱转换\n", "- one-hot编码\n", "- 标准化" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 值替换\n", "\n", "我们先对以下特征做值替换:\n", "\n", "| feature | 描述 | 取值和值替换规则 |\n", "| :-----| :---- | :---- |\n", "| education | 教育状况 | 'tertiary' -> 3, 'secondary' -> 2, 'unknown' -> 0, 'primary' -> 1 |\n", "| default | 是否有不良信用记录 | 'no' -> 0,'yes' -> 1,'unknown' -> NaN |\n", "| housing | 是否有房贷 | 'no' -> 0,'yes' -> 1,'unknown' -> NaN |\n", "| loan | 是否有个人贷款 | 'no' -> 0,'yes' -> 1,'unknown' -> NaN |\n", "| month | 上次联系月份 | 'jan' -> 1, 'feb' -> 2, 'mar' -> 3, ..., 'nov' -> 11, 'dec' ->12 |\n", "| y | label | 'yes' -> 1,'no' -> 0 |\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "替换完之后,我们使用 **sf.reveal** 来查看效果,请注意在生产中,**sf.reveal** 将会直接泄露数据,需要严格限制和进行审计。\n", "\n", "> 在生产中,请严格限制**sf.reveal**的使用。" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " age job marital education default balance housing loan\n", "0 43 technician single 2.0 0 593 1 0\n", "1 46 management married 3.0 0 229 1 0\n", "2 42 technician married 2.0 0 8036 0 0\n", "3 38 admin. married 1.0 0 1487 0 0\n", "4 39 blue-collar married 2.0 0 138 0 0\n", "... ... ... ... ... ... ... ... ...\n", "36588 36 self-employed single 3.0 0 4844 0 0\n", "36589 49 housemaid married 1.0 0 3376 0 0\n", "36590 52 entrepreneur married 3.0 0 1115 1 0\n", "36591 40 blue-collar married 1.0 0 48 0 0\n", "36592 46 services married 3.0 0 474 0 0\n", "\n", "[36593 rows x 8 columns]\n", " contact day month duration campaign pdays previous poutcome y\n", "0 unknown 5 5 55 1 -1 0 unknown 0\n", "1 unknown 5 5 197 1 -1 0 unknown 0\n", "2 unknown 9 6 948 5 -1 0 unknown 0\n", "3 unknown 9 6 332 2 -1 0 unknown 0\n", "4 unknown 9 6 61 2 -1 0 unknown 0\n", "... ... ... ... ... ... ... ... ... ..\n", "36588 unknown 9 6 1137 3 -1 0 unknown 1\n", "36589 unknown 9 6 119 3 -1 0 unknown 0\n", "36590 unknown 9 6 124 2 -1 0 unknown 0\n", "36591 unknown 9 6 100 5 -1 0 unknown 0\n", "36592 unknown 9 6 445 2 -1 0 unknown 0\n", "\n", "[36593 rows x 9 columns]\n" ] } ], "source": [ "vdf['education'] = vdf['education'].replace(\n", " {'tertiary': 3, 'secondary': 2, 'primary': 1, 'unknown': np.NaN}\n", ")\n", "\n", "vdf['default'] = vdf['default'].replace({'no': 0, 'yes': 1, 'unknown': np.NaN})\n", "\n", "vdf['housing'] = vdf['housing'].replace({'no': 0, 'yes': 1, 'unknown': np.NaN})\n", "\n", "vdf['loan'] = vdf['loan'].replace({'no': 0, 'yes': 1, 'unknown': np.NaN})\n", "\n", "vdf['month'] = vdf['month'].replace(\n", " {\n", " 'jan': 1,\n", " 'feb': 2,\n", " 'mar': 3,\n", " 'apr': 4,\n", " 'may': 5,\n", " 'jun': 6,\n", " 'jul': 7,\n", " 'aug': 8,\n", " 'sep': 9,\n", " 'oct': 10,\n", " 'nov': 11,\n", " 'dec': 12,\n", " }\n", ")\n", "\n", "vdf['y'] = vdf['y'].replace(\n", " {\n", " 'no': 0,\n", " 'yes': 1,\n", " }\n", ")\n", "\n", "print(sf.reveal(vdf.partitions[alice].data))\n", "print(sf.reveal(vdf.partitions[bob].data))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 安全性讨论\n", "\n", "值替换操作由数据所有者的PYU Device执行,不会泄露数据。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 缺失值填充\n", "\n", "接下来我们对缺失值进行填充。我们在这里均填充了众数,其他可选的策略还包括平均数、中位数等。\n", "\n", "其他可能的处理方法包括删除缺省的行, 或者可以使用数据完整的行作为训练集,以此来预测缺失值。\n", "\n", "替换完之后,我们使用 **sf.reveal** 来查看效果。" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " age job marital education default balance housing loan\n", "0 43 technician single 2.0 0 593 1 0\n", "1 46 management married 3.0 0 229 1 0\n", "2 42 technician married 2.0 0 8036 0 0\n", "3 38 admin. married 1.0 0 1487 0 0\n", "4 39 blue-collar married 2.0 0 138 0 0\n", "... ... ... ... ... ... ... ... ...\n", "36588 36 self-employed single 3.0 0 4844 0 0\n", "36589 49 housemaid married 1.0 0 3376 0 0\n", "36590 52 entrepreneur married 3.0 0 1115 1 0\n", "36591 40 blue-collar married 1.0 0 48 0 0\n", "36592 46 services married 3.0 0 474 0 0\n", "\n", "[36593 rows x 8 columns]\n", " contact day month duration campaign pdays previous poutcome y\n", "0 unknown 5 5 55 1 -1 0 unknown 0\n", "1 unknown 5 5 197 1 -1 0 unknown 0\n", "2 unknown 9 6 948 5 -1 0 unknown 0\n", "3 unknown 9 6 332 2 -1 0 unknown 0\n", "4 unknown 9 6 61 2 -1 0 unknown 0\n", "... ... ... ... ... ... ... ... ... ..\n", "36588 unknown 9 6 1137 3 -1 0 unknown 1\n", "36589 unknown 9 6 119 3 -1 0 unknown 0\n", "36590 unknown 9 6 124 2 -1 0 unknown 0\n", "36591 unknown 9 6 100 5 -1 0 unknown 0\n", "36592 unknown 9 6 445 2 -1 0 unknown 0\n", "\n", "[36593 rows x 9 columns]\n" ] } ], "source": [ "vdf[\"education\"] = vdf[\"education\"].fillna(vdf[\"education\"].mode())\n", "vdf[\"default\"] = vdf[\"default\"].fillna(vdf[\"default\"].mode())\n", "vdf[\"housing\"] = vdf[\"housing\"].fillna(vdf[\"housing\"].mode())\n", "vdf[\"loan\"] = vdf[\"loan\"].fillna(vdf[\"loan\"].mode())\n", "\n", "print(sf.reveal(vdf.partitions[alice].data))\n", "print(sf.reveal(vdf.partitions[bob].data))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 安全性讨论\n", "\n", "所填充的缺失值由属于数据所有者的PYU Device执行,并在接下来的缺失值操作中由数据所有者的PYU Device使用,不会泄露数据。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### woe分箱\n", "\n", "woe分箱用于将连续值替换为离散值。\n", "\n", "将连续型特征离散化的一个好处是可以有效地克服数据中隐藏的缺陷: 使模型结果更加稳定。例如,数据中的极端值是影响模型效果的一个重要因素。极端值导致模型参数过高或过低,或导致模型被虚假现象\"迷惑\",把原来不存在的关系作为重要模式来学习。而离散化可以有效地减弱极端值和异常值的影响。\n", "\n", "变量duration的75%分位数远小于最大值,而且该变量的标准差相对也比较大。因此需要对变量duration进行离散化。" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[36m(_run pid=44647)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " age job marital education default balance housing loan\n", "0 43 technician single 2.0 0 593 1 0\n", "1 46 management married 3.0 0 229 1 0\n", "2 42 technician married 2.0 0 8036 0 0\n", "3 38 admin. married 1.0 0 1487 0 0\n", "4 39 blue-collar married 2.0 0 138 0 0\n", "... ... ... ... ... ... ... ... ...\n", "36588 36 self-employed single 3.0 0 4844 0 0\n", "36589 49 housemaid married 1.0 0 3376 0 0\n", "36590 52 entrepreneur married 3.0 0 1115 1 0\n", "36591 40 blue-collar married 1.0 0 48 0 0\n", "36592 46 services married 3.0 0 474 0 0\n", "\n", "[36593 rows x 8 columns]\n", " contact day month duration campaign pdays previous poutcome y\n", "0 unknown 5 5 -3.983588 1 -1 0 unknown 0\n", "1 unknown 5 5 -1.232426 1 -1 0 unknown 0\n", "2 unknown 9 6 2.351786 5 -1 0 unknown 0\n", "3 unknown 9 6 0.185882 2 -1 0 unknown 0\n", "4 unknown 9 6 -3.983588 2 -1 0 unknown 0\n", "... ... ... ... ... ... ... ... ... ..\n", "36588 unknown 9 6 2.351786 3 -1 0 unknown 1\n", "36589 unknown 9 6 -1.193344 3 -1 0 unknown 0\n", "36590 unknown 9 6 -1.193344 2 -1 0 unknown 0\n", "36591 unknown 9 6 -1.667386 5 -1 0 unknown 0\n", "36592 unknown 9 6 0.474665 2 -1 0 unknown 0\n", "\n", "[36593 rows x 9 columns]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=56064)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n" ] } ], "source": [ "from secretflow.preprocessing.binning.vert_woe_binning import VertWoeBinning\n", "from secretflow.preprocessing.binning.vert_woe_substitution import VertWOESubstitution\n", "\n", "binning = VertWoeBinning(spu)\n", "woe_rules = binning.binning(\n", " vdf,\n", " binning_method=\"chimerge\",\n", " bin_num=4,\n", " bin_names={alice: [], bob: [\"duration\"]},\n", " label_name=\"y\",\n", ")\n", "\n", "woe_sub = VertWOESubstitution()\n", "vdf = woe_sub.substitution(vdf, woe_rules)\n", "\n", "print(sf.reveal(vdf.partitions[alice].data))\n", "print(sf.reveal(vdf.partitions[bob].data))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 安全性讨论\n", "\n", "woe分桶需要利用alice和bob两边的数据,因此相关的计算需要使用**SPU device**确保原始数据不被泄露。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### One Hot编码" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "one-hot编码适用于将类型编码转化为数值编码。 对于job、marital等特征我们需要one-hot编码。" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " age education default balance housing loan job_admin. \\\n", "0 43 2.0 0 593 1 0 0.0 \n", "1 46 3.0 0 229 1 0 0.0 \n", "2 42 2.0 0 8036 0 0 0.0 \n", "3 38 1.0 0 1487 0 0 1.0 \n", "4 39 2.0 0 138 0 0 0.0 \n", "... ... ... ... ... ... ... ... \n", "36588 36 3.0 0 4844 0 0 0.0 \n", "36589 49 1.0 0 3376 0 0 0.0 \n", "36590 52 3.0 0 1115 1 0 0.0 \n", "36591 40 1.0 0 48 0 0 0.0 \n", "36592 46 3.0 0 474 0 0 0.0 \n", "\n", " job_blue-collar job_entrepreneur job_housemaid ... job_retired \\\n", "0 0.0 0.0 0.0 ... 0.0 \n", "1 0.0 0.0 0.0 ... 0.0 \n", "2 0.0 0.0 0.0 ... 0.0 \n", "3 0.0 0.0 0.0 ... 0.0 \n", "4 1.0 0.0 0.0 ... 0.0 \n", "... ... ... ... ... ... \n", "36588 0.0 0.0 0.0 ... 0.0 \n", "36589 0.0 0.0 1.0 ... 0.0 \n", "36590 0.0 1.0 0.0 ... 0.0 \n", "36591 1.0 0.0 0.0 ... 0.0 \n", "36592 0.0 0.0 0.0 ... 0.0 \n", "\n", " job_self-employed job_services job_student job_technician \\\n", "0 0.0 0.0 0.0 1.0 \n", "1 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 1.0 \n", "3 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 0.0 \n", "... ... ... ... ... \n", "36588 1.0 0.0 0.0 0.0 \n", "36589 0.0 0.0 0.0 0.0 \n", "36590 0.0 0.0 0.0 0.0 \n", "36591 0.0 0.0 0.0 0.0 \n", "36592 0.0 1.0 0.0 0.0 \n", "\n", " job_unemployed job_unknown marital_divorced marital_married \\\n", "0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 1.0 \n", "2 0.0 0.0 0.0 1.0 \n", "3 0.0 0.0 0.0 1.0 \n", "4 0.0 0.0 0.0 1.0 \n", "... ... ... ... ... \n", "36588 0.0 0.0 0.0 0.0 \n", "36589 0.0 0.0 0.0 1.0 \n", "36590 0.0 0.0 0.0 1.0 \n", "36591 0.0 0.0 0.0 1.0 \n", "36592 0.0 0.0 0.0 1.0 \n", "\n", " marital_single \n", "0 1.0 \n", "1 0.0 \n", "2 0.0 \n", "3 0.0 \n", "4 0.0 \n", "... ... \n", "36588 1.0 \n", "36589 0.0 \n", "36590 0.0 \n", "36591 0.0 \n", "36592 0.0 \n", "\n", "[36593 rows x 21 columns]\n", " duration campaign pdays previous y contact_cellular \\\n", "0 -3.983588 1 -1 0 0 0.0 \n", "1 -1.232426 1 -1 0 0 0.0 \n", "2 2.351786 5 -1 0 0 0.0 \n", "3 0.185882 2 -1 0 0 0.0 \n", "4 -3.983588 2 -1 0 0 0.0 \n", "... ... ... ... ... .. ... \n", "36588 2.351786 3 -1 0 1 0.0 \n", "36589 -1.193344 3 -1 0 0 0.0 \n", "36590 -1.193344 2 -1 0 0 0.0 \n", "36591 -1.667386 5 -1 0 0 0.0 \n", "36592 0.474665 2 -1 0 0 0.0 \n", "\n", " contact_telephone contact_unknown month_1.0 month_2.0 ... \\\n", "0 0.0 1.0 0.0 0.0 ... \n", "1 0.0 1.0 0.0 0.0 ... \n", "2 0.0 1.0 0.0 0.0 ... \n", "3 0.0 1.0 0.0 0.0 ... \n", "4 0.0 1.0 0.0 0.0 ... \n", "... ... ... ... ... ... \n", "36588 0.0 1.0 0.0 0.0 ... \n", "36589 0.0 1.0 0.0 0.0 ... \n", "36590 0.0 1.0 0.0 0.0 ... \n", "36591 0.0 1.0 0.0 0.0 ... \n", "36592 0.0 1.0 0.0 0.0 ... \n", "\n", " day_26.0 day_27.0 day_28.0 day_29.0 day_30.0 day_31.0 \\\n", "0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 0.0 0.0 0.0 \n", "... ... ... ... ... ... ... \n", "36588 0.0 0.0 0.0 0.0 0.0 0.0 \n", "36589 0.0 0.0 0.0 0.0 0.0 0.0 \n", "36590 0.0 0.0 0.0 0.0 0.0 0.0 \n", "36591 0.0 0.0 0.0 0.0 0.0 0.0 \n", "36592 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", " poutcome_failure poutcome_other poutcome_success poutcome_unknown \n", "0 0.0 0.0 0.0 1.0 \n", "1 0.0 0.0 0.0 1.0 \n", "2 0.0 0.0 0.0 1.0 \n", "3 0.0 0.0 0.0 1.0 \n", "4 0.0 0.0 0.0 1.0 \n", "... ... ... ... ... \n", "36588 0.0 0.0 0.0 1.0 \n", "36589 0.0 0.0 0.0 1.0 \n", "36590 0.0 0.0 0.0 1.0 \n", "36591 0.0 0.0 0.0 1.0 \n", "36592 0.0 0.0 0.0 1.0 \n", "\n", "[36593 rows x 55 columns]\n" ] } ], "source": [ "from secretflow.preprocessing.encoder import OneHotEncoder\n", "\n", "encoder = OneHotEncoder()\n", "# for vif and correlation only\n", "vdf_hat = vdf.drop(columns=[\"job\", \"marital\", \"contact\", \"month\", \"day\", \"poutcome\"])\n", "\n", "tranformed_df = encoder.fit_transform(vdf['job'])\n", "vdf[tranformed_df.dtypes.index] = tranformed_df\n", "\n", "tranformed_df = encoder.fit_transform(vdf['marital'])\n", "vdf[tranformed_df.dtypes.index] = tranformed_df\n", "\n", "tranformed_df = encoder.fit_transform(vdf['contact'])\n", "vdf[tranformed_df.dtypes.index] = tranformed_df\n", "\n", "tranformed_df = encoder.fit_transform(vdf['month'])\n", "vdf[tranformed_df.dtypes.index] = tranformed_df\n", "\n", "tranformed_df = encoder.fit_transform(vdf['day'])\n", "vdf[tranformed_df.dtypes.index] = tranformed_df\n", "\n", "tranformed_df = encoder.fit_transform(vdf['poutcome'])\n", "vdf[tranformed_df.dtypes.index] = tranformed_df\n", "\n", "vdf = vdf.drop(columns=[\"job\", \"marital\", \"contact\", \"month\", \"day\", \"poutcome\"])\n", "\n", "print(sf.reveal(vdf.partitions[alice].data))\n", "print(sf.reveal(vdf.partitions[bob].data))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 安全性讨论\n", "\n", "one-hot编码操作由数据所有者的PYU Device执行,不会泄露数据。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 标准化 \n", "特征之间数值差距太大会使得模型收敛困难,我们一般先对数值进行标准化。" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=57413)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but StandardScaler was fitted without feature names\n", "\u001b[2m\u001b[36m(_run pid=57413)\u001b[0m warnings.warn(\n", "\u001b[2m\u001b[36m(_run pid=57412)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but StandardScaler was fitted without feature names\n", "\u001b[2m\u001b[36m(_run pid=57412)\u001b[0m warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " age education default balance housing loan \\\n", "0 0.193250 -0.214015 -0.134056 -0.254360 0.890792 -0.437727 \n", "1 0.475648 1.317062 -0.134056 -0.374630 0.890792 -0.437727 \n", "2 0.099118 -0.214015 -0.134056 2.204893 -1.122596 -0.437727 \n", "3 -0.277412 -1.745093 -0.134056 0.041028 -1.122596 -0.437727 \n", "4 -0.183280 -0.214015 -0.134056 -0.404697 -1.122596 -0.437727 \n", "... ... ... ... ... ... ... \n", "36588 -0.465677 1.317062 -0.134056 1.150219 -1.122596 -0.437727 \n", "36589 0.758046 -1.745093 -0.134056 0.665175 -1.122596 -0.437727 \n", "36590 1.040444 1.317062 -0.134056 -0.081885 0.890792 -0.437727 \n", "36591 -0.089147 -1.745093 -0.134056 -0.434434 -1.122596 -0.437727 \n", "36592 0.475648 1.317062 -0.134056 -0.293679 -1.122596 -0.437727 \n", "\n", " job_admin. job_blue-collar job_entrepreneur job_housemaid ... \\\n", "0 -0.361290 -0.524802 -0.185399 -0.168477 ... \n", "1 -0.361290 -0.524802 -0.185399 -0.168477 ... \n", "2 -0.361290 -0.524802 -0.185399 -0.168477 ... \n", "3 2.767863 -0.524802 -0.185399 -0.168477 ... \n", "4 -0.361290 1.905480 -0.185399 -0.168477 ... \n", "... ... ... ... ... ... \n", "36588 -0.361290 -0.524802 -0.185399 -0.168477 ... \n", "36589 -0.361290 -0.524802 -0.185399 5.935545 ... \n", "36590 -0.361290 -0.524802 5.393786 -0.168477 ... \n", "36591 -0.361290 1.905480 -0.185399 -0.168477 ... \n", "36592 -0.361290 -0.524802 -0.185399 -0.168477 ... \n", "\n", " job_retired job_self-employed job_services job_student \\\n", "0 -0.229175 -0.188918 -0.317656 -0.145439 \n", "1 -0.229175 -0.188918 -0.317656 -0.145439 \n", "2 -0.229175 -0.188918 -0.317656 -0.145439 \n", "3 -0.229175 -0.188918 -0.317656 -0.145439 \n", "4 -0.229175 -0.188918 -0.317656 -0.145439 \n", "... ... ... ... ... \n", "36588 -0.229175 5.293301 -0.317656 -0.145439 \n", "36589 -0.229175 -0.188918 -0.317656 -0.145439 \n", "36590 -0.229175 -0.188918 -0.317656 -0.145439 \n", "36591 -0.229175 -0.188918 -0.317656 -0.145439 \n", "36592 -0.229175 -0.188918 3.148056 -0.145439 \n", "\n", " job_technician job_unemployed job_unknown marital_divorced \\\n", "0 2.225530 -0.172045 -0.082437 -0.361145 \n", "1 -0.449331 -0.172045 -0.082437 -0.361145 \n", "2 2.225530 -0.172045 -0.082437 -0.361145 \n", "3 -0.449331 -0.172045 -0.082437 -0.361145 \n", "4 -0.449331 -0.172045 -0.082437 -0.361145 \n", "... ... ... ... ... \n", "36588 -0.449331 -0.172045 -0.082437 -0.361145 \n", "36589 -0.449331 -0.172045 -0.082437 -0.361145 \n", "36590 -0.449331 -0.172045 -0.082437 -0.361145 \n", "36591 -0.449331 -0.172045 -0.082437 -0.361145 \n", "36592 -0.449331 -0.172045 -0.082437 -0.361145 \n", "\n", " marital_married marital_single \n", "0 -1.231619 1.59589 \n", "1 0.811939 -0.62661 \n", "2 0.811939 -0.62661 \n", "3 0.811939 -0.62661 \n", "4 0.811939 -0.62661 \n", "... ... ... \n", "36588 -1.231619 1.59589 \n", "36589 0.811939 -0.62661 \n", "36590 0.811939 -0.62661 \n", "36591 0.811939 -0.62661 \n", "36592 0.811939 -0.62661 \n", "\n", "[36593 rows x 21 columns]\n", " duration campaign pdays previous y contact_cellular \\\n", "0 -1.990727 -0.566486 -0.4121 -0.243363 0 -1.357755 \n", "1 -0.300264 -0.566486 -0.4121 -0.243363 0 -1.357755 \n", "2 1.902071 0.715878 -0.4121 -0.243363 0 -1.357755 \n", "3 0.571222 -0.245895 -0.4121 -0.243363 0 -1.357755 \n", "4 -1.990727 -0.245895 -0.4121 -0.243363 0 -1.357755 \n", "... ... ... ... ... .. ... \n", "36588 1.902071 0.074696 -0.4121 -0.243363 1 -1.357755 \n", "36589 -0.276250 0.074696 -0.4121 -0.243363 0 -1.357755 \n", "36590 -0.276250 -0.245895 -0.4121 -0.243363 0 -1.357755 \n", "36591 -0.567526 0.715878 -0.4121 -0.243363 0 -1.357755 \n", "36592 0.748666 -0.245895 -0.4121 -0.243363 0 -1.357755 \n", "\n", " contact_telephone contact_unknown month_1.0 month_2.0 ... \\\n", "0 -0.260775 1.572308 -0.179805 -0.249288 ... \n", "1 -0.260775 1.572308 -0.179805 -0.249288 ... \n", "2 -0.260775 1.572308 -0.179805 -0.249288 ... \n", "3 -0.260775 1.572308 -0.179805 -0.249288 ... \n", "4 -0.260775 1.572308 -0.179805 -0.249288 ... \n", "... ... ... ... ... ... \n", "36588 -0.260775 1.572308 -0.179805 -0.249288 ... \n", "36589 -0.260775 1.572308 -0.179805 -0.249288 ... \n", "36590 -0.260775 1.572308 -0.179805 -0.249288 ... \n", "36591 -0.260775 1.572308 -0.179805 -0.249288 ... \n", "36592 -0.260775 1.572308 -0.179805 -0.249288 ... \n", "\n", " day_26.0 day_27.0 day_28.0 day_29.0 day_30.0 day_31.0 \\\n", "0 -0.154952 -0.159514 -0.206026 -0.201369 -0.187673 -0.118176 \n", "1 -0.154952 -0.159514 -0.206026 -0.201369 -0.187673 -0.118176 \n", "2 -0.154952 -0.159514 -0.206026 -0.201369 -0.187673 -0.118176 \n", "3 -0.154952 -0.159514 -0.206026 -0.201369 -0.187673 -0.118176 \n", "4 -0.154952 -0.159514 -0.206026 -0.201369 -0.187673 -0.118176 \n", "... ... ... ... ... ... ... \n", "36588 -0.154952 -0.159514 -0.206026 -0.201369 -0.187673 -0.118176 \n", "36589 -0.154952 -0.159514 -0.206026 -0.201369 -0.187673 -0.118176 \n", "36590 -0.154952 -0.159514 -0.206026 -0.201369 -0.187673 -0.118176 \n", "36591 -0.154952 -0.159514 -0.206026 -0.201369 -0.187673 -0.118176 \n", "36592 -0.154952 -0.159514 -0.206026 -0.201369 -0.187673 -0.118176 \n", "\n", " poutcome_failure poutcome_other poutcome_success poutcome_unknown \n", "0 -0.348698 -0.206673 -0.18697 0.473513 \n", "1 -0.348698 -0.206673 -0.18697 0.473513 \n", "2 -0.348698 -0.206673 -0.18697 0.473513 \n", "3 -0.348698 -0.206673 -0.18697 0.473513 \n", "4 -0.348698 -0.206673 -0.18697 0.473513 \n", "... ... ... ... ... \n", "36588 -0.348698 -0.206673 -0.18697 0.473513 \n", "36589 -0.348698 -0.206673 -0.18697 0.473513 \n", "36590 -0.348698 -0.206673 -0.18697 0.473513 \n", "36591 -0.348698 -0.206673 -0.18697 0.473513 \n", "36592 -0.348698 -0.206673 -0.18697 0.473513 \n", "\n", "[36593 rows x 55 columns]\n" ] } ], "source": [ "from secretflow.preprocessing import StandardScaler\n", "\n", "X = vdf.drop(columns=['y'])\n", "y = vdf['y']\n", "scaler = StandardScaler()\n", "X = scaler.fit_transform(X)\n", "vdf[X.columns] = X\n", "print(sf.reveal(vdf.partitions[alice].data))\n", "print(sf.reveal(vdf.partitions[bob].data))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 安全性讨论\n", "\n", "标准化操作由数据所有者的PYU Device执行,不会泄露数据。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 更多\n", "\n", "隐语还支持其他更多的特征预处理能力,请参考这篇[文档](./data_preprocessing_with_data_frame.ipynb)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "至此,我们已经完成了所有特征预处理工作。\n", "\n", "> 本文主要目的是为了展示隐语的预处理能力,本文对于数据预处理方法的使用可能是有争议的,敬请谅解。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 数据分析" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在建模之前,我们有必要分析一下我们所使用的数据,以便确认是否需要重复特征预处理的过程。\n", "\n", "下面我们将会展示隐语以下数据分析能力:\n", "\n", "- 全表统计\n", "- 相关系数矩阵\n", "- VIF指标计算\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 全表统计\n", "\n", "我们提供了类似于 **pd.DataFrame.describe** 来展示所有特征的基本统计信息。\n", "\n", "> 在特征预处理的过程中,你可以不断调用全表统计来关注预处理效果。" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datatypetotal_countcountcount_naminmaxmeanvarstdsem...moment_2moment_3moment_4central_moment_2central_moment_3central_moment_4sumsum_2sum_3sum_4
durationfloat6436593365930-1.9907271.902071-6.601933e-181.0000271.0000140.005228...1.000000-0.5578512.8995451.000000-0.5578512.899545-2.415845e-1336593.0-2.041344e+041.061031e+05
campaignfloat6436593365930-0.56648619.310148-7.611640e-171.0000271.0000140.005228...1.0000004.94754943.2365021.0000004.94754943.236502-2.785328e-1236593.01.810457e+051.582153e+06
pdaysfloat6436593365930-0.4121008.0746473.728150e-171.0000271.0000140.005228...1.0000002.6189549.9798171.0000002.6189549.9798171.364242e-1236593.09.583538e+043.651915e+05
previousfloat6436593365930-0.243363114.1040964.349509e-171.0000271.0000140.005228...1.00000044.7836574683.1464701.00000044.7836574683.1464701.591616e-1236593.01.638768e+061.713704e+08
yint64365933659300.0000001.0000001.162517e-010.1027400.3205310.001676...0.1162520.1162520.1162520.1027370.0788510.0710724.254000e+034254.04.254000e+034.254000e+03
contact_cellularfloat6436593365930-1.3577550.736510-7.456301e-171.0000271.0000140.005228...1.000000-0.6212461.3859461.000000-0.6212461.385946-2.728484e-1236593.0-2.273325e+045.071593e+04
contact_telephonefloat6436593365930-0.2607753.8347294.349509e-171.0000271.0000140.005228...1.0000003.57395513.7731541.0000003.57395513.7731541.591616e-1236593.01.307817e+055.040010e+05
contact_unknownfloat6436593365930-0.6360081.5723080.000000e+001.0000271.0000140.005228...1.0000000.9363001.8766571.0000000.9363001.8766570.000000e+0036593.03.426201e+046.867251e+04
month_1.0float6436593365930-0.1798055.5615709.320376e-181.0000271.0000140.005228...1.0000005.38176529.9633951.0000005.38176529.9633953.410605e-1336593.01.969349e+051.096450e+06
month_2.0float6436593365930-0.2492884.0114276.524263e-171.0000271.0000140.005228...1.0000003.76213915.1536901.0000003.76213915.1536902.387424e-1236593.01.376680e+055.545190e+05
month_3.0float6436593365930-0.1032529.6850679.320376e-181.0000271.0000140.005228...1.0000009.58181592.8111791.0000009.58181592.8111793.410605e-1336593.03.506274e+053.396239e+06
month_4.0float6436593365930-0.2642243.784667-6.524263e-171.0000271.0000140.005228...1.0000003.52044313.3935161.0000003.52044313.393516-2.387424e-1236593.01.288236e+054.901089e+05
month_5.0float6436593365930-0.6620361.510493-1.864075e-161.0000271.0000140.005228...1.0000000.8484571.7198801.0000000.8484571.719880-6.821210e-1236593.03.104760e+046.293557e+04
month_6.0float6436593365930-0.3646172.742607-3.728150e-171.0000271.0000140.005228...1.0000002.3779906.6548361.0000002.3779906.654836-1.364242e-1236593.08.701779e+042.435204e+05
month_7.0float6436593365930-0.4245342.3555250.000000e+001.0000271.0000140.005228...1.0000001.9309914.7287261.0000001.9309914.7287260.000000e+0036593.07.066075e+041.730383e+05
month_8.0float6436593365930-0.4006292.496075-1.242717e-161.0000271.0000140.005228...1.0000002.0954465.3908931.0000002.0954465.390893-4.547474e-1236593.07.667865e+041.972689e+05
month_9.0float6436593365930-0.1136978.795317-4.038830e-171.0000271.0000140.005228...1.0000008.68162076.3705291.0000008.68162076.370529-1.477929e-1236593.03.176865e+052.794627e+06
month_10.0float6436593365930-0.1294407.725601-4.660188e-181.0000271.0000140.005228...1.0000007.59616158.7016631.0000007.59616158.701663-1.705303e-1336593.02.779663e+052.148070e+06
month_11.0float6436593365930-0.3097743.2281631.366988e-161.0000271.0000140.005228...1.0000002.9183899.5169961.0000002.9183899.5169965.002221e-1236593.01.067926e+053.482554e+05
month_12.0float6436593365930-0.06709614.9039616.213584e-181.0000271.0000140.005228...1.00000014.836865221.1325511.00000014.836865221.1325512.273737e-1336593.05.429254e+058.091903e+06
day_1.0float6436593365930-0.08557111.6862179.320376e-181.0000271.0000140.005228...1.00000011.600646135.5749921.00000011.600646135.5749923.410605e-1336593.04.245024e+054.961096e+06
day_2.0float6436593365930-0.1704395.867198-1.242717e-171.0000271.0000140.005228...1.0000005.69675833.4530571.0000005.69675833.453057-4.547474e-1336593.02.084615e+051.224148e+06
day_3.0float6436593365930-0.1564256.3928410.000000e+001.0000271.0000140.005228...1.0000006.23641639.8928901.0000006.23641639.8928900.000000e+0036593.02.282092e+051.459801e+06
day_4.0float6436593365930-0.1821415.490262-7.456301e-171.0000271.0000140.005228...1.0000005.30812229.1761541.0000005.30812229.176154-2.728484e-1236593.01.942401e+051.067643e+06
day_5.0float6436593365930-0.2077494.813497-7.456301e-171.0000271.0000140.005228...1.0000004.60574722.2129091.0000004.60574722.212909-2.728484e-1236593.01.685381e+058.128370e+05
day_6.0float6436593365930-0.2113024.7325538.077659e-171.0000271.0000140.005228...1.0000004.52125121.4417081.0000004.52125121.4417082.955858e-1236593.01.654461e+057.846164e+05
day_7.0float6436593365930-0.2061704.8503759.320376e-181.0000271.0000140.005228...1.0000004.64420622.5686451.0000004.64420622.5686453.410605e-1336593.01.699454e+058.258544e+05
day_8.0float6436593365930-0.2077494.813497-6.213584e-171.0000271.0000140.005228...1.0000004.60574722.2129091.0000004.60574722.212909-2.273737e-1236593.01.685381e+058.128370e+05
day_9.0float6436593365930-0.1901565.2588446.213584e-171.0000271.0000140.005228...1.0000005.06868826.6916021.0000005.06868826.6916022.273737e-1236593.01.854785e+059.767258e+05
day_10.0float6436593365930-0.1063179.405819-1.864075e-171.0000271.0000140.005228...1.0000009.29950287.4807411.0000009.29950287.480741-6.821210e-1336593.03.402967e+053.201183e+06
day_11.0float6436593365930-0.1822215.4878507.145621e-171.0000271.0000140.005228...1.0000005.30562929.1497011.0000005.30562929.1497012.614797e-1236593.01.941489e+051.066675e+06
day_12.0float6436593365930-0.1900795.260979-3.106792e-171.0000271.0000140.005228...1.0000005.07090026.7140301.0000005.07090026.714030-1.136868e-1236593.01.855595e+059.775465e+05
day_13.0float6436593365930-0.1913105.227117-2.174754e-171.0000271.0000140.005228...1.0000005.03580726.3593551.0000005.03580726.359355-7.958079e-1336593.01.842753e+059.645679e+05
day_14.0float6436593365930-0.2070324.8301616.213584e-171.0000271.0000140.005228...1.0000004.62312822.3733151.0000004.62312822.3733152.273737e-1236593.01.691741e+058.187067e+05
day_15.0float6436593365930-0.1982635.0438113.106792e-171.0000271.0000140.005228...1.0000004.84554824.4793371.0000004.84554824.4793371.136868e-1236593.01.773131e+058.957724e+05
day_16.0float6436593365930-0.1798865.559067-1.864075e-171.0000271.0000140.005228...1.0000005.37918129.9355851.0000005.37918129.935585-6.821210e-1336593.01.968404e+051.095433e+06
day_17.0float6436593365930-0.2121484.713694-5.592226e-171.0000271.0000140.005228...1.0000004.50154621.2639151.0000004.50154621.263915-2.046363e-1236593.01.647251e+057.781105e+05
day_18.0float6436593365930-0.2316084.317635-1.242717e-171.0000271.0000140.005228...1.0000004.08602717.6956181.0000004.08602717.695618-4.547474e-1336593.01.495200e+056.475357e+05
day_19.0float6436593365930-0.2008544.9787438.699017e-171.0000271.0000140.005228...1.0000004.77788923.8282211.0000004.77788923.8282213.183231e-1236593.01.748373e+058.719461e+05
day_20.0float6436593365930-0.2532223.9491093.417471e-171.0000271.0000140.005228...1.0000003.69588814.6595861.0000003.69588814.6595861.250555e-1236593.01.352436e+055.364382e+05
day_21.0float6436593365930-0.2162654.623964-5.902905e-171.0000271.0000140.005228...1.0000004.40769920.4278101.0000004.40769920.427810-2.160050e-1236593.01.612909e+057.475149e+05
day_22.0float6436593365930-0.1437656.9558083.417471e-171.0000271.0000140.005228...1.0000006.81204347.4039341.0000006.81204347.4039341.250555e-1236593.02.492731e+051.734652e+06
day_23.0float6436593365930-0.1465136.825333-4.349509e-171.0000271.0000140.005228...1.0000006.67882045.6066421.0000006.67882045.606642-1.591616e-1236593.02.443981e+051.668884e+06
day_24.0float6436593365930-0.1005139.9489134.815528e-171.0000271.0000140.005228...1.0000009.84840097.9909771.0000009.84840097.9909771.762146e-1236593.03.603825e+053.585784e+06
day_25.0float6436593365930-0.1387337.208092-9.320376e-181.0000271.0000140.005228...1.0000007.06935950.9758311.0000007.06935950.975831-3.410605e-1336593.02.586890e+051.865359e+06
day_26.0float6436593365930-0.1549526.4536181.242717e-171.0000271.0000140.005228...1.0000006.29866640.6731941.0000006.29866640.6731944.547474e-1336593.02.304871e+051.488354e+06
day_27.0float6436593365930-0.1595146.269024-4.349509e-171.0000271.0000140.005228...1.0000006.10950938.3261061.0000006.10950938.326106-1.591616e-1236593.02.235653e+051.402467e+06
day_28.0float6436593365930-0.2060264.853768-7.456301e-171.0000271.0000140.005228...1.0000004.64774222.6015071.0000004.64774222.601507-2.728484e-1236593.01.700748e+058.270569e+05
day_29.0float6436593365930-0.2013694.966014-6.213584e-181.0000271.0000140.005228...1.0000004.76464523.7018401.0000004.76464523.701840-2.273737e-1336593.01.743526e+058.673214e+05
day_30.0float6436593365930-0.1876735.3284116.213584e-171.0000271.0000140.005228...1.0000005.14073827.4271891.0000005.14073827.4271892.273737e-1236593.01.881150e+051.003643e+06
day_31.0float6436593365930-0.1181768.4619832.485434e-171.0000271.0000140.005228...1.0000008.34380870.6191241.0000008.34380870.6191249.094947e-1336593.03.053249e+052.584166e+06
poutcome_failurefloat6436593365930-0.3486982.867813-6.834942e-171.0000271.0000140.005228...1.0000002.5191157.3459411.0000002.5191157.345941-2.501110e-1236593.09.218198e+042.688100e+05
poutcome_otherfloat6436593365930-0.2066734.838554-5.281546e-171.0000271.0000140.005228...1.0000004.63188122.4543221.0000004.63188122.454322-1.932676e-1236593.01.694944e+058.216710e+05
poutcome_successfloat6436593365930-0.1869705.3484571.242717e-171.0000271.0000140.005228...1.0000005.16148727.6409451.0000005.16148727.6409454.547474e-1336593.01.888743e+051.011465e+06
poutcome_unknownfloat6436593365930-2.1118740.473513-1.304853e-161.0000271.0000140.005228...1.000000-1.6383613.6842271.000000-1.6383613.684227-4.774847e-1236593.0-5.995254e+041.348169e+05
agefloat6436593365930-2.1600645.088144-3.106792e-171.0000271.0000140.005228...1.0000000.6788683.3003191.0000000.6788683.300319-1.136868e-1236593.02.484182e+041.207686e+05
educationfloat6436593365930-1.7450931.317062-2.609705e-161.0000271.0000140.005228...1.000000-0.1501542.3031971.000000-0.1501542.303197-9.549694e-1236593.0-5.494580e+038.428090e+04
defaultfloat6436593365930-0.1340567.459592-8.543678e-181.0000271.0000140.005228...1.0000007.32553654.6634821.0000007.32553654.663482-3.126388e-1336593.02.680633e+052.000301e+06
balancefloat6436593365930-2.71262233.293644-1.553396e-171.0000271.0000140.005228...1.0000008.256555144.4155771.0000008.256555144.415577-5.684342e-1336593.03.021321e+055.284599e+06
housingfloat6436593365930-1.1225960.890792-1.864075e-171.0000271.0000140.005228...1.000000-0.2318041.0537331.000000-0.2318041.053733-6.821210e-1336593.0-8.482406e+033.855926e+04
loanfloat6436593365930-0.4377272.2845286.058244e-171.0000271.0000140.005228...1.0000001.8468014.4106741.0000001.8468014.4106742.216893e-1236593.06.757999e+041.613998e+05
job_admin.float6436593365930-0.3612902.7678632.019415e-171.0000271.0000140.005228...1.0000002.4065736.7915951.0000002.4065736.7915957.389644e-1336593.08.806374e+042.485248e+05
job_blue-collarfloat6436593365930-0.5248021.905480-8.699017e-171.0000271.0000140.005228...1.0000001.3806772.9062701.0000001.3806772.906270-3.183231e-1236593.05.052313e+041.063492e+05
job_entrepreneurfloat6436593365930-0.1853995.3937861.398056e-171.0000271.0000140.005228...1.0000005.20838728.1273001.0000005.20838728.1273005.115908e-1336593.01.905905e+051.029262e+06
job_housemaidfloat6436593365930-0.1684775.935545-3.728150e-171.0000271.0000140.005228...1.0000005.76706834.2590771.0000005.76706834.259077-1.364242e-1236593.02.110343e+051.253642e+06
job_managementfloat6436593365930-0.5117761.9539806.213584e-181.0000271.0000140.005228...1.0000001.4422043.0799531.0000001.4422043.0799532.273737e-1336593.05.277458e+041.127047e+05
job_retiredfloat6436593365930-0.2291754.3634825.902905e-171.0000271.0000140.005228...1.0000004.13430818.0924991.0000004.13430818.0924992.160050e-1236593.01.512867e+056.620588e+05
job_self-employedfloat6436593365930-0.1889185.2933012.019415e-171.0000271.0000140.005228...1.0000005.10438327.0547231.0000005.10438327.0547237.389644e-1336593.01.867847e+059.900135e+05
job_servicesfloat6436593365930-0.3176563.1480562.796113e-171.0000271.0000140.005228...1.0000002.8304009.0111621.0000002.8304009.0111621.023182e-1236593.01.035728e+053.297455e+05
job_studentfloat6436593365930-0.1454396.8757351.553396e-181.0000271.0000140.005228...1.0000006.73029646.2968781.0000006.73029646.2968785.684342e-1436593.02.462817e+051.694142e+06
job_technicianfloat6436593365930-0.4493312.2255301.553396e-171.0000271.0000140.005228...1.0000001.7761994.1548841.0000001.7761994.1548845.684342e-1336593.06.499646e+041.520397e+05
job_unemployedfloat6436593365930-0.1720455.8124203.495141e-181.0000271.0000140.005228...1.0000005.64037432.8138201.0000005.64037432.8138201.278977e-1336593.02.063982e+051.200756e+06
job_unknownfloat6436593365930-0.08243712.1305324.504848e-171.0000271.0000140.005228...1.00000012.048095146.1565931.00000012.048095146.1565931.648459e-1236593.04.408759e+055.348308e+06
marital_divorcedfloat6436593365930-0.3611452.768974-7.844650e-171.0000271.0000140.005228...1.0000002.4078306.7976451.0000002.4078306.797645-2.870593e-1236593.08.810972e+042.487462e+05
marital_marriedfloat6436593365930-1.2316190.8119391.198057e-161.0000271.0000140.005228...1.000000-0.4196801.1761311.000000-0.4196801.1761314.384049e-1236593.0-1.535734e+044.303817e+04
marital_singlefloat6436593365930-0.6266101.595890-2.213589e-171.0000271.0000140.005228...1.0000000.9692801.9395041.0000000.9692801.939504-8.100187e-1336593.03.546887e+047.097227e+04
\n", "

76 rows × 25 columns

\n", "
" ], "text/plain": [ " datatype total_count count count_na min \\\n", "duration float64 36593 36593 0 -1.990727 \n", "campaign float64 36593 36593 0 -0.566486 \n", "pdays float64 36593 36593 0 -0.412100 \n", "previous float64 36593 36593 0 -0.243363 \n", "y int64 36593 36593 0 0.000000 \n", "contact_cellular float64 36593 36593 0 -1.357755 \n", "contact_telephone float64 36593 36593 0 -0.260775 \n", "contact_unknown float64 36593 36593 0 -0.636008 \n", "month_1.0 float64 36593 36593 0 -0.179805 \n", "month_2.0 float64 36593 36593 0 -0.249288 \n", "month_3.0 float64 36593 36593 0 -0.103252 \n", "month_4.0 float64 36593 36593 0 -0.264224 \n", "month_5.0 float64 36593 36593 0 -0.662036 \n", "month_6.0 float64 36593 36593 0 -0.364617 \n", "month_7.0 float64 36593 36593 0 -0.424534 \n", "month_8.0 float64 36593 36593 0 -0.400629 \n", "month_9.0 float64 36593 36593 0 -0.113697 \n", "month_10.0 float64 36593 36593 0 -0.129440 \n", "month_11.0 float64 36593 36593 0 -0.309774 \n", "month_12.0 float64 36593 36593 0 -0.067096 \n", "day_1.0 float64 36593 36593 0 -0.085571 \n", "day_2.0 float64 36593 36593 0 -0.170439 \n", "day_3.0 float64 36593 36593 0 -0.156425 \n", "day_4.0 float64 36593 36593 0 -0.182141 \n", "day_5.0 float64 36593 36593 0 -0.207749 \n", "day_6.0 float64 36593 36593 0 -0.211302 \n", "day_7.0 float64 36593 36593 0 -0.206170 \n", "day_8.0 float64 36593 36593 0 -0.207749 \n", "day_9.0 float64 36593 36593 0 -0.190156 \n", "day_10.0 float64 36593 36593 0 -0.106317 \n", "day_11.0 float64 36593 36593 0 -0.182221 \n", "day_12.0 float64 36593 36593 0 -0.190079 \n", "day_13.0 float64 36593 36593 0 -0.191310 \n", "day_14.0 float64 36593 36593 0 -0.207032 \n", "day_15.0 float64 36593 36593 0 -0.198263 \n", "day_16.0 float64 36593 36593 0 -0.179886 \n", "day_17.0 float64 36593 36593 0 -0.212148 \n", "day_18.0 float64 36593 36593 0 -0.231608 \n", "day_19.0 float64 36593 36593 0 -0.200854 \n", "day_20.0 float64 36593 36593 0 -0.253222 \n", "day_21.0 float64 36593 36593 0 -0.216265 \n", "day_22.0 float64 36593 36593 0 -0.143765 \n", "day_23.0 float64 36593 36593 0 -0.146513 \n", "day_24.0 float64 36593 36593 0 -0.100513 \n", "day_25.0 float64 36593 36593 0 -0.138733 \n", "day_26.0 float64 36593 36593 0 -0.154952 \n", "day_27.0 float64 36593 36593 0 -0.159514 \n", "day_28.0 float64 36593 36593 0 -0.206026 \n", "day_29.0 float64 36593 36593 0 -0.201369 \n", "day_30.0 float64 36593 36593 0 -0.187673 \n", "day_31.0 float64 36593 36593 0 -0.118176 \n", "poutcome_failure float64 36593 36593 0 -0.348698 \n", "poutcome_other float64 36593 36593 0 -0.206673 \n", "poutcome_success float64 36593 36593 0 -0.186970 \n", "poutcome_unknown float64 36593 36593 0 -2.111874 \n", "age float64 36593 36593 0 -2.160064 \n", "education float64 36593 36593 0 -1.745093 \n", "default float64 36593 36593 0 -0.134056 \n", "balance float64 36593 36593 0 -2.712622 \n", "housing float64 36593 36593 0 -1.122596 \n", "loan float64 36593 36593 0 -0.437727 \n", "job_admin. float64 36593 36593 0 -0.361290 \n", "job_blue-collar float64 36593 36593 0 -0.524802 \n", "job_entrepreneur float64 36593 36593 0 -0.185399 \n", "job_housemaid float64 36593 36593 0 -0.168477 \n", "job_management float64 36593 36593 0 -0.511776 \n", "job_retired float64 36593 36593 0 -0.229175 \n", "job_self-employed float64 36593 36593 0 -0.188918 \n", "job_services float64 36593 36593 0 -0.317656 \n", "job_student float64 36593 36593 0 -0.145439 \n", "job_technician float64 36593 36593 0 -0.449331 \n", "job_unemployed float64 36593 36593 0 -0.172045 \n", "job_unknown float64 36593 36593 0 -0.082437 \n", "marital_divorced float64 36593 36593 0 -0.361145 \n", "marital_married float64 36593 36593 0 -1.231619 \n", "marital_single float64 36593 36593 0 -0.626610 \n", "\n", " max mean var std sem \\\n", "duration 1.902071 -6.601933e-18 1.000027 1.000014 0.005228 \n", "campaign 19.310148 -7.611640e-17 1.000027 1.000014 0.005228 \n", "pdays 8.074647 3.728150e-17 1.000027 1.000014 0.005228 \n", "previous 114.104096 4.349509e-17 1.000027 1.000014 0.005228 \n", "y 1.000000 1.162517e-01 0.102740 0.320531 0.001676 \n", "contact_cellular 0.736510 -7.456301e-17 1.000027 1.000014 0.005228 \n", "contact_telephone 3.834729 4.349509e-17 1.000027 1.000014 0.005228 \n", "contact_unknown 1.572308 0.000000e+00 1.000027 1.000014 0.005228 \n", "month_1.0 5.561570 9.320376e-18 1.000027 1.000014 0.005228 \n", "month_2.0 4.011427 6.524263e-17 1.000027 1.000014 0.005228 \n", "month_3.0 9.685067 9.320376e-18 1.000027 1.000014 0.005228 \n", "month_4.0 3.784667 -6.524263e-17 1.000027 1.000014 0.005228 \n", "month_5.0 1.510493 -1.864075e-16 1.000027 1.000014 0.005228 \n", "month_6.0 2.742607 -3.728150e-17 1.000027 1.000014 0.005228 \n", "month_7.0 2.355525 0.000000e+00 1.000027 1.000014 0.005228 \n", "month_8.0 2.496075 -1.242717e-16 1.000027 1.000014 0.005228 \n", "month_9.0 8.795317 -4.038830e-17 1.000027 1.000014 0.005228 \n", "month_10.0 7.725601 -4.660188e-18 1.000027 1.000014 0.005228 \n", "month_11.0 3.228163 1.366988e-16 1.000027 1.000014 0.005228 \n", "month_12.0 14.903961 6.213584e-18 1.000027 1.000014 0.005228 \n", "day_1.0 11.686217 9.320376e-18 1.000027 1.000014 0.005228 \n", "day_2.0 5.867198 -1.242717e-17 1.000027 1.000014 0.005228 \n", "day_3.0 6.392841 0.000000e+00 1.000027 1.000014 0.005228 \n", "day_4.0 5.490262 -7.456301e-17 1.000027 1.000014 0.005228 \n", "day_5.0 4.813497 -7.456301e-17 1.000027 1.000014 0.005228 \n", "day_6.0 4.732553 8.077659e-17 1.000027 1.000014 0.005228 \n", "day_7.0 4.850375 9.320376e-18 1.000027 1.000014 0.005228 \n", "day_8.0 4.813497 -6.213584e-17 1.000027 1.000014 0.005228 \n", "day_9.0 5.258844 6.213584e-17 1.000027 1.000014 0.005228 \n", "day_10.0 9.405819 -1.864075e-17 1.000027 1.000014 0.005228 \n", "day_11.0 5.487850 7.145621e-17 1.000027 1.000014 0.005228 \n", "day_12.0 5.260979 -3.106792e-17 1.000027 1.000014 0.005228 \n", "day_13.0 5.227117 -2.174754e-17 1.000027 1.000014 0.005228 \n", "day_14.0 4.830161 6.213584e-17 1.000027 1.000014 0.005228 \n", "day_15.0 5.043811 3.106792e-17 1.000027 1.000014 0.005228 \n", "day_16.0 5.559067 -1.864075e-17 1.000027 1.000014 0.005228 \n", "day_17.0 4.713694 -5.592226e-17 1.000027 1.000014 0.005228 \n", "day_18.0 4.317635 -1.242717e-17 1.000027 1.000014 0.005228 \n", "day_19.0 4.978743 8.699017e-17 1.000027 1.000014 0.005228 \n", "day_20.0 3.949109 3.417471e-17 1.000027 1.000014 0.005228 \n", "day_21.0 4.623964 -5.902905e-17 1.000027 1.000014 0.005228 \n", "day_22.0 6.955808 3.417471e-17 1.000027 1.000014 0.005228 \n", "day_23.0 6.825333 -4.349509e-17 1.000027 1.000014 0.005228 \n", "day_24.0 9.948913 4.815528e-17 1.000027 1.000014 0.005228 \n", "day_25.0 7.208092 -9.320376e-18 1.000027 1.000014 0.005228 \n", "day_26.0 6.453618 1.242717e-17 1.000027 1.000014 0.005228 \n", "day_27.0 6.269024 -4.349509e-17 1.000027 1.000014 0.005228 \n", "day_28.0 4.853768 -7.456301e-17 1.000027 1.000014 0.005228 \n", "day_29.0 4.966014 -6.213584e-18 1.000027 1.000014 0.005228 \n", "day_30.0 5.328411 6.213584e-17 1.000027 1.000014 0.005228 \n", "day_31.0 8.461983 2.485434e-17 1.000027 1.000014 0.005228 \n", "poutcome_failure 2.867813 -6.834942e-17 1.000027 1.000014 0.005228 \n", "poutcome_other 4.838554 -5.281546e-17 1.000027 1.000014 0.005228 \n", "poutcome_success 5.348457 1.242717e-17 1.000027 1.000014 0.005228 \n", "poutcome_unknown 0.473513 -1.304853e-16 1.000027 1.000014 0.005228 \n", "age 5.088144 -3.106792e-17 1.000027 1.000014 0.005228 \n", "education 1.317062 -2.609705e-16 1.000027 1.000014 0.005228 \n", "default 7.459592 -8.543678e-18 1.000027 1.000014 0.005228 \n", "balance 33.293644 -1.553396e-17 1.000027 1.000014 0.005228 \n", "housing 0.890792 -1.864075e-17 1.000027 1.000014 0.005228 \n", "loan 2.284528 6.058244e-17 1.000027 1.000014 0.005228 \n", "job_admin. 2.767863 2.019415e-17 1.000027 1.000014 0.005228 \n", "job_blue-collar 1.905480 -8.699017e-17 1.000027 1.000014 0.005228 \n", "job_entrepreneur 5.393786 1.398056e-17 1.000027 1.000014 0.005228 \n", "job_housemaid 5.935545 -3.728150e-17 1.000027 1.000014 0.005228 \n", "job_management 1.953980 6.213584e-18 1.000027 1.000014 0.005228 \n", "job_retired 4.363482 5.902905e-17 1.000027 1.000014 0.005228 \n", "job_self-employed 5.293301 2.019415e-17 1.000027 1.000014 0.005228 \n", "job_services 3.148056 2.796113e-17 1.000027 1.000014 0.005228 \n", "job_student 6.875735 1.553396e-18 1.000027 1.000014 0.005228 \n", "job_technician 2.225530 1.553396e-17 1.000027 1.000014 0.005228 \n", "job_unemployed 5.812420 3.495141e-18 1.000027 1.000014 0.005228 \n", "job_unknown 12.130532 4.504848e-17 1.000027 1.000014 0.005228 \n", "marital_divorced 2.768974 -7.844650e-17 1.000027 1.000014 0.005228 \n", "marital_married 0.811939 1.198057e-16 1.000027 1.000014 0.005228 \n", "marital_single 1.595890 -2.213589e-17 1.000027 1.000014 0.005228 \n", "\n", " ... moment_2 moment_3 moment_4 central_moment_2 \\\n", "duration ... 1.000000 -0.557851 2.899545 1.000000 \n", "campaign ... 1.000000 4.947549 43.236502 1.000000 \n", "pdays ... 1.000000 2.618954 9.979817 1.000000 \n", "previous ... 1.000000 44.783657 4683.146470 1.000000 \n", "y ... 0.116252 0.116252 0.116252 0.102737 \n", "contact_cellular ... 1.000000 -0.621246 1.385946 1.000000 \n", "contact_telephone ... 1.000000 3.573955 13.773154 1.000000 \n", "contact_unknown ... 1.000000 0.936300 1.876657 1.000000 \n", "month_1.0 ... 1.000000 5.381765 29.963395 1.000000 \n", "month_2.0 ... 1.000000 3.762139 15.153690 1.000000 \n", "month_3.0 ... 1.000000 9.581815 92.811179 1.000000 \n", "month_4.0 ... 1.000000 3.520443 13.393516 1.000000 \n", "month_5.0 ... 1.000000 0.848457 1.719880 1.000000 \n", "month_6.0 ... 1.000000 2.377990 6.654836 1.000000 \n", "month_7.0 ... 1.000000 1.930991 4.728726 1.000000 \n", "month_8.0 ... 1.000000 2.095446 5.390893 1.000000 \n", "month_9.0 ... 1.000000 8.681620 76.370529 1.000000 \n", "month_10.0 ... 1.000000 7.596161 58.701663 1.000000 \n", "month_11.0 ... 1.000000 2.918389 9.516996 1.000000 \n", "month_12.0 ... 1.000000 14.836865 221.132551 1.000000 \n", "day_1.0 ... 1.000000 11.600646 135.574992 1.000000 \n", "day_2.0 ... 1.000000 5.696758 33.453057 1.000000 \n", "day_3.0 ... 1.000000 6.236416 39.892890 1.000000 \n", "day_4.0 ... 1.000000 5.308122 29.176154 1.000000 \n", "day_5.0 ... 1.000000 4.605747 22.212909 1.000000 \n", "day_6.0 ... 1.000000 4.521251 21.441708 1.000000 \n", "day_7.0 ... 1.000000 4.644206 22.568645 1.000000 \n", "day_8.0 ... 1.000000 4.605747 22.212909 1.000000 \n", "day_9.0 ... 1.000000 5.068688 26.691602 1.000000 \n", "day_10.0 ... 1.000000 9.299502 87.480741 1.000000 \n", "day_11.0 ... 1.000000 5.305629 29.149701 1.000000 \n", "day_12.0 ... 1.000000 5.070900 26.714030 1.000000 \n", "day_13.0 ... 1.000000 5.035807 26.359355 1.000000 \n", "day_14.0 ... 1.000000 4.623128 22.373315 1.000000 \n", "day_15.0 ... 1.000000 4.845548 24.479337 1.000000 \n", "day_16.0 ... 1.000000 5.379181 29.935585 1.000000 \n", "day_17.0 ... 1.000000 4.501546 21.263915 1.000000 \n", "day_18.0 ... 1.000000 4.086027 17.695618 1.000000 \n", "day_19.0 ... 1.000000 4.777889 23.828221 1.000000 \n", "day_20.0 ... 1.000000 3.695888 14.659586 1.000000 \n", "day_21.0 ... 1.000000 4.407699 20.427810 1.000000 \n", "day_22.0 ... 1.000000 6.812043 47.403934 1.000000 \n", "day_23.0 ... 1.000000 6.678820 45.606642 1.000000 \n", "day_24.0 ... 1.000000 9.848400 97.990977 1.000000 \n", "day_25.0 ... 1.000000 7.069359 50.975831 1.000000 \n", "day_26.0 ... 1.000000 6.298666 40.673194 1.000000 \n", "day_27.0 ... 1.000000 6.109509 38.326106 1.000000 \n", "day_28.0 ... 1.000000 4.647742 22.601507 1.000000 \n", "day_29.0 ... 1.000000 4.764645 23.701840 1.000000 \n", "day_30.0 ... 1.000000 5.140738 27.427189 1.000000 \n", "day_31.0 ... 1.000000 8.343808 70.619124 1.000000 \n", "poutcome_failure ... 1.000000 2.519115 7.345941 1.000000 \n", "poutcome_other ... 1.000000 4.631881 22.454322 1.000000 \n", "poutcome_success ... 1.000000 5.161487 27.640945 1.000000 \n", "poutcome_unknown ... 1.000000 -1.638361 3.684227 1.000000 \n", "age ... 1.000000 0.678868 3.300319 1.000000 \n", "education ... 1.000000 -0.150154 2.303197 1.000000 \n", "default ... 1.000000 7.325536 54.663482 1.000000 \n", "balance ... 1.000000 8.256555 144.415577 1.000000 \n", "housing ... 1.000000 -0.231804 1.053733 1.000000 \n", "loan ... 1.000000 1.846801 4.410674 1.000000 \n", "job_admin. ... 1.000000 2.406573 6.791595 1.000000 \n", "job_blue-collar ... 1.000000 1.380677 2.906270 1.000000 \n", "job_entrepreneur ... 1.000000 5.208387 28.127300 1.000000 \n", "job_housemaid ... 1.000000 5.767068 34.259077 1.000000 \n", "job_management ... 1.000000 1.442204 3.079953 1.000000 \n", "job_retired ... 1.000000 4.134308 18.092499 1.000000 \n", "job_self-employed ... 1.000000 5.104383 27.054723 1.000000 \n", "job_services ... 1.000000 2.830400 9.011162 1.000000 \n", "job_student ... 1.000000 6.730296 46.296878 1.000000 \n", "job_technician ... 1.000000 1.776199 4.154884 1.000000 \n", "job_unemployed ... 1.000000 5.640374 32.813820 1.000000 \n", "job_unknown ... 1.000000 12.048095 146.156593 1.000000 \n", "marital_divorced ... 1.000000 2.407830 6.797645 1.000000 \n", "marital_married ... 1.000000 -0.419680 1.176131 1.000000 \n", "marital_single ... 1.000000 0.969280 1.939504 1.000000 \n", "\n", " central_moment_3 central_moment_4 sum sum_2 \\\n", "duration -0.557851 2.899545 -2.415845e-13 36593.0 \n", "campaign 4.947549 43.236502 -2.785328e-12 36593.0 \n", "pdays 2.618954 9.979817 1.364242e-12 36593.0 \n", "previous 44.783657 4683.146470 1.591616e-12 36593.0 \n", "y 0.078851 0.071072 4.254000e+03 4254.0 \n", "contact_cellular -0.621246 1.385946 -2.728484e-12 36593.0 \n", "contact_telephone 3.573955 13.773154 1.591616e-12 36593.0 \n", "contact_unknown 0.936300 1.876657 0.000000e+00 36593.0 \n", "month_1.0 5.381765 29.963395 3.410605e-13 36593.0 \n", "month_2.0 3.762139 15.153690 2.387424e-12 36593.0 \n", "month_3.0 9.581815 92.811179 3.410605e-13 36593.0 \n", "month_4.0 3.520443 13.393516 -2.387424e-12 36593.0 \n", "month_5.0 0.848457 1.719880 -6.821210e-12 36593.0 \n", "month_6.0 2.377990 6.654836 -1.364242e-12 36593.0 \n", "month_7.0 1.930991 4.728726 0.000000e+00 36593.0 \n", "month_8.0 2.095446 5.390893 -4.547474e-12 36593.0 \n", "month_9.0 8.681620 76.370529 -1.477929e-12 36593.0 \n", "month_10.0 7.596161 58.701663 -1.705303e-13 36593.0 \n", "month_11.0 2.918389 9.516996 5.002221e-12 36593.0 \n", "month_12.0 14.836865 221.132551 2.273737e-13 36593.0 \n", "day_1.0 11.600646 135.574992 3.410605e-13 36593.0 \n", "day_2.0 5.696758 33.453057 -4.547474e-13 36593.0 \n", "day_3.0 6.236416 39.892890 0.000000e+00 36593.0 \n", "day_4.0 5.308122 29.176154 -2.728484e-12 36593.0 \n", "day_5.0 4.605747 22.212909 -2.728484e-12 36593.0 \n", "day_6.0 4.521251 21.441708 2.955858e-12 36593.0 \n", "day_7.0 4.644206 22.568645 3.410605e-13 36593.0 \n", "day_8.0 4.605747 22.212909 -2.273737e-12 36593.0 \n", "day_9.0 5.068688 26.691602 2.273737e-12 36593.0 \n", "day_10.0 9.299502 87.480741 -6.821210e-13 36593.0 \n", "day_11.0 5.305629 29.149701 2.614797e-12 36593.0 \n", "day_12.0 5.070900 26.714030 -1.136868e-12 36593.0 \n", "day_13.0 5.035807 26.359355 -7.958079e-13 36593.0 \n", "day_14.0 4.623128 22.373315 2.273737e-12 36593.0 \n", "day_15.0 4.845548 24.479337 1.136868e-12 36593.0 \n", "day_16.0 5.379181 29.935585 -6.821210e-13 36593.0 \n", "day_17.0 4.501546 21.263915 -2.046363e-12 36593.0 \n", "day_18.0 4.086027 17.695618 -4.547474e-13 36593.0 \n", "day_19.0 4.777889 23.828221 3.183231e-12 36593.0 \n", "day_20.0 3.695888 14.659586 1.250555e-12 36593.0 \n", "day_21.0 4.407699 20.427810 -2.160050e-12 36593.0 \n", "day_22.0 6.812043 47.403934 1.250555e-12 36593.0 \n", "day_23.0 6.678820 45.606642 -1.591616e-12 36593.0 \n", "day_24.0 9.848400 97.990977 1.762146e-12 36593.0 \n", "day_25.0 7.069359 50.975831 -3.410605e-13 36593.0 \n", "day_26.0 6.298666 40.673194 4.547474e-13 36593.0 \n", "day_27.0 6.109509 38.326106 -1.591616e-12 36593.0 \n", "day_28.0 4.647742 22.601507 -2.728484e-12 36593.0 \n", "day_29.0 4.764645 23.701840 -2.273737e-13 36593.0 \n", "day_30.0 5.140738 27.427189 2.273737e-12 36593.0 \n", "day_31.0 8.343808 70.619124 9.094947e-13 36593.0 \n", "poutcome_failure 2.519115 7.345941 -2.501110e-12 36593.0 \n", "poutcome_other 4.631881 22.454322 -1.932676e-12 36593.0 \n", "poutcome_success 5.161487 27.640945 4.547474e-13 36593.0 \n", "poutcome_unknown -1.638361 3.684227 -4.774847e-12 36593.0 \n", "age 0.678868 3.300319 -1.136868e-12 36593.0 \n", "education -0.150154 2.303197 -9.549694e-12 36593.0 \n", "default 7.325536 54.663482 -3.126388e-13 36593.0 \n", "balance 8.256555 144.415577 -5.684342e-13 36593.0 \n", "housing -0.231804 1.053733 -6.821210e-13 36593.0 \n", "loan 1.846801 4.410674 2.216893e-12 36593.0 \n", "job_admin. 2.406573 6.791595 7.389644e-13 36593.0 \n", "job_blue-collar 1.380677 2.906270 -3.183231e-12 36593.0 \n", "job_entrepreneur 5.208387 28.127300 5.115908e-13 36593.0 \n", "job_housemaid 5.767068 34.259077 -1.364242e-12 36593.0 \n", "job_management 1.442204 3.079953 2.273737e-13 36593.0 \n", "job_retired 4.134308 18.092499 2.160050e-12 36593.0 \n", "job_self-employed 5.104383 27.054723 7.389644e-13 36593.0 \n", "job_services 2.830400 9.011162 1.023182e-12 36593.0 \n", "job_student 6.730296 46.296878 5.684342e-14 36593.0 \n", "job_technician 1.776199 4.154884 5.684342e-13 36593.0 \n", "job_unemployed 5.640374 32.813820 1.278977e-13 36593.0 \n", "job_unknown 12.048095 146.156593 1.648459e-12 36593.0 \n", "marital_divorced 2.407830 6.797645 -2.870593e-12 36593.0 \n", "marital_married -0.419680 1.176131 4.384049e-12 36593.0 \n", "marital_single 0.969280 1.939504 -8.100187e-13 36593.0 \n", "\n", " sum_3 sum_4 \n", "duration -2.041344e+04 1.061031e+05 \n", "campaign 1.810457e+05 1.582153e+06 \n", "pdays 9.583538e+04 3.651915e+05 \n", "previous 1.638768e+06 1.713704e+08 \n", "y 4.254000e+03 4.254000e+03 \n", "contact_cellular -2.273325e+04 5.071593e+04 \n", "contact_telephone 1.307817e+05 5.040010e+05 \n", "contact_unknown 3.426201e+04 6.867251e+04 \n", "month_1.0 1.969349e+05 1.096450e+06 \n", "month_2.0 1.376680e+05 5.545190e+05 \n", "month_3.0 3.506274e+05 3.396239e+06 \n", "month_4.0 1.288236e+05 4.901089e+05 \n", "month_5.0 3.104760e+04 6.293557e+04 \n", "month_6.0 8.701779e+04 2.435204e+05 \n", "month_7.0 7.066075e+04 1.730383e+05 \n", "month_8.0 7.667865e+04 1.972689e+05 \n", "month_9.0 3.176865e+05 2.794627e+06 \n", "month_10.0 2.779663e+05 2.148070e+06 \n", "month_11.0 1.067926e+05 3.482554e+05 \n", "month_12.0 5.429254e+05 8.091903e+06 \n", "day_1.0 4.245024e+05 4.961096e+06 \n", "day_2.0 2.084615e+05 1.224148e+06 \n", "day_3.0 2.282092e+05 1.459801e+06 \n", "day_4.0 1.942401e+05 1.067643e+06 \n", "day_5.0 1.685381e+05 8.128370e+05 \n", "day_6.0 1.654461e+05 7.846164e+05 \n", "day_7.0 1.699454e+05 8.258544e+05 \n", "day_8.0 1.685381e+05 8.128370e+05 \n", "day_9.0 1.854785e+05 9.767258e+05 \n", "day_10.0 3.402967e+05 3.201183e+06 \n", "day_11.0 1.941489e+05 1.066675e+06 \n", "day_12.0 1.855595e+05 9.775465e+05 \n", "day_13.0 1.842753e+05 9.645679e+05 \n", "day_14.0 1.691741e+05 8.187067e+05 \n", "day_15.0 1.773131e+05 8.957724e+05 \n", "day_16.0 1.968404e+05 1.095433e+06 \n", "day_17.0 1.647251e+05 7.781105e+05 \n", "day_18.0 1.495200e+05 6.475357e+05 \n", "day_19.0 1.748373e+05 8.719461e+05 \n", "day_20.0 1.352436e+05 5.364382e+05 \n", "day_21.0 1.612909e+05 7.475149e+05 \n", "day_22.0 2.492731e+05 1.734652e+06 \n", "day_23.0 2.443981e+05 1.668884e+06 \n", "day_24.0 3.603825e+05 3.585784e+06 \n", "day_25.0 2.586890e+05 1.865359e+06 \n", "day_26.0 2.304871e+05 1.488354e+06 \n", "day_27.0 2.235653e+05 1.402467e+06 \n", "day_28.0 1.700748e+05 8.270569e+05 \n", "day_29.0 1.743526e+05 8.673214e+05 \n", "day_30.0 1.881150e+05 1.003643e+06 \n", "day_31.0 3.053249e+05 2.584166e+06 \n", "poutcome_failure 9.218198e+04 2.688100e+05 \n", "poutcome_other 1.694944e+05 8.216710e+05 \n", "poutcome_success 1.888743e+05 1.011465e+06 \n", "poutcome_unknown -5.995254e+04 1.348169e+05 \n", "age 2.484182e+04 1.207686e+05 \n", "education -5.494580e+03 8.428090e+04 \n", "default 2.680633e+05 2.000301e+06 \n", "balance 3.021321e+05 5.284599e+06 \n", "housing -8.482406e+03 3.855926e+04 \n", "loan 6.757999e+04 1.613998e+05 \n", "job_admin. 8.806374e+04 2.485248e+05 \n", "job_blue-collar 5.052313e+04 1.063492e+05 \n", "job_entrepreneur 1.905905e+05 1.029262e+06 \n", "job_housemaid 2.110343e+05 1.253642e+06 \n", "job_management 5.277458e+04 1.127047e+05 \n", "job_retired 1.512867e+05 6.620588e+05 \n", "job_self-employed 1.867847e+05 9.900135e+05 \n", "job_services 1.035728e+05 3.297455e+05 \n", "job_student 2.462817e+05 1.694142e+06 \n", "job_technician 6.499646e+04 1.520397e+05 \n", "job_unemployed 2.063982e+05 1.200756e+06 \n", "job_unknown 4.408759e+05 5.348308e+06 \n", "marital_divorced 8.810972e+04 2.487462e+05 \n", "marital_married -1.535734e+04 4.303817e+04 \n", "marital_single 3.546887e+04 7.097227e+04 \n", "\n", "[76 rows x 25 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from secretflow.stats.table_statistics import table_statistics\n", "\n", "pd.set_option('display.max_rows', None)\n", "data_stats = table_statistics(vdf)\n", "data_stats\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "pd.reset_option('display.max_rows')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 安全性讨论\n", "\n", "请注意,全表统计会暴露数据整体统计结果,其背后实际上蕴含了**sf.reveal**,请谨慎使用。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 相关系数矩阵\n", "\n", "我们接下来计算特征和特征之间,特征和标签之间的相关系数矩阵。\n", "\n", "> 计算相关系数矩阵时,one-hot编码各列无需参与计算。" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=56064)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but StandardScaler was fitted without feature names\n", "\u001b[2m\u001b[36m(_run pid=56064)\u001b[0m warnings.warn(\n", "\u001b[2m\u001b[36m(_run pid=57413)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but StandardScaler was fitted without feature names\n", "\u001b[2m\u001b[36m(_run pid=57413)\u001b[0m warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=56064)\u001b[0m [2022-11-10 15:05:24.416] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n", "\u001b[2m\u001b[36m(_run pid=57413)\u001b[0m [2022-11-10 15:05:24.835] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=57413)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[2m\u001b[36m(_spu_compile pid=44652)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n" ] }, { "data": { "text/plain": [ "array([[1.000, -0.185, 0.013, 0.010, 0.331, -0.007, -0.001, -0.005,\n", " 0.016, 0.003, -0.008],\n", " [-0.185, 1.000, -0.091, -0.032, -0.074, 0.004, 0.002, 0.013,\n", " -0.012, -0.023, 0.013],\n", " [0.013, -0.091, 1.000, 0.439, 0.104, -0.021, 0.005, -0.031, 0.003,\n", " 0.123, -0.022],\n", " [0.010, -0.032, 0.439, 1.000, 0.091, 0.003, 0.022, -0.019, 0.015,\n", " 0.037, -0.009],\n", " [0.331, -0.074, 0.104, 0.091, 1.000, 0.022, 0.069, -0.023, 0.053,\n", " -0.135, -0.068],\n", " [-0.007, 0.004, -0.021, 0.003, 0.022, 1.000, -0.169, -0.016,\n", " 0.096, -0.183, -0.014],\n", " [-0.001, 0.002, 0.005, 0.022, 0.069, -0.169, 1.000, -0.012, 0.068,\n", " -0.070, -0.030],\n", " [-0.005, 0.013, -0.031, -0.019, -0.023, -0.016, -0.012, 1.000,\n", " -0.067, -0.010, 0.079],\n", " [0.016, -0.012, 0.003, 0.015, 0.053, 0.096, 0.068, -0.067, 1.000,\n", " -0.072, -0.084],\n", " [0.003, -0.023, 0.123, 0.037, -0.135, -0.183, -0.070, -0.010,\n", " -0.072, 1.000, 0.043],\n", " [-0.008, 0.013, -0.022, -0.009, -0.068, -0.014, -0.030, 0.079,\n", " -0.084, 0.043, 1.000]], dtype=float32)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from secretflow.stats.ss_pearsonr_v import PearsonR\n", "\n", "pearson_r_calculator = PearsonR(spu)\n", "corr_matrix = pearson_r_calculator.pearsonr(vdf_hat)\n", "\n", "import numpy as np\n", "\n", "np.set_printoptions(formatter={'float': lambda x: \"{0:0.3f}\".format(x)})\n", "corr_matrix\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 安全性讨论\n", "\n", "相关系数矩阵的计算需要利用alice和bob两边的数据,因此相关的计算需要使用**SPU device**确保原始数据不被泄露。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### VIF指标计算\n", "\n", "隐语还支持VIF的计算来进行多重共线性检验。\n", "\n", "> 计算VIF指标时,one-hot编码各列无需参与计算。" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=44652)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but StandardScaler was fitted without feature names\n", "\u001b[2m\u001b[36m(_run pid=44652)\u001b[0m warnings.warn(\n", "\u001b[2m\u001b[36m(_run pid=57413)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/sklearn/base.py:443: UserWarning: X has feature names, but StandardScaler was fitted without feature names\n", "\u001b[2m\u001b[36m(_run pid=57413)\u001b[0m warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=44652)\u001b[0m [2022-11-10 15:05:25.323] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n", "Index(['duration', 'campaign', 'pdays', 'previous', 'y', 'age', 'education',\n", " 'default', 'balance', 'housing', 'loan'],\n", " dtype='object')\n", "[1.162 1.044 1.276 1.243 1.177 1.082 1.052 1.011 1.030 1.089 1.018]\n" ] } ], "source": [ "from secretflow.stats.ss_vif_v import VIF\n", "\n", "vif_calculator = VIF(spu)\n", "vif_results = vif_calculator.vif(vdf_hat)\n", "print(vdf_hat.columns)\n", "print(vif_results)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 安全性讨论\n", "\n", "VIF指标的计算需要利用alice和bob两边的数据,因此相关的计算需要使用**SPU device**确保原始数据不被泄露。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 模型训练\n", "\n", "接下来,我们将会分别训练一个逻辑回归模型和一个XGB模型。\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 随机分割\n", "\n", "在训练之前,我们需要将数据分割为训练集和验证集。\n", "\n", "其中train_x和train_y为训练集的特征和标签。test_x和test_y为训练集的特征和标签。\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "from secretflow.data.split import train_test_split\n", "\n", "random_state = 1234\n", "\n", "train_vdf, test_vdf = train_test_split(vdf, train_size=0.8, random_state=random_state)\n", "\n", "train_x = train_vdf.drop(columns=['y'])\n", "train_y = train_vdf['y']\n", "\n", "test_x = test_vdf.drop(columns=['y'])\n", "test_y = test_vdf['y']\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 安全性讨论\n", "\n", "随机分割时,每一方会共享随机数种子,并由每一方数据的owner分别执行各自的数据分割并且确保最终分割结果仍然是对齐的。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### PSI(人群稳定性分析)\n", "\n", "样本稳定指数是衡量样本变化所产生的偏移量的一种重要指标,通常用来衡量样本的稳定程度,比如样本在两个月份之间的变化是否稳定。通常变量的PSI值在0.1以下表示变化不太显著,在0.1到0.25之间表示有比较显著的变化,大于0.25表示变量变化比较剧烈,需要特殊关注。\n", "\n", "接下来以`balance`为例子,确认两次抽样的样本分布是否接近。\n", "\n", "> 根据业务需求,PSI分析也可以在数据分析或者特征预处理的时候进行。\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "stats_df = table_statistics(train_x['balance'])" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "min_val, max_val = stats_df['min'], stats_df['max']" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n" ] }, { "data": { "text/plain": [ "DeviceArray(0.000, dtype=float32)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from secretflow.stats import psi_eval\n", "from secretflow.stats.core.utils import equal_range\n", "import jax.numpy as jnp\n", "\n", "split_points = equal_range(jnp.array([min_val, max_val]), 3)\n", "balance_psi_score = psi_eval(train_x['balance'], test_x['balance'], split_points)\n", "\n", "sf.reveal(balance_psi_score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 安全性讨论\n", "\n", "PSI分析是一个单方运算,由数据owner的PYU Device执行计算。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 逻辑回归模型\n", "\n", "使用 **ml.linear.ss_sgd.SSRegression** 可以进行密态逻辑回归模型的训练。\n", "\n", "请参考相关的API文档。\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_spu_compile pid=44652)\u001b[0m /* error: missing value */\n", "\u001b[2m\u001b[36m(_spu_compile pid=44652)\u001b[0m {}:task_name:_spu_compile\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=57412)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n" ] } ], "source": [ "from secretflow.ml.linear.ss_sgd import SSRegression\n", "\n", "lr_model = SSRegression(spu)\n", "lr_model.fit(\n", " x=train_x,\n", " y=train_y,\n", " epochs=3,\n", " learning_rate=0.1,\n", " batch_size=1024,\n", " sig_type='t1',\n", " reg_type='logistic',\n", " penalty='l2',\n", " l2_norm=0.5,\n", ")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "你可能会对为何上面的语句很快就执行完毕感到困惑,原因是在隐语中,语句都是lazy evaluation的,在上面的例子中,直到lr_model被真正被使用的时候,**lr_model.fit**才会被执行。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 安全性讨论\n", "\n", "SSRegression的训练基于SPU Device,双方的原始数据将会被保护。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### XGBoost模型" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "使用 **ml.boost.ss_xgb_v.Xgb** 可以进行密态XGBoost模型的训练。\n", "\n", "请参考相关的API文档。" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_spu_compile pid=57412)\u001b[0m /* error: missing value */\n", "\u001b[2m\u001b[36m(_spu_compile pid=57412)\u001b[0m {}\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=68105)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=57412)\u001b[0m [2022-11-10 15:05:38.196] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=68077)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=68077)\u001b[0m [2022-11-10 15:05:40.288] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n", "\u001b[2m\u001b[36m(_spu_compile pid=68105)\u001b[0m /* error: missing value */\n", "\u001b[2m\u001b[36m(_spu_compile pid=68105)\u001b[0m {}:task_name:_run\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[1m\u001b[36m(scheduler +57s)\u001b[0m Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.\n", "\u001b[2m\u001b[1m\u001b[33m(scheduler +57s)\u001b[0m Warning: The following resource request cannot be scheduled right now: {'alice': 1.0, 'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_spu_compile pid=75168)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[2m\u001b[36m(_run pid=72626)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[2m\u001b[36m(_run pid=72627)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[2m\u001b[36m(_run pid=72628)\u001b[0m 2022-11-10 15:05:59.676639: E external/org_tensorflow/tensorflow/core/tpu/tpu_initializer_helper.cc:230] Unable to open shared memory for GCS file system creator.\n", "\u001b[2m\u001b[36m(_run pid=72628)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[2m\u001b[36m(_run pid=75170)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[2m\u001b[36m(_run pid=72636)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[2m\u001b[36m(_run pid=75169)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=72628)\u001b[0m [2022-11-10 15:06:10.337] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n", "\u001b[2m\u001b[36m(_run pid=75169)\u001b[0m [2022-11-10 15:06:10.540] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n", "\u001b[2m\u001b[36m(_spu_compile pid=72628)\u001b[0m /* error: missing value */\n", "\u001b[2m\u001b[36m(_spu_compile pid=72628)\u001b[0m {}:task_name:_run\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[36m(_spu_compile pid=89969)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[2m\u001b[36m(_run pid=89765)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[2m\u001b[36m(_run pid=89767)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[2m\u001b[36m(_run pid=89768)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[2m\u001b[36m(_run pid=89971)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[2m\u001b[36m(_run pid=89766)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[2m\u001b[36m(_run pid=89972)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=89768)\u001b[0m [2022-11-10 15:06:41.189] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n", "\u001b[2m\u001b[36m(_run pid=89766)\u001b[0m [2022-11-10 15:06:41.335] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n", "\u001b[2m\u001b[36m(_spu_compile pid=89768)\u001b[0m /* error: missing value */\n", "\u001b[2m\u001b[36m(_spu_compile pid=89768)\u001b[0m {}:task_name:_run\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[36m(_spu_compile pid=102624)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[2m\u001b[36m(_run pid=101745)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[2m\u001b[36m(_run pid=101753)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[2m\u001b[36m(_run pid=101755)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[2m\u001b[36m(_run pid=102625)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[2m\u001b[36m(_run pid=102626)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n" ] } ], "source": [ "from secretflow.ml.boost.ss_xgb_v import Xgb\n", "\n", "xgb = Xgb(spu)\n", "params = {\n", " 'num_boost_round': 3,\n", " 'max_depth': 5,\n", " 'sketch_eps': 0.25,\n", " 'objective': 'logistic',\n", " 'reg_lambda': 0.2,\n", " 'subsample': 1,\n", " 'colsample_bytree': 1,\n", " 'base_score': 0.5,\n", "}\n", "xgb_model = xgb.train(params=params, dtrain=train_x, label=train_y)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Xgb.train将会直接执行,请耐心等待。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 安全性讨论\n", "\n", "Xgb的训练基于SPU Device,双方的原始数据将会被保护。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 模型预测\n", "\n", "接下来,我们将会分别利用刚刚训练好的模型来预测测试集。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 逻辑回归模型\n", "\n", "由于在我们的场景下,数据集标签的持有者是bob,因此我们在这里将预测结果**reveal**给bob." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "lr_y_hat = lr_model.predict(x=test_x, batch_size=1024, to_pyu=bob)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 安全性讨论\n", "\n", "逻辑回归的预测基于SPU Device,双方的原始数据将会被保护。\n", "\n", "当设置**to_pyu**,预测结果将会被reveal给该方,否则将仍然保持秘密分享的状态。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### XGBoost模型\n", "\n", "由于在我们的场景下,数据集标签的持有者是bob,因此我们在这里将预测结果**reveal**给bob." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=101755)\u001b[0m [2022-11-10 15:07:12.900] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n", "\u001b[2m\u001b[36m(_run pid=102626)\u001b[0m [2022-11-10 15:07:12.900] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n" ] } ], "source": [ "xgb_y_hat = xgb_model.predict(dtrain=test_x, to_pyu=bob)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 安全性讨论\n", "\n", "XGBoost模型的预测基于SPU Device,双方的原始数据将会被保护。\n", "\n", "当设置**to_pyu**,预测结果将会被reveal给该方,否则将仍然保持秘密分享的状态。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 模型评估\n", "\n", "接下来,我们将利用测试数据集对模型效果进行评估,包括:\n", "\n", "- 二分类评估\n", "- PVA\n", "- P-Value\n", "- 评分卡转换" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 二分类评估\n", "\n", "隐语中对二分类的评估有集成的支持。\n", "\n", "`BiClassificationEval` 将计算 `AUC`, `KS`, `F1 Score`, `Lift`, `K-S`, `Gain`, `Precision`, `Recall` 等统计数值, 并提供(基于prediction score的)等频和等距分箱的统计报告和总报告。\n", "\n", "不同分桶中评估模型的预测的`threshold`不同。总报告中依赖`threshold`的统计取的是各个分桶的最佳值。\n", "\n", "详情可以参考API文档。" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[36m(_run pid=117458)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=117896)\u001b[0m [2022-11-10 15:07:21.122] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=117896)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[2m\u001b[36m(_spu_compile pid=101746)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[2m\u001b[36m(_spu_compile pid=115860)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[2m\u001b[36m(_spu_compile pid=115859)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=101746)\u001b[0m [2022-11-10 15:07:21.342] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n" ] } ], "source": [ "from secretflow.stats.biclassification_eval import BiClassificationEval\n", "\n", "biclassification_evaluator = BiClassificationEval(\n", " y_true=test_y, y_score=lr_y_hat, bucket_size=20\n", ")\n", "lr_report = sf.reveal(biclassification_evaluator.get_all_reports())\n" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "positive_samples: 860.0\n", "negative_samples: 6459.0\n", "total_samples: 7319.0\n", "auc: 0.8958946466445923\n", "ks: 0.6431037187576294\n", "f1_score: 0.5415411591529846\n" ] } ], "source": [ "print(f'positive_samples: {lr_report.summary_report.positive_samples}')\n", "print(f'negative_samples: {lr_report.summary_report.negative_samples}')\n", "print(f'total_samples: {lr_report.summary_report.total_samples}')\n", "print(f'auc: {lr_report.summary_report.auc}')\n", "print(f'ks: {lr_report.summary_report.ks}')\n", "print(f'f1_score: {lr_report.summary_report.f1_score}')\n" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "biclassification_evaluator = BiClassificationEval(\n", " y_true=test_y, y_score=xgb_y_hat, bucket_size=20\n", ")\n", "xgb_report = sf.reveal(biclassification_evaluator.get_all_reports())\n" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "positive_samples: 860.0\n", "negative_samples: 6459.0\n", "total_samples: 7319.0\n", "auc: 0.8133864402770996\n", "ks: 0.5009485483169556\n", "f1_score: 0.4135618805885315\n" ] } ], "source": [ "print(f'positive_samples: {xgb_report.summary_report.positive_samples}')\n", "print(f'negative_samples: {xgb_report.summary_report.negative_samples}')\n", "print(f'total_samples: {xgb_report.summary_report.total_samples}')\n", "print(f'auc: {xgb_report.summary_report.auc}')\n", "print(f'ks: {xgb_report.summary_report.ks}')\n", "print(f'f1_score: {xgb_report.summary_report.f1_score}')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### PVA (预测和实际平均值比较)\n", "\n", "结果由`abs(mean(Acutal) - mean(Prediction))`计算获得, 值越小越好。" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DeviceArray(0.051, dtype=float32)" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from secretflow.stats import pva_eval\n", "\n", "lr_pva_score = pva_eval(test_y, lr_y_hat, 1)\n", "\n", "sf.reveal(lr_pva_score)\n" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DeviceArray(0.065, dtype=float32)" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "xgb_pva_score = pva_eval(test_y, xgb_y_hat, 1)\n", "\n", "sf.reveal(xgb_pva_score)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### P-Value\n", "双方可通过p-value的值来判断参数是否显著,即该自变量是否可以有效预测因变量的变异, 从而判定对应的解释变量是否应包括在模型中。\n" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=117458)\u001b[0m [2022-11-10 15:08:18.378] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m /home/fengjun.feng/miniconda3/envs/sf/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.9) or chardet (5.0.0)/charset_normalizer (2.0.12) doesn't match a supported version!\n", "\u001b[2m\u001b[33m(raylet)\u001b[0m warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=9587)\u001b[0m [2022-11-10 15:08:21.419] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=9587)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[2m\u001b[36m(_spu_compile pid=117457)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n", "\u001b[2m\u001b[36m(_run pid=9615)\u001b[0m WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2m\u001b[36m(_run pid=9615)\u001b[0m [2022-11-10 15:08:21.780] [info] [thread_pool.cc:30] Create a fixed thread pool with size 63\n" ] }, { "data": { "text/plain": [ "array([0.000, 0.682, 0.855, 0.755, 0.981, 0.989, 0.975, 0.964, 0.986,\n", " 0.773, 0.963, 0.981, 0.977, 0.973, 0.975, 0.859, 0.833, 0.982,\n", " 0.827, 0.946, 0.993, 0.994, 0.994, 0.987, 0.984, 0.979, 0.990,\n", " 0.986, 0.947, 0.997, 0.978, 0.970, 0.997, 0.976, 0.994, 0.961,\n", " 0.994, 0.967, 0.979, 0.997, 0.986, 0.975, 0.997, 0.987, 0.987,\n", " 0.969, 0.997, 0.975, 0.962, 0.990, 0.976, 0.992, 0.819, 0.977,\n", " 0.645, 0.416, 0.805, 0.452, 0.024, 0.201, 0.998, 0.992, 0.986,\n", " 0.974, 0.999, 0.965, 0.998, 0.988, 0.919, 0.998, 0.988, 0.987,\n", " 0.998, 0.990, 0.987, 0.000])" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from secretflow.stats import SSPValue\n", "\n", "model = lr_model.save_model()\n", "sspv = SSPValue(spu)\n", "pvalues = sspv.pvalues(test_x, test_y, model)\n", "\n", "pvalues" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 评分卡转换\n", "\n", "> 严格来说,评分卡转化是对预测结果的后续处理,并不属于模型评估。\n", "\n", "\n", "我们将 `y = 1` 的概率设为`p`, `odds = p / (1 - p)`, 评分卡设定的分值刻度可以通过将分值表示为比率对数的线性表达式来定义,即可表示为下式:\n", "\n", "`Score = A - B log(odds)`, A 和 B 是可以设定的常数。隐语中提供了评分卡转换功能,详情可以参考API文档。" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[496.647],\n", " [453.126],\n", " [496.647],\n", " ...,\n", " [451.232],\n", " [494.179],\n", " [466.241]])" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from secretflow.stats import BiClassificationEval, ScoreCard\n", "\n", "sc = ScoreCard(20, 600, 20)\n", "score = sc.transform(xgb_y_hat)\n", "\n", "sf.reveal(score.partitions[bob])\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 安全性讨论\n", "\n", "以上所有模型评估的方法均为单方运算,由label拥有者的PYU Device进行运算。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 实验结束\n", "\n", "最后,我们需要清理临时文件,并关闭隐语cluster。" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "try:\n", " os.remove(alice_path)\n", " os.remove(alice_psi_path)\n", " os.remove(bob_path)\n", " os.remove(bob_psi_path)\n", "except OSError:\n", " pass\n", "\n", "sf.shutdown()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "恭喜!你已经完成了隐语金融风控全链路的全部实验内容。\n", "\n", "如果你对本实验有任何建议和问题,请在[Github Issues](https://github.com/secretflow/secretflow/issues)上联系我们。" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "vscode": { "interpreter": { "hash": "02db3bab010a384e41503da74327ad4dd04080832919be62bcff46931ddfd4bc" } } }, "nbformat": 4, "nbformat_minor": 2 }