{ "cells": [ { "cell_type": "markdown", "id": "a07356b3-744a-4319-9fec-cd62f37fa865", "metadata": {}, "source": [ "# Feature Engineering" ] }, { "cell_type": "markdown", "id": "7d72fa78", "metadata": {}, "source": [ ">The following codes are demos only. It's **NOT for production** due to system security concerns, please **DO NOT** use it directly in production." ] }, { "cell_type": "markdown", "id": "946a36f5", "metadata": {}, "source": [] }, { "cell_type": "markdown", "id": "e89e9dcf", "metadata": {}, "source": [ "It is recommended to use [jupyter](https://jupyter.org/) to run this tutorial." ] }, { "cell_type": "markdown", "id": "8ff05a38-2211-4240-a9db-0d79c813ab99", "metadata": {}, "source": [ "Secretflow provides a variety of tools to analyze the statistics of the dataset in order to improve the quality of results from the machine learning process." ] }, { "cell_type": "markdown", "id": "25ec1569-f9f7-4f27-90b8-a6c7feab28e2", "metadata": { "tags": [] }, "source": [ "## Preparation\n", "\n", "Initialize secretflow and create two parties alice and bob." ] }, { "cell_type": "markdown", "id": "83e1596d-8ca1-40ae-9681-7254c563ff7e", "metadata": {}, "source": [ "> 💡 Before using preprocessing, you may need to get to know secretflow's [DataFrame](../preprocessing/DataFrame.ipynb)." ] }, { "cell_type": "code", "execution_count": 13, "id": "9ad74320-2c3a-4c86-aea4-6688d96d2230", "metadata": {}, "outputs": [], "source": [ "import secretflow as sf\n", "\n", "# In case you have a running secretflow runtime already.\n", "sf.shutdown()\n", "\n", "sf.init(['alice', 'bob'], address='local')\n", "alice = sf.PYU('alice')\n", "bob = sf.PYU('bob')\n", "\n", "spu = sf.SPU(sf.utils.testing.cluster_def(['alice', 'bob']))" ] }, { "cell_type": "markdown", "id": "94c83c7b-417a-4772-9de1-2efc589cd89f", "metadata": { "tags": [] }, "source": [ "## Data Preparation" ] }, { "cell_type": "markdown", "id": "86168ad6-2fe0-4410-b59c-fd65cbe8ea9b", "metadata": {}, "source": [ "Here we use linear as example data." ] }, { "cell_type": "code", "execution_count": 14, "id": "9d7d70b8-2d12-40c0-891e-d42cbd567cab", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "orig ds in alice:\n", " x1 x2 x3\n", "0 -0.514226 0.730010 -0.730391\n", "1 -0.725537 0.482244 -0.823223\n", "2 0.608353 -0.071102 -0.775098\n", "3 -0.686642 0.160470 0.914477\n", "4 -0.198111 0.212909 0.950474\n", "... ... ... ...\n", "9995 -0.367246 -0.296454 0.558596\n", "9996 0.010913 0.629268 -0.384093\n", "9997 -0.238097 0.904069 -0.344859\n", "9998 0.453686 -0.375173 0.899238\n", "9999 -0.776015 -0.772112 0.012110\n", "\n", "[10000 rows x 3 columns]\n", "orig ds in bob:\n", " x18 x19 x20\n", "0 0.810261 0.048303 0.937679\n", "1 0.312728 0.526637 0.589773\n", "2 0.039087 -0.753417 0.516735\n", "3 -0.855979 0.250944 0.979465\n", "4 -0.238805 0.243109 -0.121446\n", "... ... ... ...\n", "9995 -0.847253 0.069960 0.786748\n", "9996 -0.502486 -0.076290 -0.604832\n", "9997 -0.424209 0.434947 0.998955\n", "9998 0.914291 -0.473056 0.616257\n", "9999 -0.602927 -0.021368 0.885519\n", "\n", "[10000 rows x 3 columns]\n" ] } ], "source": [ "from secretflow.utils.simulation.datasets import load_linear\n", "\n", "vdf = load_linear(parts={alice: (1, 4), bob: (18, 22)})\n", "\n", "label_data = vdf[\"y\"]\n", "vdf = vdf.drop(columns=\"y\")\n", "\n", "print(f\"orig ds in alice:\\n {sf.reveal(vdf.partitions[alice].data)}\")\n", "print(f\"orig ds in bob:\\n {sf.reveal(vdf.partitions[bob].data)}\")" ] }, { "cell_type": "markdown", "id": "b346a56c-6a5b-4b59-8c36-99344febdb51", "metadata": { "tags": [] }, "source": [ "## Pearson product-moment correlation coefficient\n", "\n", "The Pearson product-moment correlation coefficient is used to measure the degree of correlation (linear correlation) between two variables X and Y.\n", "\n", "The Pearson product-moment correlation coefficient between two variables is defined as the covariance of the two variables divided by the product of their standard deviations:\n", "\n", "$$ \\rho_{X,Y}=\\frac{cov(X, Y)}{\\sigma_X \\sigma_Y}=\\frac{(X-\\mu_X)(Y-\\mu_Y)}{\\sigma_X \\sigma_Y} $$\n", "\n", "$$ \\mu_X= \\mathbb{E}(X), \\sigma_X^2=\\mathbb{E}[(X-\\mathbb{E}(X))^2]=\\mathbb{E}(X^2)-\\mathbb{E}^2(X) $$\n", "\n", "The Pearson product-moment correlation coefficient for samples(features) from dataset, usually represented by the lowercase letter r:\n", "\n", "$$ r=\\frac{\\sum^n_{i=1}(X_i-\\bar{X})(Y_i-\\bar{Y})}{\\sqrt{\\sum^n_{i=1}(X_i-\\bar{X})^2} \\sqrt{\\sum^n_{i=1}(Y_i-\\bar{Y})^2}} $$\n", "\n", "Its value is between -1 and 1. $r>0$ corresponds to a positive correlation between features, otherwise it is a negative correlation; the larger the $|r|$, the greater the degree of correlation.\n", "\n", "### SSVertPearsonR\n", "\n", "SecretFlow provides `SSVertPearsonR` for calculating Pearson product-moment correlation coefficient of Vertical DataFrame using secret sharing.\n", "\n", "SSVertPearsonR will standardize input dataset first. After standardized, all features' variance is 1 and the mean is 0, the calculation is simplified to:\n", "\n", "$$ r=\\frac{1}{n-1}X^TX $$\n", "\n" ] }, { "cell_type": "code", "execution_count": 15, "id": "b8c94841-73b3-4eb8-913d-af2f29c15ee2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "corr: \n", " [[ 1.0001000e+00 2.4262429e-03 -5.3907027e-03 -7.7360502e-04\n", " -7.4419454e-03 -1.0678579e-02]\n", " [ 2.4262429e-03 1.0001000e+00 -5.4155029e-03 -2.0672032e-03\n", " 4.3277428e-03 3.7399144e-03]\n", " [-5.3907027e-03 -5.4155029e-03 1.0001000e+00 6.2504560e-03\n", " -3.7937090e-03 7.3285382e-03]\n", " [-7.7360502e-04 -2.0672032e-03 6.2504560e-03 1.0001000e+00\n", " -1.0416691e-02 1.0044558e-02]\n", " [-7.4419454e-03 4.3277428e-03 -3.7937090e-03 -1.0416691e-02\n", " 1.0001000e+00 -1.3045982e-02]\n", " [-1.0678579e-02 3.7399144e-03 7.3285382e-03 1.0044558e-02\n", " -1.3045982e-02 1.0001000e+00]]\n" ] } ], "source": [ "from secretflow.stats import SSVertPearsonR\n", "\n", "v_pearsonr = SSVertPearsonR(spu)\n", "corr = v_pearsonr.pearsonr(vdf)\n", "print(f\"corr: \\n {corr}\")" ] }, { "cell_type": "markdown", "id": "a404fe97", "metadata": {}, "source": [ "## Variance inflation factor\n", "\n", "Variance Inflation Factor (VIF) was used to detect multicollinearity between variables. In a linear statistical model, the variance inflation factor (VIF) of a coefficient is equal to the quotient of the variance of that coefficient in a multivariate model and the variance of that coefficient in a model with only one variable. In simple terms, it refers to the ratio of the variance when there is multicollinearity among the explanatory variables (features) to the variance when there is no multicollinearity.\n", "\n", "### SSVertVIF\n", "SecretFlow provides `SSVertVIF` for calculating variance inflation factor of Vertical DataFrame using secret sharing.\n", "\n", "The vif value of the j-th feature is: $$ VIF_j = (X^TX)^{-1}_{jj}(n-1)var(X_j) $$\n", "\n", "Notice:\n", "\n", "The analytical solution of matrix inversion in secret sharing is very expensive,\n", "so this method uses Newton iteration to find approximate solution.\n", "\n", "When there is multicollinearity in the input dataset, the XTX matrix is not full rank,\n", "and the analytical solution for the inverse of the XTX matrix does not exist.\n", "\n", "The VIF results of these linear correlational columns calculated by statsmodels are INF,\n", "indicating that the correlation is infinite.\n", "However, this method will get a large VIF value (>>1000) on these columns,\n", "which can also correctly reflect the strong correlation of these columns.\n", "\n", "When there are constant columns in the data, the VIF result calculated by statsmodels is NAN,\n", "and the result of this method is also a large VIF value (>> 1000),\n", "means these columns need to be removed before training.\n", "\n", "Therefore, although the results of this method cannot be completely consistent with statemodels\n", "that calculations in plain text, but they can still correctly reflect the correlation of the input data columns." ] }, { "cell_type": "code", "execution_count": 16, "id": "2ea2c73d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "vif: \n", "[0.9999169 0.9997679 0.9999169 0.9999169 1.0000659 1.0002149]\n" ] } ], "source": [ "from secretflow.stats import SSVertVIF \n", "\n", "v_vif = SSVertVIF(spu)\n", "vif = v_vif.vif(vdf)\n", "print(f\"vif: \\n{vif}\")" ] }, { "cell_type": "markdown", "id": "d0c396a9", "metadata": {}, "source": [ "## Hypothesis Testing for Regression Coefficients\n", "\n", "Linear / logistic regression variable significance test for all features (explanatory variables) use to judge whether the parameter is significant, that is, whether the independent variable can effectively predict the variation of the dependent variable, so as to determine whether the corresponding explanatory variable should be included in the model.\n", "\n", "### Hypothesis Testing for linear Regression Coefficients\n", "For linear regression $y=Xω$ (X contains a constant term), use the t-test to test whether the regression term coefficients have zero values.\n", "\n", "Assume:\n", "\n", "$$ \\hat{ω}=(X^T X)^{-1} X^T y=ω+(X^T X)^{-1} X^T ε $$\n", "$$ E(\\hat{ω})=ω $$\n", "$$ Var(\\hat{ω} )=σ^2 (X^T X)^{-1} $$\n", "\n", "In the case where the five assumptions made by OLS are all established, the statistic:\n", "$$ t_j=\\frac{\\hat{ω}_j- ω_j}{s.e.(ω_j )}=\\frac{\\hat{ω}_j - ω_j}{\\sqrt{\\hat{σ}^2 (X^T X)_{jj}^{-1}}} \\sim t_{n-k} $$\n", "where n is sample size, k is feature size.\n", "\n", "The null and alternative hypotheses tested are:\n", "$$ \\begin{aligned}\n", "& H_0:ω_j=0 (j=1,2,⋯,k) \\\\ & H_1:ω_j≠0\n", "\\end{aligned} $$\n", "\n", "Bring the null hypothesis of the test into $t_j$ :\n", "\n", "$$ t_j=\\frac{\\hat{ω}_j}{s.e.(ω_j )}=\\frac{\\hat{ω}_j}{\\sqrt{\\hat{σ}^2 (X^T X)_{jj}^{-1}}} \\sim t_{n-k} $$\n", "\n", "### Hypothesis Testing for Logistic Regression Coefficients\n", "For logistic regression\n", "$$ P(y=1|x)=π(x)=1/(1+e^{-ωx} ) \\\\\n", "P(y=0|x)=1-π(x)=1/(1+e^{ωx} ) $$\n", "\n", "The null and alternative hypotheses tested are:\n", "$$ \\begin{aligned}\n", "& H_0:ω_j=0 (j=1,2,⋯,k) \\\\\n", "& H_1:ω_j≠0\n", "\\end{aligned} $$\n", "\n", "Wald test $Wald=(\\hat{ω}_k/SE(\\hat{ω}_k ) )^2$ fits a chi-square distribution with 1 degree of freedom.\n", "\n", "Where $SE(\\hat{ω}_k )$ is the standard error of $ω_k$, which is the square root of the diagonal elements of the variance-covariance matrix:\n", "$$ SE(\\hat{ω}_k )={Cov(\\hat{ω}_{kk})}^{1/2} $$\n", "\n", "The variance and covariance matrices of the model parameters are the values ​​of the inverse of the Hessian matrix of the log-likelihood function at $\\hat{ω}$:\n", "$$ Cov(\\hat{ω}) =H^{-1}=\\frac{∂^2 l(ω)}{∂ω^2}|_{\\hat{ω}} $$\n", "\n", "Where:\n", "$$ H_{kr}=\\frac{∂^2l(ω)}{∂ω_k ∂ω_r}=∑_{i=1}^mx_{ik}π(x_i)[π(x_i )-1]x_{ir} $$\n", "\n", "The Hessian matrix is ​​expressed as $H=X^T A X$, A is a n*n matrix, where:\n", "$$ \\begin{aligned}\n", "& A_{ii} = π(x_{i})[π(x_{i}) - 1] \\\\\n", "& A_{ij} = 0 , i≠j\n", "\\end{aligned} $$\n", "\n", "Available:\n", "$$ \\begin{aligned}\n", "Wald & = (\\hat{ω}_k/SE(\\hat{ω}_k ) )^2 \\\\ \n", "& = \\hat{ω}_k^2 /Cov(\\hat{ω}_{kk}) \\\\\n", "& = \\hat{ω}_k^2 / H^{-1}_{kk} \\\\\n", "& = \\hat{ω}_k^2 / (X^T \\Lambda X )^{-1}_{kk}\n", "\\end{aligned} $$\n", "\n", "### SSPValue\n", "\n", "SecretFlow provides `SSPValue` for calculating P-Value of hypothesis testing using secret sharing.\n", "\n", "For linear regression:\n", "\n", " * calculate prediction residuals $\\hat{ε}=Xω - y$\n", " * calculate $\\hat{σ}^2=\\hat{ε}^T\\hat{ε} /(n-k)$\n", " * get $(X^T X)^{-1}$ by Newton iteration\n", " * $t^2= ω^2 / \\hat{σ}^2(X^T X)_{jj}^{-1}$ \n", " * $p =2 * (1 - t_{n-k}.cdf(|t|))$\n", "\n", "For logistic regression:\n", "\n", " * calculate $H=X^TAX$\n", " * get $H^{-1}$ by Newton iteration\n", " * calculate $z^2=ω^2/H^{-1}_{kk}$\n", " * $p = 2 * (1 - norm.cdf(|z|))$" ] }, { "cell_type": "code", "execution_count": 19, "id": "6134ca93", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "logistic pvalue: \n", "[9.46532264e-08 0.00000000e+00 0.00000000e+00 0.00000000e+00\n", " 8.21281902e-07 0.00000000e+00 0.00000000e+00]\n" ] } ], "source": [ "from secretflow.stats import SSPValue\n", "from secretflow.ml.linear.ss_sgd import SSRegression\n", "\n", "# first, get a LR model\n", "model = SSRegression(spu)\n", "model.fit(\n", " vdf,\n", " label_data,\n", " 3, # epochs\n", " 0.3, # Learning rate\n", " 64, # batch_size\n", " 't1', # sig_type\n", " 'logistic', # reg_type\n", " 'l2', # penalty\n", " 0.5, # l2_norm\n", ")\n", "spu_model = model.save_model()\n", "\n", "sspv = SSPValue(spu)\n", "pvalues = sspv.pvalues(vdf, label_data, spu_model)\n", "print(f\"logistic pvalue: \\n{pvalues}\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.13 ('py3.8')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "vscode": { "interpreter": { "hash": "66d1547304beaba725027c44e57cc46fc747862fe9496520910412a3187eb35f" } } }, "nbformat": 4, "nbformat_minor": 5 }