{ "cells": [ { "cell_type": "markdown", "id": "1701e451-b3a0-45b7-a6e4-50928fbf1636", "metadata": { "tags": [] }, "source": [ "# DataFrame" ] }, { "cell_type": "markdown", "id": "c19b35a5", "metadata": {}, "source": [ ">The following codes are demos only. It's **NOT for production** due to system security concerns, please **DO NOT** use it directly in production." ] }, { "cell_type": "markdown", "id": "d4bd0033", "metadata": {}, "source": [ "It is recommended to use [jupyter](https://jupyter.org/) to run this tutorial." ] }, { "cell_type": "markdown", "id": "7ad971ee-e595-42ab-a1af-60269a82c6f8", "metadata": {}, "source": [ "Secretflow provides federated data encapsulation in the form of DataFrame. DataFrame is composed of data blocks of multiple parties and supports horizontal or vertical partitioned data.\n", "\n", "\"dataframe.png\"\n", "\n", "Currently secretflow.DataFrame provides a subset of pandas operations, which are basically the same as pandas. During the calculation process, the original data is kept in the data holder and will not go out of the domain.\n", "\n", "\n", "\n", "The following will demonstrate how to use a DataFrame." ] }, { "cell_type": "markdown", "id": "2e1bee0f-115a-4ba1-8ffb-06fa9f7aac10", "metadata": { "tags": [] }, "source": [ "## Preparation\n", "\n", "Initialize secretflow and create three parties alice, bob and carol." ] }, { "cell_type": "code", "execution_count": 1, "id": "34143d30-4177-4470-88a1-b1d0fd96671d", "metadata": {}, "outputs": [], "source": [ "import secretflow as sf\n", "\n", "# In case you have a running secretflow runtime already.\n", "sf.shutdown()\n", "\n", "sf.init(['alice', 'bob', 'carol'], address='local')\n", "alice, bob, carol = sf.PYU('alice'), sf.PYU('bob'), sf.PYU('carol')" ] }, { "cell_type": "markdown", "id": "f4a3923c-d722-4d77-a9c1-d405fd8800d3", "metadata": { "tags": [] }, "source": [ "## Data preparation" ] }, { "cell_type": "markdown", "id": "89e73a98-322e-4afe-a03a-5da7e9e30671", "metadata": {}, "source": [ "Here we use [iris](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html) as example data." ] }, { "cell_type": "code", "execution_count": 2, "id": "372a0d20-e081-460f-9850-d62c8550c146", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)target
05.13.51.40.20
14.93.01.40.20
24.73.21.30.20
34.63.11.50.20
45.03.61.40.20
..................
1456.73.05.22.32
1466.32.55.01.92
1476.53.05.22.02
1486.23.45.42.32
1495.93.05.11.82
\n", "

150 rows × 5 columns

\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n", "0 5.1 3.5 1.4 0.2 \n", "1 4.9 3.0 1.4 0.2 \n", "2 4.7 3.2 1.3 0.2 \n", "3 4.6 3.1 1.5 0.2 \n", "4 5.0 3.6 1.4 0.2 \n", ".. ... ... ... ... \n", "145 6.7 3.0 5.2 2.3 \n", "146 6.3 2.5 5.0 1.9 \n", "147 6.5 3.0 5.2 2.0 \n", "148 6.2 3.4 5.4 2.3 \n", "149 5.9 3.0 5.1 1.8 \n", "\n", " target \n", "0 0 \n", "1 0 \n", "2 0 \n", "3 0 \n", "4 0 \n", ".. ... \n", "145 2 \n", "146 2 \n", "147 2 \n", "148 2 \n", "149 2 \n", "\n", "[150 rows x 5 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "from sklearn.datasets import load_iris\n", "\n", "iris = load_iris(as_frame=True)\n", "data = pd.concat([iris.data, iris.target], axis=1)\n", "data" ] }, { "cell_type": "markdown", "id": "e5ae1ad4-19c4-4e07-9f43-871053bff629", "metadata": {}, "source": [ "We partition the data according to horizontal (the same features, each holds some samples) and vertical mode (each holds some features) to facilitate subsequent display." ] }, { "cell_type": "code", "execution_count": 3, "id": "4072d720-1b65-4029-9c67-8a0e7c52d085", "metadata": {}, "outputs": [], "source": [ "# Horizontal partitioning.\n", "h_alice, h_bob, h_carol = data.iloc[:40, :], data.iloc[40:100, :], data.iloc[100:, :]\n", "\n", "# Save to temporary files.\n", "import tempfile\n", "import os\n", "\n", "temp_dir = tempfile.mkdtemp()\n", "\n", "h_alice_path = os.path.join(temp_dir, 'h_alice.csv')\n", "h_bob_path = os.path.join(temp_dir, 'h_bob.csv')\n", "h_carol_path = os.path.join(temp_dir, 'h_carol.csv')\n", "h_alice.to_csv(h_alice_path, index=False)\n", "h_bob.to_csv(h_bob_path, index=False)\n", "h_carol.to_csv(h_carol_path, index=False)" ] }, { "cell_type": "code", "execution_count": 4, "id": "3d408498-a91a-42a2-9c86-9ca52ed793d9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "( sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n", " 0 5.1 3.5 1.4 0.2 \n", " 1 4.9 3.0 1.4 0.2 \n", " 2 4.7 3.2 1.3 0.2 \n", " 3 4.6 3.1 1.5 0.2 \n", " 4 5.0 3.6 1.4 0.2 \n", " \n", " target \n", " 0 0 \n", " 1 0 \n", " 2 0 \n", " 3 0 \n", " 4 0 ,\n", " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n", " 40 5.0 3.5 1.3 0.3 \n", " 41 4.5 2.3 1.3 0.3 \n", " 42 4.4 3.2 1.3 0.2 \n", " 43 5.0 3.5 1.6 0.6 \n", " 44 5.1 3.8 1.9 0.4 \n", " \n", " target \n", " 40 0 \n", " 41 0 \n", " 42 0 \n", " 43 0 \n", " 44 0 ,\n", " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n", " 100 6.3 3.3 6.0 2.5 \n", " 101 5.8 2.7 5.1 1.9 \n", " 102 7.1 3.0 5.9 2.1 \n", " 103 6.3 2.9 5.6 1.8 \n", " 104 6.5 3.0 5.8 2.2 \n", " \n", " target \n", " 100 2 \n", " 101 2 \n", " 102 2 \n", " 103 2 \n", " 104 2 )" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "h_alice.head(), h_bob.head(), h_carol.head()" ] }, { "cell_type": "code", "execution_count": 5, "id": "ea2ce091-753a-4080-82e7-b5ce8037f933", "metadata": {}, "outputs": [], "source": [ "# Vertical partitioning.\n", "v_alice, v_bob, v_carol = data.iloc[:, :2], data.iloc[:, 2:4], data.iloc[:, 4:]\n", "\n", "# Save to temporary files.\n", "v_alice_path = os.path.join(temp_dir, 'v_alice.csv')\n", "v_bob_path = os.path.join(temp_dir, 'v_bob.csv')\n", "v_carol_path = os.path.join(temp_dir, 'v_carol.csv')\n", "v_alice.to_csv(v_alice_path, index=False)\n", "v_bob.to_csv(v_bob_path, index=False)\n", "v_carol.to_csv(v_carol_path, index=False)" ] }, { "cell_type": "code", "execution_count": 6, "id": "9cf4a0f1-299a-4945-b61d-0d79259255d4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "( sepal length (cm) sepal width (cm)\n", " 0 5.1 3.5\n", " 1 4.9 3.0\n", " 2 4.7 3.2\n", " 3 4.6 3.1\n", " 4 5.0 3.6\n", " .. ... ...\n", " 145 6.7 3.0\n", " 146 6.3 2.5\n", " 147 6.5 3.0\n", " 148 6.2 3.4\n", " 149 5.9 3.0\n", " \n", " [150 rows x 2 columns],\n", " petal length (cm) petal width (cm)\n", " 0 1.4 0.2\n", " 1 1.4 0.2\n", " 2 1.3 0.2\n", " 3 1.5 0.2\n", " 4 1.4 0.2\n", " .. ... ...\n", " 145 5.2 2.3\n", " 146 5.0 1.9\n", " 147 5.2 2.0\n", " 148 5.4 2.3\n", " 149 5.1 1.8\n", " \n", " [150 rows x 2 columns],\n", " target\n", " 0 0\n", " 1 0\n", " 2 0\n", " 3 0\n", " 4 0\n", " .. ...\n", " 145 2\n", " 146 2\n", " 147 2\n", " 148 2\n", " 149 2\n", " \n", " [150 rows x 1 columns])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "v_alice, v_bob, v_carol" ] }, { "cell_type": "markdown", "id": "257723ee-f582-4fa4-b539-5361e19cdee2", "metadata": { "tags": [] }, "source": [ "## Creation" ] }, { "cell_type": "markdown", "id": "37534ec9-0de4-46c1-8226-b8ffd8e0eba1", "metadata": { "tags": [] }, "source": [ "### Horitontal DataFrame\n", "\n", "Create a DataFrame consisting of horizontally partitioned data.\n", "\n", "> 💡 The original data is still stored locally in the data holder and is not transmitted out of the domain.\n", "\n", "Here, as a simple show case, we choose secure aggregation and spu comparison. You can refer to [Secure Aggregation](../../developer/algorithm/secure_aggregation.ipynb) to learn more about secure aggregation solutions and implement appropriate security policies according to your needs." ] }, { "cell_type": "code", "execution_count": 7, "id": "3bbff8d2-02ca-4f86-a624-f6c89935dc25", "metadata": {}, "outputs": [], "source": [ "from secretflow.data.horizontal import read_csv as h_read_csv\n", "from secretflow.security.aggregation import SecureAggregator\n", "from secretflow.security.compare import SPUComparator\n", "\n", "# The aggregator and comparator are respectively used to aggregate \n", "# or compare data in subsequent data analysis operations.\n", "aggr = SecureAggregator(device=alice, participants=[alice, bob, carol])\n", "\n", "spu = sf.SPU(sf.utils.testing.cluster_def(parties=['alice', 'bob', 'carol']))\n", "comp = SPUComparator(spu)\n", "hdf = h_read_csv({alice: h_alice_path, bob: h_bob_path, carol: h_carol_path}, \n", " aggregator=aggr, \n", " comparator=comp)" ] }, { "cell_type": "markdown", "id": "68afe3b5-c90d-46c4-b102-77bce9af163e", "metadata": {}, "source": [ "### Vertical DataFrame\n", "\n", "Create a DataFrame consisting of vertically partitioned data.\n", "\n", "> 💡 The original data is still stored locally in the data holder and is not transmitted out of the domain." ] }, { "cell_type": "code", "execution_count": 8, "id": "a202553b-41c4-4186-9c73-85ffbed185ca", "metadata": {}, "outputs": [], "source": [ "from secretflow.data.vertical import read_csv as v_read_csv\n", "\n", "vdf = v_read_csv({alice: v_alice_path, bob: v_bob_path, carol: v_carol_path})" ] }, { "cell_type": "markdown", "id": "6056def7-9623-4dd1-82fc-aa942e9dd76f", "metadata": { "tags": [] }, "source": [ "## Data analysis\n", "\n", "For data privacy protection purposes, DataFrame does not allow the view of raw data. DataFrame provides an interface similar to pandas for users to analyze data. These interfaces usually support both horizontal and vertical partitioned data.\n", "\n", "> 💡 During the following operations, the original data of the DataFrame is still stored locally on the node and is not transmitted out of the domain." ] }, { "cell_type": "code", "execution_count": 9, "id": "152527e6-cdbc-4652-9a98-c32d2f5ba796", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',\n", " 'petal width (cm)', 'target'],\n", " dtype='object')" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hdf.columns" ] }, { "cell_type": "code", "execution_count": 10, "id": "5efe698d-0ef4-477f-8dd0-5d7cccbc39f8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',\n", " 'petal width (cm)', 'target'],\n", " dtype='object')" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vdf.columns" ] }, { "cell_type": "markdown", "id": "fe6a7d37-d85d-4f4c-8386-7b306c0e0ae8", "metadata": {}, "source": [ "Get the minimum value, you can see that it is consistent with the original data." ] }, { "cell_type": "code", "execution_count": 11, "id": "57d88dcb-51e0-4bcf-9be0-dcc6a138bd7c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Horizontal df:\n", " sepal length (cm) 4.3\n", "sepal width (cm) 2.0\n", "petal length (cm) 1.0\n", "petal width (cm) 0.1\n", "target 0.0\n", "dtype: float64\n", "\n", "Vertical df:\n", " sepal length (cm) 4.3\n", "sepal width (cm) 2.0\n", "petal length (cm) 1.0\n", "petal width (cm) 0.1\n", "target 0.0\n", "dtype: float64\n", "\n", "Pandas:\n", " sepal length (cm) 4.3\n", "sepal width (cm) 2.0\n", "petal length (cm) 1.0\n", "petal width (cm) 0.1\n", "target 0.0\n", "dtype: float64\n" ] } ], "source": [ "print('Horizontal df:\\n', hdf.min())\n", "print('\\nVertical df:\\n', vdf.min())\n", "print('\\nPandas:\\n', data.min())" ] }, { "cell_type": "markdown", "id": "20b644b8-2b13-4fe3-980b-39aaa913bbdc", "metadata": {}, "source": [ "You can also view information such as maximum value, mean value, and quantity." ] }, { "cell_type": "code", "execution_count": 12, "id": "469bd791-74f6-487c-81f3-a8931d50c156", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sepal length (cm) 7.9\n", "sepal width (cm) 4.4\n", "petal length (cm) 6.9\n", "petal width (cm) 2.5\n", "target 2.0\n", "dtype: float64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hdf.max()" ] }, { "cell_type": "code", "execution_count": 13, "id": "f7628417-a17e-4323-9490-07a42e5cd994", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sepal length (cm) 7.9\n", "sepal width (cm) 4.4\n", "petal length (cm) 6.9\n", "petal width (cm) 2.5\n", "target 2.0\n", "dtype: float64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vdf.max()" ] }, { "cell_type": "code", "execution_count": 14, "id": "5a3623c7-52ac-4df8-926a-5f70759daada", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sepal length (cm) 5.843333\n", "sepal width (cm) 3.057333\n", "petal length (cm) 3.758000\n", "petal width (cm) 1.199333\n", "target 1.000000\n", "dtype: float64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hdf.mean(numeric_only=True)" ] }, { "cell_type": "code", "execution_count": 15, "id": "b28899eb-ec00-41d9-9630-9ac7c57ae523", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sepal length (cm) 5.843333\n", "sepal width (cm) 3.057333\n", "petal length (cm) 3.758000\n", "petal width (cm) 1.199333\n", "target 1.000000\n", "dtype: float64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vdf.mean(numeric_only=True)" ] }, { "cell_type": "code", "execution_count": 16, "id": "9e2d83b7-1ee8-4db5-b59b-7149bac5e6fa", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sepal length (cm) 150\n", "sepal width (cm) 150\n", "petal length (cm) 150\n", "petal width (cm) 150\n", "target 150\n", "dtype: int64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hdf.count()" ] }, { "cell_type": "code", "execution_count": 17, "id": "7aea6a98-a96f-4655-80dd-c4d62e0bc6f2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sepal length (cm) 150\n", "sepal width (cm) 150\n", "petal length (cm) 150\n", "petal width (cm) 150\n", "target 150\n", "dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vdf.count()" ] }, { "cell_type": "markdown", "id": "5a51defd-69f4-4b2f-b5de-aa8a08e41d86", "metadata": { "tags": [] }, "source": [ "### Selection\n", "\n", "Get partial columns." ] }, { "cell_type": "code", "execution_count": 18, "id": "8008563e-73ca-4c19-ab0f-08b055893493", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sepal length (cm) 5.843333\n", "target 1.000000\n", "dtype: float64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hdf_part = hdf[['sepal length (cm)', 'target']]\n", "hdf_part.mean(numeric_only=True)" ] }, { "cell_type": "code", "execution_count": 19, "id": "bbfb8fb0-c044-4e7b-a827-eb5858194de4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sepal width (cm) 3.057333\n", "target 1.000000\n", "dtype: float64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vdf_part = hdf[['sepal width (cm)', 'target']]\n", "vdf_part.mean(numeric_only=True)" ] }, { "cell_type": "markdown", "id": "cb8dae0b-c7e7-4888-9e6b-d25bd6fc864b", "metadata": { "tags": [] }, "source": [ "### Modification" ] }, { "cell_type": "markdown", "id": "d7d332a2-0504-4e10-a886-d4729f70e72f", "metadata": {}, "source": [ "Horizontal DataFrame" ] }, { "cell_type": "code", "execution_count": 20, "id": "299fcc32-6e0d-4f1a-8344-1273cd562f13", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Min of target: 0.0\n", "Max of target: 2.0\n" ] } ], "source": [ "hdf_copy = hdf.copy()\n", "print('Min of target: ', hdf_copy['target'].min()[0])\n", "print('Max of target: ', hdf_copy['target'].max()[0])" ] }, { "cell_type": "code", "execution_count": 21, "id": "43a292b2-75fc-4d64-9b8a-8e2aa0720613", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Min of target: 1.0\n", "Max of target: 1.0\n" ] } ], "source": [ "# Set target to 1。\n", "hdf_copy['target'] = 1\n", "\n", "# You can see that the value of target has become 1.\n", "print('Min of target: ', hdf_copy['target'].min()[0])\n", "print('Max of target: ', hdf_copy['target'].max()[0])" ] }, { "cell_type": "markdown", "id": "e0324286-98ca-4cc2-8435-138be15dddcb", "metadata": {}, "source": [ "Vertical DataFrame." ] }, { "cell_type": "code", "execution_count": 22, "id": "b9b96eaf-5523-45b5-841a-088a4c8662da", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Min of sepal width (cm): 2.0\n", "Max of sepal width (cm): 4.4\n" ] } ], "source": [ "vdf_copy = vdf.copy()\n", "print('Min of sepal width (cm): ', vdf_copy['sepal width (cm)'].min()[0])\n", "print('Max of sepal width (cm): ', vdf_copy['sepal width (cm)'].max()[0])" ] }, { "cell_type": "code", "execution_count": 23, "id": "93a329c3-b265-4d20-9978-5aa5a1b0494b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Min of sepal width (cm): 20\n", "Max of sepal width (cm): 20\n" ] } ], "source": [ "# Set sepal width (cm) to 20。\n", "vdf_copy['sepal width (cm)'] = 20\n", "\n", "# You can see that the value of sepal width (cm) has become 20.\n", "print('Min of sepal width (cm): ', vdf_copy['sepal width (cm)'].min()[0])\n", "print('Max of sepal width (cm): ', vdf_copy['sepal width (cm)'].max()[0])" ] }, { "cell_type": "markdown", "id": "a84fd212-ee94-4c8c-9d9d-ba6b41c49e7b", "metadata": {}, "source": [ "## Ending" ] }, { "cell_type": "code", "execution_count": 24, "id": "2e665be8-e4b9-4202-afac-2cc8be8543fd", "metadata": {}, "outputs": [], "source": [ "# Clean up temporary files\n", "\n", "import shutil\n", "\n", "shutil.rmtree(temp_dir, ignore_errors=True)" ] }, { "cell_type": "markdown", "id": "164a2b4d", "metadata": {}, "source": [ "## What's Next?\n", "\n", "Learn how to do data preprocessing with DataFrame with [this tutorial](../../tutorial/data_preprocessing_with_data_frame.ipynb)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.13 ('3.8')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "vscode": { "interpreter": { "hash": "ae1fdd5fd034b7d694352220485921694ff89198520409089b4646721fce11ca" } } }, "nbformat": 4, "nbformat_minor": 5 }