DataFrame#

以下代码仅作为示例，请勿在生产环境直接使用。

推荐使用 jupyter 运行本教程。

隐语提供了联邦数据的DataFrame形式封装。DataFrame由多个参与方的数据块构成，支持数据水平或者垂直切分。

目前DataFrame兼容了部分pandas接口，使用体感和pandas一致。在计算过程中，原始数据始终保持在数据拥有方本地，不会出域。

下面将展示如何使用DataFrame。

前置准备#

初始化secretflow，创建alice、bob和carol三个参与方。

[1]:

import secretflow as sf

# In case you have a running secretflow runtime already.
sf.shutdown()

sf.init(['alice', 'bob', 'carol'], address='local')
alice, bob, carol = sf.PYU('alice'), sf.PYU('bob'), sf.PYU('carol')

数据准备#

我们使用 iris 作为示例数据集。

[2]:

import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris(as_frame=True)
data = pd.concat([iris.data, iris.target], axis=1)
data

[2]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target
0	5.1	3.5	1.4	0.2	0
1	4.9	3.0	1.4	0.2	0
2	4.7	3.2	1.3	0.2	0
3	4.6	3.1	1.5	0.2	0
4	5.0	3.6	1.4	0.2	0
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	2
146	6.3	2.5	5.0	1.9	2
147	6.5	3.0	5.2	2.0	2
148	6.2	3.4	5.4	2.3	2
149	5.9	3.0	5.1	1.8	2

150 rows × 5 columns

我们按照水平（特征相同）和垂直（样本相同）两种方式对数据进行切分，以方便后续的演示。

[3]:

# Horizontal partitioning.
h_alice, h_bob, h_carol = data.iloc[:40, :], data.iloc[40:100, :], data.iloc[100:, :]

# Save to temporary files.
import tempfile
import os

temp_dir = tempfile.mkdtemp()

h_alice_path = os.path.join(temp_dir, 'h_alice.csv')
h_bob_path = os.path.join(temp_dir, 'h_bob.csv')
h_carol_path = os.path.join(temp_dir, 'h_carol.csv')
h_alice.to_csv(h_alice_path, index=False)
h_bob.to_csv(h_bob_path, index=False)
h_carol.to_csv(h_carol_path, index=False)

[4]:

h_alice.head(), h_bob.head(), h_carol.head()

[4]:

(   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
 0                5.1               3.5                1.4               0.2
 1                4.9               3.0                1.4               0.2
 2                4.7               3.2                1.3               0.2
 3                4.6               3.1                1.5               0.2
 4                5.0               3.6                1.4               0.2

    target
 0       0
 1       0
 2       0
 3       0
 4       0  ,
     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
 40                5.0               3.5                1.3               0.3
 41                4.5               2.3                1.3               0.3
 42                4.4               3.2                1.3               0.2
 43                5.0               3.5                1.6               0.6
 44                5.1               3.8                1.9               0.4

     target
 40       0
 41       0
 42       0
 43       0
 44       0  ,
      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
 100                6.3               3.3                6.0               2.5
 101                5.8               2.7                5.1               1.9
 102                7.1               3.0                5.9               2.1
 103                6.3               2.9                5.6               1.8
 104                6.5               3.0                5.8               2.2

      target
 100       2
 101       2
 102       2
 103       2
 104       2  )

[5]:

# Vertical partitioning.
v_alice, v_bob, v_carol = data.iloc[:, :2], data.iloc[:, 2:4], data.iloc[:, 4:]

# Save to temporary files.
v_alice_path = os.path.join(temp_dir, 'v_alice.csv')
v_bob_path = os.path.join(temp_dir, 'v_bob.csv')
v_carol_path = os.path.join(temp_dir, 'v_carol.csv')
v_alice.to_csv(v_alice_path, index=False)
v_bob.to_csv(v_bob_path, index=False)
v_carol.to_csv(v_carol_path, index=False)

[6]:

v_alice, v_bob, v_carol

[6]:

(     sepal length (cm)  sepal width (cm)
 0                  5.1               3.5
 1                  4.9               3.0
 2                  4.7               3.2
 3                  4.6               3.1
 4                  5.0               3.6
 ..                 ...               ...
 145                6.7               3.0
 146                6.3               2.5
 147                6.5               3.0
 148                6.2               3.4
 149                5.9               3.0

 [150 rows x 2 columns],
      petal length (cm)  petal width (cm)
 0                  1.4               0.2
 1                  1.4               0.2
 2                  1.3               0.2
 3                  1.5               0.2
 4                  1.4               0.2
 ..                 ...               ...
 145                5.2               2.3
 146                5.0               1.9
 147                5.2               2.0
 148                5.4               2.3
 149                5.1               1.8

 [150 rows x 2 columns],
      target
 0         0
 1         0
 2         0
 3         0
 4         0
 ..      ...
 145       2
 146       2
 147       2
 148       2
 149       2

 [150 rows x 1 columns])

创建#

水平DataFrame#

创建一个由水平切分数据组成的DataFrame。

💡原始数据仍然保持在数据拥有方本地，并不会出域。

作为演示，我们选择了安全聚合和spu安全比较。您可以参考安全聚合了解更多安全聚合方案，并选择适合您的安全策略。

[7]:

from secretflow.data.horizontal import read_csv as h_read_csv
from secretflow.security.aggregation import SecureAggregator
from secretflow.security.compare import SPUComparator

# The aggregator and comparator are respectively used to aggregate
# or compare data in subsequent data analysis operations.
aggr = SecureAggregator(device=alice, participants=[alice, bob, carol])

spu = sf.SPU(sf.utils.testing.cluster_def(parties=['alice', 'bob', 'carol']))
comp = SPUComparator(spu)
hdf = h_read_csv({alice: h_alice_path, bob: h_bob_path, carol: h_carol_path},
                 aggregator=aggr,
                 comparator=comp)

垂直DataFrame#

创建一个由水平切分数据组成的DataFrame。

💡原始数据仍然保持在数据拥有方本地，并不会出域。

[8]:

from secretflow.data.vertical import read_csv as v_read_csv

vdf = v_read_csv({alice: v_alice_path, bob: v_bob_path, carol: v_carol_path})

数据分析#

为了保护数据隐私，DataFrame不允许对原始数据进行查看。DataFrame提供了类似pandas的接口来方便用户分析数据。这些接口对水平和垂直切分数据都通用。

在以下操作中，原始数据仍然保持在数据拥有方本地，并不会传输出域。

[9]:

hdf.columns

[9]:

Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)', 'target'],
      dtype='object')

[10]:

vdf.columns

[10]:

Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)', 'target'],
      dtype='object')

获取最小值，可以看到和原始数据一致。

[11]:

print('Horizontal df:\n', hdf.min())
print('\nVertical df:\n', vdf.min())
print('\nPandas:\n', data.min())

Horizontal df:
 sepal length (cm)    4.3
sepal width (cm)     2.0
petal length (cm)    1.0
petal width (cm)     0.1
target               0.0
dtype: float64

Vertical df:
 sepal length (cm)    4.3
sepal width (cm)     2.0
petal length (cm)    1.0
petal width (cm)     0.1
target               0.0
dtype: float64

Pandas:
 sepal length (cm)    4.3
sepal width (cm)     2.0
petal length (cm)    1.0
petal width (cm)     0.1
target               0.0
dtype: float64

您也可以查看比如最大值、平均值、数量等信息。

[12]:

hdf.max()

[12]:

sepal length (cm)    7.9
sepal width (cm)     4.4
petal length (cm)    6.9
petal width (cm)     2.5
target               2.0
dtype: float64

[13]:

vdf.max()

[13]:

sepal length (cm)    7.9
sepal width (cm)     4.4
petal length (cm)    6.9
petal width (cm)     2.5
target               2.0
dtype: float64

[14]:

hdf.mean(numeric_only=True)

[14]:

sepal length (cm)    5.843333
sepal width (cm)     3.057333
petal length (cm)    3.758000
petal width (cm)     1.199333
target               1.000000
dtype: float64

[15]:

vdf.mean(numeric_only=True)

[15]:

sepal length (cm)    5.843333
sepal width (cm)     3.057333
petal length (cm)    3.758000
petal width (cm)     1.199333
target               1.000000
dtype: float64

[16]:

hdf.count()

[16]:

sepal length (cm)    150
sepal width (cm)     150
petal length (cm)    150
petal width (cm)     150
target               150
dtype: int64

[17]:

vdf.count()

[17]:

sepal length (cm)    150
sepal width (cm)     150
petal length (cm)    150
petal width (cm)     150
target               150
dtype: int64

选取数据#

选取部分列。

[18]:

hdf_part = hdf[['sepal length (cm)', 'target']]
hdf_part.mean(numeric_only=True)

[18]:

sepal length (cm)    5.843333
target               1.000000
dtype: float64

[19]:

vdf_part = hdf[['sepal width (cm)', 'target']]
vdf_part.mean(numeric_only=True)

[19]:

sepal width (cm)    3.057333
target              1.000000
dtype: float64

修改#

水平DataFrame

[20]:

hdf_copy = hdf.copy()
print('Min of target: ', hdf_copy['target'].min()[0])
print('Max of target: ', hdf_copy['target'].max()[0])

Min of target:  0.0
Max of target:  2.0

[21]:

# Set target to 1。
hdf_copy['target'] = 1

# You can see that the value of target has become 1.
print('Min of target: ', hdf_copy['target'].min()[0])
print('Max of target: ', hdf_copy['target'].max()[0])

Min of target:  1.0
Max of target:  1.0

垂直DataFrame

[22]:

vdf_copy = vdf.copy()
print('Min of sepal width (cm): ', vdf_copy['sepal width (cm)'].min()[0])
print('Max of sepal width (cm): ', vdf_copy['sepal width (cm)'].max()[0])

Min of sepal width (cm):  2.0
Max of sepal width (cm):  4.4

[23]:

# Set sepal width (cm) to 20。
vdf_copy['sepal width (cm)'] = 20

# You can see that the value of sepal width (cm) has become 20.
print('Min of sepal width (cm): ', vdf_copy['sepal width (cm)'].min()[0])
print('Max of sepal width (cm): ', vdf_copy['sepal width (cm)'].max()[0])

Min of sepal width (cm):  20
Max of sepal width (cm):  20

收尾#

[24]:

# Clean up temporary files

import shutil

shutil.rmtree(temp_dir, ignore_errors=True)

下一步#

跟着教程预处理学习如何对数据进行预处理。

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target
0	5.1	3.5	1.4	0.2	0
1	4.9	3.0	1.4	0.2	0
2	4.7	3.2	1.3	0.2	0
3	4.6	3.1	1.5	0.2	0
4	5.0	3.6	1.4	0.2	0
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	2
146	6.3	2.5	5.0	1.9	2
147	6.5	3.0	5.2	2.0	2
148	6.2	3.4	5.4	2.3	2
149	5.9	3.0	5.1	1.8	2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target
0	5.1	3.5	1.4	0.2	0
1	4.9	3.0	1.4	0.2	0
2	4.7	3.2	1.3	0.2	0
3	4.6	3.1	1.5	0.2	0
4	5.0	3.6	1.4	0.2	0
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	2
146	6.3	2.5	5.0	1.9	2
147	6.5	3.0	5.2	2.0	2
148	6.2	3.4	5.4	2.3	2
149	5.9	3.0	5.1	1.8	2

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target
0	5.1	3.5	1.4	0.2	0
1	4.9	3.0	1.4	0.2	0
2	4.7	3.2	1.3	0.2	0
3	4.6	3.1	1.5	0.2	0
4	5.0	3.6	1.4	0.2	0
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	2
146	6.3	2.5	5.0	1.9	2
147	6.5	3.0	5.2	2.0	2
148	6.2	3.4	5.4	2.3	2
149	5.9	3.0	5.1	1.8	2