数据预处理#

以下代码仅作为示例,请勿在生产环境直接使用。

推荐使用 jupyter 运行本教程。

隐语提供了多种预处理工具来处理数据。

前置准备#

初始化隐语,创建alice和bob两个参与方。

💡 在使用预处理之前,您可以需要先了解隐语 DataFrame

[ ]:
import secretflow as sf

# In case you have a running secretflow runtime already.
sf.shutdown()

sf.init(['alice', 'bob'], address='local')
alice = sf.PYU('alice')
bob = sf.PYU('bob')

数据准备#

我们使用 iris 作为示例数据集。

[2]:
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris(as_frame=True)
data = pd.concat([iris.data, iris.target], axis=1)

# In order to facilitate the subsequent display,
# here we first set some data to None.
data.iloc[1, 1] = None
data.iloc[100, 1] = None

# Restore target to its original name.
data['target'] = data['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica' })
data
[2]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.1 3.5 1.4 0.2 setosa
1 4.9 NaN 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns

创建垂直DataFrame。

[3]:
import tempfile
from secretflow.data.vertical import read_csv as v_read_csv

# Vertical partitioning.
v_alice, v_bob = data.iloc[:, :2], data.iloc[:, 2:]

# Save to temprary files.
_, alice_path = tempfile.mkstemp()
_, bob_path = tempfile.mkstemp()
v_alice.to_csv(alice_path, index=False)
v_bob.to_csv(bob_path, index=False)


df = v_read_csv({alice: alice_path, bob: bob_path})

你也可以创建一个水平DataFrame,后续的操作对于水平和垂直DataFrame效果一样。

[4]:
# from secretflow.data.horizontal import read_csv as h_read_csv
# from secretflow.security.aggregation import PlainAggregator
# from secretflow.security.compare import PlainComparator

# # Horizontal partitioning.
# h_alice, h_bob = data.iloc[:70, :], data.iloc[70:, :]

# # Save to temorary files.
# _, h_alice_path = tempfile.mkstemp()
# _, h_bob_path = tempfile.mkstemp()
# h_alice.to_csv(h_alice_path, index=False)
# h_bob.to_csv(h_bob_path, index=False)

# df = h_read_csv(
#     {alice: h_alice_path, bob: h_bob_path},
#     aggregator=PlainAggregator(alice),
#     comparator=PlainComparator(alice),
# )

预处理#

隐语提供了缺失值填充、标准化、分类数据编码、离散化等多种预处理功能,其使用方式和sklearn的预处理一致。

缺失值填充#

DataFrame提供了fillna方法,可以和pandas一样对缺失值进行填充。

[5]:
# Before filling, the sepal width (cm) is missing in two positions.
df.count()['sepal width (cm)']
[5]:
148
[6]:
# Fill sepal width (cm) with 10.
df.fillna(value={'sepal width (cm)': 10}).count()['sepal width (cm)']
[6]:
150

标准化#

将特征缩放到某个范围。#

隐语提供了 MinMaxScaler 用于把特征缩放到最大和最小值之间。MinMaxScaler的输入和输出形式均为DataFrame。

下面是将 sepal length (cm) 缩放到[0, 1]范围的示例。

[7]:
from secretflow.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaled_sepal_len = scaler.fit_transform(df['sepal length (cm)'])

print('Min: ', scaled_sepal_len.min())
print('Max: ', scaled_sepal_len.max())
Min:  sepal length (cm)    0.0
dtype: float64
Max:  sepal length (cm)    1.0
dtype: float64

方差缩放#

隐语提供了 StandardScaler 进行方差缩放。StandardScaler的输入和输出行为均为DataFrame。

下面是一个将 sepal length (cm) 进行方差缩放的例子。

[8]:
from secretflow.preprocessing import StandardScaler

scaler = StandardScaler()

scaled_sepal_len = scaler.fit_transform(df['sepal length (cm)'])

print('Min: ', scaled_sepal_len.min())
print('Max: ', scaled_sepal_len.max())
Min:  sepal length (cm)   -1.870024
dtype: float64
Max:  sepal length (cm)    2.492019
dtype: float64

分类特征编码#

独热编码#

隐语提供了 OneHotEncoder 用作独热编码。 OneHotEncoder的输入和输出行为均为DataFrame。

下面是一个对target列进行独热编码的例子。

[9]:
from secretflow.preprocessing import OneHotEncoder

onehot_encoder = OneHotEncoder()
onehot_target = onehot_encoder.fit_transform(df['target'])

print('Columns: ', onehot_target.columns)
print('Min: \n', onehot_target.min())
print('Max: \n', onehot_target.max())
Columns:  Index(['target_setosa', 'target_versicolor', 'target_virginica'], dtype='object')
Min:
 target_setosa        0.0
target_versicolor    0.0
target_virginica     0.0
dtype: float64
Max:
 target_setosa        1.0
target_versicolor    1.0
target_virginica     1.0
dtype: float64

标签编码#

隐语提供了 LabelEncoder 用作将标签列编码至[0, 类别数 - 1]。LabelEncoder的输入输出形式均为DataFrame。

下面是一个对target列进行标签编码的例子。

[10]:
from secretflow.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
encoded_label = label_encoder.fit_transform(df['target'])

print('Columns: ', encoded_label.columns)
print('Min: \n', encoded_label.min())
print('Max: \n', encoded_label.max())
Columns:  Index(['target'], dtype='object')
Min:
 target    0
dtype: int64
Max:
 target    2
dtype: int64

离散化#

隐语提供了 KBinsDiscretizer 用作将连续数据切分成离散值。KBinsDiscretizer的输入输出形式均为DataFrame。

下面是一个将 petal length (cm) 切分成5个分桶的例子。

[11]:
from secretflow.preprocessing import KBinsDiscretizer

estimator = KBinsDiscretizer(n_bins=5)
binned_petal_len = estimator.fit_transform(df['petal length (cm)'])

print('Min: \n', binned_petal_len.min())
print('Max: \n', binned_petal_len.max())
Min:
 petal length (cm)    0.0
dtype: float64
Max:
 petal length (cm)    4.0
dtype: float64

WOE编码#

隐语提供了 VertWoeBinning ,可以对特征按照数量或者chimerge方法进行分桶,并计算每个分桶的WOE和IV值。VertWOESubstitution 可以把特征值替换成WOE值。

下面是一个对特征进行WOE编码的例子。

[14]:
# woe binning use SPU or HEU device to protect label
spu = sf.SPU(sf.utils.testing.cluster_def(['alice', 'bob']))

# Only support binary classification label dataset for now.
# use linear dataset as example
from secretflow.utils.simulation.datasets import load_linear
vdf = load_linear(parts={alice: (1, 4), bob: (18, 22)})
print(f"orig ds in alice:\n {sf.reveal(vdf.partitions[alice].data)}")
print(f"orig ds in bob:\n {sf.reveal(vdf.partitions[bob].data)}")

from secretflow.preprocessing.binning.vert_woe_binning import VertWoeBinning

binning = VertWoeBinning(spu)
woe_rules = binning.binning(
    vdf,
    binning_method="quantile",
    bin_num=5,
    bin_names={alice: ["x1", "x2", "x3"], bob: ["x18", "x19", "x20"]},
    label_name="y",
)

print(f"woe_rules for alice:\n {sf.reveal(woe_rules[alice])}")
print(f"woe_rules for bob:\n {sf.reveal(woe_rules[bob])}")

from secretflow.preprocessing.binning.vert_woe_substitution import VertWOESubstitution

woe_sub = VertWOESubstitution()
sub_data = woe_sub.substitution(vdf, woe_rules)

print(f"substituted ds in alice:\n {sf.reveal(sub_data.partitions[alice].data)}")
print(f"substituted ds in bob:\n {sf.reveal(sub_data.partitions[bob].data)}")
orig ds in alice:
             x1        x2        x3
0    -0.514226  0.730010 -0.730391
1    -0.725537  0.482244 -0.823223
2     0.608353 -0.071102 -0.775098
3    -0.686642  0.160470  0.914477
4    -0.198111  0.212909  0.950474
...        ...       ...       ...
9995 -0.367246 -0.296454  0.558596
9996  0.010913  0.629268 -0.384093
9997 -0.238097  0.904069 -0.344859
9998  0.453686 -0.375173  0.899238
9999 -0.776015 -0.772112  0.012110

[10000 rows x 3 columns]
orig ds in bob:
            x18       x19       x20  y
0     0.810261  0.048303  0.937679  1
1     0.312728  0.526637  0.589773  1
2     0.039087 -0.753417  0.516735  0
3    -0.855979  0.250944  0.979465  1
4    -0.238805  0.243109 -0.121446  1
...        ...       ...       ... ..
9995 -0.847253  0.069960  0.786748  1
9996 -0.502486 -0.076290 -0.604832  1
9997 -0.424209  0.434947  0.998955  1
9998  0.914291 -0.473056  0.616257  1
9999 -0.602927 -0.021368  0.885519  0

[10000 rows x 4 columns]
woe_rules for alice:
 {'variables': [{'name': 'x1', 'type': 'numeric', 'split_points': [-0.6048731088638305, -0.2093792676925656, 0.1864844083786014, 0.59245548248291], 'woes': [0.13818949789069251, 0.1043626580338657, 0.012473718947119546, -0.08312553911263658, -0.16055365315128886], 'ivs': [0.003719895277030594, 0.0021358508878795324, 3.104792224414912e-05, 0.001402356358949689, 0.005298781871642081], 'total_counts': [2000, 2000, 2000, 2000, 2000], 'else_woe': -0.7620562438001163, 'else_iv': 6.385923806287768e-05, 'else_counts': 0}, {'name': 'x2', 'type': 'numeric', 'split_points': [-0.6180597543716427, -0.21352910995483343, 0.18739376068115243, 0.5941788196563724], 'woes': [-0.5795513521445242, -0.17800092651085536, 0.02175062133493428, 0.32061945260518093, 0.5508555713857505], 'ivs': [0.07282166470764677, 0.006530977061248972, 9.424153452104671e-05, 0.019271267842100412, 0.05393057944296773], 'total_counts': [2000, 2000, 2000, 2000, 2000], 'else_woe': -0.7620562438001163, 'else_iv': 6.385923806287768e-05, 'else_counts': 0}, {'name': 'x3', 'type': 'numeric', 'split_points': [-0.5902724504470824, -0.19980529546737677, 0.2072824716567998, 0.6102998018264773], 'woes': [-0.5371125119817587, -0.25762552591997334, -0.022037294110497735, 0.3445721198562295, 0.6304998785437507], 'ivs': [0.062290057806557865, 0.013846189451494859, 9.751520288052276e-05, 0.022140413942886045, 0.06928418076454097], 'total_counts': [2000, 2000, 2000, 2000, 2000], 'else_woe': -0.7620562438001163, 'else_iv': 6.385923806287768e-05, 'else_counts': 0}]}
woe_rules for bob:
 {'variables': [{'name': 'x18', 'type': 'numeric', 'split_points': [-0.595701837539673, -0.18646149635314926, 0.20281808376312258, 0.5969645977020259], 'woes': [0.7644870924575128, 0.3796894156855692, 0.09717493242210018, -0.3856750302449858, -0.6258460389655672], 'ivs': [0.0984553650514661, 0.026672043182215357, 0.0018543743703697353, 0.031572379289703925, 0.08527363286990977], 'total_counts': [2000, 2000, 2000, 2000, 2000], 'else_woe': -0.7620562438001163, 'else_iv': 6.385923806287768e-05, 'else_counts': 0}, {'name': 'x19', 'type': 'numeric', 'split_points': [-0.5988080263137814, -0.2046342611312865, 0.1958462238311768, 0.6044608354568479], 'woes': [-0.24268812281101115, -0.18886157950622262, 0.061543825157264156, 0.15773711862524092, 0.24528753075504478], 'ivs': [0.012260322787780317, 0.0073647296376464335, 0.0007489127774465146, 0.0048277504221347, 0.011464529974064627], 'total_counts': [2000, 2000, 2000, 2000, 2000], 'else_woe': -0.7620562438001163, 'else_iv': 6.385923806287768e-05, 'else_counts': 0}, {'name': 'x20', 'type': 'numeric', 'split_points': [-0.6013513207435608, -0.2053116083145139, 0.19144065380096467, 0.5987063169479374], 'woes': [1.1083043875152403, 0.5598579367731444, 0.15773711862524092, -0.4618210945247346, -0.9083164208649596], 'ivs': [0.1887116758575296, 0.055586120695474514, 0.0048277504221347, 0.04568212645465593, 0.18279475271010598], 'total_counts': [2000, 2000, 2000, 2000, 2000], 'else_woe': -0.7620562438001163, 'else_iv': 6.385923806287768e-05, 'else_counts': 0}]}
substituted ds in alice:
             x1        x2        x3
0     0.104363  0.550856 -0.537113
1     0.138189  0.320619 -0.537113
2    -0.160554  0.021751 -0.537113
3     0.138189  0.021751  0.630500
4     0.012474  0.320619  0.630500
...        ...       ...       ...
9995  0.104363 -0.178001  0.344572
9996  0.012474  0.550856 -0.257626
9997  0.104363  0.550856 -0.257626
9998 -0.083126 -0.178001  0.630500
9999  0.138189 -0.579551 -0.022037

[10000 rows x 3 columns]
substituted ds in bob:
            x18       x19       x20  y
0    -0.625846  0.061544 -0.908316  1
1    -0.385675  0.157737 -0.461821  1
2     0.097175 -0.242688 -0.461821  0
3     0.764487  0.157737 -0.908316  1
4     0.379689  0.157737  0.157737  1
...        ...       ...       ... ..
9995  0.764487  0.061544 -0.908316  1
9996  0.379689  0.061544  1.108304  1
9997  0.379689  0.157737 -0.908316  1
9998 -0.625846 -0.188862 -0.908316  1
9999  0.764487  0.061544 -0.908316  0

[10000 rows x 4 columns]

收尾#

[13]:
# Clean up temporary files

import os

try:
    os.remove(alice_path)
    os.remove(bob_path)
except OSError:
    pass