数据预处理#
以下代码仅作为示例,请勿在生产环境直接使用。
推荐使用 jupyter 运行本教程。
隐语提供了多种预处理工具来处理数据。
前置准备#
初始化隐语,创建alice和bob两个参与方。
💡 在使用预处理之前,您可以需要先了解隐语 DataFrame 。
[ ]:
import secretflow as sf
# In case you have a running secretflow runtime already.
sf.shutdown()
sf.init(['alice', 'bob'], address='local')
alice = sf.PYU('alice')
bob = sf.PYU('bob')
数据准备#
我们使用 iris 作为示例数据集。
[2]:
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris(as_frame=True)
data = pd.concat([iris.data, iris.target], axis=1)
# In order to facilitate the subsequent display,
# here we first set some data to None.
data.iloc[1, 1] = None
data.iloc[100, 1] = None
# Restore target to its original name.
data['target'] = data['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica' })
data
[2]:
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | NaN | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
150 rows × 5 columns
创建垂直DataFrame。
[3]:
import tempfile
from secretflow.data.vertical import read_csv as v_read_csv
# Vertical partitioning.
v_alice, v_bob = data.iloc[:, :2], data.iloc[:, 2:]
# Save to temprary files.
_, alice_path = tempfile.mkstemp()
_, bob_path = tempfile.mkstemp()
v_alice.to_csv(alice_path, index=False)
v_bob.to_csv(bob_path, index=False)
df = v_read_csv({alice: alice_path, bob: bob_path})
你也可以创建一个水平DataFrame,后续的操作对于水平和垂直DataFrame效果一样。
[4]:
# from secretflow.data.horizontal import read_csv as h_read_csv
# from secretflow.security.aggregation import PlainAggregator
# from secretflow.security.compare import PlainComparator
# # Horizontal partitioning.
# h_alice, h_bob = data.iloc[:70, :], data.iloc[70:, :]
# # Save to temorary files.
# _, h_alice_path = tempfile.mkstemp()
# _, h_bob_path = tempfile.mkstemp()
# h_alice.to_csv(h_alice_path, index=False)
# h_bob.to_csv(h_bob_path, index=False)
# df = h_read_csv(
# {alice: h_alice_path, bob: h_bob_path},
# aggregator=PlainAggregator(alice),
# comparator=PlainComparator(alice),
# )
预处理#
隐语提供了缺失值填充、标准化、分类数据编码、离散化等多种预处理功能,其使用方式和sklearn的预处理一致。
缺失值填充#
DataFrame提供了fillna方法,可以和pandas一样对缺失值进行填充。
[5]:
# Before filling, the sepal width (cm) is missing in two positions.
df.count()['sepal width (cm)']
[5]:
148
[6]:
# Fill sepal width (cm) with 10.
df.fillna(value={'sepal width (cm)': 10}).count()['sepal width (cm)']
[6]:
150
标准化#
将特征缩放到某个范围。#
隐语提供了 MinMaxScaler 用于把特征缩放到最大和最小值之间。MinMaxScaler的输入和输出形式均为DataFrame。
下面是将 sepal length (cm) 缩放到[0, 1]范围的示例。
[7]:
from secretflow.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_sepal_len = scaler.fit_transform(df['sepal length (cm)'])
print('Min: ', scaled_sepal_len.min())
print('Max: ', scaled_sepal_len.max())
Min: sepal length (cm) 0.0
dtype: float64
Max: sepal length (cm) 1.0
dtype: float64
方差缩放#
隐语提供了 StandardScaler 进行方差缩放。StandardScaler的输入和输出行为均为DataFrame。
下面是一个将 sepal length (cm) 进行方差缩放的例子。
[8]:
from secretflow.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_sepal_len = scaler.fit_transform(df['sepal length (cm)'])
print('Min: ', scaled_sepal_len.min())
print('Max: ', scaled_sepal_len.max())
Min: sepal length (cm) -1.870024
dtype: float64
Max: sepal length (cm) 2.492019
dtype: float64
分类特征编码#
独热编码#
隐语提供了 OneHotEncoder 用作独热编码。 OneHotEncoder的输入和输出行为均为DataFrame。
下面是一个对target列进行独热编码的例子。
[9]:
from secretflow.preprocessing import OneHotEncoder
onehot_encoder = OneHotEncoder()
onehot_target = onehot_encoder.fit_transform(df['target'])
print('Columns: ', onehot_target.columns)
print('Min: \n', onehot_target.min())
print('Max: \n', onehot_target.max())
Columns: Index(['target_setosa', 'target_versicolor', 'target_virginica'], dtype='object')
Min:
target_setosa 0.0
target_versicolor 0.0
target_virginica 0.0
dtype: float64
Max:
target_setosa 1.0
target_versicolor 1.0
target_virginica 1.0
dtype: float64
标签编码#
隐语提供了 LabelEncoder 用作将标签列编码至[0, 类别数 - 1]。LabelEncoder的输入输出形式均为DataFrame。
下面是一个对target列进行标签编码的例子。
[10]:
from secretflow.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
encoded_label = label_encoder.fit_transform(df['target'])
print('Columns: ', encoded_label.columns)
print('Min: \n', encoded_label.min())
print('Max: \n', encoded_label.max())
Columns: Index(['target'], dtype='object')
Min:
target 0
dtype: int64
Max:
target 2
dtype: int64
离散化#
隐语提供了 KBinsDiscretizer 用作将连续数据切分成离散值。KBinsDiscretizer的输入输出形式均为DataFrame。
下面是一个将 petal length (cm) 切分成5个分桶的例子。
[11]:
from secretflow.preprocessing import KBinsDiscretizer
estimator = KBinsDiscretizer(n_bins=5)
binned_petal_len = estimator.fit_transform(df['petal length (cm)'])
print('Min: \n', binned_petal_len.min())
print('Max: \n', binned_petal_len.max())
Min:
petal length (cm) 0.0
dtype: float64
Max:
petal length (cm) 4.0
dtype: float64
WOE编码#
隐语提供了 VertWoeBinning ,可以对特征按照数量或者chimerge方法进行分桶,并计算每个分桶的WOE和IV值。VertWOESubstitution 可以把特征值替换成WOE值。
下面是一个对特征进行WOE编码的例子。
[14]:
# woe binning use SPU or HEU device to protect label
spu = sf.SPU(sf.utils.testing.cluster_def(['alice', 'bob']))
# Only support binary classification label dataset for now.
# use linear dataset as example
from secretflow.utils.simulation.datasets import load_linear
vdf = load_linear(parts={alice: (1, 4), bob: (18, 22)})
print(f"orig ds in alice:\n {sf.reveal(vdf.partitions[alice].data)}")
print(f"orig ds in bob:\n {sf.reveal(vdf.partitions[bob].data)}")
from secretflow.preprocessing.binning.vert_woe_binning import VertWoeBinning
binning = VertWoeBinning(spu)
woe_rules = binning.binning(
vdf,
binning_method="quantile",
bin_num=5,
bin_names={alice: ["x1", "x2", "x3"], bob: ["x18", "x19", "x20"]},
label_name="y",
)
print(f"woe_rules for alice:\n {sf.reveal(woe_rules[alice])}")
print(f"woe_rules for bob:\n {sf.reveal(woe_rules[bob])}")
from secretflow.preprocessing.binning.vert_woe_substitution import VertWOESubstitution
woe_sub = VertWOESubstitution()
sub_data = woe_sub.substitution(vdf, woe_rules)
print(f"substituted ds in alice:\n {sf.reveal(sub_data.partitions[alice].data)}")
print(f"substituted ds in bob:\n {sf.reveal(sub_data.partitions[bob].data)}")
orig ds in alice:
x1 x2 x3
0 -0.514226 0.730010 -0.730391
1 -0.725537 0.482244 -0.823223
2 0.608353 -0.071102 -0.775098
3 -0.686642 0.160470 0.914477
4 -0.198111 0.212909 0.950474
... ... ... ...
9995 -0.367246 -0.296454 0.558596
9996 0.010913 0.629268 -0.384093
9997 -0.238097 0.904069 -0.344859
9998 0.453686 -0.375173 0.899238
9999 -0.776015 -0.772112 0.012110
[10000 rows x 3 columns]
orig ds in bob:
x18 x19 x20 y
0 0.810261 0.048303 0.937679 1
1 0.312728 0.526637 0.589773 1
2 0.039087 -0.753417 0.516735 0
3 -0.855979 0.250944 0.979465 1
4 -0.238805 0.243109 -0.121446 1
... ... ... ... ..
9995 -0.847253 0.069960 0.786748 1
9996 -0.502486 -0.076290 -0.604832 1
9997 -0.424209 0.434947 0.998955 1
9998 0.914291 -0.473056 0.616257 1
9999 -0.602927 -0.021368 0.885519 0
[10000 rows x 4 columns]
woe_rules for alice:
{'variables': [{'name': 'x1', 'type': 'numeric', 'split_points': [-0.6048731088638305, -0.2093792676925656, 0.1864844083786014, 0.59245548248291], 'woes': [0.13818949789069251, 0.1043626580338657, 0.012473718947119546, -0.08312553911263658, -0.16055365315128886], 'ivs': [0.003719895277030594, 0.0021358508878795324, 3.104792224414912e-05, 0.001402356358949689, 0.005298781871642081], 'total_counts': [2000, 2000, 2000, 2000, 2000], 'else_woe': -0.7620562438001163, 'else_iv': 6.385923806287768e-05, 'else_counts': 0}, {'name': 'x2', 'type': 'numeric', 'split_points': [-0.6180597543716427, -0.21352910995483343, 0.18739376068115243, 0.5941788196563724], 'woes': [-0.5795513521445242, -0.17800092651085536, 0.02175062133493428, 0.32061945260518093, 0.5508555713857505], 'ivs': [0.07282166470764677, 0.006530977061248972, 9.424153452104671e-05, 0.019271267842100412, 0.05393057944296773], 'total_counts': [2000, 2000, 2000, 2000, 2000], 'else_woe': -0.7620562438001163, 'else_iv': 6.385923806287768e-05, 'else_counts': 0}, {'name': 'x3', 'type': 'numeric', 'split_points': [-0.5902724504470824, -0.19980529546737677, 0.2072824716567998, 0.6102998018264773], 'woes': [-0.5371125119817587, -0.25762552591997334, -0.022037294110497735, 0.3445721198562295, 0.6304998785437507], 'ivs': [0.062290057806557865, 0.013846189451494859, 9.751520288052276e-05, 0.022140413942886045, 0.06928418076454097], 'total_counts': [2000, 2000, 2000, 2000, 2000], 'else_woe': -0.7620562438001163, 'else_iv': 6.385923806287768e-05, 'else_counts': 0}]}
woe_rules for bob:
{'variables': [{'name': 'x18', 'type': 'numeric', 'split_points': [-0.595701837539673, -0.18646149635314926, 0.20281808376312258, 0.5969645977020259], 'woes': [0.7644870924575128, 0.3796894156855692, 0.09717493242210018, -0.3856750302449858, -0.6258460389655672], 'ivs': [0.0984553650514661, 0.026672043182215357, 0.0018543743703697353, 0.031572379289703925, 0.08527363286990977], 'total_counts': [2000, 2000, 2000, 2000, 2000], 'else_woe': -0.7620562438001163, 'else_iv': 6.385923806287768e-05, 'else_counts': 0}, {'name': 'x19', 'type': 'numeric', 'split_points': [-0.5988080263137814, -0.2046342611312865, 0.1958462238311768, 0.6044608354568479], 'woes': [-0.24268812281101115, -0.18886157950622262, 0.061543825157264156, 0.15773711862524092, 0.24528753075504478], 'ivs': [0.012260322787780317, 0.0073647296376464335, 0.0007489127774465146, 0.0048277504221347, 0.011464529974064627], 'total_counts': [2000, 2000, 2000, 2000, 2000], 'else_woe': -0.7620562438001163, 'else_iv': 6.385923806287768e-05, 'else_counts': 0}, {'name': 'x20', 'type': 'numeric', 'split_points': [-0.6013513207435608, -0.2053116083145139, 0.19144065380096467, 0.5987063169479374], 'woes': [1.1083043875152403, 0.5598579367731444, 0.15773711862524092, -0.4618210945247346, -0.9083164208649596], 'ivs': [0.1887116758575296, 0.055586120695474514, 0.0048277504221347, 0.04568212645465593, 0.18279475271010598], 'total_counts': [2000, 2000, 2000, 2000, 2000], 'else_woe': -0.7620562438001163, 'else_iv': 6.385923806287768e-05, 'else_counts': 0}]}
substituted ds in alice:
x1 x2 x3
0 0.104363 0.550856 -0.537113
1 0.138189 0.320619 -0.537113
2 -0.160554 0.021751 -0.537113
3 0.138189 0.021751 0.630500
4 0.012474 0.320619 0.630500
... ... ... ...
9995 0.104363 -0.178001 0.344572
9996 0.012474 0.550856 -0.257626
9997 0.104363 0.550856 -0.257626
9998 -0.083126 -0.178001 0.630500
9999 0.138189 -0.579551 -0.022037
[10000 rows x 3 columns]
substituted ds in bob:
x18 x19 x20 y
0 -0.625846 0.061544 -0.908316 1
1 -0.385675 0.157737 -0.461821 1
2 0.097175 -0.242688 -0.461821 0
3 0.764487 0.157737 -0.908316 1
4 0.379689 0.157737 0.157737 1
... ... ... ... ..
9995 0.764487 0.061544 -0.908316 1
9996 0.379689 0.061544 1.108304 1
9997 0.379689 0.157737 -0.908316 1
9998 -0.625846 -0.188862 -0.908316 1
9999 0.764487 0.061544 -0.908316 0
[10000 rows x 4 columns]
收尾#
[13]:
# Clean up temporary files
import os
try:
os.remove(alice_path)
os.remove(bob_path)
except OSError:
pass