secretflow.data.vertical#
Classes:
|
Federated dataframe holds vertical partitioned data. |
Functions:
|
Read a comma-separated values (csv) file into VDataFrame. |
|
Write object to a comma-separated values (csv) file. |
- class secretflow.data.vertical.VDataFrame(partitions: Dict[PYU, Partition], aligned: bool = True)[源代码]#
-
Federated dataframe holds vertical partitioned data.
This dataframe is design to provide a federated pandas dataframe and just same as using pandas. The original data is still stored locally in the data holder and is not transmitted out of the domain during all the methods execution.
The method with a prefix partition_ will return a dict {pyu of partition: result of partition}.
- partitions#
a dict of pyu and partition.
- aligned#
a boolean indicating whether the data is
- Type:
bool
示例
>>> from secretflow.data.vertical import read_csv >>> from secretflow import PYU >>> alice = PYU('alice') >>> bob = PYU('bob') >>> v_df = read_csv({alice: 'alice.csv', bob: 'bob.csv'}) >>> v_df.columns Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'], dtype='object') >>> v_df.mean(numeric_only=True) sepal_length 5.827693 sepal_width 3.054000 petal_length 3.730000 petal_width 1.198667 dtype: float64 >>> v_df.min(numeric_only=True) sepal_length 4.3 sepal_width 2.0 petal_length 1.0 petal_width 0.1 dtype: float64 >>> v_df.max(numeric_only=True) sepal_length 7.9 sepal_width 4.4 petal_length 6.9 petal_width 2.5 dtype: float64 >>> v_df.count() sepal_length 130 sepal_width 150 petal_length 120 petal_width 150 class 150 dtype: int64 >>> v_df.fillna({'sepal_length': 2})
Attributes:
Return the dtypes in the DataFrame.
The column labels of the DataFrame.
Return a tuple representing the dimensionality of the DataFrame.
Return a federated Numpy representation of the DataFrame.
Returns columns of each partition.
Methods:
mode
([numeric_only, dropna])Return the mode of the values over the axis 0.
sum
([numeric_only])Return the sum of the values over the axis 0.
min
([numeric_only])Return the min of the values over the axis 0.
max
([numeric_only])Return the max of the values over the axis 0.
pow
(*args, **kwargs)Gets Exponential power of dataframe and other, element-wise (binary operator pow).
round
(*args, **kwargs)Round the DataFrame to a variable number of decimal places.
select_dtypes
(*args, **kwargs)Returns a subset of the DataFrame's columns based on the column dtypes.
replace
(*args, **kwargs)Replace values given in to_replace with value.
subtract
(*args, **kwargs)Gets Subtraction of dataframe and other, element-wise (binary operator sub).
astype
(dtype[, copy, errors])Cast object to a specified dtype
dtype
.mean
([numeric_only])Return the mean of the values over the axis 0.
var
([numeric_only])Return the var of the values over the axis 0.
std
([numeric_only])Return the std of the values over the axis 0.
sem
([numeric_only])Return the standard error of the mean over the axis 0.
skew
([numeric_only])Return the skewness over the axis 0.
kurtosis
([numeric_only])Return the kurtosis over the requested axis.
quantile
([q])Returns values at the given quantile over axis 0.
count
([numeric_only])Count non-NA cells for each column.
isna
()"Detects missing values for an array-like object. Same as pandas.DataFrame.isna Returns DataFrame: Mask of bool values for each element in DataFrame that indicates whether an element is an NA value.
copy
()Shallow copy of this dataframe.
drop
([labels, axis, index, columns, level, ...])Drop specified labels from rows or columns.
fillna
([value, method, axis, inplace, ...])Fill NA/NaN values using the specified method.
to_csv
(fileuris, **kwargs)Write object to a comma-separated values (csv) file.
Return shapes of each partition.
__init__
(partitions[, aligned])- aligned: bool = True#
- mode(numeric_only=False, dropna=True) Series [源代码]#
Return the mode of the values over the axis 0. The mode of a set of values is the value that appears most often. Restrict mode on axis 0 in VDataFrame for data protection reasons.
- 返回:
pd.Series
- sum(numeric_only=False) Series [源代码]#
Return the sum of the values over the axis 0.
Restrict sum on axis 0 in VDataFrame for data protection reasons.
- 返回:
pd.Series
- min(numeric_only=False) Series [源代码]#
Return the min of the values over the axis 0.
Note columns containing None values are ignored. Fill before proceed.
Restrict min on axis 0 in VDataFrame for data protection reasons.
- 返回:
pd.Series
- max(numeric_only=False) Series [源代码]#
Return the max of the values over the axis 0.
Note columns containing None values are ignored. Fill before proceed.
Restrict max on axis 0 in VDataFrame for data protection reasons.
- 返回:
pd.Series
- pow(*args, **kwargs) VDataFrame [源代码]#
Gets Exponential power of dataframe and other, element-wise (binary operator pow). Equivalent to dataframe ** other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rpow. Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, , /, //, %, *.
- 返回:
VDataFrame
- Reference:
pd.DataFrame.pow
- round(*args, **kwargs) VDataFrame [源代码]#
Round the DataFrame to a variable number of decimal places.
- 返回:
same shape except value rounded
- 返回类型:
- Reference:
pd.DataFrame.round
- select_dtypes(*args, **kwargs) VDataFrame [源代码]#
Returns a subset of the DataFrame’s columns based on the column dtypes.
- Reference:
pandas.DataFrame.select_dtypes
- replace(*args, **kwargs) VDataFrame [源代码]#
Replace values given in to_replace with value. Same as pandas.DataFrame.replace Values of the DataFrame are replaced with other values dynamically.
- 返回:
same shape except value replaced
- 返回类型:
- subtract(*args, **kwargs) VDataFrame [源代码]#
Gets Subtraction of dataframe and other, element-wise (binary operator sub). Equivalent to dataframe - other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rsub. Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, , /, //, %, *.
Note each part only will contains its own columns.
- Reference:
pd.DataFrame.subtract
- property dtypes: Series#
Return the dtypes in the DataFrame.
- 返回:
the data type of each column.
- 返回类型:
pd.Series
- astype(dtype, copy: bool = True, errors: str = 'raise')[源代码]#
Cast object to a specified dtype
dtype
.All args are same as
pandas.DataFrame.astype()
.
- property columns#
The column labels of the DataFrame.
- property shape#
Return a tuple representing the dimensionality of the DataFrame.
- mean(numeric_only=False) Series [源代码]#
Return the mean of the values over the axis 0.
Note columns containing None values are ignored. Fill before proceed.
Restrict mean on axis 0 in VDataFrame for data protection reasons.
- 返回:
pd.Series
- var(numeric_only=False) Series [源代码]#
Return the var of the values over the axis 0.
Note columns containing None values are ignored. Fill before proceed.
Restrict var on axis 0 in VDataFrame for data protection reasons.
- 返回:
pd.Series
- std(numeric_only=False) Series [源代码]#
Return the std of the values over the axis 0.
Note columns containing None values are ignored. Fill before proceed.
Restrict std on axis 0 in VDataFrame for data protection reasons.
- 返回:
pd.Series
- sem(numeric_only=False) Series [源代码]#
Return the standard error of the mean over the axis 0.
Note columns containing None values are ignored. Fill before proceed.
Restrict sem on axis 0 in VDataFrame for data protection reasons.
- 返回:
pd.Series
- skew(numeric_only=False) Series [源代码]#
Return the skewness over the axis 0.
Note columns containing None values are ignored. Fill before proceed.
Restrict skew on axis 0 in VDataFrame for data protection reasons.
- 返回:
pd.Series
- kurtosis(numeric_only=False) Series [源代码]#
Return the kurtosis over the requested axis.
Note columns containing None values are ignored. Fill before proceed.
Restrict kurtosis on axis 0 in VDataFrame for data protection reasons.
- 返回:
pd.Series
- quantile(q=0.5) Series [源代码]#
Returns values at the given quantile over axis 0.
Note columns containing None values are ignored. Fill before proceed.
Restrict quantile on axis 0 in VDataFrame for data protection reasons.
- 返回:
pd.Series
- count(numeric_only=False) Series [源代码]#
Count non-NA cells for each column.
Restrict count on axis 0 in VDataFrame for data protection reasons.
- 返回:
pd.Series
- isna() VDataFrame [源代码]#
“Detects missing values for an array-like object. Same as pandas.DataFrame.isna Returns
- DataFrame: Mask of bool values for each element in DataFrame
that indicates whether an element is an NA value.
- 返回:
VDataFrame
- Reference:
pd.DataFrame.isna
- property values#
Return a federated Numpy representation of the DataFrame.
- 返回:
FedNdarray.
- copy() VDataFrame [源代码]#
Shallow copy of this dataframe.
- 返回:
VDataFrame.
- drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise') Optional[VDataFrame] [源代码]#
Drop specified labels from rows or columns.
All arguments are same with
pandas.DataFrame.drop()
.- 返回:
VDataFrame without the removed index or column labels or None if inplace=True.
- fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None) Optional[VDataFrame] [源代码]#
Fill NA/NaN values using the specified method.
All arguments are same with
pandas.DataFrame.fillna()
.- 返回:
VDataFrame with missing values filled or None if inplace=True.
- to_csv(fileuris: Dict[PYU, str], **kwargs)[源代码]#
Write object to a comma-separated values (csv) file.
- 参数:
fileuris – a dict of file uris specifying file for each PYU.
kwargs – other arguments are same with
pandas.DataFrame.to_csv()
.
- 返回:
Returns a list of PYUObjects whose value is none. You can use secretflow.wait to wait for the save to complete.
- property partition_columns#
Returns columns of each partition.
- 返回:
columns}
- 返回类型:
a dict of {pyu
- secretflow.data.vertical.read_csv(filepath: Dict[PYU, str], delimiter=',', dtypes: Optional[Dict[PYU, Dict[str, type]]] = None, spu: Optional[SPU] = None, keys: Optional[Union[str, List[str], Dict[Device, List[str]]]] = None, drop_keys: Optional[Union[str, List[str], Dict[Device, List[str]]]] = None, psi_protocl=None, no_header: bool = False) VDataFrame [源代码]#
Read a comma-separated values (csv) file into VDataFrame.
When specifying spu and keys, the fields specified by keys are used for PSI alignment.Fields used for alignment must be common to all parties, and other fields cannot be repeated across parties. The data for each party is supposed pre-aligned if not specifying spu and keys.
- 参数:
filepath –
The file path of each party. It can be a local file with a relative or absolute path, or a remote file starting with oss://, http(s)://, E.g.
{ PYU('alice'): 'alice.csv', PYU('bob'): 'bob.csv' }
delimiter – the file separator.
dtypes –
Participant field type. It will be inferred from the file if not specified, E.g.
{ PYU('alice'): {'uid': np.str, 'age': np.int32}, PYU('bob'): {'uid': np.str, 'score': np.float32} }
spu – SPU device, used for PSI data alignment. The data of all parties are supposed pre-aligned if not specified.
keys – The field used for psi, which can be single or multiple fields. This parameter is required when spu is specified.
drop_keys – keys to removed, which can be single or multiple fields. This parameter is required when spu is specified since VDataFrame doesn’t allow duplicate column names.
psi_protocl – Specified protocol for PSI. Default ‘KKRT_PSI_2PC’ for 2 parties, ‘ECDH_PSI_3PC’ for 3 parties.
- 返回:
A aligned VDataFrame.
- secretflow.data.vertical.to_csv(df: VDataFrame, file_uris: Dict[PYU, str], **kwargs)[源代码]#
Write object to a comma-separated values (csv) file.
- 参数:
df – the VDataFrame to save.
file_uris – the file path of each PYU.
kwargs – all other arguments are same with
pandas.DataFrame.to_csv()
.
secretflow.data.vertical.dataframe#
Classes:
|
Federated dataframe holds vertical partitioned data. |
- class secretflow.data.vertical.dataframe.VDataFrame(partitions: Dict[PYU, Partition], aligned: bool = True)[源代码]#
-
Federated dataframe holds vertical partitioned data.
This dataframe is design to provide a federated pandas dataframe and just same as using pandas. The original data is still stored locally in the data holder and is not transmitted out of the domain during all the methods execution.
The method with a prefix partition_ will return a dict {pyu of partition: result of partition}.
- partitions#
a dict of pyu and partition.
- aligned#
a boolean indicating whether the data is
- Type:
bool
示例
>>> from secretflow.data.vertical import read_csv >>> from secretflow import PYU >>> alice = PYU('alice') >>> bob = PYU('bob') >>> v_df = read_csv({alice: 'alice.csv', bob: 'bob.csv'}) >>> v_df.columns Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'], dtype='object') >>> v_df.mean(numeric_only=True) sepal_length 5.827693 sepal_width 3.054000 petal_length 3.730000 petal_width 1.198667 dtype: float64 >>> v_df.min(numeric_only=True) sepal_length 4.3 sepal_width 2.0 petal_length 1.0 petal_width 0.1 dtype: float64 >>> v_df.max(numeric_only=True) sepal_length 7.9 sepal_width 4.4 petal_length 6.9 petal_width 2.5 dtype: float64 >>> v_df.count() sepal_length 130 sepal_width 150 petal_length 120 petal_width 150 class 150 dtype: int64 >>> v_df.fillna({'sepal_length': 2})
Attributes:
Return the dtypes in the DataFrame.
The column labels of the DataFrame.
Return a tuple representing the dimensionality of the DataFrame.
Return a federated Numpy representation of the DataFrame.
Returns columns of each partition.
Methods:
mode
([numeric_only, dropna])Return the mode of the values over the axis 0.
sum
([numeric_only])Return the sum of the values over the axis 0.
min
([numeric_only])Return the min of the values over the axis 0.
max
([numeric_only])Return the max of the values over the axis 0.
pow
(*args, **kwargs)Gets Exponential power of dataframe and other, element-wise (binary operator pow).
round
(*args, **kwargs)Round the DataFrame to a variable number of decimal places.
select_dtypes
(*args, **kwargs)Returns a subset of the DataFrame's columns based on the column dtypes.
replace
(*args, **kwargs)Replace values given in to_replace with value.
subtract
(*args, **kwargs)Gets Subtraction of dataframe and other, element-wise (binary operator sub).
astype
(dtype[, copy, errors])Cast object to a specified dtype
dtype
.mean
([numeric_only])Return the mean of the values over the axis 0.
var
([numeric_only])Return the var of the values over the axis 0.
std
([numeric_only])Return the std of the values over the axis 0.
sem
([numeric_only])Return the standard error of the mean over the axis 0.
skew
([numeric_only])Return the skewness over the axis 0.
kurtosis
([numeric_only])Return the kurtosis over the requested axis.
quantile
([q])Returns values at the given quantile over axis 0.
count
([numeric_only])Count non-NA cells for each column.
isna
()"Detects missing values for an array-like object. Same as pandas.DataFrame.isna Returns DataFrame: Mask of bool values for each element in DataFrame that indicates whether an element is an NA value.
copy
()Shallow copy of this dataframe.
drop
([labels, axis, index, columns, level, ...])Drop specified labels from rows or columns.
fillna
([value, method, axis, inplace, ...])Fill NA/NaN values using the specified method.
to_csv
(fileuris, **kwargs)Write object to a comma-separated values (csv) file.
Return shapes of each partition.
__init__
(partitions[, aligned])- aligned: bool = True#
- mode(numeric_only=False, dropna=True) Series [源代码]#
Return the mode of the values over the axis 0. The mode of a set of values is the value that appears most often. Restrict mode on axis 0 in VDataFrame for data protection reasons.
- 返回:
pd.Series
- sum(numeric_only=False) Series [源代码]#
Return the sum of the values over the axis 0.
Restrict sum on axis 0 in VDataFrame for data protection reasons.
- 返回:
pd.Series
- min(numeric_only=False) Series [源代码]#
Return the min of the values over the axis 0.
Note columns containing None values are ignored. Fill before proceed.
Restrict min on axis 0 in VDataFrame for data protection reasons.
- 返回:
pd.Series
- max(numeric_only=False) Series [源代码]#
Return the max of the values over the axis 0.
Note columns containing None values are ignored. Fill before proceed.
Restrict max on axis 0 in VDataFrame for data protection reasons.
- 返回:
pd.Series
- pow(*args, **kwargs) VDataFrame [源代码]#
Gets Exponential power of dataframe and other, element-wise (binary operator pow). Equivalent to dataframe ** other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rpow. Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, , /, //, %, *.
- 返回:
VDataFrame
- Reference:
pd.DataFrame.pow
- round(*args, **kwargs) VDataFrame [源代码]#
Round the DataFrame to a variable number of decimal places.
- 返回:
same shape except value rounded
- 返回类型:
- Reference:
pd.DataFrame.round
- select_dtypes(*args, **kwargs) VDataFrame [源代码]#
Returns a subset of the DataFrame’s columns based on the column dtypes.
- Reference:
pandas.DataFrame.select_dtypes
- replace(*args, **kwargs) VDataFrame [源代码]#
Replace values given in to_replace with value. Same as pandas.DataFrame.replace Values of the DataFrame are replaced with other values dynamically.
- 返回:
same shape except value replaced
- 返回类型:
- subtract(*args, **kwargs) VDataFrame [源代码]#
Gets Subtraction of dataframe and other, element-wise (binary operator sub). Equivalent to dataframe - other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rsub. Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, , /, //, %, *.
Note each part only will contains its own columns.
- Reference:
pd.DataFrame.subtract
- property dtypes: Series#
Return the dtypes in the DataFrame.
- 返回:
the data type of each column.
- 返回类型:
pd.Series
- astype(dtype, copy: bool = True, errors: str = 'raise')[源代码]#
Cast object to a specified dtype
dtype
.All args are same as
pandas.DataFrame.astype()
.
- property columns#
The column labels of the DataFrame.
- property shape#
Return a tuple representing the dimensionality of the DataFrame.
- mean(numeric_only=False) Series [源代码]#
Return the mean of the values over the axis 0.
Note columns containing None values are ignored. Fill before proceed.
Restrict mean on axis 0 in VDataFrame for data protection reasons.
- 返回:
pd.Series
- var(numeric_only=False) Series [源代码]#
Return the var of the values over the axis 0.
Note columns containing None values are ignored. Fill before proceed.
Restrict var on axis 0 in VDataFrame for data protection reasons.
- 返回:
pd.Series
- std(numeric_only=False) Series [源代码]#
Return the std of the values over the axis 0.
Note columns containing None values are ignored. Fill before proceed.
Restrict std on axis 0 in VDataFrame for data protection reasons.
- 返回:
pd.Series
- sem(numeric_only=False) Series [源代码]#
Return the standard error of the mean over the axis 0.
Note columns containing None values are ignored. Fill before proceed.
Restrict sem on axis 0 in VDataFrame for data protection reasons.
- 返回:
pd.Series
- skew(numeric_only=False) Series [源代码]#
Return the skewness over the axis 0.
Note columns containing None values are ignored. Fill before proceed.
Restrict skew on axis 0 in VDataFrame for data protection reasons.
- 返回:
pd.Series
- kurtosis(numeric_only=False) Series [源代码]#
Return the kurtosis over the requested axis.
Note columns containing None values are ignored. Fill before proceed.
Restrict kurtosis on axis 0 in VDataFrame for data protection reasons.
- 返回:
pd.Series
- quantile(q=0.5) Series [源代码]#
Returns values at the given quantile over axis 0.
Note columns containing None values are ignored. Fill before proceed.
Restrict quantile on axis 0 in VDataFrame for data protection reasons.
- 返回:
pd.Series
- count(numeric_only=False) Series [源代码]#
Count non-NA cells for each column.
Restrict count on axis 0 in VDataFrame for data protection reasons.
- 返回:
pd.Series
- isna() VDataFrame [源代码]#
“Detects missing values for an array-like object. Same as pandas.DataFrame.isna Returns
- DataFrame: Mask of bool values for each element in DataFrame
that indicates whether an element is an NA value.
- 返回:
VDataFrame
- Reference:
pd.DataFrame.isna
- property values#
Return a federated Numpy representation of the DataFrame.
- 返回:
FedNdarray.
- copy() VDataFrame [源代码]#
Shallow copy of this dataframe.
- 返回:
VDataFrame.
- drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise') Optional[VDataFrame] [源代码]#
Drop specified labels from rows or columns.
All arguments are same with
pandas.DataFrame.drop()
.- 返回:
VDataFrame without the removed index or column labels or None if inplace=True.
- fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None) Optional[VDataFrame] [源代码]#
Fill NA/NaN values using the specified method.
All arguments are same with
pandas.DataFrame.fillna()
.- 返回:
VDataFrame with missing values filled or None if inplace=True.
- to_csv(fileuris: Dict[PYU, str], **kwargs)[源代码]#
Write object to a comma-separated values (csv) file.
- 参数:
fileuris – a dict of file uris specifying file for each PYU.
kwargs – other arguments are same with
pandas.DataFrame.to_csv()
.
- 返回:
Returns a list of PYUObjects whose value is none. You can use secretflow.wait to wait for the save to complete.
- property partition_columns#
Returns columns of each partition.
- 返回:
columns}
- 返回类型:
a dict of {pyu
secretflow.data.vertical.io#
Functions:
|
Read a comma-separated values (csv) file into VDataFrame. |
|
Write object to a comma-separated values (csv) file. |
- secretflow.data.vertical.io.read_csv(filepath: Dict[PYU, str], delimiter=',', dtypes: Optional[Dict[PYU, Dict[str, type]]] = None, spu: Optional[SPU] = None, keys: Optional[Union[str, List[str], Dict[Device, List[str]]]] = None, drop_keys: Optional[Union[str, List[str], Dict[Device, List[str]]]] = None, psi_protocl=None, no_header: bool = False) VDataFrame [源代码]#
Read a comma-separated values (csv) file into VDataFrame.
When specifying spu and keys, the fields specified by keys are used for PSI alignment.Fields used for alignment must be common to all parties, and other fields cannot be repeated across parties. The data for each party is supposed pre-aligned if not specifying spu and keys.
- 参数:
filepath –
The file path of each party. It can be a local file with a relative or absolute path, or a remote file starting with oss://, http(s)://, E.g.
{ PYU('alice'): 'alice.csv', PYU('bob'): 'bob.csv' }
delimiter – the file separator.
dtypes –
Participant field type. It will be inferred from the file if not specified, E.g.
{ PYU('alice'): {'uid': np.str, 'age': np.int32}, PYU('bob'): {'uid': np.str, 'score': np.float32} }
spu – SPU device, used for PSI data alignment. The data of all parties are supposed pre-aligned if not specified.
keys – The field used for psi, which can be single or multiple fields. This parameter is required when spu is specified.
drop_keys – keys to removed, which can be single or multiple fields. This parameter is required when spu is specified since VDataFrame doesn’t allow duplicate column names.
psi_protocl – Specified protocol for PSI. Default ‘KKRT_PSI_2PC’ for 2 parties, ‘ECDH_PSI_3PC’ for 3 parties.
- 返回:
A aligned VDataFrame.
- secretflow.data.vertical.io.to_csv(df: VDataFrame, file_uris: Dict[PYU, str], **kwargs)[源代码]#
Write object to a comma-separated values (csv) file.
- 参数:
df – the VDataFrame to save.
file_uris – the file path of each PYU.
kwargs – all other arguments are same with
pandas.DataFrame.to_csv()
.