secretflow.data#
Classes:
|
Horizontal or vertical partitioned Ndarray. |
|
The partitioning. |
- class secretflow.data.FedNdarray(partitions: Dict[PYU, PYUObject], partition_way: PartitionWay)[source]#
Bases:
objectHorizontal or vertical partitioned Ndarray.
- partitions#
List of references to local numpy.ndarray that makes up federated ndarray.
Attributes:
Get shape of united ndarray.
Methods:
Get ndarray shapes of all partitions.
Get ndarray sizes of all partitions.
astype(dtype[, order, casting, subok, copy])Cast to a specified type.
__init__(partitions, partition_way)- partition_way: PartitionWay#
- property shape: Tuple[int, int]#
Get shape of united ndarray.
- astype(dtype, order='K', casting='unsafe', subok=True, copy=True)[source]#
Cast to a specified type.
All args are same with
numpy.ndarray.astype().
- __init__(partitions: Dict[PYU, PYUObject], partition_way: PartitionWay) None#
- class secretflow.data.PartitionWay(value)[source]#
Bases:
EnumThe partitioning. HORIZONTAL: horizontal partitioning. VERATICAL: vertical partitioning.
Attributes:
- HORIZONTAL = 'horizontal'#
- VERTICAL = 'vertical'#
secretflow.data.base#
Classes:
Abstract base class for horizontal, vertical and mixed partitioned DataFrame |
|
|
Slice of data that makes up horizontal, vertical and mixed partitioned DataFrame. |
- class secretflow.data.base.DataFrameBase[source]#
Bases:
ABCAbstract base class for horizontal, vertical and mixed partitioned DataFrame
Methods:
min()Gets minimum value of all columns
max()Gets maximum value of all columns
count()Gets number of rows
values()Get underlying ndarray
- class secretflow.data.base.Partition(data: Optional[PYUObject] = None)[source]#
Bases:
DataFrameBaseSlice of data that makes up horizontal, vertical and mixed partitioned DataFrame.
Attributes:
Returns the underlying ndarray.
Returns the index (row labels) of the DataFrame.
Returns the dtypes in the DataFrame.
Returns the column labels of the DataFrame.
Returns a tuple representing the dimensionality of the DataFrame.
Methods:
mean(*args, **kwargs)Returns the mean of the values over the requested axis.
var(*args, **kwargs)Returns the variance of the values over the requested axis.
std(*args, **kwargs)Returns the standard deviation of the values over the requested axis.
sem(*args, **kwargs)Returns the standard error of the mean over the requested axis.
skew(*args, **kwargs)Returns the skewness over the requested axis.
kurtosis(*args, **kwargs)Returns the kurtosis over the requested axis.
sum(*args, **kwargs)Returns the sum of the values over the requested axis.
replace(*args, **kwargs)Replace values given in to_replace with value.
quantile([q, axis])Returns values at the given quantile over requested axis.
min(*args, **kwargs)Returns the minimum of the values over the requested axis.
mode(*args, **kwargs)Returns the mode of the values over the requested axis.
max(*args, **kwargs)Returns the maximum of the values over the requested axis.
count(*args, **kwargs)Counts non-NA cells for each column or row.
isna()Detects missing values for an array-like object. Same as pandas.DataFrame.isna Returns DataFrame: Mask of bool values for each element in DataFrame that indicates whether an element is an NA value.
pow(*args, **kwargs)Gets Exponential power of (partition of) dataframe and other, element-wise (binary operator pow).
subtract(*args, **kwargs)Gets Subtraction of (partition of) dataframe and other, element-wise (binary operator sub).
round(*args, **kwargs)Round the (partition of) DataFrame to a variable number of decimal places.
select_dtypes(*args, **kwargs)Returns a subset of the DataFrame's columns based on the column dtypes.
astype(dtype[, copy, errors])Cast a pandas object to a specified dtype
dtype.iloc(index)Integer-location based indexing for selection by position.
drop([labels, axis, index, columns, level, ...])See pandas.DataFrame.drop
fillna([value, method, axis, inplace, ...])See
pandas.DataFrame.fillna()rename([mapper, index, columns, axis, copy, ...])See
pandas.DataFrame.rename()value_counts(*args, **kwargs)Return a Series containing counts of unique values.
to_csv(filepath, **kwargs)Save DataFrame to csv file.
copy()Shallow copy.
__init__([data])- mean(*args, **kwargs) Partition[source]#
Returns the mean of the values over the requested axis.
- Returns:
mean values series.
- Return type:
- var(*args, **kwargs) Partition[source]#
Returns the variance of the values over the requested axis.
- Returns:
variance values series.
- Return type:
- std(*args, **kwargs) Partition[source]#
Returns the standard deviation of the values over the requested axis.
- Returns:
standard deviation values series.
- Return type:
- sem(*args, **kwargs) Partition[source]#
Returns the standard error of the mean over the requested axis.
- Returns:
standard error of the mean series.
- Return type:
- skew(*args, **kwargs) Partition[source]#
Returns the skewness over the requested axis.
- Returns:
skewness series.
- Return type:
- kurtosis(*args, **kwargs) Partition[source]#
Returns the kurtosis over the requested axis.
- Returns:
kurtosis series.
- Return type:
- sum(*args, **kwargs) Partition[source]#
Returns the sum of the values over the requested axis.
- Returns:
sum values series.
- Return type:
- replace(*args, **kwargs) Partition[source]#
Replace values given in to_replace with value. Same as pandas.DataFrame.replace Values of the DataFrame are replaced with other values dynamically.
- Returns:
same shape except value replaced
- Return type:
- quantile(q=0.5, axis=0) Partition[source]#
Returns values at the given quantile over requested axis.
- Returns:
quantile values series.
- Return type:
- min(*args, **kwargs) Partition[source]#
Returns the minimum of the values over the requested axis.
- Returns:
minimum values series.
- Return type:
- mode(*args, **kwargs) Partition[source]#
Returns the mode of the values over the requested axis.
For data protection reasons, only one mode will be returned.
- Returns:
mode values series.
- Return type:
- max(*args, **kwargs) Partition[source]#
Returns the maximum of the values over the requested axis.
- Returns:
maximum values series.
- Return type:
- count(*args, **kwargs) Partition[source]#
Counts non-NA cells for each column or row.
- Returns:
count values series.
- Return type:
- isna() Partition[source]#
Detects missing values for an array-like object. Same as pandas.DataFrame.isna Returns
- DataFrame: Mask of bool values for each element in DataFrame
that indicates whether an element is an NA value.
- pow(*args, **kwargs) Partition[source]#
Gets Exponential power of (partition of) dataframe and other, element-wise (binary operator pow). Equivalent to dataframe ** other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rpow. Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, , /, //, %, *.
- Reference:
pd.DataFrame.pow
- subtract(*args, **kwargs) Partition[source]#
Gets Subtraction of (partition of) dataframe and other, element-wise (binary operator sub). Equivalent to dataframe - other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rsub. Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, , /, //, %, *.
- Reference:
pd.DataFrame.subtract
- round(*args, **kwargs) Partition[source]#
Round the (partition of) DataFrame to a variable number of decimal places.
- Reference:
pd.DataFrame.round
- select_dtypes(*args, **kwargs) Partition[source]#
Returns a subset of the DataFrame’s columns based on the column dtypes.
- Reference:
pandas.DataFrame.select_dtypes
- property values#
Returns the underlying ndarray.
- property index#
Returns the index (row labels) of the DataFrame.
- property dtypes#
Returns the dtypes in the DataFrame.
- astype(dtype, copy: bool = True, errors: str = 'raise')[source]#
Cast a pandas object to a specified dtype
dtype.All args are same as
pandas.DataFrame.astype().
- property columns#
Returns the column labels of the DataFrame.
- property shape#
Returns a tuple representing the dimensionality of the DataFrame.
- iloc(index: Union[int, slice, List[int]]) Partition[source]#
Integer-location based indexing for selection by position.
- Parameters:
index (Union[int, slice, List[int]]) – rows index.
- Returns:
Selected DataFrame.
- Return type:
- drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise') Optional[Partition][source]#
See pandas.DataFrame.drop
- fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None) Optional[Partition][source]#
See
pandas.DataFrame.fillna()
- rename(mapper=None, index=None, columns=None, axis=None, copy=True, inplace=False, level=None, errors='ignore') Optional[Partition][source]#
See
pandas.DataFrame.rename()
secretflow.data.math_utils#
Functions:
|
|
|
|
|
|
|
|
|
|
|
secretflow.data.ndarray#
Classes:
|
The partitioning. |
|
Horizontal or vertical partitioned Ndarray. |
Functions:
|
subtraction of two FedNdarray object |
|
Load FedNdarray from data source. |
|
Random shuffle data. |
|
|
|
|
|
Mean of all elements :param y: FedNdarray :param spu_device: SPU |
|
|
|
Residual Sum of Squares of all elements |
|
Total Sum of Square (Variance) of all elements |
|
Mean Squared Error of all elements |
|
Root Mean Squared Error of all elements |
|
Mean Absolute Error |
|
Mean Absolute Percentage Error |
|
R2 Score |
|
Histogram of all elements a restricted version of the counterpart in numpy |
|
Histogram of residuals of y1 - y2 |
- class secretflow.data.ndarray.PartitionWay(value)[source]#
Bases:
EnumThe partitioning. HORIZONTAL: horizontal partitioning. VERATICAL: vertical partitioning.
Attributes:
- HORIZONTAL = 'horizontal'#
- VERTICAL = 'vertical'#
- class secretflow.data.ndarray.FedNdarray(partitions: Dict[PYU, PYUObject], partition_way: PartitionWay)[source]#
Bases:
objectHorizontal or vertical partitioned Ndarray.
- partitions#
List of references to local numpy.ndarray that makes up federated ndarray.
Attributes:
Get shape of united ndarray.
Methods:
Get ndarray shapes of all partitions.
Get ndarray sizes of all partitions.
astype(dtype[, order, casting, subok, copy])Cast to a specified type.
__init__(partitions, partition_way)- partition_way: PartitionWay#
- property shape: Tuple[int, int]#
Get shape of united ndarray.
- astype(dtype, order='K', casting='unsafe', subok=True, copy=True)[source]#
Cast to a specified type.
All args are same with
numpy.ndarray.astype().
- __init__(partitions: Dict[PYU, PYUObject], partition_way: PartitionWay) None#
- secretflow.data.ndarray.subtract(y1: FedNdarray, y2: FedNdarray, spu_device: Optional[SPU] = None)[source]#
subtraction of two FedNdarray object
- Args:
y1: FedNdarray y2: FedNdarray spu_device: Optional SPU device
- Returns:
result of subtraction
as long as they have the same shape, the result is computable. They may have different partition shapes.
- secretflow.data.ndarray.load(sources: Dict[PYU, Union[str, Callable[[], ndarray], PYUObject]], partition_way: PartitionWay = PartitionWay.VERTICAL, allow_pickle=False, encoding='ASCII') FedNdarray[source]#
Load FedNdarray from data source.
Warning
Loading files that contain object arrays uses the
picklemodule, which is not secure against erroneous or maliciously constructed data. Consider passingallow_pickle=Falseto load data that is known not to contain object arrays for the safer handling of untrusted sources.- Parameters:
sources – Data source in each partition. Shall be one of the followings. 1) Loaded numpy.ndarray. 2) Local filepath which should be .npy or .npz file. 3) Callable function that return numpy.ndarray.
allow_pickle – Allow loading pickled object arrays stored in npy files.
encoding – What encoding to use when reading Python 2 strings.
- Raises:
TypeError – illegal source。
- Returns:
Returns a FedNdarray if source is pyu object or .npy. Or return a dict {key: FedNdarray} if source is .npz.
Examples
>>> fed_arr = load({'alice': 'example/alice.csv', 'bob': 'example/alice.csv'})
- secretflow.data.ndarray.shuffle(data: FedNdarray)[source]#
Random shuffle data.
- Parameters:
data – data to be shuffled.
- secretflow.data.ndarray.check_same_partition_shapes(a1: FedNdarray, a2: FedNdarray)[source]#
- secretflow.data.ndarray.unary_op(handle_function: Callable, len_1_handle_function: Callable, y: FedNdarray, spu_device: Optional[SPU] = None, simulate_double_value_replacer_handle: Optional[Callable] = None)[source]#
- secretflow.data.ndarray.mean(y: FedNdarray, spu_device: Optional[SPU] = None)[source]#
Mean of all elements :param y: FedNdarray :param spu_device: SPU
If y is from a single party, then a PYUObject is returned. If y is from multiple parties, then
an SPU device is required and an SPUObject is returned.
If y is empty return 0.
- secretflow.data.ndarray.binary_op(handle_function: Callable, len_1_handle_function: Callable, y1: FedNdarray, y2: FedNdarray, spu_device: Optional[SPU] = None)[source]#
- secretflow.data.ndarray.get_concat_axis(y: FedNdarray) int[source]#
- secretflow.data.ndarray.rss(y1: FedNdarray, y2: FedNdarray, spu_device: Optional[SPU] = None)[source]#
Residual Sum of Squares of all elements
more detail for rss: https://en.wikipedia.org/wiki/Residual_sum_of_squares
- Parameters:
y1 – FedNdarray
y2 – FedNdarray
spu_device – SPU
y1 and y2 must have the same device and partition shapes
If y1 is from a single party, then a PYUObject is returned. If y1 is from multiple parties, then
an SPU device is required and an SPUObject is returned.
If y1 is empty return 0.
- secretflow.data.ndarray.tss(y: FedNdarray, spu_device: Optional[SPU] = None)[source]#
Total Sum of Square (Variance) of all elements
more detail for tss: https://en.wikipedia.org/wiki/Total_sum_of_squares
- Parameters:
y – FedNdarray
If y is from a single party, then a PYUObject is returned. If y is from multiple parties, then
an SPU device is required and an SPUObject is returned.
If y is empty return 0.
- secretflow.data.ndarray.mean_squared_error(y_true: FedNdarray, y_pred: FedNdarray, spu_device: Optional[SPU] = None)[source]#
Mean Squared Error of all elements
more detail for mse: https://en.wikipedia.org/wiki/Mean_squared_error
- Parameters:
y_true – FedNdarray
y_pred – FedNdarray
spu_device – SPU
y_true and y_pred must have the same device and partition shapes
If y_true is from a single party, then a PYUObject is returned. If y_true is from multiple parties, then
an SPU device is required and an SPUObject is returned.
If y_true is empty return 0.
- secretflow.data.ndarray.root_mean_squared_error(y_true: FedNdarray, y_pred: FedNdarray, spu_device: Optional[SPU] = None)[source]#
Root Mean Squared Error of all elements
more detail for mse: https://en.wikipedia.org/wiki/Root-mean-square_deviation
- Parameters:
y_true – FedNdarray
y_pred – FedNdarray
spu_device – SPU
y_true and y_pred must have the same device and partition shapes
If y_true is from a single party, then a PYUObject is returned. If y_true is from multiple parties, then
an SPU device is required and an SPUObject is returned.
If y_true is empty return 0.
- secretflow.data.ndarray.mean_abs_err(y_true: FedNdarray, y_pred: FedNdarray, spu_device: Optional[SPU] = None)[source]#
Mean Absolute Error
more detail for mean abs err: https://en.wikipedia.org/wiki/Mean_absolute_error
- Parameters:
y_true – FedNdarray
y_pred – FedNdarray
y_true and y_pred must have the same device and partition shapes
If y_true is from a single party, then a PYUObject is returned. If y_true is from multiple parties, then
an SPU device is required and an SPUObject is returned.
If y_true is empty return 0.
- secretflow.data.ndarray.mean_abs_percent_err(y_true: FedNdarray, y_pred: FedNdarray, spu_device: Optional[SPU] = None)[source]#
Mean Absolute Percentage Error
more detail for mean percent err: https://en.wikipedia.org/wiki/Mean_absolute_percentage_error
- Parameters:
y_true – FedNdarray
y_pred – FedNdarray
y_true and y_pred must have the same device and partition shapes
If y_true is from a single party, then a PYUObject is returned. If y_true is from multiple parties, then
an SPU device is required and an SPUObject is returned.
If y_true is empty return 0.
- secretflow.data.ndarray.r2_score(y_true: FedNdarray, y_pred: FedNdarray, spu_device: Optional[SPU] = None)[source]#
R2 Score
more detail for r2 score: https://en.wikipedia.org/wiki/Coefficient_of_determination
- Parameters:
y_true – FedNdarray
y_pred – FedNdarray
y_true and y_pred must have the same device and partition shapes
If y_true is from a single party, then a PYUObject is returned. If y_true is from multiple parties, then
an SPU device is required and an SPUObject is returned.
If y_true is empty return 0.
- secretflow.data.ndarray.histogram(y: FedNdarray, bins: int = 10, spu_device: Optional[SPU] = None)[source]#
Histogram of all elements a restricted version of the counterpart in numpy
more detail for histogram: https://numpy.org/doc/stable/reference/generated/numpy.histogram.html
- Parameters:
y – FedNdarray
If y is from a single party, then a PYUObject is returned. If y is from multiple parties, then
an SPU device is required and an SPUObject is returned.
- secretflow.data.ndarray.residual_histogram(y1: FedNdarray, y2: FedNdarray, bins: int = 10, spu_device: Optional[SPU] = None)[source]#
Histogram of residuals of y1 - y2
Support histogram(y1 - y2) equivalent function even if y1 and y2 have distinct partition shapes.
- Parameters:
y1 – FedNdarray
y2 – FedNdarray
If y is from a single party, then a PYUObject is returned. If y is from multiple parties, then
an SPU device is required and an SPUObject is returned.
secretflow.data.split#
Functions:
|
Split data into train and test dataset. |
- secretflow.data.split.train_test_split(data: Union[VDataFrame, HDataFrame, FedNdarray], test_size=None, train_size=None, random_state=1234, shuffle=True) Tuple[object, object][source]#
Split data into train and test dataset.
- Parameters:
data – DataFrame to split, supported are: VDataFrame,HDataFrame,FedNdarray.
test_size (float) – test dataset size, default is None.
train_size (float) – train dataset size, default is None.
random_state (int) – Controls the shuffling applied to the data before applying the split.
shuffle (bool) – Whether or not to shuffle the data before splitting, default is True.
- Returns
splitting : list, length=2 * len(arrays)
Examples
>>> import numpy as np >>> from secret.data.split import train_test_split >>> # FedNdarray >>> alice_arr = alice(lambda: np.array([[1, 2, 3], [4, 5, 6]]))() >>> bob_arr = bob(lambda: np.array([[11, 12, 13], [14, 15, 16]]))()
>>> fed_arr = load({self.alice: alice_arr, self.bob: bob_arr}) >>> >>> X_train, X_test = train_test_split( ... fed_arr, test_size=0.33, random_state=42) ... >>> VDataFrame >>> df_alice = pd.DataFrame({'a1': ['K5', 'K1', None, 'K6'], ... 'a2': ['A5', 'A1', 'A2', 'A6'], ... 'a3': [5, 1, 2, 6]})
>>> df_bob = pd.DataFrame({'b4': [10.2, 20.5, None, -0.4], ... 'b5': ['B3', None, 'B9', 'B4'], ... 'b6': [3, 1, 9, 4]}) >>> df_alice = df_alice >>> df_bob = df_bob >>> vdf = VDataFrame( ... {alice: Partition(data=cls.alice(lambda: df_alice)()), ... bob: Partition(data=cls.bob(lambda: df_bob)())}) >>> train_vdf, test_vdf = train_test_split(vdf, test_size=0.33, random_state=42)