Constructing Distributed (Federated) Missing Data Scenarios
In this section, we will demonstrate how to construct federated missing data scenarios using the fedimpute.scenario.ScenarioBuilder
module.
What is Distributed Missing Data Scenario
Distributed Missing Data Scenario represents a distributed network with a number of clients. Each client has its local dataset (we consider horizontal setting, which means these datasets contain same feature space). There are missing values inside local dataset of these clients. Our fedimpute.scenario.ScenarioBuilder
module will construct all necessary components for the distributed missing data scenario (e.g., client-specific training and test data, missing data, etc.), which will be used in other modules for conducting federated imputation and prediction.
The input to this module is a <np.ndarray>
or <pd.DataFrame>
dataset or datasets (real federated datasets) and a data configuration dictionary data_config
.
Details on how to preparing the dataset and the data configuration dictionary are provided in the Data Preparation section.
fedimpute.scenario.ScenarioBuilder
provides two approaches for scenario construction that cover the most common research settings in distributed imputation: (1) simulation-based scenarios, where data partitioning and missing values are systematically simulated according to user-specified parameters, and (2) real-world scenarios, where data is naturally partitioned across silos with existing missing values.
Both approaches output standard formats of data components (1) clients_train_data
: client-specific training datasets, (2) clients_train_data_ms
: client-specific training datasets
with simulated or existed missing values, (3) clients_test_data
: client-specific test datasets for local evaluation, and (4) global_test_data
: global test dataset for federated evaluation. These structured outputs serve as consistent building blocks for subsequent steps in the fedimpute workflow, including distributed environment initialization, algorithm execution, and performance evaluation.
Scenario Construction Interface
Constructing Simulated Scenario
The fedimpute.scenario.ScenarioBuilder
module include the following core functionalities for constructing a simulated scenario: (1) Data Partition: Partition the dataset horizontally into multiple clients. (2) Missing Data Simulation: Introduce missing values in the dataset of each client. It takes the a centralized <np.ndarray>
or <pd.DataFrame>
data and data configuration as input and perform data partition and missing data simulation logic based on the parameters specified by the user and output the client-specific data components (clients' local training data, test etc.)
The following example shows how to construct a simulated scenario. Firstly, initialize the fedimpute.scenario.ScenarioBuilder
class and call the create_simulated_scenario
method to simulate the federated missing data scenario.
from fedimpute.scenario import ScenarioBuilder
scenario_builder = ScenarioBuilder()
scenario_data = scenario_builder.create_simulated_scenario(
data, data_config, num_clients = 4, dp_strategy='iid-even', ms_scenario='mnar-heter'
)
print('Results Structure (Dict Keys):')
print(list(scenario_data.keys()))
scenario_builder.summarize_scenario()
Supported Data Partition Strategies
Data partition can be set by dp_strategy
parameter, which takes following options
iid-even
: iid partition with even sample sizesiid-dir@<alpha>
: iid parititon with sample sizes following dirichlet distribution with parameteralpha
to control sample size heterogeneity e.g.iid-dir@0.1
niid-dir@<alpha>
: non-iid partition based on some columns with dirichlet ditribution with parameteralpha
to control data heterogneity e.g.niid-dir@0.1
niid-path@<n>
: non-iid partition based on some columns with pathological distribution (shard partition) with parametern
control heterogeneityniid-path@2
, each client own 2 classes of a column values.
Other Parameters for Data Partition
- num_clients (int) - Number of clients to partition the dataset.
- dp_split_cols (Union[str, int, List[int]]) - Column index or name to split the data samples. If the column is continuous, it will be binned into categories by
dp_reg_bins
.target
: Split the data samples based on the target column.feature
: Split the data samples based on the first feature column.
- dp_min_samples (int) - Minimum number of samples in each client.
- dp_max_samples (int) - Maximum number of samples in each client.
- dp_sample_iid_direct (bool) - Instead of partition data i.i.d, sample data i.i.d from global population (original data) for each client.
- dp_local_test_size (float) = 0.1 - The size of local test set for each client for downstream local federated prediction evaluation.
- dp_global_test_size (float) = 0.1 - The size of global test set for the downstream federated prediction evaluation.
- dp_local_backup_size (float) = 0.05 - backup sample size to avoid all samples in data to be missing
- dp_reg_bins (int) = 50 - Used for non-i.i.d data partitioning, if column for non-i.i.d partition is continuous, binning it into categories for meaningful non-i.i.d partiton.
Supported Missing Data Mechanism Type
Missing mechansim can be set by ms_mech_type
parameter, which supports all commonly used three types of general missing mechanism. The options are shown as below. Refer to how to create missingness in python? for more details.
- MCAR Missing Mechanism
mcar
missing completely at random implemented using purely random mask.
- MAR Missing Mechanism
mar_quantile
: missing at random based on quantile of values of other features.mar_logit
missing at random created based on logit regressoin on values of other features.
- MNAR Missing Mechanism
mnar_logit
: missingness based on values of feature itself and other features.mnar_sm_logit
: self-masking missingness logit regression based on values of feature itself.mnar_sm_quantile
: self-masking missingness based on quantile of values of feature itself.
Missing Data Simulation Parameters
The missing data simulation component is used to simulate missing data in the dataset of each client. The core concept here is the missing data heterogeneity which means the each client can have a different missing data characteristics in terms of missing ratio, missing feature and missing mechanisms.
The core parameters for missing data simulation are:
-
ms_cols (Union[str, List[int]]) - features to introduce missing values.
all
: introduce missing values in all features (default).all-num
: introduce missing values in all numerical features.
-
ms_global_mechanism (bool) - If True, all clients have the same missing data mechanism. If False, each client has a different missing data mechanism. This is used for control homogenous or heterogeneous missing data scenario.
- ms_mr_dist_clients (str) - Missing ratio distribution across clients. The available options:
fixed
: Missing ratio is the same for all clients.randu
: Random uniform missing ratio with random float value for each client.randn
: Random normal missing ratio with random float value for each client.
- ms_mf_dist_clients (str) - Missing feature distribution across clients.
identity
: Each client has the same missing features.
- ms_mm_dist_clients (str) - Missing mechanism distribution across clients.
identity
: Each client has the same missing mechanism.random
: Random missing mechanism function for each client.
We have another parameter ms_scenario
which simplify the missing data heterogeneity simulation by providing 5 predefined homogeneous and heterogeneous mechanism settings. It has the following options (Note: by setting this parameter, you don't need to specify the parameter above for missing mechanism heterogeneity):
- mcar
: MCAR setting
- mar-heter
: heterogeneous MAR setting
- mar-homo
: homogeneous MAR setting
- mnar-heter
: heterogeneous MNAR setting
- mnar-homo
: homogenous MNAR setting
Other Parameters
- ms_mr_lower (float) = 0.3 - Lower bound of missing ratio
- ms_mr_upper (float) = 0.7 - Upper bound of missing ratio
- ms_mm_funcs_bank (str) = 'lr' - missing mechanism function direction bank for MAR, MNAR mechanism. It is a string with any of
l
,r
,m
,t
four types of functions.l
: left side missingr
: right side missingm
: middle missingt
: two sides missing
- ms_mm_strictness (bool) - If True, the missing mechanism function is strict, otherwise it is probabilistic.
- ms_mm_obs (bool) = False - This is for MAR mechanism, if True, the missing data is related to some fully observed variables.
- ms_mm_feature_option (str) = 'allk=0.2' - This is for MAR, MNAR mechanism, strategies for selecting features which missing value is correlated.
allk=<ratio>
means select k (determined by ratio) highly correlated features from all features. - ms_mm_beta_option (str) = None, strategies set coefficient of logistic function for
mar_logit
andmnar_sm_logit
,mnar_logit
mechanism type.
Constructing Real Scenario
In certain cases, we have real data available with naturally occurring missing values and has well-defined partitions for distribution. To handle such cases, the module provides the create_real_scenario()
method to construct distributed missing data scenario corresponding to the given data. Unlike simulation-based construction,
this method expects input data as a Python List of <pandas.DataFrame>
datasets, where each
dataframe represents a client-specific local dataset. The method processes these distributed
datasets to generate scenario components in the same standardized format described earlier (e.g. split training and test data for each client's local data and construct a global test dataset for federated prediction), ensuring consistent interfaces for subsequent distributed imputation and evaluation.
Parameters:
datas
(List[pd.DataFrame]): input list of datasetsdata_config
(Dict): data configurationseed
(int): random seed for train-test splittingverbose
(int): show processing information
Usage:
from fedimpute.data_prep import load_data, display_data, column_check
from fedimpute.scenario import ScenarioBuilder
data, data_config = load_data("fed_heart_disease")
scenario_builder = ScenarioBuilder()
scenario_data = scenario_builder.create_real_scenario(
data, data_config,
)
scenario_builder.summarize_scenario()
Scenario Exploration and Summary
The module also provides comprehensive tools for analyzing scenario-specific data for any given distributed missing data scenario through a collection of visualization and analysis interfaces, example of these functions can be found in the tutorials. It includes the following APIs:
-
summarize_scenario(log_to_file, file_path)
provides a summary report of the scenario data components, user can choose whether show the summary to save summary report to the disk. -
visualize_missing_pattern(client_ids: List[int], data_type: str = 'train')
visualizes the missing data pattern for client-specific local data.client_ids
(List[int]): client ids to show the patterndata_type
(str):train
ortest
to show pattern for training data or test data.
scenario_builder.visualize_missing_pattern(client_ids=[0, 1, 2, 3])
-
visualize_missing_distribution(client_ids: List[int], feature_ids: List[int])
visualizes the distribution of missing and observed values for features within client-specific local data.client_ids
(List[int]): client ids to show the missing distribution.feature_ids
(List[int]): feature indices to set for which feature the missing data distribution to be shown.
scenario_builder.visualize_missing_distribution(client_ids = [0, 1], feature_ids = [0, 1, 2, 3, 4])
-
visualize_data_heterogeneity(client_ids: List[int], distance_method: str = 'swd',)
visualizes the heatmap to assess cross-client local data heterogeneity.client_ids
(List[int]): client ids to show the information.distance_method
(str): method to calculate pair-wise client distance.swd
- sliced wasserstein distance over local data.correlation
- euclidean distance caculated on feature correlation matrix.
scenario_builder.visualize_data_heterogeneity(client_ids=[0, 1, 2, 3], distance_method='swd')
predefined setting ms_scenario
- parameters mapping:
mcar
- Missing Completely At Random (MCAR) mechanism.ms_mech_type = 'mcar' ms_global_mechanism = False ms_mr_dist_clients = 'randu-int' ms_mm_dist_clients = 'identity' ms_mm_beta_option = None ms_mm_obs = False
mar-heter
- Missing At Random (MAR) mechanism with heterogeneous missing data scenario.ms_mech_type = 'mar_sigmoid' ms_global_mechanism = False ms_mr_dist_clients = 'randu-int' ms_mm_dist_clients = 'identity' ms_mm_beta_option = 'randu' ms_mm_obs = True
mar-homo
- Missing At Random (MAR) mechanism with homogeneous missing data scenario.ms_mech_type = 'mar_sigmoid' ms_global_mechanism = True ms_mr_dist_clients = 'randu-int' ms_mm_dist_clients = 'identity' ms_mm_beta_option = 'fixed' ms_mm_obs = True
mnar-heter
- Missing Not At Random (MNAR) mechanism with heterogeneous missing data scenario.ms_mech_type = 'mnar_sigmoid' ms_global_mechanism = False ms_mr_dist_clients = 'randu-int' ms_mm_dist_clients = 'identity' ms_mm_beta_option = 'self' ms_mm_obs = False
mnar-homo
- Missing Not At Random (MNAR) mechanism with homogeneous missing data scenario.ms_mech_type = 'mnar_sigmoid' ms_global_mechanism = True ms_mr_dist_clients = 'randu-int' ms_mm_dist_clients = 'identity' ms_mm_beta_option = 'self' ms_mm_obs = False