Imputation Models

Non-NN Based Imputer

class BaseMLImputer(name: str, model_persistable: bool)

Abstract class for the non-NN based imputer to be used in the federated imputation environment

Methods

  • get_imp_model_params Return model parameters

  • set_imp_model_params Set model parameters

  • initialize Initialize imputer - statistics imputation models etc.

  • fit Fit imputer to train local imputation models

  • impute Impute missing values using an imputation model

  • get_fit_res

  • save_model Save the imputer model

  • load_model Load the imputer model

method BaseMLImputer.initialize(X: np.array, missing_mask: np.array, data_utils: dict, params: dict, seed: int)None

Initialize imputer - statistics imputation models etc.

Parameters

  • X : np.array data with intial imputed values

  • missing_mask : np.array missing mask of data

  • data_utils : dict data utils dictionary - contains information about data

  • params : dict params for initialization

  • seed : int int - seed for randomization

method BaseMLImputer.fit(X: np.array, y: np.array, missing_mask: np.array, params: dict)dict

Fit imputer to train local imputation models

Parameters

  • X : np.array np.array - float numpy array features

  • y : np.array np.array - target

  • missing_mask : np.array np.array - missing mask

  • params : dict parameters for local training

method BaseMLImputer.impute(X: np.array, y: np.array, missing_mask: np.array, params: dict)np.ndarray

Impute missing values using an imputation model

Parameters

  • X : np.array numpy array of features

  • y : np.array numpy array of target

  • missing_mask : np.array missing mask

  • params : dict parameters for imputation

Returns

  • np.ndarray imputed data - numpy array - same dimension as X

Mean

class SimpleImputer(strategy: str = 'mean')

Bases : BaseMLImputer

Simple imputer class for imputing missing values in data using simple strategies like mean, median etc.

Attributes

  • strategy : str strategy for imputation - mean, median etc.

  • mean_params : np.array mean parameters for imputation

  • model_type : str type of the model - numpy or sklearn

  • model_persistable : bool whether model is persistable or not

  • name : str name of the imputer

Methods

  • get_imp_model_params

  • set_imp_model_params

  • initialize

  • fit

  • impute

  • get_fit_res

EM

class EMImputer(clip: bool = True, use_y: bool = False)

Bases : BaseMLImputer, ICEImputerMixin

EM imputer class for imputing missing values in data using Expectation Maximization algorithm.

Attributes

  • clip bool - whether to clip the imputed values

  • use_y bool - whether to use target variable in imputation

  • min_values np.array - minimum values for clipping

  • max_values np.array - maximum values for clipping

  • data_utils_info dict - information about data

  • seed int - seed for randomization

  • name str = 'em' - name of the imputer

  • model_type str = 'simple' - type of the imputer - simple or nn - neural network based or not

  • mu np.array - mean of the data

  • sigma np.array - covariance matrix of the data

  • miss np.array - missing values indices

  • obs np.array - observed values indices

  • model_persistable bool - whether model is persistable or not

Methods

  • initialize Initialize imputer - statistics imputation models etc.

  • set_imp_model_params

  • get_imp_model_params

  • fit Fit the imputer on the data.

  • impute Impute the missing values in the data.

  • get_fit_res

ICE

class LinearICEImputer(estimator_num: str = 'ridge_cv', estimator_cat: str = 'ridge', mm_model: str = 'logistic', mm_model_params=None, clip: bool = True, use_y: bool = False)

Bases : BaseMLImputer, ICEImputerMixin

Linear ICE imputer class for imputing missing values in data using linear models.

Attributes

  • estimator_num : str estimator for numerical columns

  • estimator_cat : str estimator for categorical columns

  • mm_model : str missing mechanism model

  • mm_model_params : dict missing mechanism model parameters

  • clip : bool whether to clip the imputed values

  • use_y : bool whether to use target variable in imputation

  • imp_models : list list of imputation models

  • mm_model missing mechanism model

  • data_utils_info : dict information about data

  • seed : int seed for randomization

  • model_type : str type of the imputer - simple or nn - neural network based or not, defaults to 'sklearn'

  • model_persistable : bool whether model is persistable or not, defaults to False

  • name : str name of the imputer, defaults to 'linear_ice'

Methods

  • initialize Initialize imputer - statistics imputation models etc.

  • set_imp_model_params

  • get_imp_model_params

  • fit Fit imputer to train local imputation models

  • impute Impute missing values using an imputation model

  • save_model

  • load_model

  • get_fit_res

MissForest

class MissForestImputer(n_estimators: int = 200, bootstrap: bool = True, n_jobs: int = 2, clip: bool = True, use_y: bool = False)

Bases : BaseMLImputer, ICEImputerMixin

MissForest imputer class for the federated imputation environment

Attributes

  • n_estimators : int number of trees in the forest

  • bootstrap : bool whether bootstrap samples are used when building trees

  • n_jobs : int number of jobs to run in parallel

  • clip : bool whether to clip the imputed values

  • use_y : bool whether to use target values for imputation

  • imp_models : list list of imputation models

  • mm_model : object model for missing mask imputation

  • data_utils_info : dict data utils information

  • seed : int seed for randomization

  • model_type : str type of the model, defaults to 'sklearn'

  • model_persistable : bool whether the model is persistable, defaults to False

  • name : str name of the imputer, defaults to 'missforest'

Methods

  • initialize

  • set_imp_model_params

  • get_imp_model_params

  • fit

  • impute

  • save_model

  • load_model

  • get_fit_res

NN Based Imputer

class BaseNNImputer()

Abstract class for the NN based imputer to be used in the federated imputation environment

Methods

  • get_imp_model_params Return model parameters

  • set_imp_model_params Set model parameters

  • initialize Initialize imputer - statistics imputation models etc.

  • configure_model Fetch model for training

  • configure_optimizer Configure optimizer for training

  • impute Impute missing values using an imputation model

  • save_model Save the imputer model

  • load_model Load the imputer model

method BaseNNImputer.initialize(X: np.array, missing_mask: np.array, data_utils: dict, params: dict, seed: int)None

Initialize imputer - statistics imputation models etc.

Parameters

  • X : np.array data with intial imputed values

  • missing_mask : np.array missing mask of data

  • data_utils : dict data utils dictionary - contains information about data

  • params : dict params for initialization

  • seed : int seed for randomization

method BaseNNImputer.configure_model(params: dict, X: np.ndarray, y: np.ndarray, missing_mask: np.ndarray)Tuple[torch.nn.Module, torch.utils.data.DataLoader]

Fetch model for training

Parameters

  • params : dict parameters for training

  • X : np.ndarray imputed data

  • y : np.ndarray target

  • missing_mask : np.ndarray missing mask

Returns

  • Tuple[torch.nn.Module, torch.utils.data.DataLoader] model, train_dataloader

method BaseNNImputer.impute(X: np.array, y: np.array, missing_mask: np.array, params: dict)np.ndarray

Impute missing values using an imputation model

Parameters

  • X : np.array numpy array of features

  • y : np.array numpy array of target

  • missing_mask : np.array missing mask

  • params : dict parameters for imputation

Returns

  • np.ndarray imputed data - numpy array - same dimension as X

GAIN

class GAINImputer(h_dim: int = 20, n_layers: int = 2, activation: str = 'relu', initializer: str = 'kaiming', loss_alpha: float = 10, hint_rate: float = 0.9, clip: bool = True, batch_size: int = 256, learning_rate: int = 0.001, weight_decay: int = 0.0001, scheduler: str = 'step', optimizer: str = 'sgd')

Bases : BaseNNImputer, JMImputerMixin

GAIN imputer class for imputing missing values in data using Generative Adversarial Imputation Networks.

Attributes

  • h_dim : int dimension of hidden layers

  • n_layers : int number of layers

  • activation : str activation function

  • initializer : str initializer for weights

  • loss_alpha : float alpha parameter for loss

  • hint_rate : float hint rate for loss

  • clip : bool whether to clip the imputed values

  • batch_size : int batch size for training

  • learning_rate : int learning rate for optimizer

  • weight_decay : int weight decay for optimizer

  • scheduler : str scheduler for optimizer

  • optimizer : str optimizer for training

  • scheduler_params : dict scheduler parameters

Methods

  • initialize

  • get_imp_model_params

  • set_imp_model_params

  • configure_model

  • configure_optimizer

  • impute

MIWAE

class MIWAEImputer(name: str = 'miwae', latent_size: int = 5, n_hidden: int = 16, n_hidden_layers: int = 2, out_dist='studentt', K: int = 20, L: int = 100, activation='tanh', initializer='xavier', clip: bool = True, batch_size: int = 256, learning_rate: int = 0.001, weight_decay: int = 0.0001, scheduler: str = 'step', optimizer: str = 'sgd')

Bases : BaseNNImputer, JMImputerMixin

MiWAE imputer class for imputing missing values in data using Multiple Imputation with Auxiliary Deep Generative Models.

Attributes

  • name : str name of the imputer

  • clip : bool whether to clip the imputed values

  • latent_size : int size of the latent space

  • n_hidden : int number of hidden units

  • n_hidden_layers : int number of hidden layers

  • out_dist : str output distribution

  • K : int number of samples

  • L : int number of MCMC samples

  • activation : str activation function

  • initializer : str initializer for weights

  • batch_size : int batch size for training

  • learning_rate : int learning rate for optimizer

  • weight_decay : int weight decay for optimizer

  • scheduler : str scheduler for optimizer

  • optimizer : str optimizer for training

Methods

  • get_imp_model_params

  • set_imp_model_params

  • initialize

  • configure_model

  • configure_optimizer

  • fit

  • impute

NOTMIWAE

class NotMIWAEImputer(latent_size: int = 5, n_hidden: int = 16, n_hidden_layers: int = 2, out_dist='studentt', K: int = 20, L: int = 100, activation='tanh', initializer='xavier', mask_net_type: str = 'linear', clip: bool = True, batch_size: int = 256, learning_rate: int = 0.001, weight_decay: int = 0.0001, scheduler: str = 'step', optimizer: str = 'sgd')

Bases : BaseNNImputer, JMImputerMixin

MiWAE imputer class for imputing missing values in data using Multiple Imputation with Auxiliary Deep Generative Models.

Attributes

  • name : str name of the imputer

  • clip : bool whether to clip the imputed values

  • latent_size : int size of the latent space

  • n_hidden : int number of hidden units

  • n_hidden_layers : int number of hidden layers

  • out_dist : str output distribution

  • K : int number of samples

  • L : int number of MCMC samples

  • activation : str activation function

  • initializer : str initializer for weights

  • batch_size : int batch size for training

  • learning_rate : int learning rate for optimizer

  • weight_decay : int weight decay for optimizer

  • scheduler : str scheduler for optimizer

  • optimizer : str optimizer for training

Methods

  • get_imp_model_params

  • set_imp_model_params

  • initialize

  • configure_model

  • configure_optimizer

  • fit

  • impute

GNR

class GNRImputer(latent_size: int = 5, n_hidden: int = 16, n_hidden_layers: int = 2, K: int = 20, L: int = 100, activation='tanh', initializer='xavier', loss_coef=10, mr_loss_coef: bool = True, clip: bool = True, batch_size: int = 256, learning_rate: int = 0.001, weight_decay: int = 0.0001, scheduler: str = 'step', optimizer: str = 'sgd')

Bases : BaseNNImputer, JMImputerMixin

MiWAE imputer class for imputing missing values in data using Multiple Imputation with Auxiliary Deep Generative Models.

Attributes

  • name : str name of the imputer

  • clip : bool whether to clip the imputed values

  • latent_size : int size of the latent space

  • n_hidden : int number of hidden units

  • n_hidden_layers : int number of hidden layers

  • out_dist : str output distribution

  • K : int number of samples

  • L : int number of MCMC samples

  • activation : str activation function

  • initializer : str initializer for weights

  • batch_size : int batch size for training

  • learning_rate : int learning rate for optimizer

  • weight_decay : int weight decay for optimizer

  • scheduler : str scheduler for optimizer

  • optimizer : str optimizer for training

Methods

  • get_imp_model_params

  • set_imp_model_params

  • initialize

  • configure_model

  • configure_optimizer

  • fit

  • impute