10  Imputation Lab

In this lab, we will implement some of the techniques we have looked at in class to deal with missing data. You will be using a dataset on paua, or abalone. The data set includes the following variables:

Table 10.1: Abalone Dataset Variables
Name Data Type Measurement Unit Description
Type nominal M, F, and I (infant)
Length continuous mm Longest shell measurement
Diameter continuous mm perpendicular to length
Height continuous mm with meat in shell
Whole.weight continuous grams whole abalone
Shucked.weight continuous grams weight of meat
Viscera.weight continuous grams gut weight (after bleeding)
Shell.weight continuous grams after being dried
Rings integer +1.5 gives the age in years

Start by downloading the dataset.

import pandas as pd
import numpy as np
from scipy import stats

# URL of the Abalone dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data"

# Column names for the dataset
column_names = ["Type", "Length", "Diameter", "Height", "Whole weight", "Shucked weight", "Viscera weight", "Shell weight", "Rings"]

# Load the dataset into a Pandas DataFrame
abalone_data = pd.read_csv(url, header=None, names=column_names)

# Display the first few rows of the dataset
print(abalone_data.head())
  Type  Length  Diameter  Height  Whole weight  Shucked weight  \
0    M   0.455     0.365   0.095        0.5140          0.2245   
1    M   0.350     0.265   0.090        0.2255          0.0995   
2    F   0.530     0.420   0.135        0.6770          0.2565   
3    M   0.440     0.365   0.125        0.5160          0.2155   
4    I   0.330     0.255   0.080        0.2050          0.0895   

   Viscera weight  Shell weight  Rings  
0          0.1010         0.150     15  
1          0.0485         0.070      7  
2          0.1415         0.210      9  
3          0.1140         0.155     10  
4          0.0395         0.055      7  

We will investigate the age of the paua, so let’s start by randomly dropping about 10% of the data. We will set a random seed, so that our output is repeatable for debugging purposes, and make a copy of the original data with NAs for 400 of the recorded values of the Rings variable.

data_missing = abalone_data.copy()
n_missing = 400
rng = np.random.default_rng(345678)
missing_ind = rng.choice(data_missing.index, size=n_missing,
  replace=False)
data_missing.loc[missing_ind, 'Rings'] = np.nan

10.1 Complete Data

Start by computing a 95% confidence interval for the mean age of the paua in the population using only the complete (non-missing) data (be careful to account for the relationship between age and number of rings).

compl_mean = data_missing['Rings'].mean() + 1.5
compl_sem = stats.sem(data_missing['Rings'].dropna())
ci = stats.norm.interval(0.95, loc=compl_mean,
  scale=compl_sem)

print(f"Complete data CI: {ci[0]:.3f}, {ci[1]:.3f}")
Complete data CI: 11.332, 11.539

10.2 Mean Imputation

Next, implement simple imputation to estimate the mean age, by making a copy of the Ring column in data_missing, replacing the NAs with the mean, and recomputing the 95% confidence interval. My output is below.

Mean imputation CI: 11.342, 11.529

10.3 Hotdeck Imputation

Now try using random draws, conditioning on Type. We can do this by looping on the paua type, and replacing the missing values for each type by an appropriately sized draw with replacement from the non-missing values for that type. Reset the seed so that you can get consistent output when re-running just this part.

rng = np.random.default_rng(678910)
imp_data = data_missing['Rings'].copy()
for t in data_missing['Type'].unique():
  this_missing = ***A***
  num_draws = ***B***
  non_missing = data_missing.loc[imp_data.notna() & (data_missing['Type'] == t), 'Rings']
  hot_deck = ***C***
  imp_data[this_missing] = hot_deck

hd_imp_mean = imp_data.mean() + 1.5
hd_imp_sem = stats.sem(imp_data)
ci = stats.norm.interval(0.95, loc=hd_imp_mean,
  scale=hd_imp_sem)

print(f"Hot deck: {ci[0]:.3f}, {ci[1]:.3f}")

Note, code fragment ***A*** produces True if the corresponding row is of type t and has missing data, fragment ***B*** returns the number of missing data for type t, and fragment ***C*** performs the appropriate draw from the non-missing data (using the function rng.choice). The output is shown below:

Hot deck CI: 11.346, 11.542

10.4 Bootstrap Replication

After implementing a couple of standard imputation techniques, let’s try bootstrap replication. We will generate 200 bootstrap replicates, impute missing data via random draws conditioning on Type, and thus include the effect of missing data in our 95% CI for mean age of the population. In the code that follows, missing code fragment ***A*** samples the appropriate number of row indices with replacement, for using to build the bootstrap replicate dataframe. Fragments ***B***, ***C*** and ***D*** then repeat the hotdeck imputation steps as performed in the previous section.

rng = np.random.default_rng(891011)
imp_data = data_missing['Rings'].copy()
num_boots = 200
boot_reps = np.zeros(num_boots)

for i in range(num_boots):
  this_boot_inds = ***A***
  this_boot = data_missing.loc[this_boot_inds].reset_index()

  for t in this_boot['Type'].unique():
    this_missing = ***B***
    num_draws = ***C***
    non_missing = this_boot.loc[this_boot['Rings'].notna() & (this_boot['Type'] == t), 'Rings']
    hot_deck = ***D***
    this_boot.loc[this_missing, 'Rings'] = hot_deck

  boot_reps[i] = this_boot['Rings'].mean() + 1.5

To compute our confidence interval, we will use the mean of the bootstrapped estimates as our actual estimate, and build our confidence interval from that value using the quantiles of the bootstrap replicates, as shown in class.

bs_imp_mean = boot_reps.mean()

emp_ci = bs_imp_mean * 2 - np.quantile(boot_reps, [0.975, 0.025])
print(f"Empirical bootstrap CI: {emp_ci[0]:.3f}, {emp_ci[1]:.3f}")
Empirical bootstrap CI: 11.347, 11.562

Alterntaively we can build a confidence interval directly from our bootstrap replicates based on the normal distribution:

bs_imp_sem = stats.sem(boot_reps)
norm_ci = stats.norm.interval(0.95, loc=bs_imp_mean, scale=boot_reps.std())

print(f"Normal bootstrap CI: {norm_ci[0]:.3f}, {norm_ci[1]:.3f}")
Normal bootstrap CI: 11.336, 11.561

10.5 Multiple Imputation

Our final approach for building a confidence interval for the mean of the variable with missing data is to implement multiple imputation. We will condition on Type, and impute 500 values for each missing value, to build a 95% CI for mean age of the population that reflects the uncertainty due to missing data.

rng = np.random.default_rng(101112)
n_imps = 500
mi_est = np.zeros(n_imps)
mi_var = np.zeros(n_imps)

for i in range(n_imps):
  imp_data = data_missing['Rings'].copy()
  for t in data_missing['Type'].unique():
    this_missing = ***A***
    num_draws = ***B***
    non_missing = data_missing.loc[imp_data.notna() & (data_missing['Type'] == t), 'Rings']
    hot_deck = ***C***
    imp_data[this_missing] = hot_deck

  mi_est[i] = ***D***
  mi_var[i] = ***E***

Note that code fragments ***A***, ***B*** and ***C*** again repeat the hotdeck imputation steps as performed in the previous sections. Fragment ***D*** computes the mean for iteration i. Fragment ***E*** computes the variance of our estimator (the mean). To compute this note that:

\[\begin{matrix} Var[\overline{X}] &=&Var\left[\frac{1}{n}\sum_1^n X_i \right]\\ \\ &=& \frac{1}{n^2}Var\left[\sum_1^n X_i \right]\\ \\ &=& \frac{1}{n^2}\sum_1^n Var\left[X_i \right]\\ \\ &=& \frac{1}{n^2}\sum_1^n \sigma \\ \\ &=& \frac{1}{n} \sigma^2 \end{matrix}\]

Of course we don’t know \(\sigma^2\) as this is a population parameter, but an unbiased estimate for it is \(\frac{1}{n-1}\sum_1^n (X_i-\overline{X})^2\). We complete the analysis by computing our estimate (***F***), the within imputation variance (***G***), the between imputation variance (***H***), the total variance (***I***), the estimate of fraction of information lost due to missing data (***J***), and the degrees of freedom (***K***). We can then compute the CI width using a t distribution.

pt_est = ***F***
11.44127316255686
w_est= ***G***
0.0025160028417431446
b_est = ***H***
0.0002021039894415928
t_est = ***I***
0.0027185110391636205
gamma = ***J***
0.07449232116518455
df = ***K***
mi_ci = stats.t.interval(0.95, df, pt_est, np.sqrt(t_est))
Multiple imputation CI: 11.339, 11.544