In this lab, we will implement some of the techniques we have looked at in class to deal with missing data. You will be using a dataset on paua, or abalone. The data set includes the following variables:
Table 10.1: Abalone Dataset Variables
Name
Data Type
Measurement Unit
Description
Type
nominal
–
M, F, and I (infant)
Length
continuous
mm
Longest shell measurement
Diameter
continuous
mm
perpendicular to length
Height
continuous
mm
with meat in shell
Whole.weight
continuous
grams
whole abalone
Shucked.weight
continuous
grams
weight of meat
Viscera.weight
continuous
grams
gut weight (after bleeding)
Shell.weight
continuous
grams
after being dried
Rings
integer
–
+1.5 gives the age in years
Start by downloading the dataset.
import pandas as pdimport numpy as npfrom scipy import stats# URL of the Abalone dataseturl ="https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data"# Column names for the datasetcolumn_names = ["Type", "Length", "Diameter", "Height", "Whole weight", "Shucked weight", "Viscera weight", "Shell weight", "Rings"]# Load the dataset into a Pandas DataFrameabalone_data = pd.read_csv(url, header=None, names=column_names)# Display the first few rows of the datasetprint(abalone_data.head())
We will investigate the age of the paua, so let’s start by randomly dropping about 10% of the data. We will set a random seed, so that our output is repeatable for debugging purposes, and make a copy of the original data with NAs for 400 of the recorded values of the Rings variable.
Start by computing a 95% confidence interval for the mean age of the paua in the population using only the complete (non-missing) data (be careful to account for the relationship between age and number of rings).
Next, implement simple imputation to estimate the mean age, by making a copy of the Ring column in data_missing, replacing the NAs with the mean, and recomputing the 95% confidence interval. My output is below.
Mean imputation CI: 11.342, 11.529
10.3 Hotdeck Imputation
Now try using random draws, conditioning on Type. We can do this by looping on the paua type, and replacing the missing values for each type by an appropriately sized draw with replacement from the non-missing values for that type. Reset the seed so that you can get consistent output when re-running just this part.
Note, code fragment ***A*** produces True if the corresponding row is of type t and has missing data, fragment ***B*** returns the number of missing data for type t, and fragment ***C*** performs the appropriate draw from the non-missing data (using the function rng.choice). The output is shown below:
Hot deck CI: 11.346, 11.542
10.4 Bootstrap Replication
After implementing a couple of standard imputation techniques, let’s try bootstrap replication. We will generate 200 bootstrap replicates, impute missing data via random draws conditioning on Type, and thus include the effect of missing data in our 95% CI for mean age of the population. In the code that follows, missing code fragment ***A*** samples the appropriate number of row indices with replacement, for using to build the bootstrap replicate dataframe. Fragments ***B***, ***C*** and ***D*** then repeat the hotdeck imputation steps as performed in the previous section.
To compute our confidence interval, we will use the mean of the bootstrapped estimates as our actual estimate, and build our confidence interval from that value using the quantiles of the bootstrap replicates, as shown in class.
Our final approach for building a confidence interval for the mean of the variable with missing data is to implement multiple imputation. We will condition on Type, and impute 500 values for each missing value, to build a 95% CI for mean age of the population that reflects the uncertainty due to missing data.
Note that code fragments ***A***, ***B*** and ***C*** again repeat the hotdeck imputation steps as performed in the previous sections. Fragment ***D*** computes the mean for iteration i. Fragment ***E*** computes the variance of our estimator (the mean). To compute this note that:
Of course we don’t know \(\sigma^2\) as this is a population parameter, but an unbiased estimate for it is \(\frac{1}{n-1}\sum_1^n (X_i-\overline{X})^2\). We complete the analysis by computing our estimate (***F***), the within imputation variance (***G***), the between imputation variance (***H***), the total variance (***I***), the estimate of fraction of information lost due to missing data (***J***), and the degrees of freedom (***K***). We can then compute the CI width using a t distribution.