Chapter 9 Nonresponse
\(\DeclareMathOperator*{\argmin}{argmin}\) \(\newcommand{\var}{\mathrm{Var}}\) \(\newcommand{\bfa}[2]{{\rm\bf #1}[#2]}\) \(\newcommand{\rma}[2]{{\rm #1}[#2]}\) \(\newcommand{\estm}{\widehat}\)
Nonresponse is one of the most significant contributors to bias in surveys. By its nature nonresponse is difficult to adjust for without making assumptions which may or may not be justified.
There are two kinds of nonresponse to consider:
- Item nonresponse: Some measurements for a particular responding unit are missing;
- Unit nonresponse: The entire record for a particular responding unit is missing.
Nonresponse can also be considered as a case of missing data.
9.1 Preventing nonresponse
The best way of dealing with nonresponse is to prevent it.
Like other kinds of non-sampling error, there are many ways that nonresponse bias can be minimised in the design stage of a sample survey.
Nonresponse can be reduced by:
- Public education and publicity before a survey;
- Individual pre-notification letters sent to respondents prior to the survey telling them to expect an interviewer to call;
- Providing incentives or compulsion for response;
- Providing clear information about the uses of the data and the confidentiality of the data;
- Interviewing at an appropriate time (of the day or of the year) – to avoid inconvenient times;
- Well trained interviewers/data collectors;
- Data collection method: personal interviews may achieve a better response rate than post-back questionnaires;
- Proxy responses – it may be appropriate to allow one person to respond on behalf of another;
- Questionnaire design: well designed questionnaires which look short, easy and clear may achieve better response rates than large intimidatingly complex ones. Poorly worded questions may lead to item nonresponse, or even to irritating a respondent into a total refusal;
- Call backs at different times to turn non-contacts into responses.
9.2 Types of nonresponse
Each unit in the sample may or may not respond. So we can create an indicator variable to code for response \[\begin{equation} R_i = \begin{cases} 1 & \text{if unit $i$ responds}\\ 0 & \text{if unit $i$ does not respond} \end{cases} \end{equation}\] and we can imagine that there is a probability \(\phi_i\) that population unit \(i\) would respond if an attempt was made to measure that unit.
We are interested in the values of outcome variable \(Y_i\) on each sample unit. We also have a set of auxiliary variables \({\bf X}_i\), which are known for every unit in the sample (whether or not that unit provided a response to \(Y_i\)). The variables \({\bf X}_i\) include the design variables from the frame.
Assume that we have drawn a sample of \(n\) units from the population of size \(N\), and that \(n_R\) of these have responded.
There are then three types of missingness:
Missing Completely at Random (MCAR). This is when the probability of response \(\phi_i\) is the same for all units, no matter what their value of \(Y_i\) and \({\bf X}_i\).
This is the best situation: the respondents and nonrespondents do not differ in any important way, and it is as if the nonrespondents had been selected at random from the sample. To analyse data where we have MCAR, we can simply ignore the nonresponding units, and proceed as if we had selected a sample of \(n_R\) units from the population.
Missing at Random (MAR). This is also known as missing at random given covariates or alternatively ignorable nonresponse. This occurs when the probability of response \(\phi_i\) depends on known quantities: the auxiliary covariates \({\bf X}_i\), but not on the unknown \(Y_i\). We therefore assume that if two units have the same values of \({\bf X}\), then their likelihood of response is the same.
This is a slightly more complex situation than MCAR, but we can still fully account for nonresponse.
Nonignorable Nonresponse. This is the unfortunate situation where the probability of response \(\phi_i\) depends on the outcome of interest \(Y_i\) (as well, possibly, as \({\bf X}_i\)).
MAR and MCAR are strong assumptions. If the data are MAR we can test whether they are MCAR by comparing nonresponse rates in subgroups defined by \({\bf X}\) (or by logistic regression if any of the \({\bf X}\) variables are continuous) and test for any dependence on \({\bf X}\). However it may not be possible to distinguish whether the data are MAR or whether there is nonignorable response.
We of course exclude from consideration any data which is missing by design. These are data items which are missing because the respondent is not asked that particular question (e.g. we do not ask for the voting record of children.)
9.3 Response Rates
What is the response rate of a survey? Surprisingly there are several answers to this question. Here a some possibilities:
- The number of respondents \(n_R\) divided by the number of sample members \(n\): \[\begin{equation} \frac{n_R}{n} \end{equation}\]
- The number of respondents \(n_R\) divided by the number of sample members contacted \(n_C\): \[\begin{equation} \frac{n_R}{n_C} \end{equation}\]
- The weighted number of respondents divided by an estimate of the total population \[\begin{equation} \frac{\sum_{k\in s_R}w_k}{\sum_{k\in s}w_k} \tag{9.1} \end{equation}\] Here \(s\) is the whole sample, and \(s_R\) is just the responding part of the sample.
The response rate for household surveys in Statistics NZ is calculated as follows. All sample members are classified into 5 categories:
Classification | Sum of weights | |
---|---|---|
1 | Ineligible pre-contact | \(A\) |
2 | Ineligible post-contact | \(B\) |
3 | Eligible Non-Responding | \(C\) |
4 | Eligible Responding | \(D\) |
5 | Eligibility not established | \(E\) |
This classification makes clear that our sample frame may include ineligible units, and that there are some units whose eligibility we cannot establish because we didn’t make contact. Note that some units are ineligible pre-contact: these are cases where the house has been demolished, or is clearly not inhabited.
The eligibility rate among the units where eligibility was established post-contact is \[ \frac{C+D}{B+C+D} \] so that our estimate of the total number of eligible units is \[ C + D + E\times\left(\frac{C+D}{B+C+D}\right) \] The response rate is the ratio of the weighted number of responding units to the weighted number of eligible units: \[ \frac{D}{C+D+E\frac{C+D}{B+C+D}} \] If there are no units for which eligibility is unknown, this reduces to Equation (9.1).
9.4 Weight adjustments
If we have MCAR or MAR missingness, then we can account for nonresponse by making adjustments to the sample weights. If the probability of selection under some sampling scheme is \(\pi_i\), and the probability of responding is \(\phi_i\) then assuming selection and response probability are independent the chances of a population unit ending up as a fully responding sample unit are: \[ \begin{split} \text{Prob(selected + responding)} &= \text{Prob(selected)}\times\text{Prob(responding)}\\ \tilde{\pi}_i &= \pi_i \times \phi_i \end{split} \] Hence our estimator for a total changes from \[ \estm{Y} = \sum_{k\in s} \frac{y_k}{\pi_k} = \sum_{k\in s} w_ky_k \] to \[\begin{eqnarray*} \estm{Y}_W &=& \sum_{k\in s_R} \frac{y_k}{\tilde{\pi}_k}\\ &=& \sum_{k\in s_R} \frac{y_k}{\pi_k\phi_k}\\ &=& \sum_{k\in s_R} \frac{w_k}{\phi_k}y_k\\ &=& \sum_{k\in s_R} \tilde{w}_ky_k\\ \end{eqnarray*}\] where \(s_R\) is the part of the sample which responds, and \[\begin{equation} \tilde{w}_k = \frac{w_k}{\phi_k} \end{equation}\] are the adjusted weights.
How do we estimate \(\phi_k\)? We will examine three approaches:
- Assume data are MCAR;
- Poststratification. Assume data are MAR, that the probability of nonresponse depends only on membership of known classes, and that the population totals in each class are known;
- Weighting Class Adjustment. Assume data are MAR, that the probability of nonresponse depends only on membership of known classes, but that the population totals in each class are unknown.
We will consider a single example to illustrate these three approaches.
Example - Surveying employees
From a staff of \(N=2721\) a sample of \(n=200\) employees is selected by simple random sampling without replacement. The staff members are given a questionnaire which rates, among other things, their degree of independence – on a scale from 0 to 40.
Only \(n_R = 96\) employees respond to the survey. The mean independence score of the respondents is \(\bar{y}_R=13.2\) with standard deviation \(s_R=4.3\).
Split by the management level of the employees, the data can be summarised as follows:
Role | \(N_h\) | \(n_h\) | \(n_{hR}\) | \(\bar{y}_{hR}\) | \(s_{hR}\) | \(n_{nR}/n_h\) |
---|---|---|---|---|---|---|
Manager | 420 | 31 | 28 | 16.2 | 3.1 | 0.903 |
Non-manager | 2301 | 169 | 68 | 12.0 | 4.2 | 0.402 |
Total | 2721 | 200 | 96 | 13.2 | 4.3 | 0.480 |
9.4.1 MCAR treatment
If we assume MCAR, then every unit has the same probability of response \(\phi\), which is best estimated by the weighted response rate: \[\begin{eqnarray*} \estm{\phi} &=& \frac{\text{estimate of number of responders in population}}{ \text{estimate of population size}}\\ &=& \frac{\sum_{k\in s_R}w_k}{\sum_{k\in s}w_k} \end{eqnarray*}\] i.e. the proportion of units in the population that would respond if included in a sample.
For simple random sampling the weights are \[ w_k = \frac{N}{n} \] and the response rate is \[ \estm{\phi} = \frac{n_R}{n} \] so the adjusted weights are \[ \begin{split} \tilde{w}_k &= \frac{w_k}{\phi}\\ &= \frac{N}{n}\times\frac{n}{n_R}\\ &= \frac{N}{n_R} \end{split} \] These are the same as the weights we would calculate if the \(n_R\) responding units had been selected by SRSWOR from the population. For MCAR, we treat the sample in just that way, we reduce the sample size, and ignore the fact that nonresponse has occurred.
Example continued
In our example the weights \(w_k\) are \[ w_k = \frac{N}{n} = \frac{2721}{200} = 13.61 \] and the response rate is just: \[ \estm\phi=\frac{n_R}{n}=\frac{96}{200}=0.48 \] and so the adjusted weights are \[ \begin{split} \tilde{w}_k &= \frac{w_k}{\estm\phi} = \frac{13.605}{0.48} = 28.34\\ &= \frac{N}{n_R} = \frac{2721}{96} = 28.34 \end{split} \] Our best estimate of the population mean is therefore just the sample mean of the respondents: \[\begin{eqnarray*} \widehat{\bar{Y}}_{\rm MCAR} = \bar{y}_R = 13.2 \end{eqnarray*}\] and we calculate the variance using the SRSWOR formula with the reduced sample size \(n_R\): \[\begin{eqnarray*} \bfa{Var}{\widehat{\bar{Y}}_{\rm MCAR}} &=& \left(1-\frac{n_R}{N}\right)\frac{s_R^2}{n_R}\\ &=& \left(1-\frac{96}{2721}\right)\frac{4.3^2}{96} = 0.186 \end{eqnarray*}\]
9.4.2 Poststratification
In many surveys we have access to population summaries from other sources. For example in a household survey we can compare the population estimate from the survey (the sum of the weights) with the known population totals from the census (or updated estimates for the present day). Our sample estimates are unlikely to come up with exactly the same values of the population totals. This is partly because of sampling variability, possibly due to undercoverage of the sampling frame, but it may also be due to nonresponse. Poststratification is a means of adjusting the weights so that the survey estimates of the population match exactly those from the external source.
Assume that we have population counts \(N_h\) for a set of \(H\) poststrata. Note that these poststrata need not be the same as any strata we may have used in the sample design. The variables that define the poststrata could be age, sex – variables we are unlikely to have on our frame, but which are very likely available in census counts.
We calculate the effective response rate by \[\begin{eqnarray*} \estm{\phi_h} &=& \frac{\text{estimate of number of responders in class $h$ in population}}{ \text{actual population size in class $h$}}\\ &=& \frac{\sum_{k\in s_{hR}}w_k}{N_h} = \frac{\estm{M}_h}{N_h} \end{eqnarray*}\] We then proceed as if the data are MCAR within the poststrata.
Our estimate of the population in poststratum \(h\) that would respond is the sum of the weights of the respondents in poststratum \(h\) \[ \estm{M}_h = \sum_{k\in s_{Rh}} w_k \] Now if we modify the weights by setting \[ \tilde{w}_{k} = w_{k}\frac{N_h}{\estm{M}_h} = \frac{w_k}{\estm\phi_h} \ \ \text{for unit $k$ in poststratum $h$} \] then the sum of these new weights over the respondents will be \(N_h\) as required. The effective response rate \(\estm\phi_h\) incorporates undercoverage and nonresponse. (And note that it may be greater or less than 1.)
For simple random sampling the weights are \[ w_k = \frac{N}{n} \] and the estimated response rate within poststratum \(h\) is \[ \begin{split} \estm{\phi_h} &= \frac{n_{hR}\times w_k}{N_h}\\ &= \frac{n_{hR}}{N_h}\frac{N}{n} \end{split} \] so the adjusted weights for units within stratum \(h\) are \[ \begin{split} \tilde{w}_k &= \frac{w_k}{\estm{\phi_h}}\\ &= \frac{N}{n}\times\frac{N_hn}{n_{hR}N}\\ &= \frac{N_h}{n_{hR}} \end{split} \] These are the same as the weights we would calculate if the \(n_{hR}\) responding units had been selected by SRSWOR from stratum \(h\) (which is of size \(N_h\)): it’s as if we planned a stratified SRS within the poststrata.
Example continued
When the sample of employees was selected it was not
stratified by management role, it was just a SRSWOR drawn from all employees.
However we can use the population information on the sizes
of poststrata defined by management role (this information comes from the employee
records of the company) and adjust the weights:
Role | \(N_h\) | \(F_h\) | \(w=N/n\) | \(n_{hR}\) | \(\tilde{w}_{hR}=N_{h}/n_{hR}\) |
---|---|---|---|---|---|
Manager | 420 | 0.154 | 13.61 | 28 | 15.00 |
Non-manager | 2301 | 0.846 | 13.61 | 68 | 33.84 |
Total | 2721 | 1.000 | 96 |
The approach in poststratification is to adjust the weights so that with those corrected weights the estimates of the population size within the poststrata match the known benchmark values \(N_h\).
We form the estimate and its variance using the SRSWOR formulae with the reduced sample sizes \(n_{hR}\).
The poststratified estimate of the mean is \[\begin{eqnarray*} \widehat{\bar{Y}}_{\rm post} &=& \sum_h F_h \bar{y}_{hR}\\ &=& (0.154)(16.2) +(0.846)(12)\\ &=& 12.6 \end{eqnarray*}\] with variance \[\begin{eqnarray*} \bfa{Var}{\widehat{\bar{Y}}_{\rm post}} &=& \left(1-\frac{n_R}{N}\right)\frac{1}{n_R}\sum_h F_h s_{hR}^2 + \frac{1}{n_R^2} \sum_h (1-F_h) s_{hR}^2\\ &=& \left(1-\frac{96}{2721}\right) \frac{1}{96} \left[(0.154)(3.1)^2 + (0.846)(4.2)^2\right]\\ && + \frac{1}{96^2}\left[(1-0.154)(3.1)^2 + (1-0.846)(4.2)^2\right]\\ &=& 0.165 + 0.001 \\ &=& 0.166 \end{eqnarray*}\] An alternative approximation to the variance simply treats the sample as if it were designed as a stratified sample in the first place: \[\begin{eqnarray*} \bfa{Var}{\widehat{\bar{Y}}_{\rm post}} &=& \sum_h F_h^2\left(1-\frac{n_{hR}}{N_h}\right)\frac{s_{hR}^2}{n_{hR}}\\ &=& (0.154)^2\left(1-\frac{28}{420}\right)\frac{3.1^2}{28} + (0.846)^2\left(1-\frac{68}{2301}\right)\frac{4.2^2}{68}\\ &=& 0.188 \end{eqnarray*}\] This is a very similar result – both formulae for \(\bfa{Var}{\bar{Y}_{\rm post}}\) are approximate, and it’s OK to use either.
When the design is more complex than a SRS then the calculation of the variance of a post-stratified estimate becomes more complex. There are various approximations available, or alternatively one can use the computationally intensive resampling methods (Jackknife or bootstrap) covered later in the course.
One consequence of post-stratification is that we may end up with individuals in the same household ending up with different weights, due to their coming from, say, different age strata. This can make estimates of household level characteristics more difficult to compute in a consistent way. There are approaches such as integrated weighting which are modifications of post-stratification to avoid this kind of unwanted behaviour.
9.4.3 Weighting class adjustment
Even if we do not have access to benchmarks \(N_h\), we can still correct for MAR nonresponse using a weighting class adjustment. We do this by forming groups of respondents called estimation groups or weighting classes according to their values of auxiliary variables \({\bf X}\). We might use sex, age, region or other such variables to form the groups. However a key difference is that we must know to which group each non-respondent belongs. In that case we estimate \(\phi_h\), the probability of response in weighting class \(h\) by \[\begin{eqnarray*} \estm{\phi_h} &=& \frac{\text{estimate of number of responders in class $h$ in population}}{ \text{estimate of population size in class $h$}}\\ &=& \frac{\sum_{k\in s_{hR}}w_k}{\sum_{k\in s_h}w_k} = \frac{\estm{M}_h}{\estm{N}_h} \end{eqnarray*}\] We then proceed as if the data are MCAR within the weighting classes.
As above, our estimate of the population in class \(h\) that would respond is the sum of the weights of the respondents in class \(h\) \[ \estm{M}_h = \sum_{k\in s_{Rh}} w_k \] and we must also estimate the population in class \(h\) \[ \estm{N}_h = \sum_{k\in s_{h}} w_k \] where the sum is over all units, responding and nonresponding.
Now we modify the weights by setting \[ \tilde{w}_{k} = w_{k}\frac{\estm{N}_h}{\estm{M}_h} = \frac{w_k}{\estm\phi_h} \ \ \text{for unit $k$ in class $h$} \] so that the sum of these new weights over the respondents will match the estimates \(\estm{N}_h\).
For simple random sampling the weights are \[ w_k = \frac{N}{n} \] and the estimated response rate within estimation group \(h\) is \[ \begin{split} \estm{\phi_h} &= \frac{n_{hR}\times w_k}{n_h\times w_k}\\ &= \frac{n_{hR}}{n_h} \end{split} \] so the adjusted weights for units within stratum \(h\) are \[ \begin{split} \tilde{w}_k &= \frac{w_k}{\estm{\phi_h}}\\ &= \frac{N}{n}\times\frac{n_h}{n_{hR}}\\ \end{split} \] We also estimate the group sizes and proportions \[ \begin{split} \widehat{F}_h &= \frac{n_h}{n}\\ \estm{N}_h &= \estm{F_h}N = N\frac{n_h}{n} \end{split} \]
Example continued
Assume that we don’t actually know the population totals \(N_h\), but we want to adjust for possible differential nonresponse. Firstly we check that the data are not MCAR: do the response rates differ by management role?
Class-specific response rates are \[\begin{eqnarray*} \estm\phi_h &= \frac{n_{hR}}{n_h}&\\ \estm\phi_1 &= \frac{28}{31} &= 0.903\\ \estm\phi_1 &= \frac{68}{31} &= 0.402 \end{eqnarray*}\] These seem to differ strongly: we can test for the significance of the difference in the usual way:
Hypotheses: \(H_0: \phi_1=\phi_2\) vs. \(H_1: \phi_1\neq\phi_2\)
Test statistic: \[\begin{eqnarray*} Z &=& \frac{(\estm\phi_1-\estm\phi_2)-(\phi_1-\phi_2)}{ \sqrt{ \frac{\estm\phi_1(1-\estm\phi_1)}{n_1} + \frac{\estm\phi_2(1-\estm\phi_2)}{n_2} }}\\ &=& \frac{(0.903-0.402)-(0)}{ \sqrt{ \frac{(0.903)(1-0.903)}{31} + \frac{(0.402)(1-0.402)}{169} }}\\ &=& \frac{0.501}{0.082} = 6.14 \end{eqnarray*}\]
The p-value of this test is \(<0.001\) so we conclude a difference in response rates. The data are not MCAR.
We have established that the nonresponse is not MCAR. One possible model is that the nonresponse is MAR, and depends only on sex. However with the data we have we cannot test whether this is the case. There could be non-ignorable nonresponse occurring.
We will proceed here assuming that we have adequately described the nonresponse process with response probability only depending on sex.
Role | \(\widehat{N}_h=N\widehat{F}_h\) | \(\widehat{F}_h=n_h/n\) | \(w=N/n\) | \(n_h\) | \(n_{hR}\) | \(\phi_h\) | \(\tilde{w}_{hR}=\widehat{N}_{h}/n_{hR}=w/\phi_h\) |
---|---|---|---|---|---|---|---|
Manager | 422 | 0.155 | 13.61 | 31 | 28 | 0.903 | 15.06 |
Non-manager | 2299 | 0.845 | 13.61 | 169 | 68 | 0.402 | 33.81 |
Total | 2721 | 1.000 | 200 | 96 |
In the weighting class adjustment we use the estimates \(\widehat{F}_h\) and
\(\estm{N}_h\) defined above instead of supplied benchmarks. We form our estimate
using an expression very similar to that for stratified SRSWOR, but with \(F_h\)
replaced by \(\widehat{F}_h\):
\[\begin{eqnarray*}
\widehat{\bar{Y}}_{\rm wc}
&=& \sum_h \widehat{F}_h \bar{y}_{hR}\\
&=& (0.155)(16.2)
+(0.845)(12)\\
&=& 12.7
\end{eqnarray*}\]
The variance is a little more complex however, because the weighting class
estimator is biased. For this reason it is more appropriate to quote the
mean squared error (MSE) of the estimate, and use this when calculating
confidence intervals.
\[\begin{eqnarray*}
\bfa{MSE}{\widehat{\bar{Y}}_{\rm wc}}
&=& \bfa{Var}{\widehat{\bar{Y}}_{\rm wc}}
+\bfa{Bias}{\widehat{\bar{Y}}_{\rm wc}}^2\\
&=& \sum_h
\widehat{F}_h^2\left(1-\frac{n_{hR}}{\estm{N}_h}\right)\frac{s_{hR}^2}{n_{hR}}
+ \left(1-\frac{n_R}{N}\right)\frac{1}{n_R}
\sum_h \widehat{F}_h(\bar{y}_{hR}-\widehat{\bar{Y}}_{\rm wc})^2\\
&=& (0.155)^2\left(1-\frac{28}{422}\right)\frac{3.1^2}{28} +
(0.845)^2\left(1-\frac{68}{2299}\right)\frac{4.2^2}{68}\\
& & \qquad+
\left(1-\frac{96}{2721}\right)\frac{1}{96}
\left((0.155)(16.2-12.7)^2
+(0.845)(12-12.7)^2\right)\\
&=& 0.187 + 0.003
= 0.191
\end{eqnarray*}\]
Summary.
Here are the results of the three approaches.
Adjustment | Estimate, \(\estm{\bar{Y}}\) | SE or RMSE | RSE | Conf. Int. |
---|---|---|---|---|
MCAR | 13.2 | 0.4 | 0.033 | (12.4,14.1) |
MAR - Poststratification | 12.6 | 0.4 | 0.034 | (11.8,13.5) |
MAR - Weighting Class | 12.7 | 0.4 | 0.035 | (11.8,13.5) |
9.4.4 Notes
Poststratification is an effective means of correcting for MAR nonresponse if nonresponse depends only on the variables which define the poststrata. If we suspect that nonresponse depends on other (measured) covariates it may be most appropriate to make a weighting class adjustment before a final poststratification adjustment.
The variance of the estimator may be difficult to evaluate if poststratification or weighting class adjustments are made in a complex sample design, and in particular where the poststrata differ from the strata used in the sample selection. (e.g. if we use geographical region to select the sample, but poststratify by sex.) In these situations analytical formulae may not exist and numerical methods (such as the Jacknife) may be needed to estimate variances. (These methods will be discussed later.)
Note: Raking adjustments provide a way of poststratifying to two (or more) sets of population totals, but which are not available as crosstabulated totals. For example we may know total numbers of students by sex, and separately total numbers of students by ethnic group, but not total numbers by sex and ethnicity simultaneously.
There are many modelling approaches to the treatment of nonresponse, in particular the modelling of response propensity – the probability of response. (see for example Little and Rubin (2002), which is a whole book devoted to the treatment of missing data.)
9.5 Imputation
Imputation is a process whereby missing data are replaced. This may be done for individual items where there has been nonresponse (or inconsistent response). Alternatively, whole records may be imputed.
Imputation is often done so that a clean-looking complete dataset is created for analysis. In the dataset every sample member has a response to every (relevant) question. There are several different ways of achieving this.
9.5.1 Purposive Imputation
Where an item is missing the surveyor chooses a value that s/he considers most likely. This relies on the surveyor having relevant knowledge.
This is rarely possible, and brings with it the risk of introducing biases in the form of the surveyor’s prejudices.
9.5.2 Deductive Imputation
Where an item is missing, but its value can be deduced unambiguously from responses to other questions, then the value can be imputed.
Conversely, it should be noted that in data editing we may delete inconsistent data – e.g. where a respondent says that he is married but is only 10 years old. We may choose to delete one or other or both of the inconsistent data items.
Deductive imputation should only very rarely be possible.
9.5.3 Cell Mean Imputation
Where an numerical item is missing for a particular respondent (e.g. age or income or number of cars owned etc.) we may impute the item value by using the mean value of that item for other respondents who resemble that respondent. Respondents are grouped into cells, just as in the weighting class adjustment in Section 9.4.3, and the mean item value of all respondents in that cell is taken.
Although this process preserves the mean item value in the sample, it deflates the variance. If we have to impute the item for many records we’ll end up with a concentration of observations at the mean value, and a smaller variability in the sample. Consequently it is common to impute the value with a random draw from a normal distribution with the same mean and variance as the true respondents.
9.5.4 Hot-Deck Imputation
Once again we divide the sample members into cells using variables which are known for each sample member, whether or not they respond. Then we replace missing values with values which are copied across from other records within the cell. We can impute individual items this way, or even whole records.
How do we decide which record to copy from – i.e. how do we choose a donor record? There are various possibilities:
- Sequential Hot-Deck Imputation – use the most recent record
in the list from the same cell;
- Random Hot-Deck Imputation – use a random record from within
the cell as the donor;
- Nearest-Neighbour Hot-Deck Imputation – define a similarity measure between records (based on variables which are known for all records), and choose the record which is most similar. (e.g. the one where the age is closest, or where age and income are closest.)
9.5.5 Cold-Deck Imputation
Similar to Hot-Deck imputation, but the donor records all come from another survey or some other data source. If this other data source is out of date, or if the questions asked were slightly different, then this procedure may introduce biases.
9.5.6 Regression Imputation
Replace the missing value with the prediction from a regression model. i.e. Use the respondents in the same cell to create a model of the way the outcome \(Y\) depends on the known measured covariates \({\bf X}\) and use that model to predict a value for the missing record, given its particular values of the covariates.
9.5.7 Multiple Imputation
Instead of imputing values once, we impute them a number of times, using one of the methods above. That means we end up with, say, 10 datasets – each of which has had the missing values imputed.
We then calculate 10 estimates, one from each of the 10 imputed datasets, and average them. We then use the variability amongst those 10 estimates as an additional component of the variance of the estimator: the additional uncertainty introduced by the imputation process.
9.5.8 Substitution
This method of imputation occurs in the field while the survey is in progress. The nonrespondent is replaced by a neighbour, or the next person who can be found to respond instead. This is very much like quota sampling, and the sample which results is not a probability sample.
9.5.9 Notes
Imputation for item nonresponse may weaken or distort correlation of variables within a record. Only methods where entire records are copied can avoid this, since only they represent true responses of individuals.
Where a lot of imputation is necessary there may be serious doubts about the validity of the conclusions of the survey. The assumptions which underlie the imputation process may start to appear in the results (e.g. we may simply see the regression relationships in the data which we put there during imputation). Moreover, if an analyst proceeds to produce estimates based on an imputed dataset without realising that the actual sample size is much smaller than it appears (due to the smaller number of actual responses) then the standard errors produced will be too small.
Ideally every imputed value should be flagged so that the analyst knows how much imputation has been done, and therefore whether it will affect the study conclusions. Typically a new indicator variable is created for every variable where imputation is done. The indicator takes the value 1 if the value for a unit is imputed, and 0 otherwise.
Theoretical expressions for variance estimates may break down when there has been correction for nonresponse. In these cases numerical resampling techniques may be the only way to get realistic standard errors.
In general imputation should be minimal – as few values as possible imputed – and implemented by a robust, defensible and appropriate methodology.
Imputation is not necessary for variables which are structurally missing. That is, where the person did not provide an answer to a question because that person was not asked that question.
e.g. If a question asks ‘Is your mortgage at a fixed interest rate?’, such a question should only have a non-missing answer if the person had answered yes to the question ‘Do you have a mortgage?’.
9.6 Capture-Recapture
A common way of determining the level of nonresponse in censuses is the post-enumeration survey. After a census has taken place, a subset of areas are resurveyed, with a shorter questionnaire, and using a different interviewing workforce, so that the results are independent. The post-enumeration survey makes a much more strenuous effort to achieve a complete response, although there will still be some nonresponse.
In carrying out the Census and PES we see three kinds of respondents:
- those in both the Census and PES
- those in the Census only
- those in the PES only
There is a fourth group of people who do not appear in either survey.
The sample sizes of the numbers of people responding to the census and the post-enumeration survey can thus be summarised in the following table:
In PES? | |||
---|---|---|---|
In Census? | Yes | No | Total |
Yes | \(n_{11}\) | \(n_{12}\) | \(n_{1+}\) |
No | \(n_{21}\) | \(0\) | \(n_{21}\) |
Total | \(n_{+1}\) | \(n_{12}\) | \(n\) |
i.e. we saw \(n_{1+}\) respondents in the census and \(n_{+1}\) respondents in the PES, of whom \(n_{11}\) responded to both.
The population values are the numbers of people in the whole population classified by whether they were found by the Census and whether they would be found by the PES if the PES were a full coverage survey:
Would be found by PES? | |||
---|---|---|---|
In Census? | Yes | No | Total |
Yes | \(N_{11}\) | \(N_{12}\) | \(N_{1+}\) |
No | \(N_{21}\) | \(N_{22}\) | \(N_{2+}\) |
Total | \(N_{+1}\) | \(N_{+2}\) | \(N\) |
We are interested in the population total \(N\). We can use the census data and
the PES sample to make estimates of \(N_{11}\), \(N_{12}\) and \(N_{21}\).
Assuming independence of the response rates to the census and the PES (a big
assumption!) We can use these estimate to estimate the total
\[
\estm{N} = \frac{n_{1+}n_{+1}}{n_{11}}
\]
This estimation method is an example of capture-recapture estimation.
This type of estimation is common in biological settings where a sample of
animals is caught, marked, and then released. Some time later another sample
is caught and the ratio of marked to unmarked individuals is used to estimate
the total animal population. In the census example the census is the first
‘capture’, and the PES is the ‘recapture.’
Note: Complex capture-recapture models exist which relax the independence assumption: these models require more than two sources of information.
The estimator \(\estm{N}\) given above is biased: an approximately unbiased estimator is \[ \estm{N} = \frac{(n_{1+}+1)(n_{+1}+1)}{(n_{11}+1)} - 1 \] which has variance \[ \bfa{Var}{\estm{N}} = \frac{(n_{1+}+1)(n_{+1}+1) (n_{1+}-n_{11})(n_{+1}-n_{11})}{ (n_{11}+1)^2(n_{11}+2)} \]
Example
In a capture-recapture study of frogs on an island a sample of 50 frogs is captured, tagged and released on one night, and then a second sample of 30 frogs is captured on the second night. In the second sample there are 16 frogs with tags, and 14 without. What is the size of the population of frogs on the island?
In second sample? | |||
---|---|---|---|
In 1st sample? | Yes | No | Total |
Yes | 16 | 34 | 50 |
No | 14 | ||
Total | 30 |
We have \(n_{1+}=50\) frogs in the first sample, \(n_{+1}=30\) frogs in the second sample, and \(n_{11}=16\) frogs common to both. Our estimate of the population size is then \[ \estm{N} = \frac{(n_{1+}+1)(n_{+1}+1)}{(n_{11}+1)} - 1 = \frac{(51)(31)}{17} - 1 = 92 \] which has variance \[ \bfa{Var}{\estm{N}} = \frac{(n_{1+}+1)(n_{+1}+1) (n_{1+}-n_{11})(n_{+1}-n_{11})}{ (n_{11}+1)^2(n_{11}+2)} = \frac{(51)(31)(34)(14)}{(17)^2(18)} = 144.67 \] so the standard error and relative standard error are \[\begin{eqnarray*} \bfa{SE}{\estm{N}} &=& \sqrt{\bfa{Var}{\estm{N}}} = 12.03\\ \bfa{RSE}{\estm{N}} &=& \frac{\bfa{SE}{\estm{N}}}{\estm{N}} = \frac{12.03}{92} = 0.13 = 13% \end{eqnarray*}\] A 95% confidence interval for \(N\) is therefore: \[ \estm{N} \pm 1.96 \bfa{SE}{\estm{N}} = 92 \pm (1.96)(12.03) = 92 \pm 24 = (68,116) \]