Chapter 6 Complex designs

The previous sections have covered the most common sampling schemes: simple random sampling, stratified and cluster designs.

In this section we show options for variance estimation in more complex designs, or in situations where we don’t have access to enough information to use the methods from the earlier sections.

6.1 Replicate Weights for Variance Estimation

Whenever we create an estimate of a population parameter we need to estimate its uncertainty - in the form a of a variance (or its square root: the standard error). In certain situations we have formulae which estimate the variance, but these do not always exist for some designs.

The idea behind using replicate weights for estimation is that we see how variable our estimates are when we delete different parts of the sample. If the estimates from this procedure are all very similar, then we can conclude that the estimate using all the data may be quite precise. If the estimates differ wildly, we will be much less confident.

The sample weights in a data set from a sample survey add (approximately or exactly) to the size of the population. When we delete a section of the sample, we set the weights of that section to zero, and correspondingly increase the weights of all of the other units in the sample. The precise way we transfer the weights to those other units depends on the sample design.

The methods that are used to generate replicate weights are

Balanced Repeated Replication (BRR) - the data are partitioned into groups of first stage units, and then at each replicate specific combinations of these groups are retained in the sample while the others are deleted. (These groupings are designed to be very efficient, getting at the variance with a minimal number of replicates.)
Jacknife JK1 - one of the first stage units (i.e. clusters) is deleted in each replicate
Jacknife JKn - in a stratified sample, one of the first stage units is deleted from each stratum
Bootstrap - The original weights are used to resample with replacement from the sample - with some units being selected more than once, and having correspondingly higher weights

Note in cluster designs the units within clusters are correlated with one another, and in order not to mess with that correlation we always retain or delete the whole of a cluster - we never delete just part of it.

A dataset supplied with replicate weights will contain, in addition to the survey data:

A column with the actual sampling weights
A set of columns with replicate sets of weights (50-100 columns is common, though with the Bootstrap many more may be required)

When working with datasets with replicate weights we need to specify the names of these columns to iNZight, as well as the type of replicate.

6.1.1 Example

The data file apiclus2-jk1.csv is the same as the apiclus2 dataset inside iNZight, except that it has 40 columns of JK1 replicate weights in the columns repw01, repw02, ..., repw40. The sample design of apiclus2 is a two-stage cluster design, and at the first stage \(n=40\) school districts (clusters) are selected from the \(N=757\) available. At the second stage between 1 and 5 schools are selected from each of the selected clusters, with a total of \(m=126\) schools selected. The column pw contains the standard sampling weights, \(NM_k/(nm_k)\), where \(M_k\) is the size of cluster \(k\) and \(m_k\) is the number of schools selected from the cluster.

If you take a look in the dataset at the replicate weights columns you’ll see that one at a time blocks of weights are set to zero - these blocks correspond to all the schools within a particular cluster (school district - labelled by dnum).

To specify the two-stage cluster design with replicate weights to iNZight:

Go Dataset > Survey design > Specify replicate design ...
Specify pw as the sampling weight
Leave ticked the box Replication weights incorporate sampling weights
The select in the Select replicate weights select repw01-repw40 (click repw01 and then shift+click repw40)
At top right change Type of replication weights to JK1

Next choose for analysis the api00 variable as Variable 1.

Note that versions of iNZight before 3.5.3 had a bug which means that this didn’t work.

Also note iNZight can’t cope with missing values of analysis variables with replicate weights. So analysis of a variable like enroll (which has some missing values) will fail. The solution to this is some kind of imputation of missing values. (Simple elimination of the rows with missing values would require reconstruction of the sample and replicate weights.)

Summary output:

====================================================================================================
                                  iNZight Summary - Survey Design
----------------------------------------------------------------------------------------------------
   Primary variable of interest: api00 (numeric)
                                 
   Total number of observations: 126
      Estimated population size: 5129
----------------------------------------------------------------------------------------------------
   Replicate weights design
   Unstratified cluster jacknife (JK1) with 40 replicates.
====================================================================================================

Summary of api00:
-----------------

Population estimates:

      25%   Median      75%      Mean        SD       Total   Est. Pop. Size   |   Sample Size   Min   Max
   544.95   652.90   803.57   670.812   136.830   3440375.8             5129   |           126   453   951

Standard error of estimates:

    33.93    48.92    43.14    34.928     9.099    951979.6             1514                              

Design effects:

                                8.417                 237.7                                               


====================================================================================================

Inference output:

====================================================================================================
                               iNZight Inference using Normal Theory
----------------------------------------------------------------------------------------------------
   Primary variable of interest: api00 (numeric)
                                 
   Total number of observations: 126
      Estimated population size: 5129
----------------------------------------------------------------------------------------------------
   Replicate weights design
   Unstratified cluster jacknife (JK1) with 40 replicates.
====================================================================================================

Inference of api00:
-------------------

Population Mean with 95% Confidence Interval

   Lower   Estimate   Upper
   602.4      670.8   739.3


====================================================================================================

6.2 PPS designs

In a PPS design, the first stage sampling units (PSUs/clusters) are selected not by SRS, but with probability proportional to their size. In these designs sampling is with replacement - so that we don’t remove PSUs from the population during sampling, and if we happen to select a PSU twice (or more) we replicate its data in our sample as many times as it was selected. (Of course, we only actually collect the data once.)

iNZight can’t do these designs proper justice in all circumstances, but has suitable approximations in place for variance estimation.

To specify a PPS design, instead of supplying the population totals in the “Finite Population Correction” part of the sample design specification, just specify a column with the sample weights. Note that these are the full weights of each sample unit, combining all stages of the sample design.

If replicate weights are available then treating the sample as a without-replacement sample with supplied weights is a reasonable approximation.