Chapter 5 Cluster Sampling

In cluster sampling the population is first divided into \(M\) groups, known as clusters of Primary Sampling Units (PSUs), and a random sample of \(m\) clusters is selected. In one-stage cluster samples a census is taken of all units in each selected sample. In two-stage cluster samples a simple random sample of \(n_k\) units is taken from the \(N_k\) units in \(k^{\mathrm{th}}\) selected cluster.

5.1 One-Stage Cluster Sampling

In order for iNZight to carry out an appropriate analysis of a one-stage cluster design, we need to tell iNZight how many clusters there are in the population, and identify which units in the sample come from which cluster. We don’t need to specify the cluster sizes \(N_k\), since a census is taken of each selected cluster and thus iNZight can work out how big the clusters are. In iNZight this means

  • adding a column to the dataset with the label \(k\) of the cluster to which each sample unit belongs, and then
  • adding a column to the dataset with \(M\): the total number of clusters in the population
  • telling iNZight that the data should be treated as a one-stage cluster sampling, and specifying the columns with the cluster labels and number of clusters.

We’ll assume that the two extra columns are already in the data set under consideration.
To tell iNZight to use these columns:

  • Dataset > Survey Design > Specify Design
  • In the 1st stage clustering variable box choose the name of the cluster label variable
  • In the Finite population correction box choose the name of the variable containing the number of clusters \(M\)
  • Click OK

The probability of selection of the \(k^{\mathrm{th}}\) cluster in the sample is \[ \pi_k = \frac{m}{M} \] In a one-stage cluster sample all units in each selected cluster are selected, so the probability of selection of a sample unit is the same as the probability that the cluster is selected. So the probability that unit \(\ell\) in cluster \(k\) is selected is \[ \pi_{k\ell} = \frac{m}{M} \] So the survey weights are the same for all selected units \[ w_{k\ell} = \frac{M}{m} \]

5.1.1 Example

apiclus1 is a one-stage cluster sample of schools in California. The clusters are School Districts (stored in the variable dnum), and the variable fpc contains the value 757: the number \(M\) of School Districts. In the sample \(m=15\) districts are chosen, and then all of the schools (the units) are selected from those districts. A total of \(n=183\) schools were chosen in the 15 clusters.

To tell iNZight that this is one-stage cluster sample drawn from \(M=757\) districts, specify dnum as the 1st stage clustering variable, andfpc` as the Finite population correction.

Look at school size (Variable 1 = enroll) and school type (Variable 2 = stype).

The Elementary Schools are smallest, and dominate the sample. The Middle and High Schools have a wide range of sizes.

Summary output:

====================================================================================================
                                  iNZight Summary - Survey Design
----------------------------------------------------------------------------------------------------
   Primary variable of interest: enroll (numeric)
             Secondary variable: stype (categorical)
                                 
   Total number of observations: 183
      Estimated population size: 9235
----------------------------------------------------------------------------------------------------
   1 - level Cluster Sampling design
   With (15) clusters.
   survey::svydesign(id = ~ dnum, fpc = ~ fpc, data = dataSet)
====================================================================================================

Summary of enroll by stype:
---------------------------

Population estimates:

          25%   Median       75%       Mean       SD        Total   Est. Pop. Size   |   Sample Size   Min    Max
   E   332.00   425.00    527.00    432.854   141.11   3145637.80           7267.2   |           144   117    818
   H   323.00   467.00   1881.50   1130.286   843.10    798584.53            706.5   |            14   170   2181
   M   704.00   935.00   1021.00    897.720   355.20   1132623.40           1261.7   |            25   233   1573

Standard error of estimates:

   E    17.33    23.78     36.68     16.516    16.69    941356.77           1988.0                               
   H   326.13   425.23    378.74    357.122    52.24    338039.77            236.6                               
   M   158.23    95.37    148.87     99.532    54.31    318535.53            249.8                               

Design effects:

   E                                  2.013                123.80                                                
   H                                  2.563                  4.60                                                
   M                                  2.003                 12.89                                           

====================================================================================================

Note here that we estimate that there are \(Y_1=3.14\) million students in an estimated \(\widehat{N}_1=7267\) elementary schools, with a mean school size of \(\widehat{\bar{Y}_1}=433\) students.

5.1.2 Calibration in one-stage cluster samples

The sample size \(n\) is a random variable in a one-stage cluster sample, since the cluster sizes \(N_i\) may differ, and we don’t know in advance which clusters we’ll select. The selected sample size is \[ n = \sum_{k=1}^n N_k \] The consequence of this is that the weights of the \(n\) respondents add up to \(nM/m\), which is an estimate of the population size: \[ \widehat{N} = \sum_{k=1}^m\sum_{\ell=1}^{N_k} w_{k\ell} = \sum_{k=1}^m N_k \frac{M}{m} = n\frac{M}{m} \] If we happen to know that the true population size is \(N\) (maybe from a census, or some other data source) it makes sense to modify the weights, so that they properly add up to \(N\) \[ \tilde{w}_{k\ell} = \frac{N}{\widehat{N}} w_{k\ell} = \frac{N}{n\frac{M}{m}} \frac{M}{m} = \frac{N}{n} \] These are of course the same as the SRSWOR weights.

This process is a special case of calibration or post-stratification. It doesn’t have any effect on estimates of population means, but greatly affects estimates of population totals, since it uses the fact that we know the population size.

We can specify these weights to iNZight by storing the value \(N/n\) in a column in the data set, and

  • Dataset > Survey Design > Specify Design
  • In the 1st stage clustering variable box choose the name of the cluster label variable
  • In the Weighting Variable box choose the name of the column of rescaled weights
  • In the Finite population correction box choose the name of the variable containing the number of clusters
  • Click OK

5.1.3 Example - Calibration

Returning to the example, it is known that there are \(N=6194\) schools in the whole population. In the first analysis we estimated that there are 7267 Elementary Schools on their own, so our estimate of 3.1 million Elementary School students is likely to be an overestimate.

The column pw contains the scaled weights \(N/n=6194/183=33.847\) - we can add this to the survey design as the Weighting Variable (retaining dnum and fpc as the cluster label and population sizes).

Summary output:

====================================================================================================
                                  iNZight Summary - Survey Design
----------------------------------------------------------------------------------------------------
   Primary variable of interest: enroll (numeric)
             Secondary variable: stype (categorical)
                                 
   Total number of observations: 183
      Estimated population size: 6194
----------------------------------------------------------------------------------------------------
   1 - level Cluster Sampling design
   With (15) clusters.
   survey::svydesign(id = ~ dnum, weights = ~ pw, fpc = ~ fpc, data = dataSet)
====================================================================================================

Summary of enroll by stype:
---------------------------

Population estimates:

          25%   Median       75%       Mean       SD         Total   Est. Pop. Size   |   Sample Size   Min    Max
   E   332.00   425.00    527.00    432.854   141.11   2109717.127           4874.0   |           144   117    818
   H   323.00   467.00   1881.50   1130.286   843.10    535594.870            473.9   |            14   170   2181
   M   704.00   935.00   1021.00    897.720   355.20    759628.138            846.2   |            25   233   1573

Standard error of estimates:

   E    17.33    23.78     36.68     16.516    16.69    631349.386           1333.3                               
   H   326.13   425.23    378.74    357.122    52.24    226716.595            158.7                               
   M   158.23    95.37    148.87     99.532    54.31    213635.484            167.5                               

Design effects:

   E                                  2.033                125.039                                                
   H                                  2.588                  4.646                                                
   M                                  2.023                 13.015                                                

====================================================================================================

The graph looks identical, and the estimates of the mean sizes of schools are the same (433). However we now estimate only 2.1 million students in 4874 Elementary Schools, and have reduced thus our earlier over-estimation.

5.2 Two-stage Cluster sampling

In a two-stage cluster sample we draw \(m\) clusters from the \(M\) in the population by SRS. Then we draw \(n_k\) units from the \(N_k\) units in the \(k^{\mathrm{th}}\) selected cluster.

In order for iNZight to carry out an appropriate analysis of a one-stage cluster design, we need to tell iNZight how many clusters there are in the population, and identify which units in the sample come from which cluster. We also need to specify the cluster sizes \(N_k\), since a only a sample of \(n_k\) units is taken of each selected cluster and thus iNZight needs to know how big the clusters are. In iNZight this means

  • adding a column to the dataset with the label \(k\) of the cluster to which each sample unit belongs, and then
  • adding a column to the dataset with \(M\): the total number of clusters in the population
  • adding a column to the dataset with \(N_k\): the total number of units in the \(k^{\mathrm{th}}\) cluster from which the unit is taken
  • telling iNZight that the data should be treated as a two-stage cluster sampling, and specifying the columns with the cluster labels, total number of clusters, and cluster sizes.

We’ll assume that the three extra columns are already in the data set under consideration.
To tell iNZight to use these columns:

  • Dataset > Survey Design > Specify Design
  • In the 1st stage clustering variable box choose the name of the cluster label variable
  • In the 2nd stage clustering variable box choose the name of the variable labelling individual units
  • In the Finite population correction type var1+var2 where var1 is the column with the number of clusters \(M\), and var2 is the column with the number of units per cluster \(N_k\).
  • Click OK

The probability of selection of the \(k^{\mathrm{th}}\) cluster in the sample is \[ \pi_k = \frac{m}{M} \] In a two-stage cluster sample we select by SRS \(n_k\) of the \(N_k\) units in each selected cluster, so the probability of selection of a sample unit is the product of two SRS selection probabilities: the probability that unit \(\ell\) in cluster \(k\) is selected is \[ \pi_{k\ell} = \frac{m}{M}\times\frac{n_k}{N_k} \] So the survey weights are the same for all selected units in the same cluster, but differ between clusters: \[ w_{k\ell} = \frac{MN_k}{mn_k} \]

5.2.1 Example

apiclus2 is a two-stage cluster sample of schools in California. The clusters are School Districts (stored in the variable dnum), the variable snum identifies the Schools within clusters. The variable fpc1 contains the value 757: the number \(M\) of School Districts, and fpc2 gives \(N_k\), the number of schools in the cluster from which each school was selected. In the sample \(m=40\) districts are chosen, and then a sample of between 1 and 5 schools was taken by SRS from each selected cluster. A total of 126 schools were selected.

To tell iNZight that this is two-stage cluster sample drawn from \(M=757\) districts:

  • specify dnum as the 1st stage clustering variable
  • specify snum as the 2nd stage clustering variable,
  • type fpc1+fpc2 as the Finite population correction - to indicate the two population size columns

As before, look at school size (Variable 1 = enroll) and school type (Variable 2 = stype).

(Note iNZight may become significantly slow to run when specifying a complex design like this.)

Summary output:

====================================================================================================
                                  iNZight Summary - Survey Design
----------------------------------------------------------------------------------------------------
        Primary variable of interest: enroll (numeric)
                  Secondary variable: stype (categorical)
                                      
        Total number of observations: 126
   Number omitted due to missingness: 6 (6 in enroll)
   Total number of observations used: 120
           Estimated population size: 5129
----------------------------------------------------------------------------------------------------
   2 - level Cluster Sampling design
   With (40, 126) clusters.
   survey::svydesign(id = ~ dnum + snum, fpc = ~ fpc1+fpc2, data = dataSet)
====================================================================================================

Summary of enroll by stype:
---------------------------

Population estimates:

          25%    Median       75%        Mean       SD        Total   Est. Pop. Size   |   Sample Size   Min    Max
   E   219.75    295.60    437.43    339.7874   159.80   1161343.98           3493.6   |            83   113    840
   H   899.14   1114.75   1171.04   1038.6374   390.70    715486.12            688.9   |            20   112   2237
   M   720.00    800.92   1061.68    839.3250   279.00    762442.83            946.3   |            23   156   1211

Standard error of estimates:

   E    27.59     78.82    121.24     51.0810    28.88    334451.85           1119.7                               
   H   232.01    184.75     51.12     81.5152    97.21    338720.99            289.4                               
   M   149.18     78.66    105.74     67.5015    49.05    291735.04            311.8                               

Design effects:

   E                                   8.2586                 30.31                                                
   H                                   0.8966                 32.62                                                
   M                                   1.2529                 28.36                                                
====================================================================================================

5.2.2 Supplying Weights

As in the one-stage design we are free to specify weights rather than supplying the details of the PSUs - in that case we still need to specify the 1st stage clustering variable (cluster labels, \(k\)) and the Finite population correction (Number of clusters, \(M\)), but after that we can supply a weighting variable in place of the individual label and the cluster size (\(N_k\)) values.

5.3 Stratified Cluster Designs

iNZight allows for stratified cluster designs - we simply specify a stratification variable, and then specify the cluster design (one- or two-stage) that takes place within each selected cluster.