# Chapter 5 Cluster Sampling

In cluster sampling the population is first divided into \(M\) groups, known as **clusters** of **Primary Sampling Units** (PSUs), and a random sample of \(m\) clusters is selected. In **one-stage cluster samples** a census is taken of all units in each selected sample. In **two-stage cluster samples** a simple random sample of \(n_k\) units is taken from the \(N_k\) units in \(k^{\mathrm{th}}\) selected cluster.

## 5.1 One-Stage Cluster Sampling

In order for iNZight to carry out an appropriate analysis of a one-stage cluster design, we need to tell iNZight how many clusters there are in the population, and identify which units in the sample come from which cluster. We don’t need to specify the cluster sizes \(N_k\), since a census is taken of each selected cluster and thus iNZight can work out how big the clusters are. In iNZight this means

- adding a column to the dataset with the label \(k\) of the cluster to which each sample unit belongs, and then
- adding a column to the dataset with \(M\): the total number of clusters in the population
- telling iNZight that the data should be treated as a one-stage cluster sampling, and specifying the columns with the cluster labels and number of clusters.

We’ll assume that the two extra columns are already in the data set under consideration.

To tell iNZight to use these columns:

`Dataset > Survey Design > Specify Design`

- In the
`1st stage clustering variable`

box choose the name of the cluster label variable - In the
`Finite population correction`

box choose the name of the variable containing the number of clusters \(M\) - Click
`OK`

The probability of selection of the \(k^{\mathrm{th}}\) cluster in the sample is \[ \pi_k = \frac{m}{M} \] In a one-stage cluster sample all units in each selected cluster are selected, so the probability of selection of a sample unit is the same as the probability that the cluster is selected. So the probability that unit \(\ell\) in cluster \(k\) is selected is \[ \pi_{k\ell} = \frac{m}{M} \] So the survey weights are the same for all selected units \[ w_{k\ell} = \frac{M}{m} \]

### 5.1.1 Example

`apiclus1`

is a one-stage cluster sample of schools in California. The clusters are School Districts (stored in the variable `dnum`

), and the variable `fpc`

contains the value 757: the number \(M\) of School Districts. In the sample \(m=15\) districts are chosen, and then all of the schools (the units) are selected from those districts. A total of \(n=183\) schools were chosen in the 15 clusters.

To tell iNZight that this is one-stage cluster sample drawn from \(M=757\) districts, specify `dnum`

as the `1st stage clustering variable, and`

fpc` as the Finite population correction.

Look at school size (Variable 1 = `enroll`

) and school type (Variable 2 = `stype`

).

The Elementary Schools are smallest, and dominate the sample. The Middle and High Schools have a wide range of sizes.

`Summary`

output:

```
====================================================================================================
iNZight Summary - Survey Design
----------------------------------------------------------------------------------------------------
Primary variable of interest: enroll (numeric)
Secondary variable: stype (categorical)
Total number of observations: 183
Estimated population size: 9235
----------------------------------------------------------------------------------------------------
1 - level Cluster Sampling design
With (15) clusters.
survey::svydesign(id = ~ dnum, fpc = ~ fpc, data = dataSet)
====================================================================================================
Summary of enroll by stype:
---------------------------
Population estimates:
25% Median 75% Mean SD Total Est. Pop. Size | Sample Size Min Max
E 332.00 425.00 527.00 432.854 141.11 3145637.80 7267.2 | 144 117 818
H 323.00 467.00 1881.50 1130.286 843.10 798584.53 706.5 | 14 170 2181
M 704.00 935.00 1021.00 897.720 355.20 1132623.40 1261.7 | 25 233 1573
Standard error of estimates:
E 17.33 23.78 36.68 16.516 16.69 941356.77 1988.0
H 326.13 425.23 378.74 357.122 52.24 338039.77 236.6
M 158.23 95.37 148.87 99.532 54.31 318535.53 249.8
Design effects:
E 2.013 123.80
H 2.563 4.60
M 2.003 12.89
====================================================================================================
```

Note here that we estimate that there are \(Y_1=3.14\) million students in an **estimated** \(\widehat{N}_1=7267\) elementary schools, with a mean school size of \(\widehat{\bar{Y}_1}=433\) students.

### 5.1.2 Calibration in one-stage cluster samples

The sample size \(n\) is a **random variable** in a one-stage cluster sample, since the cluster sizes \(N_i\) may differ, and we don’t know in advance which clusters we’ll select. The selected sample size is
\[
n = \sum_{k=1}^n N_k
\]
The consequence of this is that the weights of the \(n\) respondents add up to \(nM/m\), which is an **estimate** of the population size:
\[
\widehat{N} = \sum_{k=1}^m\sum_{\ell=1}^{N_k} w_{k\ell} = \sum_{k=1}^m N_k \frac{M}{m} = n\frac{M}{m}
\]
If we happen to know that the true population size is \(N\) (maybe from a census, or some other data source) it makes sense to modify the weights, so that they properly add up to \(N\)
\[
\tilde{w}_{k\ell} = \frac{N}{\widehat{N}} w_{k\ell} = \frac{N}{n\frac{M}{m}} \frac{M}{m} = \frac{N}{n}
\]
These are of course the same as the SRSWOR weights.

This process is a special case of **calibration** or **post-stratification**. It doesn’t have any effect on estimates of population **means**, but greatly affects estimates of **population totals**, since it uses the fact that we know the population size.

We can specify these weights to iNZight by storing the value \(N/n\) in a column in the data set, and

`Dataset > Survey Design > Specify Design`

- In the
`1st stage clustering variable`

box choose the name of the cluster label variable - In the
`Weighting Variable`

box choose the name of the column of rescaled weights - In the
`Finite population correction`

box choose the name of the variable containing the number of clusters - Click
`OK`

### 5.1.3 Example - Calibration

Returning to the example, it is known that there are \(N=6194\) schools in the whole population. In the first analysis we estimated that there are 7267 Elementary Schools on their own, so our estimate of 3.1 million Elementary School students is likely to be an overestimate.

The column `pw`

contains the scaled weights \(N/n=6194/183=33.847\) - we can add this to the survey design as the `Weighting Variable`

(retaining `dnum`

and `fpc`

as the cluster label and population sizes).

`Summary`

output:

```
====================================================================================================
iNZight Summary - Survey Design
----------------------------------------------------------------------------------------------------
Primary variable of interest: enroll (numeric)
Secondary variable: stype (categorical)
Total number of observations: 183
Estimated population size: 6194
----------------------------------------------------------------------------------------------------
1 - level Cluster Sampling design
With (15) clusters.
survey::svydesign(id = ~ dnum, weights = ~ pw, fpc = ~ fpc, data = dataSet)
====================================================================================================
Summary of enroll by stype:
---------------------------
Population estimates:
25% Median 75% Mean SD Total Est. Pop. Size | Sample Size Min Max
E 332.00 425.00 527.00 432.854 141.11 2109717.127 4874.0 | 144 117 818
H 323.00 467.00 1881.50 1130.286 843.10 535594.870 473.9 | 14 170 2181
M 704.00 935.00 1021.00 897.720 355.20 759628.138 846.2 | 25 233 1573
Standard error of estimates:
E 17.33 23.78 36.68 16.516 16.69 631349.386 1333.3
H 326.13 425.23 378.74 357.122 52.24 226716.595 158.7
M 158.23 95.37 148.87 99.532 54.31 213635.484 167.5
Design effects:
E 2.033 125.039
H 2.588 4.646
M 2.023 13.015
====================================================================================================
```

The graph looks identical, and the estimates of the mean sizes of schools are the same (433). However we now estimate only 2.1 million students in 4874 Elementary Schools, and have reduced thus our earlier over-estimation.

## 5.2 Two-stage Cluster sampling

In a two-stage cluster sample we draw \(m\) clusters from the \(M\) in the population by SRS. Then we draw \(n_k\) units from the \(N_k\) units in the \(k^{\mathrm{th}}\) selected cluster.

In order for iNZight to carry out an appropriate analysis of a one-stage cluster design, we need to tell iNZight how many clusters there are in the population, and identify which units in the sample come from which cluster. We also need to specify the cluster sizes \(N_k\), since a only a sample of \(n_k\) units is taken of each selected cluster and thus iNZight needs to know how big the clusters are. In iNZight this means

- adding a column to the dataset with the label \(k\) of the cluster to which each sample unit belongs, and then
- adding a column to the dataset with \(M\): the total number of clusters in the population
- adding a column to the dataset with \(N_k\): the total number of units in the \(k^{\mathrm{th}}\) cluster from which the unit is taken
- telling iNZight that the data should be treated as a two-stage cluster sampling, and specifying the columns with the cluster labels, total number of clusters, and cluster sizes.

We’ll assume that the three extra columns are already in the data set under consideration.

To tell iNZight to use these columns:

`Dataset > Survey Design > Specify Design`

- In the
`1st stage clustering variable`

box choose the name of the cluster label variable - In the
`2nd stage clustering variable`

box choose the name of the variable labelling individual units - In the
`Finite population correction`

type`var1+var2`

where`var1`

is the column with the number of clusters \(M\), and`var2`

is the column with the number of units per cluster \(N_k\). - Click
`OK`

The probability of selection of the \(k^{\mathrm{th}}\) cluster in the sample is \[ \pi_k = \frac{m}{M} \] In a two-stage cluster sample we select by SRS \(n_k\) of the \(N_k\) units in each selected cluster, so the probability of selection of a sample unit is the product of two SRS selection probabilities: the probability that unit \(\ell\) in cluster \(k\) is selected is \[ \pi_{k\ell} = \frac{m}{M}\times\frac{n_k}{N_k} \] So the survey weights are the same for all selected units in the same cluster, but differ between clusters: \[ w_{k\ell} = \frac{MN_k}{mn_k} \]

### 5.2.1 Example

`apiclus2`

is a two-stage cluster sample of schools in California. The clusters are School Districts (stored in the variable `dnum`

), the variable `snum`

identifies the Schools within clusters. The variable `fpc1`

contains the value 757: the number \(M\) of School Districts, and `fpc2`

gives \(N_k\), the number of schools in the cluster from which each school was selected. In the sample \(m=40\) districts are chosen, and then a sample of between 1 and 5 schools was taken by SRS from each selected cluster. A total of 126 schools were selected.

To tell iNZight that this is two-stage cluster sample drawn from \(M=757\) districts:

- specify
`dnum`

as the`1st stage clustering variable`

- specify
`snum`

as the`2nd stage clustering variable`

, - type
`fpc1+fpc2`

as the Finite population correction - to indicate the two population size columns

As before, look at school size (Variable 1 = `enroll`

) and school type (Variable 2 = `stype`

).

(**Note** iNZight may become significantly slow to run when specifying a complex design like this.)

`Summary`

output:

```
====================================================================================================
iNZight Summary - Survey Design
----------------------------------------------------------------------------------------------------
Primary variable of interest: enroll (numeric)
Secondary variable: stype (categorical)
Total number of observations: 126
Number omitted due to missingness: 6 (6 in enroll)
Total number of observations used: 120
Estimated population size: 5129
----------------------------------------------------------------------------------------------------
2 - level Cluster Sampling design
With (40, 126) clusters.
survey::svydesign(id = ~ dnum + snum, fpc = ~ fpc1+fpc2, data = dataSet)
====================================================================================================
Summary of enroll by stype:
---------------------------
Population estimates:
25% Median 75% Mean SD Total Est. Pop. Size | Sample Size Min Max
E 219.75 295.60 437.43 339.7874 159.80 1161343.98 3493.6 | 83 113 840
H 899.14 1114.75 1171.04 1038.6374 390.70 715486.12 688.9 | 20 112 2237
M 720.00 800.92 1061.68 839.3250 279.00 762442.83 946.3 | 23 156 1211
Standard error of estimates:
E 27.59 78.82 121.24 51.0810 28.88 334451.85 1119.7
H 232.01 184.75 51.12 81.5152 97.21 338720.99 289.4
M 149.18 78.66 105.74 67.5015 49.05 291735.04 311.8
Design effects:
E 8.2586 30.31
H 0.8966 32.62
M 1.2529 28.36
====================================================================================================
```

### 5.2.2 Supplying Weights

As in the one-stage design we are free to specify weights rather than supplying the details of the PSUs - in that case we still need to specify the `1st stage clustering variable`

(cluster labels, \(k\)) and the `Finite population correction`

(Number of clusters, \(M\)), but after that we can supply a weighting variable in place of the individual label and the cluster size (\(N_k\)) values.

## 5.3 Stratified Cluster Designs

iNZight allows for stratified cluster designs - we simply specify a stratification variable, and then specify the cluster design (one- or two-stage) that takes place within each selected cluster.