Chapter 11 Cluster sampling

$\DeclareMathOperator*{\argmin}{argmin}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\bfa}[2]{{\rm\bf #1}[#2]}$ $\newcommand{\rma}[2]{{\rm #1}[#2]}$ $\newcommand{\estm}{\widehat}$

We have mentioned previously that to implement an SRSWOR sample design in practice requires us to have a list frame of the population units. Clearly in many practical sampling situations a list frame of the population doesn’t exist, or if it does exist the biases in the frame are unacceptably large, or the construction of a list frame would be very expensive, indeed many times the cost of sample survey of the population. What can we do?

It may be that we can form and list clusters of population units quite cheaply: e.g. we can divide the whole of New Zealand into small geographic areas on a large map of New Zealand and label these and then say select these areas by SRSWOR. (An example of such geographic areas are Statistics New Zealand’s meshblocks: New Zealand is split into about 34000 meshblocks which contain on average about 30 dwellings.)

What we then have is not an SRSWOR of dwellings, but an SRSWOR of clusters of dwellings. We now have the opportunity of listing all the dwellings in a selected cluster and perhaps taking an SRSWOR of some of them or indeed sampling all of them. This two stage process of constructing a frame is likely to be quite cheap, since we never list the entire population.

Cluster sampling arises quite naturally in sampling biological data. For example if we are interested in determining the characteristics of a deep sea fish species, e.g. average age, average weight, etc, then it is likely that we collect the fish by trawl netting of suspected shoals of such fish. So we might randomly select some shoals of fish (first stage of clustering), then take some ‘random’ trawls through the selected shoal (second stage of clustering) and then finally we might randomly select some bins of fish and measure all the fish within the selected bins (third stage of clustering).

As you might expect the properties of such sample designs are quite different from those which don’t gselect clusters. In particular, such designs are generally not as efficient as SRSWOR designs. For example, often the design effect is considerably more than 1. However, careful design of the clusters, including making sure that the clusters are very heterogeneous and the ultimate stage of clustering produces small clusters, together with careful choice of estimators and the use of stratification to form the first stage clusters into homogeneous groups can result in designs where the design effect is 1 or even less than 1.

In the following sections we are going to discuss cluster sample designs where the sampling is SRSWOR. These are not the only possible designs, but they are the simplest. Recall that in any unusual design, provided that we can work out the first and second order inclusion probabilities, we always (via the HT estimator) have a way of estimating totals or means and their variances.

%The remainder of this chapter follows the treatment given in Chapter~5 of Lohr %(1999).

11.1 Example

## Loading required package: grid

## Loading required package: Matrix

## Loading required package: survival

## 
## Attaching package: 'survey'

## The following object is masked from 'package:graphics':
## 
##     dotchart

Suppose we have a primary school with 130 students in 12 classes. We want to estimate the total number of reading books taken home by the students. We do not have the time to count the number of books taken home by all 130 students so we are only going to look at a simple random sample of 12 students to calculate our estimates.

Let $Y_i$ be the number of books taken home by the $i^{\rm th}$ student in the school: $i=1,\ldots,M$, with population size $M=130$;
Let $y_k$ be the number of books taken home by the $k^{\rm th}$ student in the sample: $k=1,\ldots,m$, with sample size $m=12$.

(For reasons that will be clearer later in the chapter, we are using $M$ and $m$ for the population size and sample size, rather than the more usual $N$ and $n$.)

Sample member	Student (within school)	Class	Class Size	Student (within class)	Number of books
$k$	$i$				$y_k$
1	5	1	9	5	2
2	14	2	10	5	2
3	16	2	10	7	2
4	63	6	16	9	1
5	65	6	16	11	0
6	70	6	16	16	1
7	77	7	12	7	2
8	80	7	12	10	3
9	87	8	14	5	5
10	89	8	14	7	2
11	107	10	6	1	6
12	127	12	10	7	3

The sample mean is $\bar{y}=2.417$ and variance is $s_y^2=1.68^2=2.811$.
This is a case of simple random sampling: the SRSWOR estimator of the total is \[ \estm{Y} = M\bar{y} = (130)(2.417) = 314 \] with variance \[ \bfa{Var}{\estm{Y}} = M^2\left(1-\frac{m}{M}\right)\frac{s_y^2}{m} = 130^2\left(1-\frac{12}{130}\right)\frac{1.68^2}{12} = 3592.891 \] so that the standard error $\bfa{SE}{\estm{Y}}=\sqrt{3592.891}=59.9$.

However, we might not have access to a list of all the students in the school – so we don’t have a frame from which we can select a simple random sample. Instead what we do have is a list of classes at the school. We can take a simple random sample of those classes, and then go to each selected class, and take our sample from the students we find there. In this case the classes are called clusters or PSUs (Primary Sampling Units). There are 12 classes at the primary school in question, each with a small number of students or SSUs (Secondary Sampling Units). We select $n=3$ classes, and then select 4 students from each class.

Let $N$ be the number of clusters: PSUs.
In cluster $i$, there are $M_i$ students: SSUs. There are $M=\sum_{i=1}^N M_i$ SSUs in all.
Let $Y_{ij}$ be the number of library books taken home by the $j^{\rm th}$ student in class $i$.
Let $m_k$ students be the number of students selected from the $k^{\rm th}$ selected class.
Let $y_{k\ell}$ be the number of library books taken home by the $\ell^{\rm th}$ selected student in the $k^{\rm th}$ selected class.

Class	Class size	Sample Indicator		Sample Size	Sample Data
$i$	$M_i$	$I_i$	$k$	$m_k$	$y_{k1},\ldots,y_{km_k}$
1	9	1	1	4	4,4,3,4
2	10	0
3	20	0
4	8	0
5	7	0
6	16	1	2	4	6,6,6,4
7	12	0
8	14	0
9	10	0
10	6	0
11	8	0
12	10	1	3	4	1,2,3,5

We have a new sample of 12 students – but we need to be careful when we analyse these data: the 12 students are grouped into clusters. Students from the same class may be much more similar to each other than they are to all of the other students at the school: particularly if, say, one teacher in one class is particularly encouraging of use of the library.

A measure of the similarity of clusters is given by the intra-cluster correlation coefficient $\rho$ – a quantity we wish to be as small as possible: indicating that there is as much variability as possible within the clusters, so that the clusters can be regarded as mini-populations that reflect the properties of the whole population. If the clusters are very homogeneous, then sampling lots of SSUs within a selected cluster doesn’t give us as much information about the population as the case when the clusters are very diverse. \[\begin{equation} \rho = \frac{ \sum_{i}\sum_{j\neq k}(y_{ij}-\bar{\bar{Y}})(y_{ik}-\bar{\bar{Y}})}{ (M-1)(NM-1)S_Y^2} \end{equation}\]

Cluster sampling consists of two steps: first we select the PSUs (the clusters), and then we select the SSUs within them. In a one-stage cluster sample, we do a census of each selected cluster (e.g. we select a class, and then survey all of the students in the class). In a two-stage cluster sample we use some sampling method to select a sample of the SSUs in a selcted cluster.

The example above is a two-stage cluster sample: we selected a sample of classes, and then took a sample within each selected class.

11.2 Comparison with stratified sampling

In both stratified and cluster sampling we break the population up into groups before drawing the sample. However beyond this superficial resemblance stratified and cluster sampling are very different.

When we set up stratifed sampling our primary goal is to reduce the variance of estimators. We do this by placing units which are similar to each other in the same stratum. We thus attempt to put as much of the variation present in the population into the difference between strata, and attempt to make strata as internally homogeneous as possible. After stratifying we sample units from within all of the strata.

In cluster sampling our primary goal is controlling the cost of creating the frame and collecting the sample. It is for this reason that the design effects of cluster designs may be worse than stratified and SRS designs. We do not survey all of the clusters, but only a random sample of them. It is more convenient to select a few clusters and to survey those, since all of the units will be close to one another (if the clustering is done geographically). Our hope in doing so is that the selected clusters still capture all of the variability present in the population. It is therefore in our interests that clusters be as internally variable as possible, and that the clusters resemble each other as much as possible.

11.3 Notation in cluster sampling

We need to extend our notation conventions to cope with cluster sampling. The population is divided into $N$ clusters or primary sampling units (PSUs). The $i^{\rm th}$ cluster contains $M_i$ secondary sampling units (SSUs). Thus the total number of SSUs in the population is $M=\sum_{i=1}^N M_i$. The mean cluster size is \[\begin{equation} \bar{M} = \frac{1}{N}\sum_{i=1}^N M_i = \frac{M}{N} \end{equation}\] so that $M=N\bar{M}$.

11.3.1 Population Quantities

Within each cluster the SSUs are labelled $j=1,\ldots,M_i$ and the value of variable $Y$ on the $j^{\rm th}$ unit in the $i^{\rm th}$ cluster is $Y_{ij}$. (Note that the SSUs may be our responding units, or we may need to have a further stage of sampling within selected SSUs to finally select the responding units.)

The total value of $Y$ within cluster $i$ is \[\begin{equation} Y_i=\sum_{j=1}^{M_i}Y_{ij} \end{equation}\] and the $i^{\rm th}$ cluster mean is \[\begin{equation} \bar{Y}_i=\frac{1}{M_i}\sum_{j=1}^{M_i} Y_{ij} = \frac{Y_i}{M_i} \end{equation}\] The overall population mean is written \[\begin{equation} \bar{\bar{Y}} = \frac{\sum_{i=1}^N\sum_{j=1}^{M_i} Y_{ij}}{M} = \frac{\sum_{i=1}^N\sum_{j=1}^{M_i} Y_{ij}}{\sum_{i=1}^NM_i} = \frac{\sum_{i=1}^N Y_i}{N\bar{M}} \end{equation}\] which is different from the mean of cluster totals \[\begin{equation} \bar{Y}_T = \frac{1}{N}\sum_{i=1}^N Y_i = \bar{M}\bar{\bar{Y}} \end{equation}\] The population variance \[\begin{equation} S_Y^2 = \frac{1}{N\bar{M}-1}\sum_{i=1}^N\sum_{j=1}^{M_i} (Y_{ij}-\bar{\bar{Y}})^2 \end{equation}\] is a combination of within PSU variation \[\begin{equation} S_i^2 = \frac{1}{M_i-1}\sum_{j=1}^{M_i} (Y_{ij}-\bar{Y}_i)^2 \end{equation}\] and between PSU variation – captured by the population variance of PSU totals: \[\begin{equation} S_T^2 = \frac{1}{N-1}\sum_{i=1}^N (Y_i-\bar{Y}_T)^2 \tag{11.1} \end{equation}\] The quantity $S_T^2$ is also known as the between cluster variance or variance of cluster totals.

11.3.2 Sample Quantities

In general we select $n$ PSUs, and select $m_k$ units from within the $k^{\rm th}$ selected PSU. In this chapter we assume that both selection methods are SRSWOR, although any probabilistic sampling method could be used. The sample of PSUs is called the first stage sample $s_I$. The SSUs chosen from within the selected PSUs form the second stage sample $s_{II}$. The sample members within the $k^{\rm th}$ selected PSU form the subsample $s_k$.

Thus $y_{k\ell}$ is the value of the variable $Y$ on the $\ell^{\rm th}$ selected unit ($\ell=1,\ldots,m_k$) from with the $k^{\rm th}$ selected cluster ($k=1,\ldots,n$).

The sample total for the $k^{\rm th}$ selected PSU is \[\begin{equation} y_k = \sum_{\ell\in s_k} y_{k\ell} \end{equation}\] and the sample mean is \[\begin{equation} \bar{y}_k = \frac{1}{m_k}\sum_{\ell\in s_k} y_{k\ell} = \frac{y_k}{m_k} \end{equation}\] This leads directly to the HT estimator of the cluster total under SRSWOR: \[\begin{equation} \widehat{Y}_k = \frac{M_k}{m_k}\sum_{\ell\in s_k} y_{k\ell} = M_k\bar{y}_k \end{equation}\] and hence to the HT estimators of the population total: \[\begin{equation} \widehat{Y}_{HT} = \frac{N}{n}\sum_{k=1}^n \widehat{Y}_k = \frac{N}{n}\sum_{k=1}^n \frac{M_k}{m_k}\sum_{\ell\in s_k} y_{k\ell} \end{equation}\] and the population mean: \[\begin{equation} \widehat{\bar{Y}}_{HT} = \frac{\widehat{Y}_{HT}}{M} = \frac{1}{n\bar{M}}\sum_{k=1}^n \widehat{Y}_k = \frac{1}{n\bar{M}}\sum_{k=1}^n \frac{M_k}{m_k}\sum_{\ell\in s_k} y_{k\ell} \end{equation}\] The within PSU sample variance is estimated by \[\begin{equation} s_k^2 = \frac{1}{m_k-1}\sum_{\ell\in s_k} (y_{k\ell}-\bar{y}_k)^2 \end{equation}\] the mean of cluster totals is estimated by \[\begin{equation} \bar{\estm{Y}}_T = \frac{1}{n}\sum_{k=1}^n \estm{Y}_k = \frac{\estm{Y}_{HT}}{N} \end{equation}\] and the estimated variance of PSU totals is \[\begin{equation} s_T^2 = \frac{1}{n-1}\sum_{k=1}^n (\widehat{Y}_k-\bar{\estm{Y}}_T)^2 \tag{11.2} \end{equation}\] NB: $s_T^2$ is not in general an unbiased estimator of $S_T^2$, although it is unbiased in the case of single stage cluster sampling.

11.4 Single Stage Cluster Sampling

Consider the situation where our population is formed into $N$ clusters of size $M_i$, and we take a sample SRSWOR of $n$ clusters from the $N$ clusters, and then sample all the population units in the $n$ selected clusters ($m_k=M_k$). The total sample size (of SSUs) is then $m=\sum_{k\in s}M_k$. Again the natural parameter to estimate is the total $Y$.

Example - School library books

Returning the school library book example: assume that we have selected $n=3$ classes by SRSWOR out of the $N=12$ in the school, and that we have done a census of each selected class. (Note that in this case we can’t know the sample size in advance, since the clusters all have different sizes.)

The data are:

Class	Class size	Sample Indicator		Sample Size	Sample Data	Cluster totals
$i$	$M_i$	$I_i$	$k$	$m_k$	$y_{k1},\ldots,y_{km_k}$	$y_k=\sum_\ell y_{k\ell}$
1	9	1	1	9	3,4,4,4,2,4,6,4,4	35
2	10	0
3	20	0
4	8	0
5	7	0
6	16	1	2	16	1,6,4,6,3,2,5,4,1,6,0,6,4,2,2,1	53
7	12	0
8	14	0
9	10	0
10	6	0
11	8	0
12	10	1	3	10	3,3,6,4,5,2,3,5,1,4	36

11.4.1 Totals

Since we are not subsampling within the selected clusters (i.e. we are taking a census within clusters), the cluster total for the variable of interest for the selected clusters is known exactly. So the situation we have is taking a sample e.g SRSWOR of size $n$ from $N$ and measuring a new variable, now the cluster total, rather than the individual values of the population units in the cluster.

Hence the obvious estimator for the total under SRSWOR: \[\begin{equation} \widehat{Y}_{1SC,SRSWOR} = \sum_{k\in s} \frac{y_k}{\pi_k} = \frac{N}{n}\sum_{k\in s} y_k \end{equation}\] where $y_k$ is the exact value for the cluster total of the $k^{\rm th}$ selected cluster (which may be the $i^{\rm th}$ population unit), and as usual $\pi_k$ is the first order inclusion probability for the $k^{\rm th}$ selected cluster. This estimator is unbiased (since it is an HT estimator), and its variance is just \[\begin{equation} \bfa{Var}{\widehat{Y}_{1SC,SRSWOR}} = N^2\left(1-\frac{n}{N}\right)\frac{S_T^2}{n} \tag{11.3} \end{equation}\] where the variance of cluster totals $S_T^2$ is defined in Equation (11.1) above.

An estimate of the variance can be made from the sample as follows \[\begin{equation} \bfa{\widehat{Var}}{\widehat{Y}_{1SC,SRSWOR}} = N^2\left(1-\frac{n}{N}\right)\frac{s_T^2}{n} \end{equation}\] which uses $s_T^2$ (as defined in Equation (11.2) as an unbiased estimate of the variance of cluster totals $S_T^2$.

Example - School library books

We view our situation as having done a SRSWOR of $n=3$ units from $N=12$. The total number of books taken home in the selected units are $(35,53,36)$, with sample mean $\bar{y}=41.3$ and standard deviation $s_T=s_y=10.12$ (the standard deviation of $y$ is the standard deviation of the cluster totals), so our best estimate of the total number of books taken home is \[ \widehat{Y}_{1SC,SRSWOR} = N\bar{y} = (12)(41.3) = 496 \] with variance \[ \bfa{\widehat{Var}}{\widehat{Y}_{1SC,SRSWOR}} = N^2\left(1-\frac{n}{N}\right)\frac{s_T^2}{n} = 12^2\left(1-\frac{3}{12}\right)\frac{10.12^2}{3} = 3684 \] so the standard error of our estimate is $60.7$ (RSE=$0.122$).

In this estimate we aggregate up all the responses from the individual students, and do all our analyses at the level of the class.

11.4.2 Means

Clearly if the population size $M=N\bar{M}$ is known then an estimate of the population mean is given by \[\begin{equation} \widehat{\bar{Y}}_{1SC} = \frac{\widehat{Y}_{1SC}}{M} = \frac{\widehat{Y}_{1SC}}{N\bar{M}} \end{equation}\] with variance \[\begin{equation} \bfa{\widehat{Var}}{\widehat{\bar{Y}}_{1SC,SRSWOR}} = \frac{N^2}{{M}^2}\left(1-\frac{n}{N}\right)\frac{s_T^2}{n} \end{equation}\]

However it is commonly the case that the cluster sizes $M_i$ are known only in the $n$ selected clusters, in which case the true population size $M$ is unknown and must be estimated. The obvious way of doing this is to use the total estimator but now applied to the cluster size, i.e. \[\begin{equation} \widehat{M}= \frac{N}{n}\sum_{k \in s} M_k \end{equation}\] where $M_k$ is a within cluster total just like $y_k$.

Notice that in this case the estimator of the population mean is now a ratio of two random variables \[\begin{equation} \widehat{\bar{Y}}_{1SC,SRSWOR,R} = \frac{\widehat{Y}_{1SC,SRSWOR}}{\widehat{M}} = \frac{\sum_{k\in s}y_k}{\sum_{k\in s}M_k} \end{equation}\] the expression for the variance needs appropriate modifications, given below.

Example - School library books

In the above example we know the sizes $M_i$ of all of the classes, so that we know the total population $M=\sum_{i=1}^N M_i=130$ and can also calculate the mean class size $\bar{M}=\frac{M}{N}=\frac{130}{12}=10.83$.

Our estimate of the population mean number of books per child is then just the estimate of the total divided by $M$: \[ \widehat{\bar{Y}}_{1SC} = \frac{\widehat{Y}_{1SC}}{M} = \frac{496}{130} = 3.82 \] with variance divided by ${M}^2$: \[ \bfa{\widehat{Var}}{\widehat{Y}_{1SC,SRSWOR}} = \frac{N^2}{{M}^2}\left(1-\frac{n}{N}\right)\frac{s_T^2}{n} = \frac{12^2}{130^2}\left(1-\frac{3}{12}\right)\frac{10.12^2}{3} = 0.218 \] i.e. a standard error of $0.47$ (RSE=$0.122$).

11.4.3 Ratio estimator of the total

We have the HT estimator of the population total \[ \widehat{Y}_{1SC,SRSWOR} = \frac{N}{n}\sum_{k\in s} y_k \] Now we can expect the value of the within-cluster totals $y_k$ to be strongly correlated with the cluster size $M_k$, and that suggests the use of a ratio estimator to improve the variance using $M_i$ as the auxiliary variable: \[\begin{equation} \widehat{Y}_{1SC,SRSWOR,R} = \widehat{Y}_{1SC,SRSWOR}\frac{M}{\widehat{M}} = \frac{\sum_{k\in s} y_k}{\sum_{k\in s} M_k} \sum_{i=1}^N M_i \end{equation}\] This estimator has approximate variance \[\begin{eqnarray*} \bfa{Var}{\widehat{Y}_{1SC,SRSWOR,R}} &=& N^2\left(1-\frac{n}{N}\right)\frac{S_e^2}{n}\\ &=& N^2\left(1-\frac{n}{N}\right)\frac{1}{n(n-1)} \sum_{k=1}^n \left(Y_k-\frac{\sum_{k'}y_{k'}}{\sum_{k'}M_{k'}}M_k\right)^2 \end{eqnarray*}\] If the cluster means are equal then the variance is zero! This is an important property of cluster samples: they can be just as efficient as SRSWOR designs provided that all of the clusters are very similar.

11.4.4 Ratio estimator of the mean

If the total number of SSUs $M$ is unknown, then the HT estimator of the mean is unavailable, and the best estimator of the population mean is the ratio estimator \[\begin{eqnarray*} \widehat{\bar{Y}}_{1SC,SRSWOR,R} &=& \frac{\widehat{Y}_{1SC,SRSWOR}}{\widehat{M}}\\ &=& \frac{\frac{1}{n}\sum_{k\in s} y_k}{\frac{1}{n}\sum_{k\in s} M_k}\\ &=& \frac{\sum_{k\in s} y_k}{\sum_{k\in s} M_k} \end{eqnarray*}\] This estimator has approximate variance \[\begin{eqnarray*} \bfa{Var}{\widehat{\bar{Y}}_{1SC,SRSWOR,R}} &=& \frac{N^2}{\widehat{M}^2}\left(1-\frac{n}{N}\right)\frac{S_e^2}{n}\\ &=& \left(\frac{n}{\sum_kM_k}\right)^2\left(1-\frac{n}{N}\right)\frac{1}{n(n-1)} \sum_{k=1}^n \left(Y_k-\frac{\sum_{k'}y_{k'}}{\sum_{k'}M_{k'}}M_k\right)^2 \end{eqnarray*}\]

11.4.5 Remark

Where we do not know the size of the clusters at the time of sampling, we have a sample design for which the sample size of the population units is a random variable which is not well controlled if the cluster sizes are very unequal. So if most of the economic cost of the survey is in collecting the data from the sampled units within selected clusters, we could be faced with a large unknown cost when using single stage cluster designs.

11.4.6 Systematic Random Samples

The linear systematic random sample (LSRS) covered in Chapter 10 is actually a special case of a one-stage cluster sample. We divide the population (the list) into $L$ groups, starting at each of the $L$ first entries in the list and then taking the observations at intervals of $L$ down the list. We choose one of these groups at random.

The method of spacing out the members of each cluster along the list means that there is a good chance that any variations will be present in every cluster, so that the clusters will resemble each other strongly. This is the situation we want when designing a cluster sample: so that any cluster is a good proxy for the whole population.

11.5 Two stage Cluster Design

Now we consider the situation where we take a sample of clusters and then subsample within the cluster. Again we restrict ourselves to the case of SRSWOR at both stages of sampling.

Sampling therefore proceeds as follows:

First Stage. A sample of $n$ PSUs is selected by SRSWOR from the population of $N$ PSUs;
Second Stage. From the $k^{\rm th}$ selected PSU a sample of $m_k$ SSUs is drawn by SRSWOR.

The total sample size (of SSUs) is then $m=\sum_{k\in s}m_k$.

Example - School library books

We select $n=3$ classes from the $N=12$ classes in the school, and then select $m_k=4$ students from each of the selected clusters.

Class	Class size	Sample Indicator		Sample Size	Sample Data	Cluster means	Cluster std. devs
$i$	$M_i$	$I_i$	$k$	$m_k$	$y_{k1},\ldots,y_{km_k}$	$\bar{y}_k$	$s_k$
1	9	1	1	4	4,4,3,4	3.75	0.50
2	10	0
3	20	0
4	8	0
5	7	0
6	16	1	2	4	6,6,6,4	5.50	1.00
7	12	0
8	14	0
9	10	0
10	6	0
11	8	0
12	10	1	3	4	1,2,3,5	2.75	1.71

Unlike in the case of single-stage cluster sampling – we know in advance that we will achieve a total sample size of $m=12$ students in the $n=3$ classes (unless we find a class with fewer than $4$ students!).

11.5.1 Inclusion probabilities

We can define inclusion probabilities for each of the stages of sampling in an obvious way. We make the important assumption that the sampling at both stages is independent (i.e. the subsampling in one cluster is independent of the subsampling in any other cluster).

First Stage. The probability that PSU $i$ is selected into the sample is the familiar SRSWOR result: \[ \pi_i = \frac{n}{N} \] and the 2nd order inclusion probabilities are likewise \[ \pi_{ij} = \begin{cases} \frac{n(n-1)}{N(N-1)} & \text{if cluster $i\neq j$}\\ \frac{n}{N} & \text{if cluster $i=j$} \end{cases} \]
Second Stage. At the second stage we need to consider probabilities which are conditional on the first stage of selection. Thus given that the PSU $i$ has been selected the inclusion probability of SSU $j$ in PSU $i$ is \[ \pi_{j|i} = \frac{m_i}{M_i} \] since we are selecting $m_i$ units from $M_i$ in cluster $i$. The total inclusion probability of SSU $j$ in PSU $i$ is then \[ \pi_{(i)j} = \pi_{j|i}\pi_i = \frac{n}{N}\frac{m_i}{M_i} \] The second order inclusion probabilities are a little more complex. We consider the joint inclusion of unit $j$ from cluster $i$, and unit $\ell$ from cluster $k$: \[ \pi_{(i)j(k)\ell} = \begin{cases} \frac{n(n-1)}{N(N-1)}\frac{m_i}{M_i}\frac{m_k}{M_k} & \text{if cluster $i\neq k$}\\ \frac{n}{N}\frac{m_i(m_i-1)}{M_i(M_i-1)} & \text{if cluster $i=k$ and unit $j\neq\ell$}\\ \frac{n}{N}\frac{m_i}{M_i} & \text{if cluster $i=k$ and unit $j=\ell$}\\ \end{cases} \]

11.5.2 Totals

Since we are subsampling SRSWOR the obvious estimator for the totals for the sampled clusters is \[ \widehat{Y}_k = \frac{M_k}{m_k}\sum_{\ell\in s_k}y_{k\ell} \] where $s_k$ means the sample of population units selected at the second stage of sampling, i.e. sampling of population within the selected clusters. That is, the estimator of the total in a two stage cluster design is \[ \widehat{Y}_{2SC,SRSWOR} = \frac{N}{n}\sum_{k\in s_I} \widehat{Y}_k \tag{11.4} \] where $s_I$ means the sample of clusters selected at the first stage. Or in full,
\[ \widehat{Y}_{2SC,SRSWOR} = \frac{N}{n}\sum_{k\in s_I} \frac{M_k}{m_k}\sum_{\ell\in s_k}y_{k\ell} = \frac{N}{n}\sum_{k=1}^n M_k\bar{y}_k = \frac{N}{n}\sum_{k=1}^n \estm{Y}_k \tag{11.5} \] It has population variance \[\begin{equation} \bfa{Var}{\widehat{Y}_{2SC,SRSWOR}} = N^2\left(1-\frac{n}{N}\right)\frac{S_T^2}{n} + \sum_{i=1}^{N}M_i^2 \left(1-\frac{m_i}{M_i}\right)\frac{S_i^2}{m_i} \tag{11.6} \end{equation}\] The first term is identical to the variance of a single stage design (Equation (11.3)) and so is a function of the between cluster variation, $S_T^2$. The second is an additional term due to subsampling within the selected clusters, and depends on the within cluster variances $S_i^2$.

To get an estimator of the variance of the total estimator from one sample it turns out that we simply put in the sample analogues of $S_T^2$ and $S_i^2$ in the above formula, specifically \[\begin{equation} \bfa{\widehat{Var}}{\widehat{Y}_{2SC,SRSWOR}} = N^2\left(1-\frac{n}{N}\right)\frac{s_T^2}{n} + \frac{N}{n}\sum_{k\in s_I}M_k^2 \left(1-\frac{m_k}{M_k}\right)\frac{s_k^2}{m_k} \end{equation}\] Recall that whereas $s_k^2$ is an unbiased estimator for $S_k^2$, $s_T^2$ is not unbiased for $S_T^2$. This is because the between cluster variance must depend on the within cluster variabilities $S_i^2$, which are unknown and must be estimated in two-stage sampling.

As in the single stage cluster design the variable cluster size affects considerably the variance of the estimator. We can use a ratio estimator and this will reduce the first term of the variance formula (Equation @ref(eq:two.stage.var)).

11.5.3 Means

Estimates for means with their variances come from dividing the estimates for the total $\widehat{Y}$ and standard error $\bfa{SE}{\widehat{Y}}$ by the population size (number of SSUs) $M$: \[\begin{eqnarray*} \widehat{\bar{Y}}_{2SC,SRSWOR} &=& \frac{\widehat{Y}_{2SC,SRSWOR}}{M}\\ \bfa{SE}{\widehat{\bar{Y}}_{2SC,SRSWOR}} &=& \frac{\bfa{SE}{\widehat{Y}_{2SC,SRSWOR}}}{M} \end{eqnarray*}\] using Equation (11.5) and (11.6) respectively.

Example - School library books

Let’s apply these result to our example. It’s important to keep the notation correct here:

We have a population with $N=12$ clusters containing $M=130$ students
We have selected $n=3$ clusters
The sizes of the selected clusters are $M_k=(9,16,10)$
The sample sizes in the selected clusters are $m_k=(4,4,4)$ (i.e. the same in each cluster)
The cluster means are $\bar{y}_k=(3.75,5.5,2.75)$
We estimate the cluster totals using the standard HT estimator \[ \begin{split} \estm{Y}_k &= M_k\bar{y}_k\\ \estm{Y}_1 &= 9\times3.75 = 33.8\\ \estm{Y}_1 &= 16\times5.5 = 88\\ \estm{Y}_1 &= 10\times2.75 = 27.5 \end{split} \] and the variance of these totals is $s_T^2=1107.1$
The estimate of the total number of books taken home by all students in all classes is then \[ \widehat{Y}_{2SC,SRSWOR} = \frac{N}{n}\sum_{k=1}^n \estm{Y}_k = \frac{12}{3}\left( 33.8 + 88 + 27.5 \right) = \frac{12}{3}149.2 = 597 \]
The variance of this estimator is given by \[ \begin{split} \bfa{\widehat{Var}}{\widehat{Y}_{2SC,SRSWOR}} &= N^2\left(1-\frac{n}{N}\right)\frac{s_T^2}{n} + \frac{N}{n}\sum_{k\in s_I}M_k^2 \left(1-\frac{m_k}{M_k}\right)\frac{s_k^2}{m_k}\\ &= (12)^2\left(1-\frac{3}{12}\right)\frac{1107.1}{3}\\ & \qquad + \frac{12}{3}\left[ 9^2\left(1-\frac{4}{9}\right)\frac{0.5^2}{4} +16^2\left(1-\frac{4}{16}\right)\frac{1^2}{4} \right.\\ & \qquad\qquad \left. + 10^2\left(1-\frac{4}{10}\right)\frac{1.71^2}{4} \right]\\ &=3.98542\times 10^{4} + 378.2 = 4.02325\times 10^{4} \end{split} \] so that the standard error is $200.6$. This variance is dominated by the first term, which depends on the the variability between the cluster totals.

Collecting together the estimates we have of the total number of books taken home by students in this primary school:

Method	Sample Size $m$	$\widehat{Y}$	$\bfa{Var}{\widehat{Y}}$	$\bfa{SE}{\widehat{Y}}$	$\bfa{RSE}{\widehat{Y}}$	VarRatio	Deff
SRSWOR	12	314	3592.9	59.9	0.19	1.00	1.00
1SC	35	496	3684.0	60.7	0.12	1.03	3.08
2SC	12	597	40232.5	200.6	0.34	11.20	10.13

(The Deff’s here are estimated using iNZight.)

Comparing 2SC with SRSWOR we can see a very poor Deff (10.13). This is not a particularly good estimate of the Deff, due to our small sample size, but does show that cluster designs can be much less efficient than SRSWOR.

11.6 Design of cluster samples

Both in single stage and two stage cluster designs we can try to overcome the problem of variable cluster size without resorting to ratio estimators by considering a design which selects the clusters using a PPSWR design using the cluster size, or at least a recent value of it, as the size measure. However, such designs also have problems. For example, if the sampling fraction of the first stage clusters is not very small, then the probability of selecting the same cluster twice is reasonably high, and the design may not be as efficient as the one we have been considering.

In such a two stage cluster design we can specify in advance the sample size of the population units. However, we have to decide how many clusters to select and how many units to subsample within each cluster. If, as is often the case in practice, the first term of the variance formula (Equation (11.6)) is considerably larger than the second term then it makes sense to sample more clusters and subsample fewer units within the cluster. But against this is the fact that generally the cost of collecting the data is higher the more clusters are selected: e.g. if one is personally interviewing respondents in a household survey, then travel to a cluster of dwellings is generally more expensive than travel around a cluster of dwellings.

11.6.1 Formation of clusters

Clusters are often naturally occurring units (e.g. cities or nests or shoals of fish). But sometimes we are able to form the clusters – e.g. we can amalgamate suburbs or census area units to form clusters of any desired size. Where we are able to form the clusters ourselves, we should try to form them so that $S_T^2$ is as small as possible and the heterogeneity of the clusters $S_i^2$ is as large as possible. This will generally make the variance of the estimator smaller.

The trade-off of within and between variability can be summarised using an adjusted $R^2$ statistic defined as \[ R_a^2 = 1 - \frac{N\bar{M}-1}{N-1} \frac{\sum_{i=1}^N(M_i-1)S_i^2}{\sum_i\sum_j(Y_{ij}-\bar{\bar{Y}})^2} = 1-\frac{\sum_{i=1}^N(M_i-1)S_i^2}{(N-1)S_Y^2} \] This definition contains a ratio between a measure of intra-cluster variability, and total variability $S_Y^2$. If the clusters are very homogeneous then the $S_i^2$ are all very small and $R_a^2$ is close to 1. In this case cluster sampling will be very inefficient. If, however, almost all the variability of the whole population is seen within the clusters, then $R_a^2$ will be close to zero, and cluster sampling will do almost as well as SRSWOR.

If the clusters are all (nearly) the same size $\bar{M}$, an estimate of the design effect for a cluster design is given by \[ \bfa{\widehat{Deff}}{\widehat{Y}_\text{cluster}} = 1 + \frac{N(\bar{M}-1)}{N-1}R_a^2 \] Hence a larger $R_a^2$, corresponding to more internally homogeneous clusters, leads to a larger design effect (i.e. a larger variance). Where the cluster sizes are all equal then $R_a^2$ is approximately the same as the intra-cluster correlation coefficient $\rho$, which is sometimes quoted as a measure of within cluster homogeneity. If the average number of SSUs sampled per cluster is $\bar{m}$ then the design effect for one-stage cluster sampling is approximately \[ \bfa{\widehat{Deff}}{\widehat{Y}_{1SC}} = 1 + \rho(\bar{n}_c-1) \]

It is possible for $R_a^2$ and $\rho$ to be negative, in which case the clusters are almost identical in their properties: i.e. there is no variability between clusters and all the population variability is within clusters. This situation can occur in systematic sampling. In systematic sampling we break the population into a set of $L$ distinct samples, and then select one of them randomly – hence systematic sampling is just a special case of one-stage cluster sampling.

In general cluster sampling rarely does better than SRSWOR, but systematic sampling where an auxiliary variable is used to order the population is one case where cluster sampling does perform better.

In summary, the optimal choice of number of clusters sampled and number of units subsampled within clusters can be decided on variance grounds, or on economic cost grounds but ideally on a mixture of both.

11.7 Multistage designs

We may want to take a third stage of sampling (i.e. we may wish to sample within SSUs). The ideas presented in this chapter can be simply extended in this case.

For example, the Statistics NZ Household sampling frame is a multistage cluster design, where the sample is selected as follows:

NZ is stratified into $H=12$ regions (e.g. Northland, Auckland, Hawke’s Bay etc.);
Each stratum (region) is broken into $N_h$ PSUs: which are amalgamations of a few meshblocks. Each PSU contains (on average) 100 households.
A SRSWOR of $n_h$ PSUs is taken within each stratum $h$; (First stage sample.)
Within the $k^{\rm th}$ selected PSU, which is of size $M_{hk}$, a linear systematic random sample is taken of $m_{hk}$ households; (Second stage sample.)
Depending on the survey, all $A_{hk\ell}$ eligible members of the $\ell^{\rm th}$ selected household will be surveyed, or a SRSWOR of $a_{hk\ell}$ members may be taken. (Third stage sample.)

The sample size is therefore \[ \text{Sample Size} = \sum_{h=1}^H\sum_{k\in s_I}\sum_{\ell\in s_k} a_{hk\ell} \] and the sample weights of individual $j$ in household $\ell$ of PSU $k$ in stratum $h$ are \[ w_{hk\ell j} = \frac{N_h}{n_h}\frac{M_{hk}}{m_{hk}} \frac{A_{hk\ell}}{a_{hk\ell}} \] A ratio estimator (based on cluster size as discussed above) can be used to improve the variance of estimators under this design. Standard errors are computed using numerical resampling methods (jackknife), since the theoretical estimates are not sufficiently accurate.

Note that although a LSRS is taken within PSUs, the analysis is done as if it were a SRSWOR. An LSRS is taken to space out the sample within the PSU, but it is reasonable to expect that the properties of such a sample are similar to those of a SRSWOR of that PSU.

Sample member	Student (within school)	Class	Class Size	Student (within class)	Number of books
\(k\)	\(i\)				\(y_k\)
1	5	1	9	5	2
2	14	2	10	5	2
3	16	2	10	7	2
4	63	6	16	9	1
5	65	6	16	11	0
6	70	6	16	16	1
7	77	7	12	7	2
8	80	7	12	10	3
9	87	8	14	5	5
10	89	8	14	7	2
11	107	10	6	1	6
12	127	12	10	7	3

Class	Class size	Sample Indicator		Sample Size	Sample Data
\(i\)	\(M_i\)	\(I_i\)	\(k\)	\(m_k\)	\(y_{k1},\ldots,y_{km_k}\)
1	9	1	1	4	4,4,3,4
2	10	0
3	20	0
4	8	0
5	7	0
6	16	1	2	4	6,6,6,4
7	12	0
8	14	0
9	10	0
10	6	0
11	8	0
12	10	1	3	4	1,2,3,5