Chapter 10 Systematic Random Sampling

$\DeclareMathOperator*{\argmin}{argmin}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\bfa}[2]{{\rm\bf #1}[#2]}$ $\newcommand{\rma}[2]{{\rm #1}[#2]}$ $\newcommand{\estm}{\widehat}$

There are certain situations where it is inconvenient or impossible to draw a SRS. For example, we may wish to take a sample of passengers arriving at an airport. Here we don’t have a frame to sample from, but we can nevertheless conceive of ways of sampling at a constant rate – e.g. to take a sample where the probability of selection is one in 20, we can draw a random number between 0 and 1 for every passenger who arrives. If that number is less than 0.05 we select the passenger into the sample.

However, it may be too difficult to implement this scheme in practice, given the setup at the arrival gate in the airport. Instead, because we believe that the passengers arrive in random order we can take a Linear Systematic Random Sample (LSRS). At the start of sampling we choose a random number $r$ between 1 and 20, and then take the $r^{\rm th}$ passenger, and then every 20th passenger after that.

Another example: we want to estimate the number of books on a Library card catalogue which have not yet been entered on the computer catalogue.

There are roughly 500,000 books in the card catalogue and we want a sample of 2000 books to check whether they exist on the computer catalogue. The card catalogue comprises 25 drawers with roughly 2000 cards in a drawer. After selecting one card randomly in the first 250 cards in the first drawer, we select every 250th card thereafter, and check whether the selected cards are on the catalogue.

This method is often used when we want to sample without constructing a list frame. If we believe that the population is randomly ordered and that units are not correlated then this would be equivalent to an SRS. However, the assumptions of random order, etc are very often violated.

10.1 Implementation of LSRS

Let $U$ be a population of $N$ units. Suppose we want a sample of size $n$. The population units are visualized as being arranged in a line.

Calculate $L$ (called the sampling interval) the integer closest to $N/n$. Then $N=mL+c$, where $m$ is the integer part of $N/L$, and $c$ is a number smaller than $L$.
Choose a random number (called the random start) $r$ from $\{1,\ldots,L\}$.
Calculate the numbers: $r, r+L, \ldots, r+(k-1)\times L, \dots$ so long as $r+(k-1)\times L \leq N$. These numbers correspond to the selected population units.

Note if $r \leq c$ then there are $m+1$ such numbers, otherwise $m$ such numbers.

Note also that $m$ may be quite different from $n$ e.g.
1. suppose $N=17$ and $n=7$, then since $17/7=2.43$ $L=2$ and so $m=8$ and $c=1$. thus possible samples are of size 9 or 8 but not 7. (i.e. there is no sampling interval $L$ that we could choose which would result in a sample of size 7.)
2. suppose $N=17$ and $n=6$, then since $17/6=2.83$ $L=3$ and so $m=5$ and $c=2$. thus possible samples are of size 6 or 5.
3. suppose $N=69$ and $n=28$, then since $69/28=2.46$ $L=2$ and so $m=34$ and $c=1$. thus possible samples are of size 35 or 34 far away from 28.
Thus our achieved sample size $n'$ (which is always $m$ or $m+1$) can be different from $n$ and the sample size for LSRS is a random variable. If a fixed sample size is required then we either need a further systematic sample from the remaining unselected units, or randomly delete units selected. However, if $N$ is large compared with $n$ the problem is minor.

10.2 Inclusion Probabilities

There are $L$ possible starting points, and hence there are $L$ possible samples. Hence the probability of selecting any particular sample is \[ p_{LSRS}(s) = \begin{cases} \frac{1}{L} & \text{if $s$ is LSRS} \\ 0 & \text{otherwise} \end{cases} \] Because each element $i$ belongs to one and only one of the $L$ equally probable systematic samples this means the 1st order inclusion probabilities are \[ \pi_{i} = \frac{1}{L} \] and also this means that for every $i\neq j$ \[ \pi_{ij} = \begin{cases} \frac{1}{L} & \text{if $i$ \& $j$ are in the same sample} \\ 0 & \text{otherwise} \end{cases} \] The fact that some of the 2nd order inclusion probabilities are zero means that valid variance estimates cannot be calculated from the sample.
(However we will see later that we can regard LSRS as a special case of cluster sampling.)

10.3 HT Estimators in LSRS

10.3.1 Total

We observed that the 1st order inclusion probabilities for a LSRS sampling scheme were \[ \pi_{i}=\frac{1}{L} \qquad \text{where $L$ was the integer closest to $N/n$} \] so that the HT estimator for the total can be constructed easily. It is \[ \widehat{Y}_{HT,LSRS} = \sum_{k\in s} \frac{y_{k}}{\pi_k} \sum_{k\in s} \frac{y_{k}}{1/L} = L\sum_{k\in s} y_{k} \] This estimator is of course unbiased.

Note that unless $N/n$ is exactly an integer, in which case $L=N/n$, then this estimator is not quite the same as that for the total in SRSWOR. If we used the HT estimator for SRSWOR in an LSRS design, where $N/n$ is not exactly an integer then we would be using a biased estimator. Of course if $N/n$ is large the bias is small.

The variance of this estimator is approximately \[ \bfa{Var}{\widehat{Y}_{HT,LSRS}} = \frac{N^2}{L}\sum_{j=1}^L ( \widehat{\bar{Y}}_j - \bar{\bar{Y}} )^2 = \frac{N^2}{L}\sum_{j=1}^L (\bar{y}_j - \bar{\bar{y}})^2 \] where \[ \bar{\bar{Y}} = \frac{1}{L}\sum_{j=1}^L\widehat{\bar{Y}}_j = \frac{1}{L}\sum_{j=1}^L\bar{y}_j = \bar{\bar{y}} \] is the average of all the possible sample averages. This variance is small if the sample averages are almost the same.

Because not all $\pi_{ij}>0$, we cannot use the HT estimator of this variance from one sample. In fact there is no unbiased estimator of the variance for systematic sampling, which is not surprising since we are really splitting the population into several subpopulations and then only sampling fully one such subpopulation.

In the absence of an unbiased estimator for the variance we may use the variance for a SRS estimate of the population total: \[ \bfa{\widehat{Var}}{\widehat{Y}_{HT,LSRS}} \simeq N^2\left(1-\frac{n}{N}\right)\frac{s_y^2}{n} \] This approximation may be sufficient, although we are aware it may lead to overestimates and underestimates of the actual sampling variance.

10.3.2 Mean

Using the natural estimator of the population mean from the sample, we have, \[ \widehat{\bar{Y}}_{HT,LSRS}=\frac{L}{N}\sum_{k \in s} y_{k} \] which will only be the sample mean if $L=N/n$. In other words, the sample mean is biased in LSRS unless $N/n$ is exactly an integer. We can form the obvious variance of the estimator, but there is no unbiased estimator for estimating this variance from a sample.

10.4 Circular Systematic Random Sampling

In LSRS we could not control the sample size $n$ exactly. Circular Systematic Random Sampling (CSRS) is a means of guaranteeing a particular sample size. Instead of regarding the population units as lying on a line, we wrap the list around on itself. We choose $L$ as above, but keep selecting from the list until we have $n$ units. If the list runs out we go back to the beginning.

10.5 Using Auxiliary Data: Ordered Populations

We have motivated the use of LSRS as being almost as good as SRS if the population list is randomly ordered. If, however, the population list is ordered with respect to some auxiliary variable $X$, known to be correlated with the outcome variable $Y$, then LSRS makes some efficiency gains over SRS.

Consider the example of countries and their military expenditures. We wish to estimate the total military expenditure. However this varies wildly between countries.

%, and we have already seen that this yields significant %gains in efficiency (i.e. smaller variance for the same sample size as an %SRS). The large variance of the SRS estimator is caused by the fact that sometimes we include the large units, and sometimes we don’t: there are many samples which are very very different from each other.

We could do this by a stratified random sample and would achieve strong gains in precision.

However if we order the population by GNP (known for every country on the frame), and then take a LSRS, we will always have a sample where there is a range of GNP values, and since GNP is strongly correlated with military expenditure, where will be a range of military expenditure values. All of the possible samples will therefore resemble each other much more strongly than they would under SRS, and each one strongly resembles the population.

We still don’t have a theoretical variance formula which can be used to compute the standard error from a single sample, however.

10.6 Example

Consider the case of data on the GDP and military expenditures of $N=150$ countries, shown in Figure 10.1.

Figure 10.1: Military expenditure and GDP

We want to estimate the mean military expenditure (in billions of dollars) using a sample of size $n=30$. The true mean value is $\bar{Y}=13.3$, the variance is $S_Y^2=4578.32$ and the variance of an estimate of the mean from a SRS is \[ \bfa{Var}{\widehat{\bar{Y}}} = \left(1-\frac{n}{N}\right)\frac{S_Y^2}{n} = 122.09 \] How does a LSRS compare?

First we calculate the sampling interval: \[ L = \ \text{closest integer to}\ \ \frac{N}{n}=\frac{150}{30}=5 \] so $L=5$. That means there are 5 possible samples, corresponding to the 5 possible random starting points. Each sample is of size $n'=30$ since $5\times30=150$.

If we draw all 5 samples from the original list (with its original ordering, alphabetical by country), we find the following results:

Sample	$\bar{y}$	$\widehat{\bar{Y}}$	$s_y^2$
1	6.00	6.00	16.95
2	40.77	40.77	146.94
3	6.66	6.66	14.91
4	7.00	7.00	17.36
5	6.23	6.23	12.05

(NB – Our estimates of the mean $\widehat{\bar{Y}}$ are in each case equal to the sample mean $\bar{y}$, because $N/n'$ is an integer. If it were not an integer these values would differ.) The second sample is very different to the rest, consequently the variance of $\widehat{\bar{Y}}$ is very large: \[ \bfa{Var}{\widehat{\bar{Y}}} = 235.46 \] i.e. a design effect of 1.93.

This result may be a consequence of an unfortunate ordering of the countries in the list. Here are the 5 possible samples drawn by LSRS after we randomly reorder the list:

Sample	$\bar{y}$	$\widehat{\bar{Y}}$	$s_y^2$
1	34.18	34.18	141.38
2	11.57	11.57	45.82
3	7.64	7.64	16.83
4	3.72	3.72	10.31
5	9.57	9.57	21.19

There is still a lot of variability between the samples, and in this case we find $\bfa{Var}{\widehat{\bar{Y}}} = 144.18$, i.e. a design effect of 1.18.

We can improve matters if we reorder the list by GDP prior to sampling. In this case we find:

Sample	$\bar{y}$	$\widehat{\bar{Y}}$	$s_y^2$
1	6.22	6.22	16.99
2	7.32	7.32	14.91
3	6.15	6.15	13.49
4	14.25	14.25	46.99
5	32.73	32.73	141.74

This has reduced the variability between the samples, and in this case we find $\bfa{Var}{\widehat{\bar{Y}}} = 128.92$, i.e. a design effect of 1.06, which is competitive with SRS.

LSRS may perform significantly worse the SRS, however in many cases LSRS yields variances which are similar. Those are the cases when the convenience of an LSRS may make it a preferable sampling scheme. In those cases the SRS variance estimates should also suffice.