1 Introductory Remarks

These STAT 392 Lecture Notes are derived from notes created by Richard Arnold, Alistair Gray, and other present and former members of staff at Victoria University of Wellington and at Statistics New Zealand.

They should not be distributed without persmission.

1.1 Recommended reading

There are lots of books on sample surveys.

Thomas Lumley’s book ‘Complex Surveys’ (T. S. Lumley (2010)) is a good introduction to sample survey analysis with R, and accompanies the survey package (T. Lumley (2004)).
Groves et al. ‘Survey Methodology’ (groves.etal) is a good general summary of the important issues in survey design.
Sarndal, Swensson and Wretman is a classic in the field of the theory of sampling (Särndal, Swensson, and Wretmann (1992))
Likewise the books by Cochran (1977) and Kish (1965)
Little and Rubin have a great book on missing data (Little and Rubin (2002))
More recent are the books by Lohr (1999) and Scheaffer, III, and Ott (2006)
And in 1994 StatsNZ published ‘A Guide to Good Survey Design’ (Zealand (1995))

$\DeclareMathOperator*{\argmin}{argmin}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\bfa}[2]{{\rm\bf #1}[#2]}$ $\newcommand{\rma}[2]{{\rm #1}[#2]}$

2 Sample Surveys

A sample survey is an exercise in which data are collected from a sample (a subset) or a population. The data collected are used to create estimates of the characteristics or parameters of the population.

Some examples:

A sample of voters are telephoned to ask their voting intentions, with the aim of predicting the outcome of the election;
A sample of sites in a river catchment is tested for the presence of algae to determine the spread of the algae throughout the entire catchment;
A sample of households is selected, and the residents asked about their employment status, in order to estimate the national unemployment rate.

A census is a special case of a sample survey in which every member of the population is surveyed.

2.1 The Survey Process

The different steps in the survey process are shown in Figure 2.1.

Figure 2.1: The Survey Process

Every survey has a set of objectives – the particular populations which are of interest;
These parameters are properties of the target population: the population about which estimates are to be made;
Not all members of the target population are able to be identified: the ones that can be surveyed make up the survey population;
The sample frame is a listing of all members of the survey population;
The sample design is the method of selecting the sample from the frame. The sample size is usually decided as a compromise between the required accuracy of estimates and the survey costs and other constraints (e.g. the time available for the survey);
survey instrument is the means of data collection. It is usually a paper or electronic questionnaire, completed by the respondent or the interviewer/observer. The instrument aims to measure the properties of the sample members which mean that the desired population characteristic can be estimated. The instrument must be valid (it must actually measure what it intends to measure) and reliable (repeated measurements of the same sample member under identical circumstances should always yield similar results);
The sample members are contacted and recruited into the sample. Not all selected members will be able to be contacted (even after strenuous efforts), and some will not respond even if contacted;
The data are collected from the sample members in some mode (e.g. face-to-face, telephone, web, observational, …);
The data collected are captured (stored on a computer); coded (converted into standard classification systems); edited (checks for data consistency); and stored in a final dataset;
The data may be adjusted, e.g. imputation or weight adjustments for nonresponse may be done;
Estimates of the parameters of interest are constructed, and other analysis of the data is carried out (e.g. regression estimation, comparison with other data etc.);
The results are summarised in a report;
The original data may be archived in some appropriate form, or destroyed;
A post-survey evaluation may be made to determine how well the survey met its goals.

For example: The Household Labour Force Survey (HLFS).

One of the main objectives of the HLFS is to estimate the unemployment rate every quarter;
The target population is the working age population of New Zealand;
The survey population is the civilian, non-institutionalised usually resident population of adults aged 15+ living in permanent, private dwellings on the main islands of New Zealand (North Island, South Island and Waiheke Island);
The sample frame is a list of dwellings created at the most recent census and regularly updated to reflect changes;
The sample design is a stratified cluster design. Within each local government area a sample of small areas (PSUs) is selected. A sample of households is selected from within each PSU. Every adult from the selected households is surveyed. 15000 households and 30000 adults are surveyed every three months, in order to create unemployment estimates accurate to within $0.5$%.
The survey instrument is an electronic questionnaire.
The interviewers contact the households in person or by telephone, making up to 10 call backs to ensure contact is made. Proxy responses are permitted (i.e. one household member can respond on behalf of another). The interview takes place by computer assisted personal or telephone interviewing (CAPI or CATI).
The results are post-stratified to match the current population estimates in each local government area;
Estimates of the unemployment rate are published within about 6 weeks of the end of the quarter.

The following tables summarise the properties of three very different surveys (following a template by Groves et al., 2004).

2.1.1 Household Labour Force Survey

Title	Household Labour Force Survey (HLFS)
Country	New Zealand
Sponsor	Statistics New Zealand
Collector	Statistics New Zealand
Purpose	To produce each quarter, a comprehensive range of statistics relating to the employed, the unemployed and those not in the labour force who comprise New Zealand’s working-age population.
Year started	1985
Target Population	The civilian non-institutionalised usually resident New Zealand population aged 15 and over
Sample Frame	Dwellings enumerated at the previous census and grouped into areas
Sample Design	Multistage, stratified clustered area probability sample of primary sampling units (PSUs); sample of dwellings within PSU drawn, all eligible adults within selected households
Coverage	Excludes households on offshore islands
Sample Size	15000 households and 30,000 adults
Use of Interviewer	Interviewer administered
Mode of Administration	Face-to-Face (first) and telephone (subsequent) interviews for each household and each person
Computer Assistance	Computer assisted personal interview (CAPI) or telephone interview (CATI)
Selection Unit	Household
Reporting Unit	Household, person
Time Dimension	Ongoing rotating panel survey of dwellings
Frequency	Conducted quarterly
Interviews per Round of Survey	Each household is surveyed every three months over two years (8 times in all)
Levels of Observation	Household, person
Response Rate	Usually 90%
Web link	http://www.stats.govt.nz/
Source	http://www.stats.govt.nz/datasets/work-income/household-labour-force-survey.htm

2.1.2 Quality of Life Survey

Title	Quality of Life Survey
Country	New Zealand
Sponsor	City Councils of NZ’s largest cities; Ministry of Social Development
Collector	TNS
Purpose	To provide information to decision-makers to improve
the Quality of Life in major New Zealand urban areas
Year started	1999
Target Population	Residents of the largest New Zealand cities
Sampling Frame	Electoral Roll
Sample Design	People were selected from the electoral roll; addresses matched to phone numbers; rang phone numbers and asked for person with the next birthday. Quota sampling within electoral wards (quotas by age, sex, ethnicity)
Coverage	Excludes households with no landline telephone
Sample Size	7720 achieved interviews
Use of Interviewer	Interviewer administered
Mode of Administration	Telephone interview
Computer Assistance	Computer assisted telephone interview (CATI)
Selection Unit	Household
Reporting Unit	Person
Time Dimension	Repeated Cross sectional survey, (most recent 2006)
Frequency	Once
Interviews per Round of Survey	One
Levels of Observation	Person
Response Rate	22%
Web link	http://www.bigcities.govt.nz/
Source	http://www.bigcities.govt.nz/

2.1.3 Survey of Hector’s Dolphins between Motunau and Timaru

Title	Survey of Hector’s Dolphins between Motunau and Timaru
Country	New Zealand
Sponsor	Department of Conservation
Collector	Department of Conservation
Purpose	To measure the abundance of Hector’s Dolphin (Cephalorhynchus hectori) between Motunau and Timaru in 1998.
Year started	1998
Target Population	Hector’s Dolphin between Motunau and Timaru
Sampling Frame	Line transects taken with 4 nautical miles of the coast between Motunau and Timaru
Sample Design	Line transects in four strata (Akaroa Harbour; other large bays on Banks Peninsula; Inshore zone (<4nm from shore); offshore zone (4-10nm))
Coverage	Excludes dolphins far offshore
Sample Size	Transects 1 nm apart within harbours and bays; 2 nm apart in Marine Mammal Sanctuary; 4nm elsewhere. 4 replicate surveys.
Use of Interviewer	Left and Right Observers on 15m catamaran
Mode of Administration	Observational, using seven-power binoculars
Computer Assistance	Third observer enters into palmtop as collected
Selection Unit	Transects; Dolphin Groups
Reporting Unit	Dolphin Groups
Time Dimension	Two month observation period
Frequency	Once (may be repeated in future)
Interviews per Round of Survey	One
Levels of Observation	Dolphin Groups
Response Rate
Web link	http://www.doc.govt.nz/
Source	Dawson et al. (2000) ‘Line-transect survey of Hector’s dolphin abundance between Motunau and Timaru’; DoC report.

2.2 Survey Error

At the end of a sample survey analysis we will have an estimate $\widehat{T}$ of a population parameter of interest $T$. For example $T$ might be the unemployment rate in the December quarter, and we find from the HLFS the estimate $\widehat{T}=3.8%$.

We can be pretty sure that the true unemployment rate isn’t exactly 3.8%, but we expect it to be close to this. The difference between our survey estimate and the (unknown) truth is called the survey error: \[ \text{Survey Error} = \text{Estimate} - \text{Truth} = \widehat{T}-T \] The value of the error is unknown to us (because the truth $T$ is unknown), but it is useful to think about what factors contribute to the error, and what effects those factors have. The points at which the different survey errors enter the survey process are shown in Figure 2.2.

Figure 2.2: Sources of Survey Error

The error sources in the diagram are usually divided into Sampling Errors and everything else: i.e. Non-sampling Errors. The reason for this is that only sampling error can be properly quantified and allowed for using statistical theory. All of the other types of error need to be controlled and minimised as far as possible, and these are generally unquantifiable. There are methods for reducing the effect of some of these errors, but most such adjustments rely on assumptions that are untestable.

2.2.1 Sampling Error

Sampling error is the error which is the result of collecting information from only a subset of the population, rather than the whole population. Thus censuses have zero sampling error by definition. Sampling error is caused by the variability in the responses across set of possible samples from the population.

The extent of the sampling error depends on many factors, including:

Sample size: increasing the sample size reduces the sample error, although there is a point beyond which little practical gain is made by further increasing the sample size.
Variability of the characteristic of interest: the greater the variation in the population, the greater the sampling error.
Sample design: designs which use known population characteristics may reduce the sampling error by targetting the sampling most efficiently.

2.2.2 Non-Sampling Error

Non-sampling error includes all other sources of error. Almost every step in the survey process is a potential source of non-sampling error, but the size of the error is often not easy or impossible to measure, and may be larger than sampling error.

Non-sampling errors may be related to:

Frame bias/coverage error: the sample frame which does not match the target population
Non-random sample selection
Non-response or false response
Poor questionnaire design, leading questions, measurement error
Interiewer error
Data entry, processing, coding, editing errors
Post-survey adjustment errors
Model misspecification in analysis
Incorrect treatment of data from a survey with a complex design

2.2.3 Examples

Invalid Instrument

Australian National Referendum question:

Do you approve the proposed law to alter the Constitution to establish the Commonwealth of Australia as a republic with the Queen and Governor-General being replaced by a President appointed by a two-thirds majority of the Members of the Commonwealth Parliament?

This is two questions in one – the question does not address support for a republic in Australia, but only support for a particular model. 45% said yes to this question.

Invalid Instrument

Referendum question at the 1999 New Zealand general election:

Should there be a reform of the justice system placing greater emphasis on the needs of victims, providing restitution and compensation for them and imposing minimum sentences and hard labour for all serious violent offences?

Almost 92% of the population answered yes. But what question where they answering? There are 5 questions here! In fact they were probably just answering the question ‘Are you worried about violent crime?’

Coverage Error and Non-response Bias

The Literary Digest magazine ran a postal poll of 10 million people selected from phone books and car registration lists before the 1936 US election. It received a response rate of 23% (2.3 million responses), and incorrectly predicted 55% support for Alf Landon (Rep.) over the incumbent F. D. Roosevelt (Dem.) (41%).

The actual result was 37% for Landon, and 61% for Roosevelt.

George Gallup polled 5,000 people, also by post, but balanced the demographics of his sample. He predicted 54% for Roosevelt, and also predicted that the Literary Digest would get the result wrong.

The Literary Digest made two mistakes:

Coverage Error - the sampling frame (car owners and phone users) was more affluent than the general population (more likely to vote Republican);
Non-response Bias/Non-random selection - people wanting change (i.e. a Republican victory) were more likely to respond.

The Literary Digest, having previously been a highly successful polling organisation, went bankrupt the next year.

In 1948 the Gallup organisation overconfidently stopped polling 2 weeks before an election, missed a dramatic late shift in public opinion, and called the election incorrectly: predicting Harry S Truman would be defeated.

3 Developing Objectives

Step by step guide to developing a statement of objectives

Important note: the word question in this guide never refers to the questions that respondents are to answer. It refers to:

the research questions the survey is to answer
policy questions that the survey information is to be used to address, or
questions about the survey that need to be addressed.

3.1 An overview

This section covers all the steps very briefly. Each step is explained in more detail in the following pages.

Before you start. Gather existing information, document the problem and how you view it, ask for advice, and make sure that you need the survey.
Where to start. Document the overall research question tou want to answer with this survey, and what the information from the survey will be used for.
Who or what you want to ask your research questions about. Specify your ideal population.
Periodicity. Say whether you need the information at one point in time, or information at a number of points in time, or continuously over a period.
The important quantitative research questions. Document a very few specific research questions that you want the survey to answer, define the terms used, and state uses of the information.
Key outputs. Specify the outputs you must produce, and the accuracy with which you want to provide them. This specification is the basis for sample design.
Other outputs. Specify any outputs you want to produce. These must be justified by the uses the information will be put to.
List of variables. Specify all the variables needed to produce the outputs you have specified. Give each a priority rating. Make sure all the terms are defined. Make sure it is clear what all the information is to be used for.
Keep checking design work and be prepared to revisit your objectives. As the design work progresses, you need to check it. The development team will give you progress reports. It may turn out that some of your objectives cannot be met. If so, you need to decide whether and how to progress.

3.2 Before you start

Are you sure that you need a survey? Does the information you need already exist? Are there other ways of finding out what you need to know?
Find people to help you.
Do you have a background paper on the subject of the survey? Does it cover all of the following points?
- what is already known on the subject
- how you are thinking of the problem - your model or way of seeing it
- why a survey is needed - other sources such as administrative data, or other surveys are not available or appropriate
- any information on the incidence of the thing your survey is going to be looking at.
If you have such a paper prepared, include it as an introduction to your objectives statement: it is very useful to the designers. If you don’t, consider writing one. It will help you establish the need for the survey, and it will help you sort out your ideas on the data you need from the survey.
Do you have a group of users who have needs for this information? Do you have an external client or clients whose information needs you have to reflect?

You need to develop a way of working with the various parties whose ideas have to be reflected in the set of objectives. It will be easier if all parties understand what the set of objectives is for and what it has to do. You could use these guidelines to help you all reach a shared understanding of what has to be done and why.

3.3 Where to start

What is the overall research question you want to answer with this survey?

This part is allowed to be a little woolly, for example:
What problems are there with housing in New Zealand? This information is needed so that the government can plan policy to address the problems.
What will the information from the survey be used for?

It’s worth spending quite a lot of time on this before going any further. It serves two main purposes: it justifies the survey itself, and it helps you work out what information is really needed.

The sort of statement you should have at the end of this phase is:
We need this information we will get from this survey in order to:
- decide where to target funding for $X$
- decide whether to change our policy on $Y$
- see whether policy has brought about the change it was meant to
The more specific and concrete you can be, the easier it will be to work out your information needs, and ensure that you have them right.
You may only be able to talk about the information being used to ‘inform debate’ but you must be specific about what the debate is about.

e.g. don’t say
informing debate on provision of housing assistance
but rather
informing debate on how housing assistance ought to be delivered to households, in particular whether …

The statement ‘there is a lack of information on …’ is not useful unless you can add ‘and it is needed in order to …’, so you might as well leave out the first bit.

3.4 State who it is you want to ask your research questions about

This is called defining your target population.

This is the ideal population about whom you want to be able to draw inferences. It may be that later it will turn out not to be practicable to survey exactly this population. For example, you may have to decide later whether it is economic to include the off-shore islands, or people in non-private dwellings. But for now the important issue is: who is it that you need to answer your research question about?

First, say whether the research question is about households, or people, or businesses or dwellings, or something else altogether. In the housing example mentioned above there could be two populations – one of people and one of dwellings.

People. Except for the five-yearly census, you probably won’t want to know about all the people who are in NZ. In the housing example, tourists, people in prison or hospital and people in army camps would probably not be included.

You might define your population as:
People living permanently in private dwellings in NZ

Note that you still have to define what you mean by all those words. Or you might want to say:
People living long-term in NZ but in institutional housing of any kind

You then have to make clear, by definition or listing, what you mean by institutional housing, plus define ‘long-term.’
Dwellings. In the housing example, you probably wouldn’t want to know about all dwellings. For example, non-private dwellings and unoccupied dwellings may not be interest.

Occupied private dwellings in NZ
Again, note that you will still have to define what you mean by all those words.
Other. In some surveys the population will be neither people nor dwellings. It might be a population of businesses, of schools, or of something else. In any case, you need to say what is to be included, and what you mean by your terms.

3.5 Periodicity

Do you need information at one point in time, or information at a number of points in time, or continuously over a period?

The answer to this should arise from your research question. The examples suggested above would require information at one point in time. But a question such as:

What is the relationship between the quality of NZ housing and the extent of government involvement in housing provision?

would require information at a number of points, possibly annual information.

Another research question might require a longitudinal survey. For example,

How much does the quality of housing vary over the lifetime of an individual?

3.6 What are the important quantitative research questions that you want the survey to answer?

This question or questions should be based on statements you produced above. More than one question may be needed, but you should not have a long list - you haven’t got to the detailed information needs yet.

Each question should have a statement about the intended use of the information. For example:

How many people in NZ live in housing of a poor standard?
Needed because if the number of people living in poor dwellings is greater than X, the government will have to increase the budget allocation to housing. (You are going to have to say how big X is and how accurately you need to estimate it. Survey designers will need that in order to design the sample.)
How does the number of people living in housing of poor standard change as the housing sector is progressively privatised?
Needed to decide whether the effect of privatisation is positive or negative and thus to decide whether to continue with the policy.
How many dwellings are of a poor standard?
Needed to calculate the amount of money needed to reduce the number to an acceptable level. (You are going to have to decide what is an acceptable level.)

Clarification and specification: clarify each of your research questions by:

putting each important word into statements like:
What I mean by people living dwellings is …
It includes people who have been in a household for more than 3 nights a week for the past 12 weeks, except for any who have another place of residence to which they will return in the next …
expanding it with some statements like
What will count as poor housing is the existence of one or more of: crowding, leaks, dampness, lack of inside plumbing.

In the second case you still need to define the terms, e.g.

Crowding is to be measured by dividing the number of people by the number of rooms, and any dwelling with a measure greater than 1.3 is to be counted as crowded.
Rooms to be counted in this measure include … but exclude ….
People living there is to include people who … but exclude
people who are ….

3.7 Key outputs

Now you need to think about the outputs from your survey that you see as absolutely essential, and how accurate you need those outputs to be. You will probably plan a larger number of outputs, and may think of some after the survey has been run, but you must have the absolutely vital ones sorted out well in advance so that the sample can be designed to ensure that those outputs are produced at the level of accuracy you desire.

The best way to describe your key outputs is to produce skeletons of tables - the title and the labels of the axes. Attached to each should be a statement about the use of the information and the accuracy desired.

Note that if there is a standard published by Statistics New Zealand for any variable in your output, you should consider using it unless there is a good reason not to.

	Key output 1
Area of the country	Proportion of dwellings that are poor
North of the North Island
South of the North Island
South Island
Total

Definitions: North of North Island – area north of a line from Y to Z. South of North Island – area south of that line.
Poor dwellings are defined above.
This output is important so that housing assistance can be targeted to areas according to need
Cells are to have a sampling error of plus or minus 5%. For instance, if the estimate of a proportion of dwellings in the North Island that are poor is 40%, this is 40 $\pm$ 5%.

	Key output 2
Ethnicity	Proportion of population living in crowded dwellings
NZ Māori
European/Pākehā or Other
Total

Ethnicity - use NZS standard definition
This output is important so that housing assistance can if appropriate be channelled through Māori organisations
Cells are to have a sampling error of plus or minus 5%. For instance, if the estimate of a proportion of population living in crowded dwellings is 40%, this is 40 $\pm$ 5%.

If something more complex than tabular output is being planned as a key output, talk to whoever is working on the sampling about what they need from you in this case.

3.8 Other outputs

These are outputs that will be useful to you, but that you can if necessary get along without, if, for instance, having them would double the cost of the survey because:

you’d need a much larger sample to enable you to have them, or
you’d need a different, more expensive, interviewing method to collect the necessary information.

At some stage the survey designers will talk to you about what level of accuracy you could get for these outputs from a sample designed to provide your key outputs. At that stage, you will need to make decisions about whether to keep or drop those proposed outputs.

Again, the best way to show the outputs you want to produce is to do skeletons of tables. The other way is to write the statements you would, or will, put into a report after the survey, leaving out the numbers. For example,

: It was found that …% of parents of children aged under 5 lived in crowded dwellings, while of parents with a youngest child aged 5 or older only …% live in crowded dwellings.

Again, you should say why it is useful to know this.

: This is needed to decide whether it would be useful to target housing assistance by presence of children aged under 5, perhaps through the present family assistance provided through IRD.

Again, you should remember that if there is a standard published by Statistics New Zealand for any variable in your output, you should consider using it unless there is a good reason not to.

3.9 List of variables

At this stage you should be able to put together a list of the variables or bits of information that you need. This is very useful to designers.

It is worth putting a priority rating for each variable at this stage. Then if some variables need to be dropped because of cost or time constraints there will be no delay while decisions are made about which to drop.

Information needed	Population the information needed about	Definition	Output Categories	Priority
Crowding	all dwellings in pop. (see definition of population)	see point 2 above	Need number for each dwelling for producing medians, quartiles, etc	A
Housing quality	ditto	see X, Y	index to be developed (see X, Y)	A
Persistent leaks	ditto	leaks whenever there is rain	two-value variable - has or does not have - contributes to housing quality index	A
Lack of inside plumbing	ditto	No running water or drainage inside main structure	ditto	A
Security of Tenure	ditto	see definitions	two-value variable: secure/not secure - see X, Y	B
Ethnic group	ditto	as in SNZ standard	Māori/other	A
Area of country	ditto	NZ divided as shown in point 2	see point 2	B
Age	ditto	as in SNZ standard	10 year age-groups as in SNZ standard, or further collapsed as necessary	B
Children under 5	ditto	whether respondent is in parental role to a child less than 5	dichotomous variable - is/is not in such a parent role	C

3.10 Keep checking

Keep checking design work done for you against objectives, and be prepared to revisit your objectives.

If the questions being developed do not seem to you to match your objectives, point this out to the designers and ask them to find remedies, rather than trying to find remedies yourself.

Similarly if there are any features of the proposed methodology that make it unlikely, in your opinion, that the survey will be able to meet your objectives, point that out to the designers.

It is quite likely that some of the information you think you need will not be able to be collected, or will not be able to be collected given your budget. So you may need to look at the objectives again and decide whether the survey is still justified, given the reduced list of objectives, and consequently the reduced set of uses of the information.

4 Sampling and Estimation

$\DeclareMathOperator*{\argmin}{argmin}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\bfa}[2]{{\rm\bf #1}[#2]}$ $\newcommand{\rma}[2]{{\rm #1}[#2]}$

4.1 Sampling Error

Our aim is to estimate a population parameter $T$, such as the unemployment rate, using a data from a sample of the population. In doing so we end up with an estimate $\widehat{T}$ of $T$, rather than the true value of $T$ itself.

Setting all other sources of survey error aside for the moment, we will concentrate on sampling error: the error caused by taking a sample rather than doing a census. If we were to do a census, then we would simply be measuring the value of $T$. Sampling error arises because the properties of a sample are (somewhat) different from the properties of the population from which the sample comes.

As an example, consider the small population of $N=10$ people listed in Table 4.1, and the sizes of the families (total numbers of siblings) that they come from.

Table 4.1: Family size data
Person, $i$	Family Size, $Y_i$
1	3
2	5
3	1
4	4
5	9
6	3
7	2
8	4
9	6
10	2

The mean family size and total number of siblings in this population are \[ \bar{Y} = \frac{1}{N} \sum_{i=1}^N Y_i = 3.9 \ \ \ \text{and}\ \ \ Y = \sum_{i=1}^N Y_i = N\bar{Y} = 39 \] If I want to estimate $\bar{Y}$, the mean family size, and I’m only allowed to take a sample of $n=5$, I would draw five numbers at random between 1 and $N=10$ (discarding duplicates) and then ask the five selected people how many siblings are in their families. I’d end up with five sample values $\{y_1, y_2, y_3, y_4, y_5\}$ and would use mean of those values as an estimate of $\bar{Y}$. For example if I draw random numbers $\{8, 5, 2, 3, 1\}$ then the sample data are \[ \{y_1, y_2, y_3, y_4, y_5\} = \{Y_8, Y_5, Y_2, Y_3, Y_1\} = \{4, 9, 5, 1, 3\} \] with mean $\bar{y}=4.4$, and standard deviation $s_y=2.966$. I take the sample mean $\bar{y}$ as my estimate $\widehat{\bar{Y}}=\bar{y}=4.4$ of the true value $\bar{Y}=3.9$.

Note: the sample mean $\bar{y}=\frac{1}{n}\sum_k y_k$ is called the estimator of $\bar{Y}$, and the value that comes from a particular sample, e.g. $\bar{y}=4.4$, is called an estimate of $\bar{Y}$. i.e. an estimator is a method of calculating an estimate.

Each sample will have different members, and a different mean. Thus the precision of the estimate $\bar{y}$ from any given sample depends on how different samples can be from one another: i.e. the precision depends on the sampling variability.

Figure 4.1 shows a histogram displaying the distribution of family sizes of the 10 members of the population. These range from 1 to 9. The thick vertical line shows the population mean 3.9.

Figure 4.1: Histogram of Family sizes

The variance of the data in this histogram is the variance of the population: \[ S_Y^2 = \bfa{Var}{Y_i} = \frac{1}{N-1}\sum_{i=1}^N (Y_i-\bar{Y})^2 = 5.4333 = 2.33^2 \] The histogram in Figure 4.2 has been made by drawing from the population all of the 252 possible distinct samples of size 5, and then calculating the mean of each one.

Figure 4.2: Histogram of estimates of mean family size in the 252 possible SRSWOR of size 5 drawn from the population

Note the following:

The distribution of estimates from the 252 possible samples all cluster around the true mean 3.9;
The distribution appears roughly normal (unimodal, symmetric, few/no outliers) – even though the data are not Normal;
The variance of the distribution of sample means $\bar{y}$ is less than the variance of the population: \[ \bfa{Var}{\bar{y}} = 0.545 = 0.739^2 \] The standard deviation of the $\bar{y}$ values is called the standard error of the estimator $\bar{y}$, and it can be used in standard ways to compute a confidence interval for $\bar{Y}$. \[ \bfa{SE}{\bar{y}} = \sqrt{\bfa{Var}{\bar{y}}} = 0.739 \] This value comes from a consideration of all possible samples, and can only be calculated using full knowledge of all the population values.

However, based on our sample only, and ignoring the fact (that we’ll get to later) that this population is finite, we can approximate the standard error of our estimate of the mean using the sample standard deviation and sample size using the familiar formula: \[ \bfa{SE}{\bar{y}} = \frac{s_y}{\sqrt{n}} = \frac{2.966}{\sqrt{5}} = 1.326 \] So a 95% confidence interval for $\mu$ based on our sample only is then \[\begin{eqnarray*} &&\bar{y} \pm 1.96\times \bfa{SE}{\bar{y}}\\ &&\bar{y} \pm 1.96\times \frac{\sigma}{\sqrt{n}}\\ &&\bar{y} \pm 1.96\times \frac{s_y}{\sqrt{n}}\\ &&4.4 \pm 1.96\times \frac{2.966}{\sqrt{5}}\\ &&4.4 \pm 1.96\times 1.326\\ &&4.4 \pm 2.6\\ &&(1.8, 7.0) \end{eqnarray*}\]

Note that the properties of samples vary less as the sample size gets bigger - see Figure 4.3, where we can see the sampling error decrease as $n$ increases.
Ultimately when the sample size is $n=N$, i.e. a census, there’s only one possible sample.

Figure 4.3: Histogram of estimates of mean family size in the possible SRSWOR of (a) size 1, (b) size 5, (c) size 8 and (d) size 10 drawn from the population

The diagram in Figure 4.4 is a picture of what is going on. We have a population of interest, which has a parameter or parameters that we want to estimate. We draw a sample according to some selection procedure, well aware that the properties of that sample will differ somewhat from those of the other possible samples that we might have drawn. Our inferences about the characteristics of the population must take this sampling variability into account. We can only make proper inferences about the population if we can assign a selection probability to all possible samples.

**Statistical Inference.** We draw a sample from the population, well aware that the properties of that sample will differ somewhat from those of the other possible samples that we might have drawn. Our inferences about the characteristics of the population must take this sampling variability into account.

Figure 4.4: Statistical Inference. We draw a sample from the population, well aware that the properties of that sample will differ somewhat from those of the other possible samples that we might have drawn. Our inferences about the characteristics of the population must take this sampling variability into account.

4.2 Statistical Review

In earlier courses you will already have come across estimators of population parameters and their standard errors. A brief review of these estimators and their properties is given below. These formulae assume that the sampling is being done from infinite populations – however one of the important differences between the theory of sample surveys and the rest of statistics is the incorporation of the finite size of populations, and we will see important modifications to these formulae later.

Estimation of a mean $\mu$ from a sample of size $n$ taken from a population with variance $\sigma_y^2$:, $\{y_1,\ldots,y_n\}$: \[ \widehat{\mu} = \bar{y} = \frac{1}{n}\sum_{k=1}^n y_k \] with standard error \[ \bfa{SE}{\widehat{\mu}} = \frac{\sigma_y}{\sqrt{n}} \simeq \frac{s_y}{\sqrt{n}} \] When the Central Limit Theorem holds the sampling distribution of $\bar{y}$ is \[ \bar{y} \sim \text{Normal}\left(\mu,\frac{\sigma^2_y}{n}\right) \] and a large sample confidence interval is \[ \bar{y} \pm Z \times \frac{s_y}{\sqrt{n}} \]
Estimation of a proportion $p$ from a Binomial sample of size $n$, where $X$ = number of successes in $n$ identical trials.
\[ \widehat{p} = \frac{X}{n} \] with standard error \[ \bfa{SE}{\widehat{p}} = \sqrt{\frac{p(1-p)}{n}} \simeq \sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}} \] When the Central Limit Theorem holds the sampling distribution of $\widehat{p}$ is \[ \widehat{p} \sim \text{Normal}\left(p,\sqrt{\frac{p(1-p)}{n}}\right) \] and a large sample confidence interval is \[ \widehat{p} \pm Z \times \sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}} \]

The large sample results rely on the Central Limit Theorem, and $Z=1.96$ for 95% confidence. A confidence interval is always stated as: \[ \begin{split} &\text{estimate} \pm \text{margin of error}\\ &\widehat{T} \pm \bfa{MOE}{\widehat{T}}\\ &\widehat{T} \pm Z\times\sqrt{\bfa{Var}{\widehat{T}}}\\ &\widehat{T} \pm Z\times\bfa{SE}{\widehat{T}} \end{split} \] Also note that a proportion is in fact just a special case of a mean. If on $n$ Binomial trials we record $y_k=1$ for a success and $y_k=0$ for a failure, (i.e. $y_k$ is an indicator variable) then the proportion of successes is \[ \widehat{p} = \frac{1}{n} \sum_{k=1}^n y_k = \frac{1 + 0 + 1 + 0 + 0 + 1 + 0 + 0 + \ldots}{n} = \frac{X}{n} \] For this reason we do not have to treat means and proportions separately: one is just a special case of the other.

4.3 Expected Value, Bias and Mean Squared Error

Given a population the choice of a sample design defines a set of all possible samples: this set is called the sample space, $\mathbb{S}$, which may be enormous. (In the example above, a random sample of 5 people from a population of size 10 means that there are 252 possible samples). In any given survey we measure just one of the elements of the sample space $s$, and we can calculate the probability of obtaining that sample $p(s)$.

An estimator $\widehat{T}(s)$ of a population parameter $T$ (e.g. the total or mean of a variable $Y$) is a random variable. Under a probabilistic sampling scheme it takes on a different value for every sample $s$ that can be drawn under that scheme.

The mean (expected) value of an estimator $\widehat{T}(s)$ is given by \[ \bfa{E}{\widehat{T}} = \sum_{s\in{\mathbb{S}}} p(s) \widehat{T}(s) \] and its variance is \[ \bfa{Var}{\widehat{T}} = \sum_{s\in{\mathbb{S}}} p(s) \left(\widehat{T}(s)-\bfa{E}{\widehat{T}}\right)^2 \] The variance is a strong indicator of the quality of the estimator. The standard error is the square root of the variance: \[ \bfa{SE}{\widehat{T}} = \sqrt{\bfa{Var}{\widehat{T}}} \]

The sampling bias of an estimator is defined to be the difference between the mean value of the estimator, and the population parameter: \[ \bfa{Bias}{\widehat{T}} = \bfa{E}{\widehat{T}}-T \] (Note that the sampling bias excludes any biases due to non-sampling error - such as non-response, questionnaire bias, respondent error etc.)

The diagram in Figure 4.5 shows the situation: the error of an estimator is a combination of bias and variance.

$An estimator $\widehat{T}$ of a population parameter $T$ has a sampling distribution with a mean $\bfa{E}{\widehat{T}}$, standard error $\bfa{Var}{\widehat{T}}$, variance $\bfa{Var}{\widehat{T}}=\bfa{Var}{\widehat{T}}^2$, and possible bias $\bfa{Bias}{\widehat{T}}$.$

Figure 4.5: An estimator $\widehat{T}$ of a population parameter $T$ has a sampling distribution with a mean $\bfa{E}{\widehat{T}}$, standard error $\bfa{Var}{\widehat{T}}$, variance $\bfa{Var}{\widehat{T}}=\bfa{Var}{\widehat{T}}^2$, and possible bias $\bfa{Bias}{\widehat{T}}$.

Where an estimator is biased, we use the mean squared error as a measure of quality, instead of the variance: \[ \bfa{MSE}{\widehat{T}} = \bfa{E}{(\widehat{T}-T)^2} = \bfa{Var}{\widehat{T}} + \bfa{Bias}{\widehat{T}}^2 \] The MSE combines the effects sampling variance and sampling bias together. A low variance estimator may be highly biased, and we may instead prefer an estimator with a higher variance but with lesser bias (Figure 4.6).

In general we prefer zero bias and low variance estimators. A low variance but highly biased estimate is usually unacceptable,and we may have to put up with high variance in order to eliminate bias. However a small bias is acceptable if this reduces the variance significantly.

Figure 4.6: In general we prefer zero bias and low variance estimators. A low variance but highly biased estimate is usually unacceptable,and we may have to put up with high variance in order to eliminate bias. However a small bias is acceptable if this reduces the variance significantly.

4.3.1 Example

Under simple random sampling, the sample mean $\bar{y}$ is an unbiased estimator $\widehat{\mu}$ of the population mean $\mu$: \[\begin{eqnarray*} \bfa{E}{\widehat{\mu}} &=& \mu\\ \bfa{Var}{\widehat{\mu}} &=& \frac{\sigma^2}{n}\\ \bfa{Bias}{\widehat{\mu}} &=& \bfa{E}{\widehat{\mu}}-\mu=\mu-\mu=0\\ \bfa{MSE}{\widehat{\mu}} &=& \bfa{Var}{\widehat{\mu}}+\bfa{Bias}{\widehat{\mu}}^2 = \frac{\sigma^2}{n} + 0 = \frac{\sigma^2}{n} \end{eqnarray*}\] We will come across estimators which are biased later in the course (e.g. the ratio estimator is an example of a biased estimator).

4.4 What are the advantages of sampling?

reduced cost;
quicker to carry out resulting in improved timeliness;
greater scope - sampling can allow for greater scope and flexibility due to it being less resource demanding. Often specialised personnel are scarce meaning that a complete census is impracticable;
improved accuracy and quality - often the time saved due to sampling can be put into more intensive and careful collection and checking of the data. More specialised training of the survey personnel is possible, allowing for an improved control of non-sampling error;
reduced burden on respondents.

4.5 What are the disadvantages of sampling?

sampling error;
less scope for detailed analysis of the data: may not obtain sufficient data for analysis of specific sub-populations;
technical knowledge required by survey staff (e.g. random selection of participants);
Public perception – small groups may worry that they are not adequately represented in the results if a sample rather than a census is taken.

4.6 Approaches to sampling

Probability Sampling. Every sample in the sample space has a known chance of selection, and it follows that every member of the population also has a known nonzero probability of selection. These properties allow the probability distribution and other properties of estimators to be derived.
Purposive or Judgement Sampling. The surveyor chooses a sample which they believe, based on their knowledge of the population, to be ‘representative’. Alternatively the sample is chosen to cover the diversity present in the population.
Quota Sampling. Sample sizes or quotas are set for different types of unit in the population (e.g. age/sex groups). Units are selected until these quotas are met.
Haphazard Sample. Use whatever sample is available. e.g. ‘People I know.’ Snowball samples (where sample members recruit other sample members).
Self-selected Sample. Volunteers phoning in, or writing letters.

Non-probability samples are entirely legitimate if a survey is qualitative. For example, a survey may aim to sample the full range of views held by all members of a population, in order to generate hypotheses which can be tested in a quantitative survey. A purposive sample is used to scatter the sample widely, rather than attempting to select a representative sample. Pilot surveys often have non-probability samples.

A purposive step in an otherwise probabilistic design may be acceptable: e.g. a random sample of people from Wellington may be justifiably thought to be similar on certain characteristics (e.g. reaction to using certain drugs) to people elsewhere. For practical reasons the survey can be justifiably restricted to Wellington, but the results can be generalised to the whole country. The justification for the generalisation needs to be sound, and can remain open to question.

4.7 Examples of sampling schemes

4.7.1 SRSWOR=Simple Random Sample WithOut Replacement.

Here we want to estimate the average number of hours spent by students majoring in Science on coursework at VUW.

We obtain a list of all students majoring in Science from the Registry. We assign a number from 1 to $N$, the total number of such students, to each student. We select $n$ distinct random numbers between 1 and $N$ from a set of random number tables. If the random number $i$ is chosen then we select the $i$th student. Then we collect the necessary information from the student.

4.7.2 STSRS=Stratified Simple Random Sample.

We might want to estimate the number of employees in food retailers. We have a list of $N$ businesses to select from, and the earnings of each retailer in the past year.

We classify the retailers into several groups (called strata) based on their earnings: there $H$ strata, with $N_h$ retailers in group $h$ (for $h=1,\ldots,H$). We take a SRSWOR of $n_h$ retailers from each stratum, and count the number of employees in each sampled retailer.

4.7.3 LSRS=Linear Systematic Random Sample.

Here we want to estimate the proportion of visitors to New Zealand who intend to stay for more than 2 weeks. Visitors can be selected at the airport as they pass through customs. We choose a time to start sampling, and then select at random one passenger in the first twenty. Thereafter we sample every twentieth arriving passenger.

4.7.4 SRS1C=One-stage Cluster Sample.

Here we want to estimate the number of shellfish categorized by species, e.g. pipis, mussels, etc on a beach.

At low tide we divide the beach into 1m $\times$ 1m quadrats. We number each quadrat, from 1 to $N$. We select $n$ distinct random numbers between 1 and $N$ from a set of random number tables. If the random number $i$ is chosen then we select the $i$th quadrat. We dig up the quadrat and count the number of shellfish.

4.7.5 SRS2C=Two-stage Cluster Sample.

Again we want to estimate the number of shellfish.

This is a big beach, e.g. Titahi Bay. So at low tide we divide the beach into 10m $\times$ 10m quadrats. We number each quadrat, from 1 to $N$. We select $n$ distinct random numbers between 1 and $N$ from a set of random number tables. If the random number $i$ is chosen then we select the $i$th quadrat – this is the first stage of sampling. Now for each selected quadrat we further divide the quadrat into one hundred 1m $\times$ 1m (sub)quadrats. We number these from 1 to 100. We select $m$ distinct random numbers between 1 and 100 from a set of random number tables. If the random number $j$ is chosen then we select the $j$th (sub)quadrat, within the $i$th quadrat – this is the second stage of sampling. We dig up the (sub)quadrat and count the number of shellfish.

4.7.6 STSRS1C=Stratified Cluster Sample.

Again we want to estimate shellfish, but now just mussels. The beach has some sandy strips and some rocky outcrops, and mussels are more likely to be found on the rocky outcrops.

We split the beach into areas which are either predominantly sandy, or predominantly rocky. At low tide within each of these areas, $h$, we divide the beach into 1m $\times$ 1m quadrats. We number each quadrat, from 1 to $N_{h}$. We select $n_{h}$ distinct random numbers between 1 and $N_{h}$ from a set of random number tables. If the random number $i$ is chosen then we select the $i$th quadrat. Note that, each of the samples in the different areas of the beach are independent. We dig up the quadrat and count the number of shellfish.

4.7.7 PPSWR=Selection Probability Proportional to Size, With Replacement.

We want to estimate the number of calves born in spring on farms supplying milk for Town Milk Supply. We have information on the number of milking cows on each such farm and indeed each cow is registered on a computer database. Suppose the cows are labelled 1 to $N$. We select $n$ random numbers between 1 and $N$ from a set of random number tables. If the random number $i$ is chosen and the $i$th cow belongs to farm $j$, then we select farm $j$ and ask them to tell us how many calves were born.

4.8 Adminstrative Data - a common alternative to sampling

Administrative Data are data which are collected during the routine operations of an organisation as it manages its client population. Examples include

Hospital admissions
GP Prescriptions
Student records in a University
Tax records
Aircraft movements at an airport

Such data sets can help us understand the population of interest, and importantly:

They are complete - the entire managed population is represented;
They are often of high quality - since the data they contain are necessary to the management of the population;
They are up to date, containing all activities in the population right up to the current moment;
The data already exist, so are a cheap form of data collection, provided we can get permission to gain access;
The sample sizes can be very large.

These characteristics make admin data sets excellent sources of information about a population.

However, they have some distinct drawbacks:

Available administrative data sets may not exactly match our target population.

e.g. we may be interested in the health of all New Zealanders, whereas the hospital admissions data set only covers people with severe conditions which require hospitalisation, and they only cover public hospitals.
They are transactional - the fundamental unit of the data may be an interaction (e.g. an episode of hospital care) rather than a member of the population. This can make them complicated to analyse: some people may appear multiple times in the data set, and others not at all. We also need to be able to match people properly so we can identify which records refer to the same individual;
They only contain information relevant to the operation of the organisation that collects the data - the things we may want to know aren’t available.

e.g. although we may find out what operations a person had in hospital, we can’t know who they felt about it. For our analyses we may also want to know information about a person’s income or education - things that aren’t collected by the health system because they’re not needed to manage care;
We can see the most recent interaction of a person with the organisation - but we can’t know if that person is still alive now, or is still in the country, so we don’t know if they are part of our population anymore.
The definitions used to define data items may differ from ones we want to use, or may change over time.

e.g. Diagnostic codes regularly undergo revision, which means when comparing older with newer data we may have to use different sets of codes in different time periods. We may be interested in the rate of occurrence of complications up to one year after surgery, but the data may only record complications up to 30 days.

The power of administrative data sets can be enhanced by linkage: individuals in a number of data sets can be identified and their records linked, as a way of increasing the richness of the information available for analysis. Statistics New Zealand’s Integrated Data Infrastructure (IDI) is an example of a data linkage warehouse where many data sets (census, tax, health, eduction etc) are linked, enabling more powerful analyses.

While administrative data may be suitable to answer many questions, some can really only be answered by a sample survey.

5 Populations and Frames

$\DeclareMathOperator*{\argmin}{argmin}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\bfa}[2]{{\rm\bf #1}[#2]}$ $\newcommand{\rma}[2]{{\rm #1}[#2]}$

The target population defines the scope of the survey: the set of units which make up the population in which we are interested.

The sample frame is a method of gaining access to the members of the target population. It is an actual or conceptual list. However it may not be perfect (Figure 5.1).

Not all members of the target population will appear on the sample frame: the missing units are out of coverage;
Not all units listed on the sample frame are members of the target population: the extra units are out of scope.
The survey population are those units which are both in the target population (in scope) and on the frame (in coverage).

Our aim is to have a sample frame which covers most if not all of the target population, and which has only very few out of scope units.

Figure 5.1: Scope and Coverage

5.1 What is a sampling frame?

The sampling frame is a key element in the survey process which provides the basis from which a random (probability based) sample can be selected. The purpose of having a sampling frame is to enable each member of the target population a nonzero probability of being selected to participate in the survey. The quality of the data, from a survey which has as its basis a comprehensive frame, can be measured and controlled.

The term sampling frame is sometimes misunderstood, often being thought of as a list containing each member of the population. In fact that is more a description of an `ideal’ sampling frame. The best way to show what a sampling frame can be in practice is to examine some examples:

List Frames. Good up-to-date lists make ideal frames. However lists of units (electoral rolls, telephone listings etc.) are generally formed for purposes other than for use as sampling frames. For this reason they are often not close to `ideal’ sampling frames. Some discussion of list frame problems comes later.
Multi-stage Sampling Frames. These frames consist of a set of lists which are arranged in a hierarchy of stages. For example a typical household survey frame will be organised like so: The first list is a a set of non-overlapping geographic areas from which a sample of areas is taken. For example we may list all the local regional councils and take a sample of them. Within these areas we might then list all the households and take samples from the household lists. From each household we would then list the residents and interview selections of them. Note that we do not have to list the entire population. The only comprehensive list needed is the list of the areas which will be quite small. By giving each area a chance of being selected we ensure each person living in each area has a chance of selection.
Continuous Sampling Frames. Consider the population of people entering a country from international flights. Suppose we choose a random number between one and ten ($=k$) and then interview the $k$th, $(k+10)$th, $(k+20)$th … persons who file through customs. We will have a sample where every person has had a 1 in 10 chance of being selected. But what is the frame? There is no list. We call such a situation a `conceptual list’, where the people filing into the country effectively formed a long queue of people which we have treated like a listing of the population, although a listing of names was never constructed. \end{description}

One can find many different definitions of sampling frames. The following is a simplified definition that should serve our purposes:

A sampling frame is part of the system of rules and procedures used when determining who is to be surveyed and which ensure that each member of the population get some measurable (non-zero) chance of participating in the survey.

The examples above show that we need to think beyond the concept of a list of units, whose construction is a separate exercise from the other survey stages.

5.2 What is a good sampling frame?

We will often have several frames available and will have to decide between different options. To help us with this decision we need to be a little more specific about what makes a good sampling frame. The first thing we should consider is the effective coverage of the sampling frame. This can be determined by testing it for the following properties:

Each unit should be counted at least once. If there are missing units (out of coverage) then the survey results may be biased if the excluded units tend to be different to the units which are included.
Each unit should be counted only once. The problem is that we usually can not tell how many times a unit is duplicated. This means we can not tell how many chances a unit has of being sampled. This can cause similar bias if the units which are duplicated are different to the other units. The results will be biased towards the duplicated sub-group of the population.
Each unit should be distinguishable from the other units on the frame. If we select a unit from the frame (say Jane Smith) but then can not tell who this refers to, when there are many Jane Smiths in the population, then the frame has not done its job.
Should provide up-to-date information about the units. For example it if it claims to provide Jane Smith’s address it should be the current address.
No units should be there if they are not meant to be. Extra units (out of scope) are not a serious problem unless we fall into the elementary error of replacing them whenever we come across them in the sample. This confuses selection probabilities. The correct procedure is to drop any ineligible unit found in the sample and not replace them.
Should allow direct access to each unit. Ideally the frame will ultimately supply us with a list of the units we are interested in analyzing. Sometimes a frame will only give samples of groups of units e.g. a list of addresses, each of which may contain many people. Further work is needed before a sample of people is realized.

5.3 Screening

Note that sometimes the units in the survey population do not have a natural frame and we need to use a frame with large numbers of ineligible units and perform screening. For example, many surveys in New Zealand require separate estimates for Māori people as well as for the whole population. There is no population register of Māori, and the only way to find a sample of sufficient size is to take a screening sample: i.e. a large number of households are selected and approached. A set of screening questions is administered to determine whether there are any eligible Māori living at the selected dwelling. If there are, then an interview is carried out, but if not the residents are thanked for their cooperation, but are not interviewed. Thus in general only a small proportion of a screening sample is actually used, since only a small proportion of units in the screening sample are eligible.

Screening may also apply to subsets of questions in a questionnaire: only certain respondents answer certain questions. For example, in the New Zealand Health Survey a set of screening questions about gambling behaviour establishes whether or not the respondent will be asked a set of questions related to the risk of problem gambling. When estimates are made about the characteristics of problem gamblers (e.g. proportion male/female; age distribution etc.) then only this restricted, screened, subsample is used because these estimates relate to a restricted subpopulation of the survey population of the whole NZ Health Survey.

Figure 5.2: Screening. The only available sample frame is dominated by ineligible (out of scope) units

5.4 Why does a frame need to be so comprehensive?

Good sampling frames help us make our surveys similar to good scientific experiments in the sense that they help us remove the effect of unknown factors on our conclusions. How?

We can show that if we miss units or count them twice our estimates derived from the survey data will be biased estimates of the population values and that the magnitude of the bias will be unknown. This means our inferences about the population may be incorrect.

To discuss this further it will help to define the target population which contains all the units we want the survey to represent and the survey population which contains all the units covered by the survey frame. A sample frame will generally be able to cover only a subset of the target population. The difference between the target and survey populations is called the frame bias and is most significant when those not covered by the frame are large in number and different to the remaining population. In fact we can show that frame bias is product of two factors $P\times D$ where:

$P=$ The proportion of the target population not covered by the frame.
$D=$ The difference between those covered by the frame and those not. For instance this might be the difference between the average heights of those on the frame and those not on the frame.

In most cases $D$ is unknown so that the best approach, to minimize the bias, is to keep $P$ as small as possible. In other words unless we have a comprehensive frame our surveys results will be biased by an unknown amount.

Example. Assume we are estimating the mean house price in New Zealand. We have an incomplete frame, containing $N_1$ units (houses), but there are $N_2$ units missing. The total number of houses in NZ is thus $N=N_1+N_2$.
Now assume that the mean price of the $N_1$ houses on the frame is $\bar{Y}_1$, and the mean price of the $N_2$ missing houses is $\bar{Y}_2$. The mean price of all houses is then

\[ \bar{Y} = \frac{N_1\bar{Y}_1+N_2\bar{Y}_2}{N} \]

If we let $P=N_2/N$ be the proportion of houses missing from the frame, and $D=\bar{Y}_1-\bar{Y}_2$ be the difference in the mean prices for houses on and off the frame we can write

\[\begin{eqnarray*} \bar{Y} &=& (1-P)\bar{Y}_1+P\bar{Y}_2\\ &=& \bar{Y}_1 - P(\bar{Y}_1-\bar{Y}_2)\\ &=& \bar{Y}_1 - PD\\ \end{eqnarray*}\]

Thus a sample estimator which estimates $\bar{Y}_1$ will have a bias: \[ {\rm Bias} = \bar{Y}_1-\bar{Y} = PD \]

Assume that $P=\frac13$ of the houses are missing, and that $\bar{Y}_1=\$750,000$ and $\bar{Y}_2=\$300,000$. The bias is then $PD=(\$750,000-\$300,000)\times\frac13=\$150,000$: i.e. we estimate the mean house price to be $\bar{Y}_1=\$750,000$, but this is an overestimate, which is too large by $150,000. In a less extreme case if$P=0.01$ (i.e. 1% of houses missing from the frame) the bias is $PD=(\$750,000-\$300,000)\times0.01=\$4500$.

5.5 Costs, Design and Operation - further complications.

Apart from the effective coverage of the population, there are a number of other issues which may determine which frame we can or should use. These include cost, survey design and survey operations.

If a good (complete and up to date) list exists then using this list will be the cheapest alternative. Constructing a complete list from scratch will be expensive in terms of the time it will take and the amount of work required. However updating a list is often just as expensive as constructing it from scratch, thus old lists are often not of much use. In many circumstances an area-based hierarchical frame will be less expensive because only the members within the sample at each stage need to be listed. A complete listing will only be needed of the groups of units at the initial stage. This stage will generally only be a list of areas or maps and will not require costly detailed listing of individual units. An area-based frame can save costs, because it will limit sampling to a number of areas. This means, for example, the time and effort required to travel to each selected individual unit will be smaller than if the selected units are spread over the whole population.

The choice of sample design can be limited by the frame. Two aspects of the sample design involving the frame are: Stratification and Cluster Sampling. Stratification involves sorting the list of units into a set of distinct groups and then selecting samples independently from each group. This process requires that we have a frame which we can easily rearrange and separate into different groups. Thus we need some flexibility and control over the ordering of individual units. An area-based frame means we could only stratify the areas and could not stratify the households or people.

Frames which do not allow us to directly select individual units mean that we are using cluster sampling. An example is a two-stage selection of people, where we initially sample households and then obtain a selection of people from the selected households. Clustering of the units prior to selection tends to increase the variances of the resulting survey estimates. This is usually tolerated because of cost savings, which are discussed briefly above.

The choice of frame affects the operation of the survey. Consider the example of a continuous sampling frame used for a sample of overseas departures. We may know that by having each person file through customs we can conceive a valid comprehensive sampling frame, with each person receiving a fixed chance of being selected. However if the questionnaire we need to use takes up to an hour to complete and is impracticable to use at such a site then we will need to consider an alternative survey procedure (including an alternative frame). We can never consider the choice of sampling frame to be independent of the operational side of the data collection.

Table 5.1: Some problems with commonly used lists
Source	Undercoverage	Overcoverage	Duplicates
Telephone Register	No landline phone; upaid bills	businesses and other non-private addresses	two phones in one dwelling; number listed against two names
Electoral Roll	Never registered; New arrivals to the electorate; aged <17	deceased people; people who have left	registered twice under two names; listed in two electorates; multiple electors per dwelling (if usng to select addresses)
Address Lists	New houses; subdivided houses; hard to find; isolated rural dwellings	businesses and other non-private addresses; vacant sections; abandoned/empty dwellings; holiday homes	houses on two street lists (e.g. corners)

6 Simple Random Sampling

$\DeclareMathOperator*{\argmin}{argmin}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\bfa}[2]{{\rm\bf #1}[#2]}$ $\newcommand{\rma}[2]{{\rm #1}[#2]}$ $\newcommand{\estm}{\widehat}$

Much of sample design theory for complex sample designs rests on the properties of the most simple of all designs: simple random sample without replacement (abbreviated SRSWOR or sometimes just SRS). We will investigate the properties of the SRSWOR later, but for the moment here is a working definition.

We have a population of size $N$, from which we want to draw a sample of size $n$ without replacement. i.e. each unit can only be drawn once, and $n$ distinct units will be drawn from the population such that any sample of size $n$ is equally likely.

6.1 Drawing a SRSWOR

One conceptual procedure for selecting a SRSWOR is the following:

Select the first unit by randomly selecting with equal probability from the $N$ units in the population
Select the second unit by randomly selecting with equal probability from the $N-1$ units which remain
…
Select the $n^{\rm th}$ unit by randomly selecting with equal probability from the $N-n+1$ units which remain

At the end we have a set of $n$ distinct units in the sample, and $N-n$ unselected units remaining in the population.

In practice we use other methods.

6.1.1 Taking a SRSWOR using a calculator

Most scientific calculators have a random number function – which generates a random number between 0 and 1 to 3 decimal places: e.g. 0.277 or 0.913. For populations $N$ of a couple of hundred members, a SRSWOR can be drawn as follows:

Generate a random number $r$ between 0 and 1
Calculate the number $N\times r$ and no matter what the result round UP to the nearest whole number $k$.
Add the $k^{\rm th}$ unit to the sample, unless it has already been selected.
Go back to Step 1 and repeat until $n$ units have been selected.

For example, if we want to select $n=4$ units from a population of size $N=8$ we proceed as follows:

Generate $r=0.675$ so $Nr=(8)(0.675)=5.4$ round up to 6: select unit 6;
Generate $r=0.528$ so $Nr=(8)(0.528)=4.2$ round up to 5: select unit 5;
Generate $r=0.161$ so $Nr=(8)(0.161)=1.3$ round up to 2: select unit 2;
Generate $r=0.225$ so $Nr=(8)(0.225)=1.8$ round up to 2: unit 2 already selected, so continue;
Generate $r=0.107$ so $Nr=(8)(0.107)=0.9$ round up to 1: select unit 1.

So our sample is made up of units $\{1, 2, 5, 6\}$.

6.1.2 Taking a SRSWOR using a spreadsheet

If we have all the units of a population listed in a spreadsheet such as Excel, it is straightforward to select a SRSWOR.

Using Excel:

Insert a new column
Fill each cell in the column with random numbers using the {} function. (To do this type {} in the first cell of the column, and then copy this into all of the cells in the column.)
Make the values permanent by selecting the whole column, going Right-Click $>>$ Copy, then Right-Click $>>$ Paste Special. Select Paste Values and paste into the same column.
Sort the whole spreadsheet by this new column (using Data $>>$ Sort)
Select the first $n$ rows of the sorted spreadsheet and copy into a new spreadsheet.

This results in a SRSRWOR of size $n$ from the population.

6.1.3 Taking a SRSWOR by scanning

It is very efficient in some computer languages to scan down the list of all the units of the population, and make a decision about whether to include each unit in turn. The algorithm below is a means of drawing a SRSWOR of $n$ units from a list of size $N$ by scanning the list once.

Set $i=1$.
Generate a random number $r$ between 0 and 1;
If $r\leq \frac{n}{N}$ then select unit $i$ into the sample and decrease $n$ by 1: i.e. $n \longleftarrow n-1$.
Decrease $N$ by 1 and increase $i$ by 1: i.e. $N \longleftarrow N-1$ and $i\longleftarrow i+1$
If $N>0$ goto 2.

else stop.

6.2 Notation: properties of populations and samples

6.2.1 Survey Population

A population is a set of members (called population units). A population can be either finite (e.g. all students enrolled at Victoria University at the start of the first trimester) or infinite (e.g. all passengers arriving at NZ ports). We will usually consider only finite populations, in which there is a well-defined (even if unknown) population size $N$.

We give each unit (member) of the population a unique label $i=1,\ldots,N$.

The population units have characteristics, (e.g. individuals have height, weight, income, age, sex). A given characteristic is represented by a variable $Y$, and every (eligible) member $i$ of the population has a value $Y_i$ for that variable.

Table 6.1: Characteristics of individual population members
Person label $i$	Gender	Highest Qual.	Age	Weekly hours worked	Weekly income	Marital Status	Ethnicity
1	female	school	15	4	87	never	European
2	female	vocational	40	42	596	married	European
3	male	none	38	40	497	married	Maori
4	female	vocational	34	8	299	never	European
5	female	school	45	16	301	married	European
6	male	degree	45	50	1614	married	European
7	female	none	36	12	201	other	European
…	…	…	…	…	…	…	…

Recall that data items can be numerical (discrete or continuous), binary, categorical (nominal or ordinal).

Population parameters are characteristics of the whole population. e.g. if $Y_i$ is the income of individual $i$, then \[\begin{equation}\label{ytot} Y = \sum_{i=1}^N Y_i \end{equation}\] is the total income of all individuals in the population, and \[\begin{equation}\label{ybar} \bar{Y} = \frac{1}{N}\sum_{i=1}^N Y_i \end{equation}\] is the mean income in the population.

Table 6.2 contains a list of standard definitions for properties of populations. The values of the population parameters are in general unknown to us, and we must estimate them using the properties of samples.

Table 6.2: Population and sample characteristics
Quantity	Population	Sample
size	$N$	$n$
total	$Y = \sum_{i=1}^N Y_i$	$y = \sum_{k=1}^n y_k$
mean	$\bar{Y} = \frac{1}{N}\sum_{i=1}^N Y_i$	$\bar{y} = \frac{1}{n}\sum_{k=1}^n y_k$
variance	$\sigma_Y^2=\frac{1}{N}\sum_{i=1}^N (Y_i-\bar{Y})^2$
adjusted variance	$S_Y^2=\frac{1}{N-1}\sum_{i=1}^N (Y_i-\bar{Y})^2$	$s_y^2=\frac{1}{n-1}\sum_{k=1}^n (y_k-\bar{y})^2$
adjusted variance for indicator variables	$S_Y^2=\frac{N}{N-1}p(1-p)$	$s_y^2=\frac{n}{n-1}\widehat{p}(1-\widehat{p})$
relative variance	$V_Y^2=\frac{S_Y^2}{\bar{Y}^2}$	$v_y^2=\frac{s_y^2}{\bar{y}^2}$
coefficient of variation	$V_Y=\sqrt{V_Y^2}=\frac{S_Y}{\bar{Y}}$	$v_y=\sqrt{v_y^2}=\frac{s_y}{\bar{y}}$
covariance	$S_{XY}=\frac{1}{N-1}\sum_{i=1}^N (X_i-\bar{X})(Y_i-\bar{Y})$	$s_{xy}=\frac{1}{n-1}\sum_{k=1}^n (x_k-\bar{x})(y_k-\bar{y})$
correlation coefficient	$\rho_{XY}=\frac{S_{XY}}{S_XS_Y}$	$r_{xy}=\frac{s_{xy}}{s_xs_y}$

Binary Indicator Variables

Binary variables are often coded by indicator variables: which take the value 0 or 1 only. (e.g. $Y_i=0$ if Male, $Y_i=1$ if Female). The total of an indicator variable \[\begin{equation} Y = \sum_{i=1}^N Y_i = \text{Number of 1s} \end{equation}\] is just the number of units in the population for which the variable is 1 (e.g. the number of females in the population). The mean of an indicator variable is \[\begin{equation} \bar{Y} = \frac{1}{N}\sum_{i=1}^N Y_i = \frac{\text{Number of 1s}}{N} = P \end{equation}\] is then just the proportion of units $P$ in the population that have the value 1 (e.g. the proportion of the population that is female).

6.2.2 Sample

A sample is a (non-empty) subset of the population containing $n$ elements. In samples drawn without replacement each element is a distinct unit, whereas with replacement samples may contain repeated units.

We have a set of labels $k=1,\ldots,n$ for each element of the sample, and these labels are of course different from the labels $i=1,\ldots,N$ which we use to identify the population units.

We use lower case $y_k$ to represent the value that the characteristic $Y$ takes on the $k$th member of the sample. Thus if the $k$th member of the sample is the $i$th member of the population then $y_k=Y_i$.

Sample statistics are characteristics of the sample. e.g. if $y_k$ is the income of individual $k$ in the sample, then \[\begin{equation} y = \sum_{k=1}^n y_k \end{equation}\] is the total income of all individuals in the sample, and \[\begin{equation} \bar{y} = \frac{1}{n}\sum_{k=1}^n y_k \end{equation}\] is the mean income in the sample.

A statistic is by definition a quantity which can be calculated from a sample of observations from a population. Certain statistics can be used as estimators of population parameters. For example, the sample mean may sometimes, but not always, be a good estimator of the population mean. However the sample total is never a good estimator of the population total (unless the ‘sample’ is actually a census).

An estimator $\estm{T}$ is a particular type of statistic: its value is used to estimate the value of a population parameter $T$.

Alongside the corresponding population properties, Table 6.2 contains a list of standard definitions for properties of samples.

The sampling fraction is the proportion of the population that has been sampled: \[\begin{equation} f = \frac{n}{N} \end{equation}\]

6.2.3 Sample Probabilities

The set of all possible samples that can be drawn under a particular sampling scheme is called the sample space $\mathbb{S}$. Each possible sample $s$ in the sample space has a known probability $p(s)$.

Under simple random sampling there are \[\begin{eqnarray*} \binom{N}{n} &=& \frac{N!}{(N-n)!n!}\\ &=& \frac{N\times(N-1)\times(N-2)\times\ldots\times(N-n+1)}{ n\times(n-1)\times(n-2)\times\ldots\times 1} \end{eqnarray*}\] distinct possible samples of size $n$ which can be drawn from a population of size $N$, and each sample is equally likely. Therefore the probability of drawing any particular sample $s$ is \[ p(s) = \frac{1}{\binom{N}{n}} = \frac{(N-n)!n!}{N!} \]

6.2.3.1 Example

Consider the following population $U$:

Table 6.3: Small example population
Unit, $i$	Value, $Y_i$
1	0
2	1
3	3
4	5
5	6
6	9

There are $N=6$ units in the population. Assume that we want to draw a SRSWOR of size $n=2$. Then there are $\binom{6}{2}=15$ possible samples: there are 6 ways to choose the first sample member, 5 ways to choose the second – so there are $6\times5=30$ possible samples. However, the ordering of the elements in the sample is unimportant: sample $(i,j)$ is the same as sample $(j,i)$, hence we have $\frac{30}{2}=15$ distinct samples.

Table 6.4: The 15 possible samples, their members, means and selection probabilities.
Sample	$i_1$	$i_2$	$y_1=Y_{i_1}$	$y_2=Y_{i_2}$	Sample Mean	Probability
1	1	2	0	1	0.5	$\frac{1}{15}$
2	1	3	0	3	1.5	$\frac{1}{15}$
3	1	4	0	5	2.5	$\frac{1}{15}$
4	1	5	0	6	3.0	$\frac{1}{15}$
5	1	6	0	9	4.5	$\frac{1}{15}$
6	2	3	1	3	2.0	$\frac{1}{15}$
7	2	4	1	5	3.0	$\frac{1}{15}$
8	2	5	1	6	3.5	$\frac{1}{15}$
9	2	6	1	9	5.0	$\frac{1}{15}$
10	3	4	3	5	4.0	$\frac{1}{15}$
11	3	5	3	6	4.5	$\frac{1}{15}$
12	3	6	3	9	6.0	$\frac{1}{15}$
13	4	5	5	6	5.5	$\frac{1}{15}$
14	4	6	5	9	7.0	$\frac{1}{15}$
15	5	6	6	9	7.5	$\frac{1}{15}$

A population of size 6, and the means of all 15 samples of size 2. The mean of the population values, and the mean of all sample means are shown by a vertical line.

Figure 6.1: A population of size 6, and the means of all 15 samples of size 2. The mean of the population values, and the mean of all sample means are shown by a vertical line.

The population mean is $\bar{Y}=4$, which can be compared with the 15 possible sample means $\bar{y}$ listed in the table. The population variance is $\sigma_Y^2=9.33$, and the standard deviation is $\sigma_Y=3.06$.

The mean of the 15 sample means is $\bar{\bar{y}}=4$ (equal to $\bar{Y}$), with standard deviation 2 (which is less than the population standard deviation). \end{quote}

The probability that any particular member $i$ of the population ends up in the sample is called its inclusion probability, written $\pi_i$. This probability must be nonzero for every unit in the survey population. In SRSWOR the inclusion probabilities are \[ \pi_i = \frac{n}{N} \] for every unit $i$. In other words each unit has an equal chance of ending up in the sample. Not all sampling schemes have this property (e.g. in stratified samples the chances of being selected may vary between strata). In the example above the inclusion probabilities are $\pi_i=\frac{2}{6}=\frac13$ for each unit.

The joint inclusion probabilities $\pi_{ij}$ are the probabilities that both unit $i$ and unit $j$ are in the sample. These probabilities can sometimes be zero: for example consider a stratified random sample where we select one unit per stratum. If units $i$ and $j$ are in the same stratum then we have to have $\pi_{ij}=0$, since if one is in the sample then the other can’t possibly be there too.

In SRSWOR the joint inclusion probabilities are \[ \pi_{ij} = \frac{n(n-1)}{N(N-1)} \qquad (i\neq j) \] for any pair of distinct units $i,j$.

In the example above the inclusion probabilities are $\pi_{ij}=\frac{2\times1}{6\times5}=\frac{1}{15}$ for any pair of units. is just the selection probability of the sample $p(s)$ in each case. (This to be expected since each sample is exactly one pair of units.)

6.2.4 Sample Weights

In a general sampling scheme each unit $i$ in the population has a known non-zero sample inclusion probability $\pi_i$ and can be assigned a sample weight \[\begin{equation} w_i = \frac{1}{\pi_i} \end{equation}\] e.g. In a sample of $n=2$ units from a population of size $N=6$ under SRSWOR, the sample inclusion probability is $\pi_i=\frac{2}{6}=\frac{1}{3}$ for each unit. The sample weights are therefore \[ w_i = \frac{1}{\pi_i} = 3 \] for each unit. i.e. each unit in the sample stands three units: itself and two others.

One way to think about this is that we replicate each sample member by its weight, and thereby create a synthetic population. We then use the properties of the synthetic population to estimate the parameters of the actual population.

Example

Consider the small population of 6 units again:

Table 6.5: Small example population
Unit, $i$	Value, $Y_i$
1	0
2	1
3	3
4	5
5	6
6	9

When we sample $n=2$ units from this population of we might get units 2 and 4, each with weight 3.

Table 6.6: Selected units 2 and 4
Unit, $k$	Value, $y_k$	Weight
1	1	3
2	5	3

So we conceptually replicate each observation 3 times to create a synthetic population.

Table 6.7: Synthetic population based on the sample
Unit, $i$	Value, $Y_i$
1	1
2	1
3	1
4	5
5	5
6	5

Certain of the properties of this synthetic population (e.g. the mean $\bar{Y}=3$, and total $Y=28$) can be used to estimate properties of the actual population.

In many of Statistics NZ’s business surveys large businesses such as Telecom are always selected: i.e. they have a sample inclusion probability of 1. Their weights are therefore also 1: they represent themselves alone – which is appropriate since they are very different from all other businesses.

Small businesses may have much lower sampling inclusion probabilities e.g. $\frac{1}{200}$, and consequently have much higher weights: 200. These businesses represent themselves and 199 others from the population.

We give a low sampling inclusion probability and therefore high weight to units which are similar to each other. Units which are rare and highly influential in estimation are given high sampling inclusion probabilities and low weights.

6.3 Estimation in SRSWOR

The principal population parameters that are usually of interest to us are the population total: \[\begin{equation} Y = \sum_{i=1}^N Y_i \end{equation}\] the population mean: \[\begin{equation} \bar{Y} = \frac{1}{N}\sum_{i=1}^N Y_i \end{equation}\] the adjusted population variance: \[\begin{equation} S_Y^2 = \frac{1}{N-1}\sum_{i=1}^N (Y_i-\bar{Y})^2 \end{equation}\] and the population variance: \[\begin{eqnarray} \sigma_Y^2 &=& \frac{1}{N}\sum_{i=1}^N (Y_i-\bar{Y})^2\\ \nonumber &=& \frac{N-1}{N}S_Y^2 \end{eqnarray}\]

For example, if we were taking a survey of incomes in a population, then we’d want to know the total income earned $Y$, the mean income per head $\bar{Y}$, and the variability of income between individuals $S_Y$.

Example

Consider the survey of working hours and income from earlier: let’s treat these data as a population of size $N=200$ from which we can sample. Note that we have added an extra column: an indicator variable which recodes the Highest Qualification column: \[ Y_i=\left\{\begin{array}{ll} 1& \text{if the person has post-school qualifications,}\\ 0& \text{otherwise} \end{array}\right. \]

Table 6.8: Work and income survey
	Personid	Gender	Qualification	Age	Hours	Income	Marital	Ethnicity	PostSchool
1	1	female	school	15	4	87	never	European	0
2	2	female	vocational	40	42	596	married	European	1
3	3	male	none	38	40	497	married	Maori	0
4	4	female	vocational	34	8	299	never	European	1
5	5	female	school	45	16	301	married	European	0
6	6	male	degree	45	50	1614	married	European	1
7	7	female	none	36	12	201	other	European	0
8	…	…	…	…	…	…	…	…	…
200	200	male	school	31	50	954	never	European	0

Figure 6.2: Distributions of Hours Worked and Income

The population parameters for working hours, income and post-school qualifications are:

Table 6.9: Population parameters
Parameter	Hours	Income	PostSchool
Pop. size $N$	200.0000	200.0000	200.0000000
Mean $\bar{Y}$	33.7100	575.3600	0.4750000
Total, $Y$	6742.0000	115072.0000	95.0000000
Adj. Variance $S_Y^2$	261.0713	120137.6386	0.2506281
Std. Deviation $S_Y$	16.1577	346.6088	0.5006277
Unadj. Variance $_Y^2	259.7659	119536.9504	0.2493750

Under SRSWOR we have estimators $\estm{T}$ for each of these parameters $T$: The SRSWOR estimator of the population total is \[\begin{equation} \estm{Y} = \frac{N}{n}\sum_{k=1}^n y_k = N\bar{y} \end{equation}\] the SRSWOR estimator of the population mean is the sample mean $\bar{y}$: \[\begin{equation} \estm{\bar{Y}} = \bar{y} = \frac{1}{n}\sum_{k=1}^n y_k \end{equation}\] the SRSWOR estimator of the adjusted population variance is the sample variance $s_y^2$: \[\begin{equation} \estm{S}_Y^2 = s_y^2 = \frac{1}{n-1}\sum_{k=1}^n (y_k-\bar{y})^2 \end{equation}\] and the SRSWOR estimator of the population variance is \[\begin{equation} \estm{\sigma}_Y^2 = \frac{N-1}{N}\estm{S}_Y^2 = \frac{N-1}{N}s_y^2 \end{equation}\] All four of these estimators are unbiased. That is to say, in the absence of non-sampling error, the mean value of their sampling distribution is equal to the population parameter of interest. In other words, if we take repeated samples and calculate the value of, say $\estm{Y}$, then the values from those samples will scatter evenly around the true population total $Y$.

For the special case where $Y_i$ is an indicator variable the SRSWOR estimator of the population proportion $P$ is the sample proportion $\estm{p}$: \[\begin{equation} \estm{p} = \frac{1}{n}\sum_{k=1}^n I_k \end{equation}\] the SRSWOR estimator of the adjusted population variance is the sample variance $s_y^2$ which takes the simple form: \[\begin{equation} \estm{S}_Y^2 = s_y^2 = \frac{n}{n-1}\estm{p}(1-\estm{p}). \end{equation}\]

Example continued

We have drawn a SRSWOR of $n=20$ from the population of $N=200$ people.

The probability of selection of each sample member is \[ \pi_k = \frac{n}{N}=\frac{20}{200} = 0.10 \] which is also the sampling fraction $f=10\%$. The weight of each sample member is \[ w_k = \frac{1}{\pi_k} = \frac{N}{n}=\frac{200}{20} = 10 \] i.e. each person stands for his/herself and 9 others in the population.

Table 6.10: Sample of size 20
	SampleID	Personid	Gender	Qualification	Age	Hours	Income	Marital	Ethnicity	PostSchool	Weight
57	1	57	male	vocational	40	39	525	previously	Maori	1	10
132	2	132	female	vocational	44	48	743	married	European	1	10
141	3	141	female	degree	23	45	658	never	other	1	10
86	4	86	female	vocational	35	40	501	previously	Maori	1	10
67	5	67	female	vocational	41	17	18	married	European	1	10
156	6	156	male	vocational	37	40	819	married	European	1	10
138	7	138	female	none	32	40	406	married	European	0	10
20	8	20	female	none	34	25	386	other	European	0	10
116	9	116	female	degree	36	40	925	previously	European	1	10
62	10	62	female	degree	28	50	544	married	European	1	10
151	11	151	male	vocational	18	40	1099	other	European	1	10
152	12	152	female	school	18	17	255	never	European	0	10
163	13	163	female	vocational	25	25	409	never	Maori	1	10
197	14	197	male	degree	26	50	1789	never	European	1	10
122	15	122	male	vocational	32	38	562	married	European	1	10
121	16	121	female	vocational	21	40	830	other	European	1	10
155	17	155	female	vocational	26	42	431	married	European	1	10
97	18	97	female	vocational	34	40	615	never	European	1	10
10	19	10	male	school	37	50	533	previously	European	0	10
165	20	165	female	none	45	40	439	never	European	0	10

Figure 6.3: Sample distributions of Hours Worked and Income (true population mean shown as a dashed line)

The sample statistics for working hours and income are:

Table 6.11: Sample statistics
Parameter	Hours	Income	PostSchool
Sample. size $n$	20.000000	20.0000	20.0000000
Mean $\bar{y}$	38.300000	624.3500	0.7500000
Adj. Variance $s_y^2$	97.273684	133580.3447	0.1973684
Std. Deviation $s_y$	9.862742	365.4864	0.4442617

Our best estimates of the mean numbers of hours worked and income earned are 38.3 hours and $624.35 respectively (compare these to the true values of 33.71 hours and $575.36). We estimate the proportion of people with post-school qualifications to be 75% (compared to the true value of 47.5%).

Note that the sample has missed people working very short and very long hours, and that even though the estimate of the mean is close to the population value (the dashed line in the histogram), the sample variance $s_y^2$ is much smaller than the population variance $S_Y^2$. This sample also has a higher proportion of people with post-school qualifications than is the case in the general population. \end{quote}

6.3.1 Using sample weights to make estimates in SRSWOR

To estimate the total of a quantitative variable, sum up the values of the variable, weighted by the sample weights: \[\begin{equation} \estm{Y} = \sum_{k=1}^n w_k y_k = \sum_{k=1}^n y_k \frac{N}{n} = \frac{N}{n}\times \sum_{k=1}^n y_k = N\bar{y} \end{equation}\] This estimator is sometimes called the rate-up estimator – we take the sample total $\sum_ky_k$ and ‘rate it up’ to the population by multiplying by the factor $N/n$.

Example. To estimate the total income earned by the population as a whole during a single week multiply each value of income earned by the sample weight, and sum up the result. For SRSWOR since the weights are all the same, an alternative is to multiply the sample mean by the population size: $N\bar{y}=(200)(624.35)=\$124870$ (true value $120138).

To estimate a mean, make an estimate of the total and divide through by the population size: \[\begin{equation} \estm{\bar{Y}} = \frac{\estm{Y}}{N} = \frac{1}{N}\sum_{k=1}^n w_k y_k = \frac{1}{N}\sum_{k=1}^n \frac{N}{n} y_k = \frac{1}{n}\sum_{k=1}^n y_k = \bar{y} \end{equation}\] i.e. in SRSWOR the estimate of the population mean is just the sample mean.

To estimate the total number of people in a certain category, add up the weights multiplied by an indicator variable $y_k$: \[ y_k = \left\{\begin{array}{ll} 1,\ &\text{if unit $k$ is in the category}\\ 0, &\text{otherwise} \end{array}\right. \] which is the same as simply summing up the weights of survey respondents in that category: \[\begin{equation} \estm{Y} = \sum_{k=1}^n w_k y_k = \sum_{k\in \text{Category}} w_k = \sum_{k\in \text{Category}} \frac{N}{n} = \frac{N}{n}\times\text{(\# in the category in the sample)} \end{equation}\]

Example. to estimate the number of people with post-school qualifications, add up the weights for the sample members who have a post-school qualification. There are 15 such people, each with weight 10, so our estimate of the number of people with a post-school qualification is 150. We estimate that the proportion of people in the population with post-school qualifications is 150/200 = 75% (true proportion 47.5%).

The number of males in the sample is 6, each with weight 10, so we estimate that there are 60 males in the population.

To estimate the proportion of population members in a certain category we can either divide our estimate of the total number by the population size $N$, or simply use the sample proportion as an estimate of the population proportion: the result is the same: \[\begin{equation} \estm{p} = \frac{\estm{Y}}{N} = \frac{1}{N}\frac{N}{n}\times\text{(\# in the category)} = \frac{1}{n}\times\text{(\# in the category)} = \text{sample proportion} \end{equation}\]

Example. We estimate that the proportion of males in the population is 60/200 = 30%. Alternatively the sample proportion is just $\bar{y}=6/20=30\%$ (Note – the true proportion is 46.5%).

6.4 Sampling Errors

Sampling theory gives us formulae of the variance of the estimators in the previous section. We will only be concerned with the variances of the estimators for the population total and population mean.

These variances are \[\begin{eqnarray} \bfa{Var}{\estm{Y}} &=& N^2\left(1-\frac{n}{N}\right)\frac{S_Y^2}{n}\\ \bfa{Var}{\estm{\bar{Y}}} &=& \left(1-\frac{n}{N}\right)\frac{S_Y^2}{n} \end{eqnarray}\] and for estimates of the population proportion the variance is \[\begin{eqnarray} \bfa{Var}{\estm{p}} &=& \left(1-\frac{n}{N}\right)\frac{P(1-P)}{n} \end{eqnarray}\]

These formulae contain the finite population correction \[\begin{equation} \text{fpc} = 1-\frac{n}{N} = 1-f \end{equation}\] a factor which often appears in formulae in survey sampling. For small sampling fractions (i.e. small $f=n/N$) the fpc is approximately 1, and can be neglected. However, as $n$ approaches $N$ the fpc becomes closer and closer to zero. If we perform a census $n=N$ then the fpc is zero and the variances of the estimators of $Y$ and $\bar{Y}$ are also zero. This is as it should be since there is no uncertainty about the true values of $Y$ and $\bar{Y}$ when we carry out a census: a census has no sampling error.

The formulae for the variances given above contain $S_Y^2$: the adjusted population variance. This is a population parameter, and is unknown to us. However we can use the sample variance $s_y^2$ to get an estimate of these variances: \[\begin{eqnarray} \bfa{\mbox{$\widehat{\bf Var}$}}{\estm{Y}} &=& N^2\left(1-\frac{n}{N}\right)\frac{s_y^2}{n}\\ \bfa{\mbox{$\widehat{\bf Var}$}}{\estm{\bar{Y}}} &=& \left(1-\frac{n}{N}\right)\frac{s_y^2}{n} \end{eqnarray}\] For proportions the corresponding expression is \[\begin{eqnarray} \bfa{\mbox{$\widehat{\bf Var}$}}{\estm{p}} &=& \left(1-\frac{n}{N}\right)\frac{\estm{p}(1-\estm{p})}{n-1} \end{eqnarray}\]

6.5 Confidence Intervals

The central limit theorem states that the distribution of sums and means of large samples of independent observations from a probability distribution are normally distributed.

In particular, if $\estm{T}$ is an unbiased estimator of the population parameter $T$ then (for sufficiently large sample sizes $n$) $\estm{T}$ is Normally distributed with mean $T$ and variance $\bfa{Var}{\estm{T}}$.

This fact allows us to construct confidence intervals for our estimates.

The standard deviation of the sampling distribution of $\estm{T}$ is called the standard error of the estimator $\estm{T}$: \[\begin{equation} \bfa{SE}{\estm{T}} = \sqrt{\bfa{Var}{\estm{T}}} \end{equation}\] We can use the standard error to construct, say, a 95% confidence interval for the population parameter $T$ in the usual way: \[\begin{equation} \text{95\% CI for $T$} = \estm{T}\pm 1.96\times\bfa{SE}{\estm{T}} \end{equation}\]

Example continued

Let’s construct a confidence interval for the population total weekly income using the SRS of size 20.

The sample statistics are $n=20$, $\bar{y}=624.35$, $s_y^2=133580.34$, giving an estimate of the population total of \[\begin{eqnarray*} \estm{Y} &=& \frac{N}{n}\sum_{k=1}^n y_k = N\bar{y}\\ &=& 200\times 624.35 = 124870 \end{eqnarray*}\] and a standard error of \[\begin{eqnarray*} \bfa{SE}{\estm{Y}} &=& \sqrt{\bfa{Var}{\estm{Y}}} = \sqrt{N^2\left(1-\frac{n}{N}\right)\frac{s_y^2}{n}}\\ &=& \sqrt{200^2\left(1-\frac{20}{200}\right)\frac{133580.34}{20}}\\ &=& \sqrt{240445612} = 15506 \end{eqnarray*}\] This gives a margin of error for 95% confidence of: \[\begin{eqnarray*} \bfa{MOE}{\estm{Y}} &=& Z\times\bfa{SE}{\estm{Y}}\\ &=& 1.96\times 15506 = 30392 \end{eqnarray*}\] and hence a 95% confidence interval of \[\begin{eqnarray*} \estm{Y} \pm \bfa{MOE}{\estm{Y}} &=& 124870 \pm 30392\\ &=& (94478, 155262)\\ &=& (94500, 155300) \end{eqnarray*}\] This confidence interval is rather wide because the population is very variable, and a sample size of 20 is too small to give a good estimate. (Note – The true total value is 115072.)

6.6 Quality of Estimates

The precision of an estimate can be characterised by the relative standard error (RSE). This is defined: \[\begin{equation} \bfa{RSE}{\estm{T}} = \frac{\bfa{SE}{\estm{T}}}{\bfa{E}{\estm{T}}} \end{equation}\] i.e. it measures how big the standard error is, compared to the estimate itself. The RSE is a particular case of a coefficient of variation (CV), which is the standard deviation of a quantity divided by its mean. Usually CV’s are only useful for quantities which are strictly positive.

In general we don’t know the values of the standard error $\bfa{SE}{\estm{T}}$ or of the mean of the estimator $\bfa{E}{\estm{T}}$, but we can replace them as usual by estimates from our sample.

Example continued

The estimate of the population total is $\estm{Y}=124870$, with a standard error of 15506. Thus the RSE is \[ \bfa{RSE}{\estm{Y}} = \frac{\bfa{SE}{\estm{Y}}}{\estm{Y}} = \frac{15506}{124870} = 0.12 = 12\% \] This RSE of 12% indicates a that the estimate is not of particularly high quality (we prefer RSEs close to 5% or 1% in general).

Let’s construct a confidence interval for the population mean weekly income using the SRS of size 20.

As before the sample statistics are $n=20$, $\bar{y}=624.35$, $s_y^2=133580.34$, giving an estimate of the population mean of \[\begin{eqnarray*} \estm{\bar{Y}} &=& \frac{1}{n}\sum_{k=1}^n y_k = \bar{y}\\ &=& 624.35 \end{eqnarray*}\] and a standard error of \[\begin{eqnarray*} \bfa{SE}{\estm{\bar{Y}}} &=& \sqrt{\bfa{Var}{\estm{\bar{Y}}}} = \sqrt{\left(1-\frac{n}{N}\right)\frac{s_y^2}{n}}\\ &=& \sqrt{\left(1-\frac{20}{200}\right)\frac{133580.34}{20}}\\ &=& \sqrt{6011.1} = 77.5 \end{eqnarray*}\] This gives a margin of error for 95% confidence of: \[\begin{eqnarray*} \bfa{MOE}{\estm{\bar{Y}}} &=& Z\times\bfa{SE}{\estm{\bar{Y}}}\\ &=& 1.96\times 77.5 = 151.9 \end{eqnarray*}\] and hence a 95% confidence interval of \[\begin{eqnarray*} \estm{\bar{Y}} \pm \bfa{MOE}{\estm{\bar{Y}}} &=& 624.4 \pm 151.9\\ &=& (472.5, 776.3)\\ &=& (470,776) \end{eqnarray*}\] and an RSE of \[ \bfa{RSE}{\estm{\bar{Y}}} = \frac{\bfa{SE}{\estm{\bar{Y}}}}{\estm{\bar{Y}}} = \frac{77.5}{624.4} = 0.12 = 12\% \] It’s no coincidence that this is also 12%: a total and mean of the same variable will always have the same RSE.

Finally let’s repeat this procedure for the proportion of people with a post-school qualification. The relevant sample statistics are $n=20$, and the sample proportion $\estm{p}=\bar{y}=0.75$, giving an estimate of the population proportion of \[\begin{eqnarray*} \estm{p} &=& \bar{y} = \frac{15}{20} = 0.75 \end{eqnarray*}\] and a standard error of \[\begin{eqnarray*} \bfa{SE}{\estm{p}} &=& \sqrt{\bfa{Var}{\estm{p}}} = \sqrt{\left(1-\frac{n}{N}\right)\frac{\estm{p}(1-\estm{p})}{n-1}}\\ &=& \sqrt{\left(1-\frac{20}{200}\right)\frac{(0.75)(0.25)}{19}}\\ &=& \sqrt{0.00888} = 0.094 \end{eqnarray*}\] This gives a margin of error for 95% confidence of: \[\begin{eqnarray*} \bfa{MOE}{\estm{p}} &=& Z\times\bfa{SE}{\estm{p}}\\ &=& 1.96\times 0.094 = 0.185 \end{eqnarray*}\] and hence a 95% confidence interval of \[\begin{eqnarray*} \estm{p} \pm \bfa{MOE}{\estm{p}} &=& 0.75 \pm 0.185\\ &=& (0.57,0.94)\\ &=& (57\%,94\%) \end{eqnarray*}\] RSEs are not usually calculated for estimates of proportions: it is sufficient to quote the SE. In this case an SE of $0.094$ or 9.4% is not very good. We really need a larger sample size for this and the earlier estimates.

Note. The above calculations for proportions are easily modified to give confidence intervals for total counts of individuals in a population with a particular property, since $Y=Np$, and hence $\widehat{Y}=N\widehat{p}$.

Thus in the above example the proportion of people with a post-school qualification is $\widehat{p}=0.75$, so our estimate of the total number of people with a post-school qualification is \[ \widehat{Y} = N\widehat{p} = (200)(0.75) = 150 \] with a standard error of \[\begin{eqnarray*} \bfa{SE}{\estm{Y}} = N\bfa{SE}{\estm{p}} &=& N\sqrt{\bfa{Var}{\estm{p}}} = \sqrt{N^2\left(1-\frac{n}{N}\right)\frac{\estm{p}(1-\estm{p})}{n-1}}\\ &=& \sqrt{200^2\left(1-\frac{20}{200}\right)\frac{(0.75)(0.25)}{19}}\\ &=& 18.8 \end{eqnarray*}\] This gives a margin of error for 95% confidence of: \[\begin{eqnarray*} \bfa{MOE}{\estm{Y}} &=& Z\times\bfa{SE}{\estm{Y}}\\ &=& 1.96\times 18.8 = 36.8 \end{eqnarray*}\] and hence a 95% confidence interval of \[\begin{eqnarray*} \estm{Y} \pm \bfa{MOE}{\estm{Y}} &=& 150 \pm 37\\ &=& (113,187) \end{eqnarray*}\] and an RSE of \[ \bfa{RSE}{\estm{{Y}}} = \frac{\bfa{SE}{\estm{{Y}}}}{\estm{{Y}}} = \frac{18.8}{150} = 0.13 = 13\% \]

6.7 Sample Size Calculations

In the previous example we saw that the sample size was too small to give a good estimate. In the planning stage of a survey we almost always calculate the sample size that will be required to gain a specific accuracy. Such calculations are a crucial part of planning, and determine whether the survey can reach its objectives given the constraints of money and time.

6.7.1 Estimation of means

The required accuracy is usually specified in terms of a desired RSE or a desired margin of error. In general a confidence interval for a population mean takes the form \[ \text{CI} = \estm{\bar{Y}} \pm \bfa{MOE}{\estm{\bar{Y}}} \] For example for SRSWOR and the estimation of the population mean we have \[\begin{eqnarray*} \bfa{MOE}{\estm{\bar{Y}}} &=& Z\times\bfa{SE}{\estm{\bar{Y}}} \\ &=& Z\sqrt{\bfa{Var}{\estm{\bar{Y}}}}\\ &=& Z\sqrt{\left(1-\frac{n}{N}\right)\frac{S_y^2}{n}}\\ &=& Z\sqrt{\left(1-\frac{n}{N}\right)}\frac{S_y}{\sqrt{n}} \end{eqnarray*}\] In this formula we have

$m=\bfa{MOE}{\estm{\bar{Y}}}$ – we specify this, our desired MOE;
$Z$ = normal quantile for the appropriate level of confidence – we choose this: usually $Z=1.96$ for 95% confidence;
$N$ = population size: this is usually known, or we have some estimate;
$S_Y^2$ = population variance – we estimate this from previous data (in the planning stages of a survey we have no sample to estimate $S_Y^2$);
$n$ = sample size – this is what we want to deduce.

If $N$ is large enough the fpc can be neglected, and the MOE formula becomes \[ m = Z\times \frac{S_Y}{\sqrt{n}} \] which can be rearranged to give \[\begin{equation} n' = \left(\frac{Z}{m}\right)^2 S_Y^2 \end{equation}\] We denote this sample size $n'$, and note that it ignores the finite population correction. We can modify this first estimate of the sample size to take account of the fpc as follows: \[\begin{equation} n = \frac{n'}{1+\frac{n'}{N}} \end{equation}\] If $N$ is large enough compared to $n'$ then $n$ and $n'$ are roughly the same. The unadjusted sample size $n'$ is always larger than the adjusted one, and is therefore more conservative.

Example

I want to estimate the mean income of all New Zealanders to an accuracy of $\pm\$30$ in a 95% confidence interval. The population size is $N=4$ million and from the survey of 15-45 year olds I estimate the variance to be $S_Y^2=400^2$. How large a sample do I need?

For 95% confidence we have $Z=1.96$ so we first calculate \[ n' = \left(\frac{1.96}{30}\right)^2 400^2 = 683 \] and then adjust it \[ n = \frac{n'}{1+\frac{n'}{N}} = \frac{683}{1+\frac{683}{4000000}} = 683 \] Here the fpc is so small that the adjustment has no effect: the required sample size is still 683. Note that if we were estimating the mean income in the Chatham Islands, where the population is only $N=1000$ then the fpc does have a significant effect: \[ n = \frac{n'}{1+\frac{n'}{N}} = \frac{683}{1+\frac{683}{1000}} = 406 \]

6.7.2 Allowing for non-response

If there is non-response expected in a sample survey, we need to allow for this when determining the sample size. If the response rate is expected to be $\phi$ and the sample size required with full response is $n$, then we should select \[\begin{equation} n_\text{select} = \frac{n}{\phi} \end{equation}\] units in order to achieve $n$ responses.

Example continued

In the example above a response rate of 90% is anticipated. How many units should be selected?

In this case we have $n=683$ and $\phi=0.90$ so \[ n_\text{select} = \frac{n}{\phi} = \frac{683}{0.9} = 759. \] i.e. we really need to select 760 units for the survey.

Note – we always round our sample sizes up rather than down, and usually to some round figure. Sample size calculations are often based on some very rough assumptions, and it is usually best to be conservative.

6.7.3 Using RSEs in sample size calculations

If the desired RSE is specified, rather than the MOE, then we proceed as follows. For a mean, the RSE is \[\begin{eqnarray*} \bfa{RSE}{\estm{\bar{Y}}} &=& \frac{\bfa{SE}{\estm{\bar{Y}}}}{\bar{Y}} \\ &=& \frac{1}{\bar{Y}}\sqrt{\left(1-\frac{n}{N}\right)\frac{S_y^2}{n}}\\ &=& \sqrt{\left(1-\frac{n}{N}\right)\frac{1}{n} \frac{S_y^2}{\bar{Y}^2}}\\ &=& \sqrt{\left(1-\frac{n}{N}\right)\frac{c^2}{n}} \end{eqnarray*}\] where \[ c = \frac{S_Y}{\bar{Y}} \] is called the coefficient of variation (CV) of the population: the standard deviation divided by the mean. This is a measure of the variability of the population values: a low CV means the population values do not vary very much.

If the desired RSE is $r$, then if we ignore the fpc as before the formula above can be rearranged to give the simple expression \[ n' = \frac{c^2}{r^2} \] which we correct as before to take account of the fpc: \[\begin{equation} n = \frac{n'}{1+\frac{n'}{N}} \end{equation}\] As before if $N$ is very large (much larger than $n'$), then $n=n'$. This has the important consequence that the same sample size is required to survey similar populations (i.e. similar $c$) which are of different sizes $N$ (say New Zealand, Australia and the USA) to the same degree of accuracy (RSE) $r$.

6.7.4 Estimation of totals

When determining the sample size required for a total, only minor modifications are required to the formulae used for means.

If the desired SE for the population total is specified, then divide by the population size $N$ to get the SE for the population mean, and then proceed as above.

If the desired RSE is specified, then proceed exactly as above – since the RSEs for totals and means are the same.

6.7.5 Estimation of proportions

When a population proportion is to be estimated the initial estimate of the required population size is \[\begin{equation} n' = \left(\frac{Z}{m}\right)^2 p(1-p) \end{equation}\] and we require some prior guess at the proportion $p$. Here $m$ is the margin of error as before.

Example

Suppose we wish to estimate the proportion of individuals who having once been on the unemployment benefit return to another benefit. We expect the figure to be round about 30%. We wish to estimate this proportion to a level of $\pm3\%$, or 0.03.

We require a sample size of \[ n' = \left(\frac{1.96}{0.03}\right)^2 (0.3)(0.7) = 896 \] i.e. around 900 people.

The initial estimate $n'$ is modified in exactly the same way to adjust for the fpc: \[\begin{equation} n = \frac{n'}{1+\frac{n'}{N}} \end{equation}\]

If $p$ is unknown, then take $p=0.5$ to be conservative (gives the largest possible value for $n'$). And also note that in that case if we have $Z=1.96$ then \[\begin{equation} n' = \left(\frac{1.96}{m}\right)^2 (0.5)(0.5) \simeq \left(\frac{2}{m}\right)^2 (0.5)(0.5) = \frac{1}{m^2} \end{equation}\] a handy rule of thumb. Thus if we specify, say, that we want a margin of error of $m=5\%$ for a proportion, then we immediately estimate \[ n' = \frac{1}{m^2} = \frac{1}{(0.05)^2} = 400 \] Alternatively, if we are told the sample size $n'$, then we can quickly estimate the margin of error for an estimate of a proportion: \[\begin{equation} m = \frac{1}{\sqrt{n'}} \end{equation}\] For example, a sample size of $n'=1000$ gets a margin of error of 3.2%: \[ m = \frac{1}{\sqrt{n'}} = \frac{1}{\sqrt{1000}} = 0.032 = 3.2\% \] But note, however, that this margin of error applies only to estimates close to $p=0.5$. For estimates $\estm{p}$ close to zero or 1, we should use the correct formula for the MOE: \[ m = \bfa{MOE}{\estm{p}} = Z\times\sqrt{\left(1-\frac{n}{N}\right)\frac{\estm{p}(1-\estm{p})}{n-1}} \]

Mediawatch. It is a misunderstanding of this point that leads to the nonsensical comment that a certain political party is ‘polling beneath the margin of error.’ This is a statement that journalists are often heard to make when reporting political polls, which typically have a nominal margin of error of 3.2% because they are based on sample of $n=1000$. If a party only polls 1%, then the journalists make the claim that the party is polling beneath the MOE: i.e. they think that the MOE is larger than the estimate.

What is really going on? The margin of error depends on the observed value of $\estm{p}$: for $n=1000$, and ignoring the fpc:

Figure 6.4: Dependence of the MOE for a proportion on the true population proportion (sample size n=1000).

While a party polling at 50% has an MOE of 3.2%, a party polling at 1% has a MOE of 0.62%, which is certainly not larger than 1%.

6.8 Derivation of sampling errors

In this section we give the derivations of the expressions in Section 6.4. We start with the probability distribution of the sample membership under SRSWOR.

6.8.1 Inclusion Probabilities

There are $\binom{N}{n}$ samples in all. Of these, there are $\binom{N-1}{n-1}$ samples which contain unit $i$ and $\binom{N-2}{n-2}$ samples which contain both units $i$ and $j$, $i\neq j$. Since each sample has equal probability $1/\binom{N}{n}$, the probability that unit $i$ is included in the sample is: \[ \pi_{i} = \binom{N-1}{n-1}\left/ \binom{N}{n}\right. = \frac{n}{N} \qquad \text{for $i=1, \ldots , N$} \] This probability is called the 1st order inclusion probability.

The probability that both units $i$ and $j$ are included is: \[ \pi_{ij} = \binom{N-2}{n-2}\left/ \binom{N}{n}\right. = \frac{n\left(n-1\right)}{N\left(N-1\right)} \qquad \text{for $i,j=1, \ldots , N, i \neq j$} \] and this probability is called the joint inclusion probability or the 2nd order inclusion probability. Note that $\pi_{ii}=\pi_i$.

Note: There are sampling schemes which have the same 1st and 2nd order inclusion probabilities equal to those from SRSWOR, but are not SRSWOR.

6.8.2 Sample Membership Indicator Variables

The sample membership indicator variable $I_{i}$ is a random variable which takes the value one or zero for each unit $i$ in the population. It is one if the unit is in the sample, and zero otherwise.

In a probability sampling scheme $I_i$ is a random variable. Its mean value is just the 1st order inclusion probability: \[\begin{eqnarray*} \bfa{E}{I_i} &=& 0\times{\rm Pr}(I_i=0) + 1\times{\rm Pr}(I_i=1)\\ &=& 0\times(1-\pi_i) + 1\times\pi_i\\ &=& \pi_i \end{eqnarray*}\] Also note that \[\begin{eqnarray*} \bfa{E}{I_iI_j} &=& 0\times0\times{\rm Pr}(I_i=0,I_j=0) + 1\times0\times{\rm Pr}(I_i=1,I_j=0)\\ & & +\ 0\times1\times{\rm Pr}(I_i=0,I_j=1) + 1\times1\times{\rm Pr}(I_i=1,I_j=1)\\ &=& {\rm Pr}(I_i=1,I_j=1)\\ &=& \pi_{ij} \end{eqnarray*}\] so that the covariance of $I_i$ and $I_j$ is \[\begin{eqnarray*} \bfa{Cov}{I_i,I_j} &=& \bfa{E}{I_iI_j} - \bfa{E}{I_i}\bfa{E}{I_j}\\ &=& \left\{\begin{array}{ll} \pi_{ij}-\pi_i\pi_j\hspace{1ex}& \text{if $i\neq j$},\\ \pi_i(1-\pi_i) & \text{if $i=j$} \end{array} \right. \end{eqnarray*}\]

For simple random sampling and $i\neq j$ we have \[\begin{eqnarray*} \pi_{ij} -\pi_i\pi_j &=& \left(\frac{n-1}{N-1}\right) \frac{n}{N} - \left(\frac{n}{N}\right)^2\\ &=& \frac{n}{N}\left[\frac{n-1}{N-1}-\frac{n}{N}\right]\\ &=& \frac{n}{N^2(N-1)}\left[(n-1)N-n(N-1)\right]\\ &=& \frac{n}{N^2(N-1)}(n-N)\\ &=& -\frac{n}{N(N-1)}\left(1-\frac{n}{N}\right) \end{eqnarray*}\] so that \[\begin{eqnarray*} \bfa{E}{I_i} &=& \frac{n}{N}\\ \bfa{Cov}{I_i,I_j} &=& \left\{\begin{array}{ll} -\frac{n}{N}\left(1-\frac{n}{N}\right)\frac{1}{N-1} \hspace{1ex}& \text{if $i\neq j$},\\ \frac{n}{N}\left(1-\frac{n}{N}\right) & \text{if $i=j$} \end{array} \right. \end{eqnarray*}\] In other words there is a (small) negative correlation between $I_i$ and $I_j$ for distinct units. If unit $i$ is already selected it slightly decreases the chances that unit $j$ will end up in the sample.

6.8.3 Horwitz-Thompson Estimator

One of the reasons for introducing the machinery of sample membership indicator random variables $I_{i}$ and their expected values, is that it immediately gives us an estimator of the population total called the Horvitz-Thompson Estimator under any sampling scheme where the 1st order inclusion probabilities $\pi_i$ are known and are non-zero. The HT estimator is \[\begin{eqnarray} \estm{Y}_{HT} &\equiv& \sum_{k\in s} \frac{y_{k}}{\pi_k}\\ \nonumber &=& \sum_{i\in U} I_i\frac{Y_i}{\pi_i} \end{eqnarray}\] (Here $k\in s$ indicates the units in the sample, and $i\in U$ the units in the population.)

We can show that the HT estimator is unbiased for $Y$ by computing its mean: \[\begin{eqnarray*} \bfa{E}{\estm{Y}_{HT}} &=& \sum_{i\in U} \bfa{E}{I_i}\frac{Y_i}{\pi_i}\\ &=& \sum_{i\in U} \pi_i\frac{Y_i}{\pi_i}\\ &=& \sum_{i\in U} Y_i\\ &=& Y\\ \end{eqnarray*}\] Its variance follows straightforwardly too: \[\begin{eqnarray} \nonumber \bfa{Var}{\estm{Y}_{HT}} &=& \bfa{E}{Y_{HT}Y_{HT}} - \bfa{E}{Y_{HT}}\bfa{E}{Y_{HT}}\\ \nonumber &=& \sum_{i\in U} \sum_{j\in U} \bfa{E}{I_iI_j} \frac{Y_i}{\pi_i}\frac{Y_j}{\pi_j} - \sum_{i\in U} \sum_{j\in U} \bfa{E}{I_i}\bfa{E}{I_j} \frac{Y_i}{\pi_i}\frac{Y_j}{\pi_j}\\ \nonumber &=& \sum_{i\in U} \sum_{j\in U} \bfa{Cov}{I_i,I_j} \frac{Y_i}{\pi_i}\frac{Y_j}{\pi_j}\\ \nonumber &=& \sum_{i\in U} \frac{Y_i^2}{\pi_i^2}\bfa{Cov}{I_i,I_i} + \sum_{i\in U} \sum_{j\in U, j\neq i} \frac{Y_i}{\pi_i}\frac{Y_j}{\pi_j} \bfa{Cov}{I_i,I_j}\\ &=& \sum_{i\in U} \frac{Y_i^2}{\pi_i^2}\pi_i(1-\pi_i) + \sum_{i\in U} \sum_{j\in U, j\neq i} \frac{Y_i}{\pi_i}\frac{Y_j}{\pi_j} (\pi_{ij}-\pi_i\pi_j) \end{eqnarray}\] In the case of SRSWOR this variance becomes \[\begin{eqnarray} \nonumber \bfa{Var}{\estm{Y}_{HT}} &=& \sum_{i\in U} Y_i^2\frac{N^2}{n^2} \frac{n}{N}\left(1-\frac{n}{N}\right) - \sum_{i\in U} \sum_{j\in U, j\neq i} Y_iY_j\frac{N^2}{n^2} \frac{n}{N}\left(1-\frac{n}{N}\right)\frac{1}{N-1}\\ \nonumber &=& \frac{N}{n}\left(1-\frac{n}{N}\right)\frac{1}{N-1} \left[ (N-1) \sum_{i\in U} Y_i^2 - \sum_{i\in U} \sum_{j\in U, j\neq i} Y_iY_j \right]\\ \nonumber &=& \frac{N}{n}\left(1-\frac{n}{N}\right)\frac{1}{N-1} \left[ N \sum_{i\in U} Y_i^2 - \sum_{i\in U} \sum_{j\in U} Y_iY_j \right]\\ \nonumber &=& \frac{N^2}{n}\left(1-\frac{n}{N}\right) \frac{\sum_{i\in U} Y_i^2 - N\bar{Y}^2}{N-1}\\ &=& N^2\left(1-\frac{n}{N}\right) \frac{S_Y^2}{n} \end{eqnarray}\]

If we know the population size $N$ then the estimator for the population mean $\estm{\bar{Y}}_{HT}$ follows immediately from the estimator for the total: \[\begin{equation} \estm{\bar{Y}}_{HT} \equiv \frac{\estm{Y}_{HT}}{N} \end{equation}\] This estimator is unbiased for $\bar{Y}$ since \[ \bfa{E}{\estm{\bar{Y}}_{HT}} = \frac{\bfa{E}{\estm{Y}_{HT}}}{N} = \frac{Y}{N} = \bar{Y} \] and its variance is \[\begin{eqnarray} \nonumber \bfa{Var}{\estm{\bar{Y}}_{HT}} &=& {\bf Var}\left[\frac{\estm{Y}_{HT}}{N}\right]\\ \nonumber &=& \frac{\bfa{Var}{\estm{Y}_{HT}}}{N^2}\\ &=& \left(1-\frac{n}{N}\right) \frac{S_Y^2}{n} \end{eqnarray}\]

6.9 Sampling errors in SRSWR

We can carry out a similar procedure to estimate the sampling error in the case of simple random sampling with replacement.

In this case the indicator variables ${\bf I}_{i=1}^N$ become the count of times that each unit is selected. In without replacement sampling $I_i$ can only be 0 or 1, but in with replacement sampling $I_i$ can take any value from 0 to $n$, subject to their values adding to the total sample size $n=\sum_{i=1}^N I_i$.

The full vector of ${\bf I}$ values follows a multinomial distribution: \[ {\bf I} \sim \text{Multinomial}(n,(1/N){\bf 1}) \] where each individual has a probability $\frac{1}{N}$ of being selected on each of the $n$ draws. Then: \[\begin{eqnarray*} \bfa{E}{I_i} &=& \frac{n}{N} = \pi_i\\ \bfa{E}{I_i^2} &=& \frac{n}{N}\left(1+\frac{n-1}{N}\right)\\ \bfa{E}{I_iI_j} &=& \frac{n(n-1)}{N^2} = \pi_{ij} \qquad \text{$i\neq j$} \end{eqnarray*}\] so that the covariance of $I_i$ and $I_j$ is \[\begin{eqnarray*} \bfa{Cov}{I_i,I_j} &=& \bfa{E}{I_iI_j} - \bfa{E}{I_i}\bfa{E}{I_j}\\ &=& \left\{\begin{array}{ll} -\frac{n}{N^2}& \text{if $i\neq j$},\\ \frac{n}{N}-\frac{n}{N^2} & \text{if $i=j$} \end{array} \right. \end{eqnarray*}\] As with SRSWOR there is a (small) negative correlation between $I_i$ and $I_j$ for distinct units. If unit $i$ is already selected it slightly decreases the chances that unit $j$ will end up in the sample.

We have the same unbiased HT estimator \[\begin{eqnarray} \estm{Y}_{HT} &\equiv& \sum_{k\in s} \frac{y_{k}}{\pi_k}\\ \nonumber &=& \sum_{i\in U} I_i\frac{Y_i}{\pi_i}\\ &=& \frac{N}{n} \sum_{i\in U} I_iY_i \end{eqnarray}\] with \[\begin{eqnarray*} \bfa{E}{\estm{Y}_{HT}} &=& \frac{N}{n} \sum_{i\in U} \bfa{E}{I_i}Y_i\\ &=& \frac{N}{n} \sum_{i\in U} \frac{n}{N} Y_i\\ &=& \sum_{i\in U} Y_i\\ &=& Y \end{eqnarray*}\]

In SRSWR the variance of this estimator is \[\begin{eqnarray*} \bfa{Var}{\estm{Y}_{HT}} &=& {\bf Var}\left[\frac{N}{n} \sum_{i\in U} I_iY_i\right]\\ &=& \frac{N^2}{n^2} \sum_{i\in U}\sum_{j\in U} {\bf Cov}\left[I_i,I_j\right] Y_iY_j\\ &=& \frac{N^2}{n^2} \sum_{i\in U}\sum_{j\in U} \left[-\frac{n}{N^2} + I(i=j)\frac{n}{N} \right] Y_iY_j\\ &=& -\frac{N^2}{n^2} \sum_{i\in U}\sum_{j\in U} \frac{n}{N^2}Y_iY_j + \frac{N^2}{n^2} \sum_{i\in U} \frac{n}{N} Y_i^2\\ &=& -\frac{1}{n}\sum_{i\in U}\sum_{j\in U} Y_iY_j + \frac{N}{n} \sum_{i\in U} Y_i^2\\ &=& -\frac{1}{n} N^2\bar{Y}^2 + \frac{N}{n} \left[(N-1)S_Y^2 + N\bar{Y}^2\right]\\ &=& N(N-1)\frac{S_Y^2}{n} \end{eqnarray*}\] (Here we’ve used the notation $I(i=j)$, known as the indicator function, which is 1 if $i=j$ and 0 if not.)

This is SRSWR variance very close to the SRSWOR variance \[\begin{eqnarray*} \bfa{Var}{\estm{Y}_{HT, SRSWOR}} &=& N^2\left(1-\frac{n}{N}\right)\frac{S_Y^2}{n} \end{eqnarray*}\] with the principal difference being the presence of the finite population correction factor $(1-\frac{n}{N})$ in the SRSWOR expression. In SRSWR even taking a sample of size $n=N$ doesn’t guarantee a census, so there is still some sampling error left.

7 Probability Proportional to Size Sampling

$\DeclareMathOperator*{\argmin}{argmin}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\bfa}[2]{{\rm\bf #1}[#2]}$ $\newcommand{\rma}[2]{{\rm #1}[#2]}$ $\newcommand{\estm}{\widehat}$

We can use auxiliary information in a variety of ways to improve estimates over SRS. In particular we note that since the most influential units in an estimate are often the largest, it is often in our interest to make sure we select the largest units with higher probability than the smaller units. This idea leads naturally to probability proportional to size sampling, where each unit has a distinct probability of selection $\psi_i$ on any given draw, which is related to its size.

7.1 Cumulative selection method

Let’s say we want to estimate the number of calves born in spring on farms supplying milk for Town Milk Supply. We have information on the number of milking cows on each such farm and indeed each cow is registered on a computer database. Farms vary in size, and we want to make sure we give larger farms (those with more cows) a greater chance of selection, since they will contribute more variability to our estimates. Assume that there are $N$ farms, labelled $i=1,\ldots,N$, and that there are $X_i$ cows on farm $i$.

We give farm $i$ a probability of selection of $\psi_i=X_i/X$ on each draw from the population. Here $X=\sum_i X_i$ is the total number of cows on all the farms. We want to have a sample of $n$ farms. To achieve this we put the $N$ farms into a list, and then number all of the cows on all of the farms as $1,\ldots,X$.

There are $X_1$ cows on the first farm, so they are numbered $1,\ldots,X_1$. On the second farm where there are $X_2$ cows we number them $X_1+1,\ldots,X_1+X_2$. The cows on the third farm are numbered $X_1+X_2+1,\ldots,X_1+X_2+X_3$ and so on up to the $N^{\rm th}$ farm and the $X^{\rm th}$ cow.

To choose our random sample we:

Choose a random number $r\in\{1,\ldots,X\}$
If cow number $r$ is in farm $i$ then select that farm into the sample
Repeat until we have $n$ farms in the sample

Note that this sampling is probability proportional to size with replacement (PPSWR). We don’t remove a farm (or a cow) from the list once it has been selected. This is necessary since there is no (known? possible?) way to select a probability proportional to size (PPS) sample without replacement. Inevitably, therefore, we will end up with some farms being selected more than once. What do we do with them? We do not drop the repeated observations – we must use the observations for those farms as many times as they appear.

We only collect the data once of course, it’s just that if a unit appears more than once in the sample, we must use its data repeatedly in our estimator.

7.2 Rejection Method

An alternative way of drawing a PPSWR which is simpler to implement than the cumulative method given above is the Rejection Method. Here again we have $N$ units in the population, and a size measure $X_i$ for each one. The largest unit is of size $M_X={\rm max}(X_i)$

Draw a random number $i\in\{1,\ldots,N\}$, which means we are considering whether or not to select unit $i$.
Draw another random number $r\in\{1,\ldots,M_X\}$.
If $r\leq X_i$ then select unit $i$ in the sample, otherwise go back to Step 1.
Repeat until $n$ units have been selected.

This is called a rejection method, since sometimes we consider a unit for inclusion in the sample, but then reject it. In general it takes more than $n$ passes through the sequence of steps to select $n$ sample members.

Example

Consider the following population of $N=6$ university classes:

Table 7.1: University Classes
Unit, $i$	Size, $X_i$	Cumulative size, ${\cal T}_i=\sum_{j=1}^iX_j$
1	10	10
2	15	25
3	40	65
4	40	105
5	80	185
6	100	285

We have a size measure for each unit $X_i$=the number of student in class $i$. We have calculated the cumulative running total ${\cal T}_i$ of $X_i$ for all units up to an including unit $i$. The total of the sizes is $X={\cal T}_N=285$. Now let’s select a sample of size using the cumulative method. We generate four random numbers in the range $1,\ldots,285$: say these numbers are $\{165, 205, 197, 12\}$. The $165^{\rm th}$ student is in class 5, so we select that class into the sample. The $205^{\rm th}$ student is in class 6, and so is the $197^{\rm th}$, so we select class 6 twice into the sample. Finally the $12^{\rm th}$ student is in class 2. Thus our final sample is $\{2,5,6,6\}$.

How does the rejection method work in this situation? We first note that the largest class has size $M_X=100$. We select a random number $i$ in the range $1,\ldots,6$, say $i=4$: thus we are considering class 4 for selection, which has size 40. We then draw another random number $r$ in the range $1,\ldots,100$: say $r=2$. We see that $r\leq X_4$ (i.e. $2\leq 40$), so we select class 4 into the sample as the first unit.

Next we select another $i$ value – say we get $i=4$ again. We draw $r$ again, but this time get $r=99$, which is greater than $X_4=40$. So we reject class 4 as the second unit.

We continue in this way, drawing pairs of random numbers $(i,r)$ and testing to see if $r\leq X_i$.

Table 7.2: Rejection method for selecting University classes by PPSWR
$i$	$r$	$X_i$	$r\leq X_i$?	Action
4	2	40	Yes	Acecpt class 4
4	99	40	No	Reject and continue
6	57	100	Yes	Accept class 6
5	63	80	Yes	Accept class 5
2	86	15	No	Reject and continue
2	57	15	No	Reject and continue
6	33	100	Yes	Accept class 6

So we end up with the sample $\{4,5,6,6\}$.

7.3 Inclusion probabilities

We have seen that the probability that a unit is selected on any given PPSWR draw is $\psi_i=X_i/X$. What is the probability $\pi_i$ that a unit is selected into sample of size $n$?

The probability that a unit is selected is one minus the probability that on every one of the $n$ draws it is not selected: \[\begin{equation} \pi_i = 1 - (1-\psi_i)^n = 1-\left(1-\frac{X_i}{X}\right)^n \tag{7.1} \end{equation}\] By noting that \[\begin{eqnarray*} p(\text{$i$ and $j$}) &=& p(i) + p(j) - p(\text{$i$ or $j$})\\ &=& p(i) + p(j) - [1-p(\text{neither $i$ nor $j$})] \end{eqnarray*}\] it can easily be shown that the 2nd order inclusion probabilities are: \[\begin{equation} \pi_{ij} = 1-(1-\psi_i)^n-(1-\psi_j)^n+(1-\psi_i-\psi_j)^n \end{equation}\]

We have already noted that in with replacement sampling, there is a chance that a unit may be selected multiple times. In general unit $i$ is selected $Q_i$ times, where $Q_i$ is a Binomial random variable: \[\begin{equation} Q_i \sim {\rm Binomial}(n,\psi_i) \end{equation}\] and the full set of $N$ $Q_i$ values is distributed according to a multinomial distribution \[\begin{equation} {\bf Q} \sim {\rm Multinomial}(n,\psi) \end{equation}\]

7.4 Estimators in PPSWR

7.4.1 Total

The Horwitz-Thompson estimator of the total is, as always \[\begin{equation} \widehat{Y} = \sum_{k\in s} \frac{y_k}{\pi_k} \end{equation}\] where the probabilities of selection $\pi_i$ are given by Equation (7.1). However we can use the following approximation which holds for small $\psi_i$: \[\begin{equation} \pi_i = 1-\left(1-\frac{X_{i}}{X}\right)^{n} \approx n\frac{X_{i}}{X} = n\psi_i \end{equation}\] This suggests another estimator for the total called the Hansen-Hurwitz estimator, where we replace the $\pi_i$ values with $n\psi_i$: \[\begin{equation} \widehat{Y}_{HH} = \frac{1}{n} \sum_{k \in s} \frac{y_k}{\psi_k} \end{equation}\] This estimator is unbiased, and has variance \[\begin{equation} \bfa{Var}{\widehat{Y}_{HH}} = \frac{1}{n}\sum_{i=1}^{N}\left(\frac{Y_{i}}{\psi_i}-Y\right)^{2}\psi_{i} \end{equation}\] An unbiased estimator of this variance from the sample is \[\begin{equation} \bfa{\widehat{Var}}{\widehat{Y}_{HH}} = \frac{1}{n(n-1)} \sum_{k \in s}\left(\frac{y_k}{\psi_k}-\widehat{Y}_{HH}\right)^{2} \end{equation}\] It is easily seen that the variance of the HH estimator is zero if all the values of the variable of interest are proportional to their size measure. Indeed, this is the motivation for such designs. However, in practice, this situation never holds.

The question as to whether the HH estimator or the HT estimator has the smaller variance is not straightforward and in fact depends on the particular configuration of the population values: how correlated are the variable of interest and the size measure, what are the coefficients of variation of the variable of interest and the size measure, etc.

7.4.2 Mean

The Hansen-Hurwitz estimator of the mean follows straightforwardly from that for the total: \[\begin{equation} \widehat{\bar{Y}}_{HH} = \frac{\widehat{Y}_{HH}}{N} = \frac{1}{nN} \sum_{k \in s} \frac{y_k}{\psi_k} \end{equation}\] This estimator is unbiased and has variance \[\begin{equation} \bfa{Var}{\widehat{\bar{Y}}_{HH}} = \frac{1}{nN^2}\sum_{i=1}^{N}\left(\frac{Y_{i}}{\psi_i}-Y\right)^{2}\psi_{i} \end{equation}\] which can be estimated from the sample by \[\begin{equation} \bfa{\widehat{Var}}{\widehat{Y}_{HH}} = \frac{1}{n(n-1)N^2} \sum_{k \in s}\left(\frac{y_k}{\psi_k}-\widehat{Y}_{HH}\right)^{2} \end{equation}\]

7.5 Example

We have a dataset of the 2020 population, GDP and military expenditure of $N=150$ countries.

## [1] 150

Figure 7.1: 2020 Military Expenditure

The distribution is very strongly right skewed - with the USA dominating expenditure. The total value of all military expenditure in 2020 was Bn$2000.07

If we take a SRSWOR of $n=30$ countries we can create an estimate of the total world military expenditure.

The data in the sample are listed in Table 7.3 and displayed in Figure 7.2.

Table 7.3: SRS of 30 countries
Country	Population	GDP (Bn$)	Military Expenditure (Bn$)
Angola	32866268	104.128681	0.9935944
Belarus	9379952	58.482353	0.8445129
Brazil	212559409	1749.104722	19.7363478
Canada	38005238	1600.331195	22.7548471
Central African Republic	4829764	2.001438	0.0413036
Congo, Dem. Rep.	89561404	45.259707	0.3620916
Gambia, The	2416664	1.672833	0.0148050
Ghana	31072945	62.724595	0.2398872
Guinea-Bissau	1967998	1.218760	0.0233067
Haiti	11402533	14.956795	0.0002638
Iraq	40222503	170.857728	7.0155588
Italy	59554023	1744.731952	28.9213428
Jamaica	2961161	13.440715	0.2444328
Korea, Rep.	51780579	1623.895081	45.7353926
Kosovo	1775378	7.144368	0.0789650
Liberia	5057677	3.115543	0.0169370
Malta	525285	12.886270	0.0806132
Moldova	2620495	8.517410	0.0445338
Montenegro	621306	4.046328	0.1020905
Morocco	36910558	105.726172	4.8309564
Mozambique	31255435	17.959222	0.1537419
Norway	5379475	403.779725	7.1125385
Paraguay	7132530	40.446809	0.3643422
Peru	32971846	190.979129	2.6331234
Romania	19286123	208.838847	5.7268442
Slovenia	2100126	48.124693	0.5748319
Somalia	15893219	5.151914	0.0983850
Tunisia	11818618	44.681504	1.1573724
Turkey	84339067	1015.326663	17.7246321
Zambia	18383956	23.418946	0.2121424

Figure 7.2: Military Expenditure of a SRSWOR of 30 countries

The mean expenditure in the sample is Bn$5.59 with standard deviation Bn$10.79. This leads to an estimate of \[ \widehat{Y}_{SRSWOR} = N\bar{y} = (158)(5.59) = 839.2 \] with standard error \[\begin{eqnarray*} \bfa{SE}{\widehat{Y}_{SRSWOR}} &=& \sqrt{N^2\left(1-\frac{n}{N}\right)\frac{s_y^2}{n}} \\ &=& \sqrt{(150)^2\left(1-\frac{30}{150}\right)\frac{(10.792)^2}{30}} \\ &=& 264.34 \end{eqnarray*}\] and RSE \[ \bfa{RSE}{\widehat{Y}_{SRSWOR}} = \frac{\bfa{SE}{\widehat{Y}_{SRSWOR}} }{\widehat{Y}_{SRSWOR}} = \frac{264.34}{839.2} = 0.315 \] This is clearly a worthless estimate. The extreme skewness of the data has led to an RSE which is completely unacceptable.

In a situation where we have a variable on the frame that is likely to correlate with the size of military expenditure, we can use PPSWR to do a lot better. Gross Domestic Product (GDP) does correlate strongly with military expenditure - as can be seen in the diagram below.

Figure 7.3: GDP and Military Expenditure

Note that we could use population size, but GDP is a better indicator of the size of the economy, which is what we would expect to correlate best with military expenditure).

If we select a sample of $n=30$ countries by ppswr we get, for example, the units

8, 8, 11, 13, 31, 31, 31, 31, 31, 50, 50, 54, 65, 65, 73, 77, 98, 103, 133, 134, 147, 148, 148, 148, 148, 148, 148, 148, 148, 150

There are a few units which have been selected multiple times – most notably Unit 31 (China) and Unit 148 (USA), which respectively occur 5 and 8 times out of 30.

An estimate of the population total from this sample is \[\begin{eqnarray*} \widehat{Y}_{HH} &=& \frac{1}{n} \sum_{k \in s} \frac{y_k}{\psi_k} \\ &=& 2032.41 \end{eqnarray*}\] This is a very good estimate (recall that the true value is 2000.07).

The standard error is \[\begin{eqnarray*} \bfa{SE}{\widehat{Y}_{HH}} &=& \sqrt{\frac{1}{n(n-1)} \sum_{k \in s}\left(\frac{y_k}{\psi_k}-\widehat{Y}_{HH}\right)^{2}}\\ &=& 180.04 \end{eqnarray*}\] with RSE \[ \bfa{RSE}{\widehat{Y}_{HH}} = \frac{\bfa{SE}{\widehat{Y}_{HH}} }{\widehat{Y}_{HH}} = \frac{180.04}{2032.41} = 0.089 \] In other words PPSWR is a dramatic improvement over SRS.

7.6 Simple Random Sampling with Replacement

If we are carrying out a simple random sample with replacement (SRSWR), the probability a unit is selected on each draw is the same for every unit: \[ \psi_i= \frac{1}{N} \] So the HH estimator of the population total from a SRSWR of size $n$ is \[\begin{eqnarray*} \widehat{Y}_{HH} &=& \frac{1}{n}\sum_{k=1}^n \frac{y_k}{\psi_k}\\ &=& \frac{N}{n}\sum_{k=1}^n y_k\\ &=& N\bar{y} \end{eqnarray*}\] which is exactly the same as the HT estimator from SRSWOR. The weights are just the same: $w_k=N/n$.

However the variance estimator is different: \[\begin{eqnarray*} \bfa{\widehat{Var}}{\widehat{Y}_{HH}} &=& \frac{1}{n(n-1)} \sum_{k=1}^n \left(\frac{y_k}{\psi_k} - \widehat{Y}_{HH}\right)^2\\ &=& \frac{1}{n(n-1)} \sum_{k=1}^n \left(Ny_k - N\bar{y}\right)^2\\ &=& \frac{N^2}{n(n-1)} \sum_{k=1}^n \left(y_k - \bar{y}\right)^2\\ &=& N^2\frac{s_y^2}{n} \end{eqnarray*}\] which is almost the same as the HT estimator: except that it lacks the finite population correction: \[\begin{eqnarray*} \bfa{\widehat{Var}}{\widehat{Y}_{HT}} &=& N^2\left(1-\frac{n}{N}\right)\frac{s_y^2}{n} \end{eqnarray*}\] This means that even as $n\rightarrow N$ the variance of the estimator does not decrease to zero. That’s because is a with-replacement design, a sample of size $n=N$ is not guaranteed to be a census.

That makes the variance of the SRSWR estimator a bit larger than that of the SRSWOR design – and the estimator is less efficient. i.e. for the same sample size $n$ we get a larger variance.

7.7 Remarks

The lack of efficiency due to the lack of the fpc is typical of with-replacement designs. However if $n$ is much smaller than $N$ the loss of efficiency isn’t important.

In PPSWR designs where the size measure $X_i$ is strongly correlated with the characteristic of interest $Y_i$ then (even without the fpc) PPSWR can have dramatic gains over SRSWOR.

Also note that if the fpc is ignored in any survey sampling analysis (and it often is) – then the analysis effectively treats the sampling as if it were with replacement. Moreover, if we are given data from a complex design – but are only given the survey weights (without enough information to calculate the joint selection probabilities) then we are forced into treating the sample as if it were drawn with replacement.

8 Stratified Sampling

$\DeclareMathOperator*{\argmin}{argmin}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\bfa}[2]{{\rm\bf #1}[#2]}$ $\newcommand{\rma}[2]{{\rm #1}[#2]}$ $\newcommand{\estm}{\widehat}$

This chapter introduces a useful technique called stratification, which is the process of splitting a finite population into subgroups and then taking independent samples from each of those subgroups. The sampling within strata may be a simple random sample, or another design such as cluster sampling. We will however concentrate on the case of simple random sampling as the within-stratum sampling scheme.

Stratification is an example of using auxiliary information about the population at the design stage. This information is used to put each unit into one of the strata.

There are several reasons for stratification. Some of these are:

Efficiency: we partition the sample space so that fewer extreme samples can be selected, or that influential sampling units are isolated and selected with a high probability.
Using auxiliary information to improve the efficiency of the sample design by forming homogeneous groups with smaller coefficients of variation (CVs).
The need to form subpopulation estimates of sufficient accuracy.
Using auxiliary information to overcome nonsampling errors; e.g. it might be that regions have differential response rates so that it useful to group these areas and try to use auxiliary information to improve the estimates.
Administrative reasons, e.g. recruitment and training of field force might be carried in certain centres and it would be natural for these to define strata.

We will briefly look at 1 and 2. In which case when we form strata we are trying to form subgroups of the population which are more homogeneous than the total population. One measure of homogeneity is the CV (coefficient of variation: $S_Y/\bar{Y}$) of the variable, and so we try to form subgroups whose CV’s are much less than the population CV’s.

Example. $N=1000$ students enrol in a first year statistics course. We are given a list of their names, and for each person we are also told whether or not they have a degree: 100 do have a degree already and 900 do not. We are asked to estimate the mean age of students in the class, and we are only allowed to sample 20 students.

How should we proceed?

We could take a simple random sample of students, and calculate the mean age $\bar{y}$ of the sample, and use that as our estimate. However it’s very likely that the people who already have degrees are older, and moreover the spread of ages in that group is likely to be much wider than in the no degree group. It would therefore make sense to take two separate samples, one from each group, separately estimate the mean of each group, and then combine those estimates to form the overall mean estimate. This is called stratified sampling, and it can lead to estimates which are much more precise than those from simple random sampling.

8.1 Notation for stratified sampling

In stratified sampling we require prior information on every unit in the population (not just the sampled units). We use this prior auxiliary information to classify every population unit into one, and only one stratum. We’ll leave the method of deciding how to form the strata for later.

For the moment suppose that we have determined our strata, and there are $H$ of them. The $N$ population units are divided up with $N_h$ units in each stratum: \[\begin{equation} N = \sum_{h=1}^H N_h \end{equation}\] Every population unit belongs to one and only one stratum $h$. The proportion of the population in stratum $h$ is $F_h$ \[\begin{equation} F_h = \frac{N_h}{N} \qquad \text{and} \qquad \sum_{h=1}^H F_h = 1 \end{equation}\]

Example continued. Our class of $N=1000$ students can be split into $H=2$ strata – those without and those with degrees:

$h$ Stratum Stratum Size Stratum Proportion

1 No degree $N_1=900$ $F_1=\frac{N_1}{N}=\frac{900}{1000}=0.9$

2 Has degree $N_2=100$ $F_2=\frac{N_2}{N}=\frac{100}{1000}=0.1$

Total $N=\sum_hN_h=1000$ $\sum_hF_h=1.0$

\(h\)	Stratum	Stratum Size	Stratum Proportion
1	No degree	\(N_1=900\)	\(F_1=\frac{N_1}{N}=\frac{900}{1000}=0.9\)
2	Has degree	\(N_2=100\)	\(F_2=\frac{N_2}{N}=\frac{100}{1000}=0.1\)
	Total	\(N=\sum_hN_h=1000\)	\(\sum_hF_h=1.0\)

We relabel each unit by its stratum $h$ and unit number $i$ within that stratum. So the within stratum total for stratum $h$ is the sum of all the $Y$ values for units in stratum $h$: \[\begin{equation} Y_h = \sum_{i=1}^{N_h} Y_{hi} \end{equation}\] similarly we have the within stratrum mean and variance: \[\begin{eqnarray} \bar{Y}_h &=& \frac{1}{N_h} \sum_{i=1}^{N_h} Y_{hi} = \frac{Y_h}{N_h}\\ S_h^2 &=& \frac{1}{N_h-1} \sum_{i=1}^{N_h} (Y_{hi}-\bar{Y}_h)^2 \end{eqnarray}\] These formulae are identical to those we have had before, only we have the label $h$ to show that they are being calculated separately in each stratum.

Example continued. Our class of $N=1000$ students can be split into $H=2$ strata – those without and those with degrees:

$h$ Stratum Stratum Total Stratum Mean Stratum Variance

1 No degree $Y_1=18435.6$ $\bar{Y}_1=20.5$ $S_1^2=2.08$

2 Has degree $Y_2=3161.6$ $\bar{Y}_2=31.6$ $S_2^2=50.03$

Total $Y=\sum_hY_h=21597.2$

\(h\)	Stratum	Stratum Total	Stratum Mean	Stratum Variance
1	No degree	\(Y_1=18435.6\)	\(\bar{Y}_1=20.5\)	\(S_1^2=2.08\)
2	Has degree	\(Y_2=3161.6\)	\(\bar{Y}_2=31.6\)	\(S_2^2=50.03\)
	Total	\(Y=\sum_hY_h=21597.2\)

The overall population total is simply the sum of all the within stratum totals: \[\begin{equation} Y = \sum_{h=1}^H Y_h \tag{8.1} \end{equation}\] However we have to be careful when combining the within stratum results to form the overall population mean and variance. The population mean is given by \[\begin{eqnarray} \bar{Y} &=& \frac{Y}{N}\\ &=& \frac{\sum_{h=1}^{H} Y_h}{N}\\ \tag{8.2} &=& \frac{\sum_{h=1}^{H} N_h\bar{Y}_h}{N}\\ &=& \sum_{h=1}^{H} \frac{N_h}{N}\bar{Y}_h\\ &=& \sum_{h=1}^{H} F_h \bar{Y}_h \end{eqnarray}\] i.e. the population mean is a weighted sum of the stratum means $Y_h$.

Example continued. To find the mean age of the 1000 students, form the weighted sum: \[ \bar{Y} = \sum_{h=1}^{H} F_h \bar{Y}_h = (0.9)(20.5) + (0.1)(31.6) = 21.6 \]

The population variance is made up of two parts: \[\begin{eqnarray} S_Y^2 &=& \frac{1}{N-1}\sum_{h=1}^H\sum_{i=1}^{N_h} (Y_{hi}-\bar{Y})^2\\ &=& \frac{1}{N-1}\left( \sum_{h=1}^H\sum_{i=1}^{N_h} (Y_{hi}-\bar{Y}_h)^2 +\sum_{h=1}^H\sum_{i=1}^{N_h} (\bar{Y}_h-\bar{Y})^2 \right)\\ \tag{8.3} &=& \frac{1}{N-1}\left( \sum_{h=1}^H (N_h-1) S_h^2 +\sum_{h=1}^H N_h (\bar{Y}_h-\bar{Y})^2 \right)\\ &=& \text{within stratum variance} + \text{between stratum variance} \end{eqnarray}\]

Example continued. To find the variance of the ages of the 1000 students, form the two components:

\[\begin{eqnarray} \text{within stratum variance} &=& \frac{1}{N-1}\sum_{h=1}^H (N_h-1) S_h^2\\ &=& \frac{(899)(2.08)+(99)(50.03)}{999} = \frac{6823}{999} = 6.83\\ \text{between stratum variance} &=& \frac{1}{N-1}\sum_{h=1}^H N_h (\bar{Y}_h-\bar{Y})^2\\ &=& \frac{(900)(20.5-21.6)^2)+(100)(31.6-21.6)^2}{999} = \frac{11089}{999} = 11.10\\ \text{total variance} &=& S_Y^2 = \frac{1}{N-1}\sum_{h=1}^H (N_h-1) S_h^2 + \frac{1}{N-1}\sum_{h=1}^H N_h (\bar{Y}_h-\bar{Y})^2\\ &=& 6.83 + 11.10 = 17.93 \end{eqnarray}\]

The total variance has been partitioned into two parts by the stratification – most of the total variance is between the strata, which is highly desirable.

The power of stratification lies in the separation of between and witin stratum variance. All our sampling error comes from the within stratum variances $S_h^2$, so if we can push as much of $S_Y^2$ into the between stratum variance, we won’t see that variance in our estimators and they will be much more accurate.

8.2 Estimation in Stratified Sampling

The key concept in stratified sampling is that we have divided the population into $H$ groups, and we take completely independent samples from each stratum: it’s as if we were running $H$ separate surveys.

This means that the sampling method can be different in each stratum: we could take a SRS in one stratum, a census in another, a cluster sample in another etc.

Then the natural estimator for the population total is \[\begin{equation} \estm{Y}_{ST} = \sum_{h=1}^{H} \estm{Y}_h \end{equation}\] where $\estm{Y}_h$ is an estimator (appropriate to the sampling scheme) for the total for stratum $h$. The variance of this estimator is easy to work out since the sampling is independent in each stratum. \[\begin{equation} \bfa{Var}{\estm{Y}_{ST}} = \sum_{h=1}^{H} \bfa{Var}{\estm{Y}_h} \end{equation}\] where $\bfa{Var}{\estm{Y}_h}$ is the variance of the estimator for the total for stratum $h$.

The natural estimator of the population mean is: \[\begin{eqnarray} \estm{\bar{Y}}_{ST} &=& \frac{\estm{Y}_{ST}}{N}\\ &=& \frac{1}{N} \sum_{h=1}^{H} \estm{Y}_h\\ &=& \frac{1}{N} \sum_{h=1}^{H} N_h\estm{\bar{Y}}_h\\ &=& \sum_{h=1}^{H} F_h\estm{\bar{Y}}_h \end{eqnarray}\] and its variance is \[\begin{equation} \bfa{Var}{\estm{\bar{Y}}_{ST}} = \sum_{h=1}^{H} F_h^2 \bfa{Var}{\estm{\bar{Y}}_h} \end{equation}\] This is a weighted sum of the variances of the stratum mean estimates, just as the mean was a weighted sum of the stratum means, but note that $F_h$ appears as $F_h^2$.

8.3 Stratified Simple Random Sampling

To take a specific example of a sampling scheme, suppose that we take a SRS of size $n_h$ from each stratum and these samples are independent. Then the within stratum estimates of totals, means and proportions are: \[\begin{eqnarray} \nonumber \estm{Y}_h &=& \frac{N_h}{n_h} \sum_{k=1}^{n_h} y_{hk} = N_h\bar{y}_h\\ \estm{\bar{Y}}_h &=& \frac{1}{n_h} \sum_{k=1}^{n_h} y_{hk} = \bar{y}_h\\ \nonumber \estm{p}_h &=& \frac{1}{n_h} \sum_{k=1}^{n_h} y_{hk} = \bar{y}_h \end{eqnarray}\] with variances \[\begin{eqnarray} \nonumber \bfa{Var}{\estm{Y}_h} &=& N_h^2\left(1-\frac{n_h}{N_h}\right)\frac{S_h^2}{n_h}\\ \bfa{Var}{\estm{\bar{Y}}_h} &=& \left(1-\frac{n_h}{N_h}\right)\frac{S_h^2}{n_h}\\ \nonumber \bfa{Var}{\estm{p}_h} &=& \left(1-\frac{n_h}{N_h}\right)\frac{p_h(1-p_h)}{n_h-1} \end{eqnarray}\] which can be estimated from sample data by: \[\begin{eqnarray} \nonumber \bfa{\widehat{Var}}{\estm{Y}_h} &=& N_h^2\left(1-\frac{n_h}{N_h}\right)\frac{s_h^2}{n_h}\\ \bfa{\widehat{Var}}{\estm{\bar{Y}}_h} &=& \left(1-\frac{n_h}{N_h}\right)\frac{s_h^2}{n_h}\\ \nonumber \bfa{\widehat{Var}}{\estm{p}_h} &=& \left(1-\frac{n_h}{N_h}\right)\frac{\estm{p}_h(1-\estm{p}_h)}{n_h-1} \end{eqnarray}\] These formulae are identical to the SRS formulae we have had earlier, with the single change that we have a subscript $h$ to indicate that the estimates and variances are specific to stratum $h$.

Example continued. Assume we have drawn a sample of size 10 from each of the two strata. The sample statistics are as given in the following table:

Stratum Stratum Size Stratum Fraction Sample Size Sampling Fraction Sample Weight Sample Mean Age Sample Variance

$h$ $N_h$ $F_h=\frac{N_h}{N}$ $n_h$ $f_h=\frac{n_h}{N_h}$ $w_{hk}$ $\bar{y}_h$ $s_h^2$

1 No degree 900 0.9 10 0.0111 90 20.3 3.22

2 Has degree 100 0.1 10 0.1000 10 37.8 56.2

Total 1000 1.0 20

Note that the sampling fractions and hence the sample weights are different in the two strata.

Using these data we can estimate the mean age in each stratum, and create a 95% confidence interval for each estimate:

Stratum Estimated Mean Age Variance of Estimate Std. Error of Estimate RSE 95% Conf. Int.

$h$ $\estm{\bar{Y}}_h=\bar{y}_h$ $\bfa{Var}{\estm{\bar{Y}}_h}$ $\bfa{SE}{\estm{\bar{Y}}_h}$ $\bfa{RSE}{\estm{\bar{Y}}_h}$

1 No degree 20.3 0.3181 0.56 0.028 (19.2, 21.4)

2 Has degree 37.8 5.0573 2.25 0.060 (33.4, 42.2)

Stratum		Stratum Size	Stratum Fraction	Sample Size	Sampling Fraction	Sample Weight	Sample Mean Age	Sample Variance
\(h\)		\(N_h\)	\(F_h=\frac{N_h}{N}\)	\(n_h\)	\(f_h=\frac{n_h}{N_h}\)	\(w_{hk}\)	\(\bar{y}_h\)	\(s_h^2\)
1	No degree	900	0.9	10	0.0111	90	20.3	3.22
2	Has degree	100	0.1	10	0.1000	10	37.8	56.2
	Total	1000	1.0	20

	Stratum	Estimated Mean Age	Variance of Estimate	Std. Error of Estimate	RSE	95% Conf. Int.
\(h\)		\(\estm{\bar{Y}}_h=\bar{y}_h\)	\(\bfa{Var}{\estm{\bar{Y}}_h}\)	\(\bfa{SE}{\estm{\bar{Y}}_h}\)	\(\bfa{RSE}{\estm{\bar{Y}}_h}\)
1	No degree	20.3	0.3181	0.56	0.028	(19.2, 21.4)
2	Has degree	37.8	5.0573	2.25	0.060	(33.4, 42.2)

Combined estimates for the population total, mean and proportion follow from Equations (8.1) and (8.2): \[\begin{eqnarray} \estm{Y}_{ST,SRS} &=& \sum_{h=1}^{H} \estm{Y}_h = \sum_{h=1}^{H} N_h\bar{y}_h\\ \estm{\bar{Y}}_{ST,SRS} &=& \sum_{h=1}^{H} F_h\estm{\bar{Y}}_h = \sum_{h=1}^{H} F_h\bar{y}_h\\ \estm{p}_{ST,SRS} &=& \sum_{h=1}^{H} F_h\estm{p}_h = \sum_{h=1}^{H} F_h\bar{y}_h \end{eqnarray}\] and these estimates have variances \[\begin{eqnarray} \bfa{Var}{\estm{Y}_{ST,SRS}} &=& \sum_{h=1}^{H} N_h^2\left(1-\frac{n_h}{N_h}\right)\frac{S_h^2}{n_h}\\ \bfa{Var}{\estm{\bar{Y}}_{ST,SRS}} &=& \sum_{h=1}^{H} F_h^2 \left(1-\frac{n_h}{N_h}\right)\frac{S_h^2}{n_h}\\ \bfa{Var}{\estm{p}_{ST,SRS}} &=& \sum_{h=1}^{H} F_h^2 \left(1-\frac{n_h}{N_h}\right)\frac{p_h(1-p_h)}{n_h-1} \end{eqnarray}\] which can be estimated from sample data by \[\begin{eqnarray} \bfa{\widehat{Var}}{\estm{Y}_{ST,SRS}} &=& \sum_{h=1}^{H} N_h^2\left(1-\frac{n_h}{N_h}\right)\frac{s_h^2}{n_h}\\ \bfa{\widehat{Var}}{\estm{\bar{Y}}_{ST,SRS}} &=& \sum_{h=1}^{H} F_h^2 \left(1-\frac{n_h}{N_h}\right)\frac{s_h^2}{n_h}\\ \bfa{\widehat{Var}}{\estm{p}_{ST,SRS}} &=& \sum_{h=1}^{H} F_h^2 \left(1-\frac{n_h}{N_h}\right)\frac{\estm{p}_h(1-\estm{p}_h)}{n_h-1} \end{eqnarray}\]

Example continued. An overall estimate of the mean age of the $N=1000$ students from the sample is given by \[ \estm{\bar{Y}} = \sum_{h=1}^H F_h\bar{y}_h = (0.9)(20.3) + (0.1)(37.8) = 22.1 \] with variance \[\begin{eqnarray*} \bfa{\widehat{Var}}{\estm{\bar{Y}}} &=& \sum_{h=1}^{H} F_h^2 \left(1-\frac{n_h}{N_h}\right)\frac{s_h^2}{n_h}\\ &=& (0.9)^2\left(1-\frac{10}{900}\right)\frac{3.22}{10} +(0.1)^2\left(1-\frac{10}{100}\right)\frac{56.2}{10} = 0.3085 \end{eqnarray*}\] leading to the following 95% confidence interval for the mean student age in the whole class of 1000 students: \[ 22.1 \pm (1.96)\sqrt{0.3085} = 22.1\pm1.1 = (21.0,23.2) \]

Example continued. Now assume that in the example above we had asked about car ownership, and found that 3 of the sampled students without degrees owned a car, whereas there were 8 car owners among those with degrees. What is the proportion of car owners in the whole population?

First compute the proportions in the two strata: \[\begin{eqnarray*} \widehat{p}_1 &=& \frac{3}{10} = 0.30\\ \widehat{p}_2 &=& \frac{8}{10} = 0.80 \end{eqnarray*}\] Then combine these estimates to form the full population estimate: \[ \widehat{p} = \sum_hF_h\widehat{p}_h = (0.9)(0.30) + (0.1)(0.80) = 0.35 \] with variance \[\begin{eqnarray*} \bfa{Var}{\widehat{p}} &=& \sum_h F_h^2\left(1-\frac{n_h}{N_h}\right) \frac{\widehat{p}_h(1-\widehat{p}_h)}{n_h-1}\\ &=& (0.9)^2\left(1-\frac{10}{900}\right)\frac{(0.3)(0.7)}{9} +(0.1)^2\left(1-\frac{10}{100}\right)\frac{(0.8)(0.2)}{9} = 0.01885 \end{eqnarray*}\] leading to the following 95% confidence interval for the proportion of car owners in the whole class of 1000 students: \[ 0.35 \pm (1.96)\sqrt{0.01885} = 0.35\pm 0.27 = (0.08,0.62) \] Scaling these estimates by the population size $N=1000$ gives an estimate of the total number of car owners \[ \widehat{Y} = N\widehat{p} = (1000)(0.35) = 350 \] with 95% confidence interval \[ N\times(0.08,0.62) = 1000\times(0.08,0.62) = (80,620) \]

8.4 Comparison of sampling schemes: the Design Effect

When we evaluate a sampling scheme, our main concern is usually to see whether it results in improved or worsened estimates than those obtainable under other sampling schemes. By improved we usually mean estimates with a smaller sampling error, although other factors such as cost and the need to form good subpopulation estimates may be important as well.

The standard comparison we make is to compare the variance of an estimator $\estm{T}_\text{complex}$ under the proposed complex sampling scheme, with the variance of the equivalent estimator under simple random sampling $\estm{T}_\text{SRS}$ with the same sample size. This comparison is made by forming the design effect – which is the ratio of the two variances: \[\begin{equation} \bfa{Deff}{\estm{T}_\text{complex}} = \frac{\bfa{Var}{\estm{T}_\text{complex}}}{\bfa{Var}{\estm{T}_\text{SRS}}} \tag{8.4} \end{equation}\] If the Deff is greater than 1, then the variance of $\estm{T}_{\rm complex}$ is greater than that of the SRS estimator $\estm{T}_{SRS}$. The estimator $\estm{T}_{\rm complex}$ is then said to be less efficient than $\estm{T}_{SRS}$. A desirable Deff is therefore less than one, indicating that the complex design is more efficient.

For example we can evaluate the Deff of the estimator of the mean $\estm{\bar{Y}}_{ST,SRS}$ from stratified SRS by calculating the Deff as follows: \[\begin{eqnarray*} \bfa{Deff}{\estm{\bar{Y}}_{ST,SRS}} &=& \frac{\bfa{Var}{\estm{\bar{Y}}_{ST,SRS}}}{\bfa{Var}{\estm{\bar{Y}}_{SRS}}}\\ &=& \frac{\sum_{h=1}^{H} F_h^2\left(1-\frac{n_h}{N_h}\right)\frac{S_h^2}{n_h}}{ \left(1-\frac{n}{N}\right)\frac{S_Y^2}{n} } \end{eqnarray*}\]

Example continued. We want to know if the stratified estimator of the mean student age is an improvement over simple random sampling. For $n=20$ the SRS estimator of the mean has variance \[ \bfa{Var}{\estm{\bar{Y}}_{SRS}} = \left(1-\frac{n}{N}\right)\frac{S_Y^2}{n} = \left(1-\frac{20}{1000}\right)\frac{17.93}{20} = 0.8786 \] For a stratified sample of $n=20$ allocated with 10 units in each stratum \[\begin{eqnarray*} \bfa{Var}{\estm{\bar{Y}}_{ST,SRS}} &=& \sum_{h=1}^H F_h^2\left(1-\frac{n_h}{N_h}\right)\frac{S_h^2}{n_h}\\ &=& (0.9)^2\left(1-\frac{10}{900}\right)\frac{2.08}{10} + (0.1)^2\left(1-\frac{10}{100}\right)\frac{50.03}{10} = 0.2117 \end{eqnarray*}\] so the design effect is \[ \bfa{Deff}{\estm{\bar{Y}}_{ST,SRS}} = \frac{\bfa{Var}{\estm{\bar{Y}}_{ST,SRS}}}{\bfa{Var}{\estm{\bar{Y}}_{SRS}}} = \frac{0.2117}{0.8786} = 0.24 \] This is less than one, showing that the stratified estimator (for this allocation of a sample of size $n=20$) is more efficient.

8.4.1 Estimation of the Deff for Stratified SRSWOR

In general $S_Y^2$ and $S_h^2$, which are required in the formula for the Deff, are unknown and must be estimated using sample data. The within stratum variances $S_h^2$ are simply estimated by the corresponding sample variances $s_h^2$, however the overall population variance is a little more complex. From Equation (8.3) we had \[ S_Y^2 = \frac{1}{N-1}\left( \sum_{h=1}^H (N_h-1) S_h^2 +\sum_{h=1}^H N_h (\bar{Y}_h-\bar{Y})^2\right) \] and it follows that an estimate of $S_Y^2$ is given by \[\begin{equation} \estm{S}_Y^2 = \sum_{h=1}^H F_h s_h^2 + \sum_{h=1}^H F_h(\bar{y}_h-\estm{\bar{Y}}_{ST,SRS})^2 \quad \text{where}\quad \estm{\bar{Y}}_{ST,SRS} = \sum_{h=1}^H F_h\bar{y}_h \end{equation}\] So to calculate the Deff of the estimator of the mean from sample data in stratified SRSWOR we calculate \[\begin{eqnarray*} \bfa{\widehat{Var}}{\estm{\bar{Y}}_{ST,SRS}} &=& \sum_{h=1}^H F_h^2\left(1-\frac{n_h}{N_h}\right)\frac{s_h^2}{n_h}\\ \bfa{\widehat{Var}}{\estm{\bar{Y}}_{SRS}} &=& \left(1-\frac{n}{N}\right)\frac{1}{n} \left[ \sum_{h=1}^H F_hs_h^2 + \sum_{h=1}^H F_h (\bar{y}_h-\estm{\bar{Y}}_{ST,SRS})^2 \right] \end{eqnarray*}\] and then compute \[ \bfa{\widehat{Deff}}{\estm{\bar{Y}}_{ST,SRS}} = \frac{\bfa{\widehat{Var}}{\estm{\bar{Y}}_{ST,SRS}}}{ \bfa{\widehat{Var}}{\estm{\bar{Y}}_{SRS}}} \] Note: to get a good estimate of the variance of the estimator and its Design Effect we need sufficient sample size in each stratum to get reliable estimates of $\bar{y}_h$ and $s_h^2$. Small samples lead to very unreliable estimates of variances in general.

Example continued. We have already estimated $\bfa{\widehat{Var}}{\estm{\bar{Y}}_{ST,SRS}}=0.3085$ from the sample. It remains to estimate $\bfa{\widehat{Var}}{\estm{\bar{Y}}_{SRS}}$: \[\begin{eqnarray*} \bfa{\widehat{Var}}{\estm{\bar{Y}}_{SRS}} &=& \left(1-\frac{n}{N}\right)\frac{1}{n} \left[ \sum_{h=1}^H F_hs_h^2 + \sum_{h=1}^H F_h (\bar{y}_h-\estm{\bar{Y}}_{ST,SRS})^2 \right]\\ &=& \left(1-\frac{20}{1000}\right)\frac{1}{20} [ (0.9)(3.22)+(0.1)(56.2) \\ && \qquad\qquad +(0.9)(20.3-22.1)^2+(0.1)(37.8-22.1)^2 ]\\ &=& (0.049)[8.518+27.565] = 1.7681 \end{eqnarray*}\] leading to an estimate of the Deff of \[ \bfa{\widehat{Deff}}{\estm{\bar{Y}}_{ST,SRS}} = \frac{\bfa{\widehat{Var}}{\estm{\bar{Y}}_{ST,SRS}}}{ \bfa{\widehat{Var}}{\estm{\bar{Y}}_{SRS}}} = \frac{0.3085}{1.7681} = 0.17 \] which is similar to the true value of 0.24, and likewise indicates that the stratified design is much more efficient than SRSWOR.

8.4.2 Interpretation of the Design Effect

The Deff can be used in two important ways:

For a specified accuracy, the Design Effect tells us by what factor our sample size is reduced (or increased) by the use of a complex design. \[ n_{\rm complex} = {\bf\rm Deff}\times n_{\rm SRS} \] Example. For a variable with a Design Effect of 0.2 what size sample is required to achieve the same accuracy as a SRSWOR with $n_{SRS}=500$? \[ n_{\rm complex} = 0.2\times500 = 100 \] Only a sample of size 100.
For a specified sample size, The design effect tells us by what factor our margins of error are reduced (or increased) by use of a complex design. \[ {\bf\rm MOE}_{\rm complex} = \sqrt{{\bf\rm Deff}}\times {\bf\rm MOE}_{\rm SRS} \] Example. In a SRS we achieve a margin of error of $\pm25$ for a sample size of 1000. For the same sample size in a complex design where the design effect is 0.2, what will the margin of error be? \[ {\bf\rm MOE}_{\rm complex} = \sqrt{0.2}\times 25 = 11.2 \] Only a margin of error of $\pm11$.

If the Deff is less than 1, the complex design requires a smaller sample size for the same accuracy, OR achieves lower margin of error for the same sample size. i.e. the complex design is better. If the Deff is greater than 1, the SRS is better than the complex design.

8.4.3 Details

In general the Design Effect is the ratio of the variance of an estimator under some complex design (such as stratified sampling), to the variance of an estimator under SRSWOR, with the same sample size: \[\begin{equation} \bfa{Deff}{T_{\rm complex}} = \frac{\bfa{Var}{\estm{T}_{\rm complex}}}{\bfa{Var}{\estm{T}_{SRS}}} \end{equation}\]

For SRSWOR the variance of the estimator of the population mean is \[\begin{equation} \bfa{Var}{\estm{\bar{Y}}_{\rm SRS}} = \left(1-\frac{n}{N}\right)\frac{S_Y^2}{n} \end{equation}\] If the sampling fraction $f=n/N$ is small enough this becomes \[\begin{equation} \bfa{Var}{\estm{\bar{Y}}_{\rm SRS}} \simeq \frac{S_Y^2}{n} \end{equation}\] Now consider the ratio of variances under SRSWOR for two sample sizes $n_1$ and $n_2$: \[\begin{equation} \frac{\bfa{Var}{\estm{\bar{Y}}_{\rm SRS}}_{n_2}}{ \bfa{Var}{\estm{\bar{Y}}_{\rm SRS}}_{n_1}} \simeq \frac{n_1}{n_2} \end{equation}\] we can therefore write the Design Effect for samples of size $n_1$ \[\begin{equation} \bfa{Deff}{\estm{\bar{Y}}_{\rm complex}}_{n_1} = \frac{\bfa{Var}{\estm{\bar{Y}}_{\rm complex}}_{n_1}}{ \bfa{Var}{\estm{\bar{Y}}_{SRS}}_{n_1}} = \frac{\bfa{Var}{\estm{\bar{Y}}_{\rm complex}}_{n_1}}{ \bfa{Var}{\estm{\bar{Y}}_{SRS}}_{n_2}} \times\frac{n_1}{n_2} \end{equation}\]

Now assume we have a sample of size $n_1$ in the complex design, and a sample of size $n_2$ in a SRSWOR, and we have chosen $n_2$ so that the variances of the two estimators are equal: \[ \bfa{Var}{\estm{\bar{Y}}_{\rm complex}}_{n_1} = \bfa{Var}{\estm{\bar{Y}}_{SRS}}_{n_2} \] then the Deff becomes simply \[\begin{equation} \bfa{Deff}{\estm{\bar{Y}}_{\rm complex}}_{n_1} = \frac{n_1}{n_2} \end{equation}\] Thus if it takes a sample of size $n_{SRS}$ to achieve a certain accuracy under SRSWOR, then it will take a sample of size \[\begin{equation} n_{\rm complex} = n_{\rm SRS}\times \bfa{Deff}{\estm{\bar{Y}}_{\rm complex}} \end{equation}\] to achieve the same precision under the complex design.

Example. For a variable with a Design Effect of 0.2 what size sample is required to achieve the same accuracy as a SRSWOR with $n_{SRS}=500$? \[ n_{\rm complex} = 0.2\times500 = 100 \] Only a sample of size 100.

The Design Effect thus tells us by what factor our sample size is reduced (or increased) by the use of the complex design.

We can also use the design effect to see the effect of a design on confidence intervals. The SRS 95% confidence interval is \[ \widehat{T}_{\rm SRS} \pm 1.96\times \bfa{SE}{\widehat{T}_{\rm SRS}} \] Using the Deff the variance and standard error of $\estm{T}_{\rm complex}$ are \[ \begin{split} \bfa{Var}{\widehat{T}_{\rm complex}} &= \bfa{Deff}{\widehat{T}_{\rm complex}} \bfa{Var}{\widehat{T}_{\rm SRS}}\\ \bfa{SE}{\widehat{T}_{\rm complex}} &= \sqrt{\bfa{Deff}{\widehat{T}_{\rm complex}}} \bfa{SE}{\widehat{T}_{\rm SRS}}\\ \end{split} \] then under the complex design the equivalent 95% confidence interval \[ \widehat{T}_{\rm complex} \pm 1.96\times \bfa{SE}{\widehat{T}_{\rm complex}} \] can be written \[ \widehat{T}_{\rm complex} \pm 1.96\times \sqrt{\bfa{Deff}{\widehat{T}_{\rm complex}}}\bfa{SE}{\widehat{T}_{\rm SRS}} \]

8.5 Sample Weights in Stratified SRSWOR

The weight of sample member $k$ in stratum $h$ of a stratified simple random sample is \[ w_{hk} = \frac{N_h}{n_h} \] Estimates of totals in stratified SRSWOR are formed just as they are in SRS, but the weights can differ between sample members due to the differing sample fractions in each of the strata.

The goal of stratification is to put units which are similar to each other, but different from the rest of the population, all together in a single stratum.

If a stratum contains a lot of highly unusual and influential units then we sample from that stratum with high probability, and consequently give each sample member from that stratum a low weight. A good example of this is in buiness surveys where a small number of big companies can dominate estimates of total revenues in certain sectors of the economy. Such companies are usually grouped together into a single stratum, and a census is taken in that stratum.

8.6 Steps in Stratified Sampling

Given a survey population of size $N$ form a stratified sample of size $n$ by the following steps.

Identify the stratification variable(s), ${\bf X}_i$.

${\bf X}_i$ must be known for every unit on the frame.
Form the strata.
Decide how many there will be ($H$), and which values of ${\bf X}$ will belong to which stratum.

Each unit $i$ is then assigned to one and only one of the strata on the basis of its ${\bf X}_i$ value. There are $N_h$ units from the population in stratum $h$, so that \[ N = \sum_{h=1}^H N_h \] The stratum fractions are \[ F_h = \frac{N_h}{N} \] and these sum up to 1: \[ \sum_{h=1}^H F_h = \sum_{h=1}^H \frac{N_h}{N} = \frac{1}{N} \sum_{h=1}^H N_h = 1 \]
Allocate the sample $n$ to the strata. i.e. decide how many units $n_h$ will be sampled from stratum $h$ (for $h=1,\ldots,H$).

These stratum sample sizes add to the total sample size: \[ n = \sum_{h=1}^H n_h \] The fractions of the sample allocated to each stratum are \[ p_h = \frac{n_h}{n} \] and these sum up to 1: \[ \sum_{h=1}^H p_h = \sum_{h=1}^H \frac{n_h}{n} = \frac{1}{n} \sum_{h=1}^H n_h = 1 \]
Draw the sample. Draw $n_h$ units from the $N_h$ in each stratum $h$, according to the chosen sampling scheme.
Calculate estimates and their variances. Make estimates both within strata as well as combined estimates for the whole population, together with their variances.

8.7 Formation of Strata

We have some auxiliary variable $X_i$ which we believe is correlated with the variable of interest $Y_i$. We have a (measured or estimated) value of $X_i$ for every unit in the population: this may have come from some previous survey or census.

How can we use this information to form strata?

We might use $X_i$ as a way of identifying those very few population units which need special treatment, and after putting them into their own full-coverage stratum we might decide that no further stratification is necessary.
Sometimes the strata are defined via ‘natural’ subpopulations or geographic areas – in which case $X_i$ is the region the unit belongs to.
We may use the stratification to produce strata which are as homogeneous as possible, with the aim of having improved efficiency of the final estimates.

One way of producing homogeneous strata is the cumulative $\sqrt{f}$ rule}: The rule is to form stratum boundaries so that the intervals are equal on the cumulative $\sqrt{f}$ scale. This requires carrying out the following steps:

Get the frequency distribution $f(X)$ (histogram) for the auxiliary variable $X$. (The number of histogram bins $M$ should be much greater than the number of strata $H$ required.)
take the square root of the frequency in the each bin: call this $\sqrt{f_{j}(X)}$
form cumulative $\sqrt{f}$ for each bin $j$ i.e. $\sum_{\ell=1}^{j}\sqrt{f_{\ell}(X)}$
split the total cumulative $\sqrt{f}$ into $H$ equal intervals:

i.e. divide the total cumulative $\sqrt{f}$ by $H$, call this $I=\frac{1}{H}\sum_{\ell=1}^{M}\sqrt{f_\ell(X)}$ and consider the $H-1$ numbers $I,\; 2\times I, \; \ldots, \; \left(H-1\right) \times I$

For each of these numbers $h\times I$, look for the histogram bin whose cumulative $\sqrt{f}$ is closest to $h\times I$ and then the stratum boundary is the right hand end of that histogram bin. (Clearly 0 and the population size are also boundaries.)

This method is an approximation which can be done with hand calculations and say a published table showing the frequencies distribution in certain intervals.

An example of the applying the cumulative $\sqrt{f}$ rule

Suppose you wish to estimate the number of days injured people are off work by taking a sample of records: this is our variable of interest $Y$. Assume that you have information on the number of ACC claims that employers send in each week: this is your auxiliary information $X$ which you have for every employer – and we have good reason to expect that $X$ and $Y$ are correlated, which means that $X$ will be a good stratification variable.

Suppose that you wish to form four strata. The Table 8.1 sets out the necessary data for calculating the stratum boundaries. For each of 5000 companies we know $X$, the number of claims that company makes per week, on average. The data are given in Table 8.1.

(Note: These average numbers of claims are decimal values, and $(5,6]$ means the interval 5 to 6 excluding 5 but including 6.)

Table 8.1: $X$: Number of ACC claims per month
$j$	No. of ACC claims per week	$f_j$	$\sqrt{f_j}$	$\sum_{\ell=1}^j\sqrt{f_\ell}$
1	(0-1]	459	21.4	21.4
2	(1-2]	841	29.0	50.4
3	(2-3]	931	30.5	80.9
4	(3-4]	783	28.0	108.9
5	(4-5]	575	24.0	132.9
6	(5-6]	419	20.5	153.4
7	(6-7]	291	17.1	170.4
8	(7-8]	222	14.9	185.3
9	(8-9]	159	12.6	197.9
10	(9-10]	100	10.0	207.9
11	(10-11]	73	8.5	216.5
12	(11-12]	58	7.6	224.1
13	(12-13]	34	5.8	229.9
14	(13-14]	20	4.5	234.4
15	(14-15]	10	3.2	237.6
16	(15-16]	6	2.4	240.0
17	(16-17]	6	2.4	242.5
18	(17-18]	3	1.7	244.2
19	(18-19]	2	1.4	245.6
20	(19-20]	3	1.7	247.3
21	(20-21]	2	1.4	248.8
22	(21-22]	1	1.0	249.8
23	(22-23]	1	1.0	250.8
24	(23-24]	1	1.0	251.8

A histogram of the auxiliary data $X$ is:

Figure 8.1: ACC Claims Distribution

Since $\sum_{j=1}^{M}\sqrt{f_{j}(y)}= 251.8 \approx 252$ and we want $H=4$ strata, the interval boundaries on the cumulative $\sqrt{f}$ scale are $252/4, 2 \times 252/4, 3 \times 252/4$, i.e. $63, 126, 189$. The actual cumulative $\sqrt{f}$ numbers nearest these are: $50.4, 132.9, 185.3$, so that the interval boundaries on the $f$ scale are $2, 5, 8$: i.e 1-2 claims a week, 4-5 claims a week, and 7-8 claims a week. Therefore the four strata we form are $(0,2], (2,5], (5,8], (8,24]$.

Where we have access to detailed information from a previous Census, with the advent of powerful computers, it is conceivable to consider tackling this problem as one of minimizing the variance of the stratified estimator subject to some constraints such as stratum size, etc. Also in practice when we survey we don’t collect just one variable. This means that it is very likely that different variables may require different stratifications, which is not practicable. Hence we need to find compromise stratifications and this is only practical by using computers.

Alternatively, we can consider applying multivariate classification methods to form homogeneous groups within the population which will become our strata, or building blocks for them. Such methods appeal to infinite population models and hence may seem subject to criticisms about the reasonableness of such models. However, having formed the strata, under classical finite population sampling the inferential framework comes from the randomization of the independent samples and not any models used to form the strata. So our inference is still somewhat assumption free.

Moreover, whatever statistical methods we use for forming strata, our aim is to form strata which are robust to changes in the population. Hence we shouldn’t blindly optimize our design on historic data.

8.8 Allocation

Once the strata have been defined we have a total sample size $n$, which we want to allocate to each stratum in proportions $p_h$, so that the proportion of the sample allocated to stratum $h$ is \[\begin{equation} p_h = \frac{n_h}{n} \end{equation}\] and \[\begin{equation} \sum_{h=1}^H p_h=1 \end{equation}\] If we know the allocation proportions $p_h$ and the sample size $n$ then the number allocated to stratum $h$ is the nearest integer to: \[ n_h = p_h n \]

There are several methods for allocating the sample. Some of these are:

Equal Allocation \[ p_h = \frac{1}{H}\ \ \ \text{i.e.}\ n_h = \frac{n}{H} \] Put an equal number of units into each stratum, irrespective of the stratum properties.
Proportional to population size \[ p_h = \frac{N_h}{N} = F_h\ \ \ \text{i.e.}\ n_h = \frac{N_h}{N}\times n \] In large samples this is what you would expect if you took an SRS of the total population and then formed the strata. So this does not produce many gains in efficiency.

This is also called a self weighting design because the weights are the same for all sample members, no matter which stratum: \[ w_h = \frac{N_h}{n_h} = N_h\frac{N}{N_hn} = \frac{N}{n} \] which are the same weights we’d get if we took a SRSWOR. (However the design still needs to be analysed using the Stratified SRS formulae.)

Neyman Allocation \[ p_h = \frac{N_hS_h}{\sum_h N_hS_h} = \frac{F_hS_h}{\sum_h F_hS_h} \] Neyman allocation gives the lowest possible variance at a fixed sample size.

Here we are allocating the sample to the strata with the greatest variance, but also accounting for the size of the strata and hence how much is contributes to the overall estimate: recall \[ \bfa{Var}{\estm{\bar{Y}}_{ST,SRS}} = \sum_{h=1}^{H} F_h^{2}\left(1 - \frac{n_h}{N_h}\right)\frac{S_h^{2}}{n_h} \] where $F_h=N_h/N$

If all the strata had the same or very nearly the same variances then Neyman allocation would be the same as proportional allocation.
Optimal Allocation \[ p_h = \frac{N_hS_h/\sqrt{c_h}}{\sum_h \left(N_hS_h/\sqrt{c_h}\right)} = \frac{F_hS_h/\sqrt{c_h}}{\sum_h \left(F_hS_h/\sqrt{c_h}\right)} \] where the cost of surveying is the sum of a base cost for the whole survey, and a varying cost per unit in each stratum ($c_h$) i.e. \[ \mbox{cost} = c_{0} + \sum_hc_hn_h \] Optimal allocation gives the lowest possible variance at a fixed survey cost.

Note Neyman allocation is the case where the cost is equal in each stratum.
Big enough so that you can form accurate stratum estimates, where the strata are subpopulations of interest.

Typically Neyman allocation leads to very accurate population estimates but poor subpopulation estimates. One way of achieving a good compromise between population and subpopulation estimates is due to Bankier, which he calls Power Allocation. Neyman allocation is a special case of this.

8.8.1 Allocation example

In a survey of students, 100 students are to be allocated across two strata: undergraduate and postgraduate. It costs twice as much to survey a postgraduate as an undergraduate, and the standard deviation of age (a key design variable) is three times higher amongst postgraduates than among undergraduates. 20% of the student body are postgraduates.

Allocate these 100 students across the two strata using each of the following methods:

Equal Allocation
Proportional Allocation
Neyman Allocation
Optimal Allocation

There are $H=2$ strata. The information we have is

$h$	Stratum	Stratum Fraction $F_h$	Cost, $c_h$	Std. Dev. $S_h$
1	Undergraduate	0.80	$C$	$S$
2	Postgraduate	0.20	$2C$	$3S$

Note that for these calculations we don’t actually have to know the value of the undergraduate cost $C$, or the standard deviation of undergraduate age $S$: just their relative sizes.

We have $n=100$ students to allocate.

Equal Allocation \[ p_h = \frac{1}{H} = \frac{1}{2} \]

$h$ Stratum $p_h$ $n_h=np_h$ Cost, $n_hc_h$

1 Undergraduate 0.50 50 $50\times C=50C$

2 Postgraduate 0.50 50 $50\times 2C=100C$

Total 1.00 100 $150C$
Proportional Allocation \[ p_h = F_h \]

$h$ Stratum $p_h$ $n_h=np_h$ Cost, $n_hc_h$

1 Undergraduate 0.80 80 $80\times C=80C$

2 Postgraduate 0.20 20 $20\times 2C=40C$

Total 1.00 100 $120C$

\(h\)	Stratum	\(p_h\)	\(n_h=np_h\)	Cost, \(n_hc_h\)
1	Undergraduate	0.50	50	\(50\times C=50C\)
2	Postgraduate	0.50	50	\(50\times 2C=100C\)
	Total	1.00	100	\(150C\)

\(h\)	Stratum	\(p_h\)	\(n_h=np_h\)	Cost, \(n_hc_h\)
1	Undergraduate	0.80	80	\(80\times C=80C\)
2	Postgraduate	0.20	20	\(20\times 2C=40C\)
	Total	1.00	100	\(120C\)

Neyman Allocation \[ p_h = \frac{N_hS_h}{\sum_h N_hS_h} = \frac{F_hS_h}{\sum_h F_hS_h} \]

$h$	Stratum	$F_hS_h$	$p_h=F_hS_h/\sum_k F_kS_k$	$n_h=np_h$	Cost, $n_hc_h$
1	Undergraduate	$(0.8)(S) = 0.8S$	$0.8S/1.4S = 0.57$	57	$57\times C=57C$
2	Postgraduate	$(0.2)(3S) = 0.6S$	$0.6S/1.4S = 0.43$	43	$43\times 2C=86C$
	Total	$1.4S$	1.00	100	$143C$

Optimal Allocation \[ p_h = \frac{N_hS_h/\sqrt{c_h}}{\sum_h N_hS_h/\sqrt{c_h}} = \frac{F_hS_h/\sqrt{c_h}}{\sum_h F_hS_h/\sqrt{c_h}} \]

$h$	Stratum	$F_hS_h/\sqrt{c_h}$	$p_h \propto F_hS_h/\sqrt{c_h}$	$n_h=np_h$	Cost, $n_hc_h$
1	Undergraduate	$(0.8)(S)/\sqrt{C} = 0.80S/\sqrt{C}$	$0.80/1.22 = 0.66$	66	$66\times C=66C$
2	Postgraduate	$(0.2)(3S)/\sqrt{2C} = 0.42S/\sqrt{C}$	$0.42/1.22 = 0.34$	34	$34\times 2C=68C$
	Total	$1.22S/\sqrt{C}$	1.00	100	$134C$

Total costs are \[ \text{Total Cost} = n_1C_1 + n_2C_2 = n_1C + n_22C = (n_1+2n_2)C \]

Relative costs, compared to Equal allocation:

Method	Allocation	Total Cost	Relative Cost
Equal	(50,50)	$150C$	1.00
Proportional	(80,20)	$120C$	0.80
Neyman	(57,43)	$143C$	0.95
Optimal	(66,34)	$134C$	0.89

At fixed sample size, Neyman allocation is always best. It is 5% cheaper than equal allocation. Proportional allocation is the cheapest, but not the most efficient.

8.8.1.1 An example of allocation rules

Table 8.2 displays the difference between Proportional and Neyman Allocation for the ACC data and the four strata chosen by the cumulative $\sqrt{f}$ rule.

Table 8.2: Number of ACC claims per week: allocation rules
Stratum	Size	Std. Dev	Fraction		Equal	Proportional	Neyman
$h$	$N_h$	$S_h$	$F_h$	$N_hS_h$	$p_h=\frac{1}{H}$	$p_h=F_h$	$p_h=\frac{N_hS_h}{\sum_hN_hS_h}$
1	1300	0.496	0.26	644.8	0.25	0.26	0.14
2	2289	0.853	0.458	1952.5	0.25	0.46	0.43
3	932	0.852	0.186	794.1	0.25	0.19	0.17
4	479	2.441	0.096	1169.2	0.25	0.10	0.26
Total	5000		1.000	4560.6	1.00	1.00	1.00

8.8.1.2 Notes

Because of rounding error it is possible that the number of units allocated to the strata is less than the desired overall sample size. In this case it is probably best to add the extra unit to the stratum with the highest variance.
With Neyman allocation and optimal allocation it is possible that the allocated sample size for a stratum is greater than the population size for that stratum. If that happens then:
1. make that stratum a full coverage stratum i.e. select all population units with probability 1. Suppose there are $N_{1}$ of them.
2. remove this full stratum from the population so the population size is now $N-N_{1}$ and re-do the Neyman allocation or optimal allocation on the remaining strata so the sample size to be allocated is now $n-N_{1}$.
If the allocation results in a stratum having fewer than 20 to 30 units it is probably best to allocate more to this strata by deleting a few units from the allocations in the other strata. Although this means that the allocation is no longer the best in terms of a variance criterion, the overall sample is likely to be more robust and the estimates of stratum variances (which are necessary to calculate the variances of the estimators) more reliable.
In practice forming strata and allocating the sample is approximate because we only can use data for some variable which we think is highly correlated to the variable of interest. Although it might seem reasonable to use old data for the variable of interest to form strata, this may lead to unrobust strata, and an over optimistic view of the likely accuracy of your estimates.
In practice when we survey we don’t collect just one variable. This means that it is very likely that different variables will need different allocations of the overall sample size. This means we need to give our variables a priority ordering or find compromise allocations. Finding compromise allocations is again an optimizing problem only practically solvable with computers.
We can show the following relationship for the variance of the mean of the estimator (ignoring terms of $O\left(1/N\right)$): \[ \bfa{Var}{\estm{\bar{Y}}_{SRS}} = \overbrace{\bfa{Var}{\estm{\bar{Y}}_{ST,SRS,OPT}} + \frac{1}{n}\sum_hF_h\left(S_h-\bar{S}\right)^{2}}^{ \bfa{Var}{\estm{\bar{Y}}_{ST,SRS,PROP}}} + \frac{1-f}{n}\sum_hF_h\left(\bar{Y}_h - \bar{Y}\right)^{2} \] where $\bar{S}$ is the mean of the stratum standard deviations.

Note that $\bfa{Var}{\estm{\bar{Y}}_{ST,SRS,PROP}}$ can be greater than $\bfa{Var}{\estm{\bar{Y}}_{SRS}}$ if the variability between strata is less than the variability within strata.

Note also that this decomposition of variance shows that a stratified SRS sample design is more efficient than an SRS sample design. With a good choice of stratification, the design effect of stratified SRS under Neyman or optimal allocation is often considerably less than 1.

8.9 Calculating Sample Sizes in Stratified SRSWOR

In general in the allocation of a total sample of size $n$ across strata we can write \[ n_h = p_hn \] where $p_h$ is the proportion of the sample allocated to stratum $h$, and $\sum_hp_h=1$.

Now assume that we want a particular standard error of an estimate of the total $\bfa{SE}{\estm{Y}}$. So we can write \[\begin{eqnarray*} \bfa{Var}{\estm{Y}} &=& \bfa{SE}{\estm{Y}}^2\\ &=& \sum_{h=1}^H N_h^2 \left(1-\frac{n_h}{N_h}\right)\frac{S_h^2}{n_h}\\ &=& \sum_{h=1}^H N_h^2 \left(1-\frac{p_hn}{N_h}\right)\frac{S_h^2}{p_hn}\\ &=& \frac{1}{n}\sum_{h=1}^H \frac{N_h^2S_h^2}{p_h} - \sum_{h=1}^HN_hS_h^2 \end{eqnarray*}\] which can be rearranged to \[\begin{equation} n = \frac{\sum_{h=1}^H \frac{N_h^2S_h^2}{p_h}}{ \bfa{SE}{\estm{Y}}^2 + \sum_{h=1}^HN_hS_h^2} \end{equation}\] This expression is particularly simple for proportional allocation where \[ p_h = \frac{N_h}{N} \] in which case \[\begin{equation} n = \frac{N\sum_{h=1}^H N_hS_h^2}{ \bfa{SE}{\estm{Y}}^2 + \sum_{h=1}^HN_hS_h^2} \end{equation}\] Neyman allocation, where \[ p_h = \frac{N_hS_h}{\sum_hN_hS_h} \] also results in a particularly simple form: \[\begin{equation} n = \frac{\left(\sum_{h=1}^H N_hS_h\right)^2}{ \bfa{SE}{\estm{Y}}^2 + \sum_{h=1}^HN_hS_h^2} \end{equation}\]

Note that strata may be used in a sample design not so much to control the variance, but to control the sample size in subpopulations. If that is the case it is best to calculate the minimum sample size required separately for each stratum, based on a required standard error for each stratum.

9 Nonresponse

$\DeclareMathOperator*{\argmin}{argmin}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\bfa}[2]{{\rm\bf #1}[#2]}$ $\newcommand{\rma}[2]{{\rm #1}[#2]}$ $\newcommand{\estm}{\widehat}$

Nonresponse is one of the most significant contributors to bias in surveys. By its nature nonresponse is difficult to adjust for without making assumptions which may or may not be justified.

There are two kinds of nonresponse to consider:

Item nonresponse: Some measurements for a particular responding unit are missing;
Unit nonresponse: The entire record for a particular responding unit is missing.

Nonresponse can also be considered as a case of missing data.

9.1 Preventing nonresponse

The best way of dealing with nonresponse is to prevent it.

Like other kinds of non-sampling error, there are many ways that nonresponse bias can be minimised in the design stage of a sample survey.

Nonresponse can be reduced by:

Public education and publicity before a survey;
Individual pre-notification letters sent to respondents prior to the survey telling them to expect an interviewer to call;
Providing incentives or compulsion for response;
Providing clear information about the uses of the data and the confidentiality of the data;
Interviewing at an appropriate time (of the day or of the year) – to avoid inconvenient times;
Well trained interviewers/data collectors;
Data collection method: personal interviews may achieve a better response rate than post-back questionnaires;
Proxy responses – it may be appropriate to allow one person to respond on behalf of another;
Questionnaire design: well designed questionnaires which look short, easy and clear may achieve better response rates than large intimidatingly complex ones. Poorly worded questions may lead to item nonresponse, or even to irritating a respondent into a total refusal;
Call backs at different times to turn non-contacts into responses.

9.2 Types of nonresponse

Each unit in the sample may or may not respond. So we can create an indicator variable to code for response \[\begin{equation} R_i = \begin{cases} 1 & \text{if unit $i$ responds}\\ 0 & \text{if unit $i$ does not respond} \end{cases} \end{equation}\] and we can imagine that there is a probability $\phi_i$ that population unit $i$ would respond if an attempt was made to measure that unit.

We are interested in the values of outcome variable $Y_i$ on each sample unit. We also have a set of auxiliary variables ${\bf X}_i$, which are known for every unit in the sample (whether or not that unit provided a response to $Y_i$). The variables ${\bf X}_i$ include the design variables from the frame.

Assume that we have drawn a sample of $n$ units from the population of size $N$, and that $n_R$ of these have responded.

There are then three types of missingness:

Missing Completely at Random (MCAR). This is when the probability of response $\phi_i$ is the same for all units, no matter what their value of $Y_i$ and ${\bf X}_i$.

This is the best situation: the respondents and nonrespondents do not differ in any important way, and it is as if the nonrespondents had been selected at random from the sample. To analyse data where we have MCAR, we can simply ignore the nonresponding units, and proceed as if we had selected a sample of $n_R$ units from the population.
Missing at Random (MAR). This is also known as missing at random given covariates or alternatively ignorable nonresponse. This occurs when the probability of response $\phi_i$ depends on known quantities: the auxiliary covariates ${\bf X}_i$, but not on the unknown $Y_i$. We therefore assume that if two units have the same values of ${\bf X}$, then their likelihood of response is the same.

This is a slightly more complex situation than MCAR, but we can still fully account for nonresponse.
Nonignorable Nonresponse. This is the unfortunate situation where the probability of response $\phi_i$ depends on the outcome of interest $Y_i$ (as well, possibly, as ${\bf X}_i$).

MAR and MCAR are strong assumptions. If the data are MAR we can test whether they are MCAR by comparing nonresponse rates in subgroups defined by ${\bf X}$ (or by logistic regression if any of the ${\bf X}$ variables are continuous) and test for any dependence on ${\bf X}$. However it may not be possible to distinguish whether the data are MAR or whether there is nonignorable response.

We of course exclude from consideration any data which is missing by design. These are data items which are missing because the respondent is not asked that particular question (e.g. we do not ask for the voting record of children.)

9.3 Response Rates

What is the response rate of a survey? Surprisingly there are several answers to this question. Here a some possibilities:

The number of respondents $n_R$ divided by the number of sample members $n$: \[\begin{equation} \frac{n_R}{n} \end{equation}\]
The number of respondents $n_R$ divided by the number of sample members contacted $n_C$: \[\begin{equation} \frac{n_R}{n_C} \end{equation}\]
The weighted number of respondents divided by an estimate of the total population \[\begin{equation} \frac{\sum_{k\in s_R}w_k}{\sum_{k\in s}w_k} \tag{9.1} \end{equation}\] Here $s$ is the whole sample, and $s_R$ is just the responding part of the sample.

The response rate for household surveys in Statistics NZ is calculated as follows. All sample members are classified into 5 categories:

	Classification	Sum of weights
1	Ineligible pre-contact	$A$
2	Ineligible post-contact	$B$
3	Eligible Non-Responding	$C$
4	Eligible Responding	$D$
5	Eligibility not established	$E$

This classification makes clear that our sample frame may include ineligible units, and that there are some units whose eligibility we cannot establish because we didn’t make contact. Note that some units are ineligible pre-contact: these are cases where the house has been demolished, or is clearly not inhabited.

The eligibility rate among the units where eligibility was established post-contact is \[ \frac{C+D}{B+C+D} \] so that our estimate of the total number of eligible units is \[ C + D + E\times\left(\frac{C+D}{B+C+D}\right) \] The response rate is the ratio of the weighted number of responding units to the weighted number of eligible units: \[ \frac{D}{C+D+E\frac{C+D}{B+C+D}} \] If there are no units for which eligibility is unknown, this reduces to Equation (9.1).

9.4 Weight adjustments

If we have MCAR or MAR missingness, then we can account for nonresponse by making adjustments to the sample weights. If the probability of selection under some sampling scheme is $\pi_i$, and the probability of responding is $\phi_i$ then assuming selection and response probability are independent the chances of a population unit ending up as a fully responding sample unit are: \[ \begin{split} \text{Prob(selected + responding)} &= \text{Prob(selected)}\times\text{Prob(responding)}\\ \tilde{\pi}_i &= \pi_i \times \phi_i \end{split} \] Hence our estimator for a total changes from \[ \estm{Y} = \sum_{k\in s} \frac{y_k}{\pi_k} = \sum_{k\in s} w_ky_k \] to \[\begin{eqnarray*} \estm{Y}_W &=& \sum_{k\in s_R} \frac{y_k}{\tilde{\pi}_k}\\ &=& \sum_{k\in s_R} \frac{y_k}{\pi_k\phi_k}\\ &=& \sum_{k\in s_R} \frac{w_k}{\phi_k}y_k\\ &=& \sum_{k\in s_R} \tilde{w}_ky_k\\ \end{eqnarray*}\] where $s_R$ is the part of the sample which responds, and \[\begin{equation} \tilde{w}_k = \frac{w_k}{\phi_k} \end{equation}\] are the adjusted weights.

How do we estimate $\phi_k$? We will examine three approaches:

Assume data are MCAR;
Poststratification. Assume data are MAR, that the probability of nonresponse depends only on membership of known classes, and that the population totals in each class are known;
Weighting Class Adjustment. Assume data are MAR, that the probability of nonresponse depends only on membership of known classes, but that the population totals in each class are unknown.

We will consider a single example to illustrate these three approaches.

Example - Surveying employees

From a staff of $N=2721$ a sample of $n=200$ employees is selected by simple random sampling without replacement. The staff members are given a questionnaire which rates, among other things, their degree of independence – on a scale from 0 to 40.

Only $n_R = 96$ employees respond to the survey. The mean independence score of the respondents is $\bar{y}_R=13.2$ with standard deviation $s_R=4.3$.

Split by the management level of the employees, the data can be summarised as follows:

	Population	Selected	Responded
Role	$N_h$	$n_h$	$n_{hR}$	$\bar{y}_{hR}$	$s_{hR}$	$n_{nR}/n_h$
Manager	420	31	28	16.2	3.1	0.903
Non-manager	2301	169	68	12.0	4.2	0.402
Total	2721	200	96	13.2	4.3	0.480

9.4.1 MCAR treatment

If we assume MCAR, then every unit has the same probability of response $\phi$, which is best estimated by the weighted response rate: \[\begin{eqnarray*} \estm{\phi} &=& \frac{\text{estimate of number of responders in population}}{ \text{estimate of population size}}\\ &=& \frac{\sum_{k\in s_R}w_k}{\sum_{k\in s}w_k} \end{eqnarray*}\] i.e. the proportion of units in the population that would respond if included in a sample.

For simple random sampling the weights are \[ w_k = \frac{N}{n} \] and the response rate is \[ \estm{\phi} = \frac{n_R}{n} \] so the adjusted weights are \[ \begin{split} \tilde{w}_k &= \frac{w_k}{\phi}\\ &= \frac{N}{n}\times\frac{n}{n_R}\\ &= \frac{N}{n_R} \end{split} \] These are the same as the weights we would calculate if the $n_R$ responding units had been selected by SRSWOR from the population. For MCAR, we treat the sample in just that way, we reduce the sample size, and ignore the fact that nonresponse has occurred.

Example continued

In our example the weights $w_k$ are \[ w_k = \frac{N}{n} = \frac{2721}{200} = 13.61 \] and the response rate is just: \[ \estm\phi=\frac{n_R}{n}=\frac{96}{200}=0.48 \] and so the adjusted weights are \[ \begin{split} \tilde{w}_k &= \frac{w_k}{\estm\phi} = \frac{13.605}{0.48} = 28.34\\ &= \frac{N}{n_R} = \frac{2721}{96} = 28.34 \end{split} \] Our best estimate of the population mean is therefore just the sample mean of the respondents: \[\begin{eqnarray*} \widehat{\bar{Y}}_{\rm MCAR} = \bar{y}_R = 13.2 \end{eqnarray*}\] and we calculate the variance using the SRSWOR formula with the reduced sample size $n_R$: \[\begin{eqnarray*} \bfa{Var}{\widehat{\bar{Y}}_{\rm MCAR}} &=& \left(1-\frac{n_R}{N}\right)\frac{s_R^2}{n_R}\\ &=& \left(1-\frac{96}{2721}\right)\frac{4.3^2}{96} = 0.186 \end{eqnarray*}\]

9.4.2 Poststratification

In many surveys we have access to population summaries from other sources. For example in a household survey we can compare the population estimate from the survey (the sum of the weights) with the known population totals from the census (or updated estimates for the present day). Our sample estimates are unlikely to come up with exactly the same values of the population totals. This is partly because of sampling variability, possibly due to undercoverage of the sampling frame, but it may also be due to nonresponse. Poststratification is a means of adjusting the weights so that the survey estimates of the population match exactly those from the external source.

Assume that we have population counts $N_h$ for a set of $H$ poststrata. Note that these poststrata need not be the same as any strata we may have used in the sample design. The variables that define the poststrata could be age, sex – variables we are unlikely to have on our frame, but which are very likely available in census counts.

We calculate the effective response rate by \[\begin{eqnarray*} \estm{\phi_h} &=& \frac{\text{estimate of number of responders in class $h$ in population}}{ \text{actual population size in class $h$}}\\ &=& \frac{\sum_{k\in s_{hR}}w_k}{N_h} = \frac{\estm{M}_h}{N_h} \end{eqnarray*}\] We then proceed as if the data are MCAR within the poststrata.

Our estimate of the population in poststratum $h$ that would respond is the sum of the weights of the respondents in poststratum $h$ \[ \estm{M}_h = \sum_{k\in s_{Rh}} w_k \] Now if we modify the weights by setting \[ \tilde{w}_{k} = w_{k}\frac{N_h}{\estm{M}_h} = \frac{w_k}{\estm\phi_h} \ \ \text{for unit $k$ in poststratum $h$} \] then the sum of these new weights over the respondents will be $N_h$ as required. The effective response rate $\estm\phi_h$ incorporates undercoverage and nonresponse. (And note that it may be greater or less than 1.)

For simple random sampling the weights are \[ w_k = \frac{N}{n} \] and the estimated response rate within poststratum $h$ is \[ \begin{split} \estm{\phi_h} &= \frac{n_{hR}\times w_k}{N_h}\\ &= \frac{n_{hR}}{N_h}\frac{N}{n} \end{split} \] so the adjusted weights for units within stratum $h$ are \[ \begin{split} \tilde{w}_k &= \frac{w_k}{\estm{\phi_h}}\\ &= \frac{N}{n}\times\frac{N_hn}{n_{hR}N}\\ &= \frac{N_h}{n_{hR}} \end{split} \] These are the same as the weights we would calculate if the $n_{hR}$ responding units had been selected by SRSWOR from stratum $h$ (which is of size $N_h$): it’s as if we planned a stratified SRS within the poststrata.

Example continued

When the sample of employees was selected it was not stratified by management role, it was just a SRSWOR drawn from all employees.
However we can use the population information on the sizes of poststrata defined by management role (this information comes from the employee records of the company) and adjust the weights:

	Population	Stratum Fraction	Selection Weight	Responding sample	Adjusted weight
Role	$N_h$	$F_h$	$w=N/n$	$n_{hR}$	$\tilde{w}_{hR}=N_{h}/n_{hR}$
Manager	420	0.154	13.61	28	15.00
Non-manager	2301	0.846	13.61	68	33.84
Total	2721	1.000		96

The approach in poststratification is to adjust the weights so that with those corrected weights the estimates of the population size within the poststrata match the known benchmark values $N_h$.

We form the estimate and its variance using the SRSWOR formulae with the reduced sample sizes $n_{hR}$.

The poststratified estimate of the mean is \[\begin{eqnarray*} \widehat{\bar{Y}}_{\rm post} &=& \sum_h F_h \bar{y}_{hR}\\ &=& (0.154)(16.2) +(0.846)(12)\\ &=& 12.6 \end{eqnarray*}\] with variance \[\begin{eqnarray*} \bfa{Var}{\widehat{\bar{Y}}_{\rm post}} &=& \left(1-\frac{n_R}{N}\right)\frac{1}{n_R}\sum_h F_h s_{hR}^2 + \frac{1}{n_R^2} \sum_h (1-F_h) s_{hR}^2\\ &=& \left(1-\frac{96}{2721}\right) \frac{1}{96} \left[(0.154)(3.1)^2 + (0.846)(4.2)^2\right]\\ && + \frac{1}{96^2}\left[(1-0.154)(3.1)^2 + (1-0.846)(4.2)^2\right]\\ &=& 0.165 + 0.001 \\ &=& 0.166 \end{eqnarray*}\] An alternative approximation to the variance simply treats the sample as if it were designed as a stratified sample in the first place: \[\begin{eqnarray*} \bfa{Var}{\widehat{\bar{Y}}_{\rm post}} &=& \sum_h F_h^2\left(1-\frac{n_{hR}}{N_h}\right)\frac{s_{hR}^2}{n_{hR}}\\ &=& (0.154)^2\left(1-\frac{28}{420}\right)\frac{3.1^2}{28} + (0.846)^2\left(1-\frac{68}{2301}\right)\frac{4.2^2}{68}\\ &=& 0.188 \end{eqnarray*}\] This is a very similar result – both formulae for $\bfa{Var}{\bar{Y}_{\rm post}}$ are approximate, and it’s OK to use either.

When the design is more complex than a SRS then the calculation of the variance of a post-stratified estimate becomes more complex. There are various approximations available, or alternatively one can use the computationally intensive resampling methods (Jackknife or bootstrap) covered later in the course.

One consequence of post-stratification is that we may end up with individuals in the same household ending up with different weights, due to their coming from, say, different age strata. This can make estimates of household level characteristics more difficult to compute in a consistent way. There are approaches such as integrated weighting which are modifications of post-stratification to avoid this kind of unwanted behaviour.

9.4.3 Weighting class adjustment

Even if we do not have access to benchmarks $N_h$, we can still correct for MAR nonresponse using a weighting class adjustment. We do this by forming groups of respondents called estimation groups or weighting classes according to their values of auxiliary variables ${\bf X}$. We might use sex, age, region or other such variables to form the groups. However a key difference is that we must know to which group each non-respondent belongs. In that case we estimate $\phi_h$, the probability of response in weighting class $h$ by \[\begin{eqnarray*} \estm{\phi_h} &=& \frac{\text{estimate of number of responders in class $h$ in population}}{ \text{estimate of population size in class $h$}}\\ &=& \frac{\sum_{k\in s_{hR}}w_k}{\sum_{k\in s_h}w_k} = \frac{\estm{M}_h}{\estm{N}_h} \end{eqnarray*}\] We then proceed as if the data are MCAR within the weighting classes.

As above, our estimate of the population in class $h$ that would respond is the sum of the weights of the respondents in class $h$ \[ \estm{M}_h = \sum_{k\in s_{Rh}} w_k \] and we must also estimate the population in class $h$ \[ \estm{N}_h = \sum_{k\in s_{h}} w_k \] where the sum is over all units, responding and nonresponding.

Now we modify the weights by setting \[ \tilde{w}_{k} = w_{k}\frac{\estm{N}_h}{\estm{M}_h} = \frac{w_k}{\estm\phi_h} \ \ \text{for unit $k$ in class $h$} \] so that the sum of these new weights over the respondents will match the estimates $\estm{N}_h$.

For simple random sampling the weights are \[ w_k = \frac{N}{n} \] and the estimated response rate within estimation group $h$ is \[ \begin{split} \estm{\phi_h} &= \frac{n_{hR}\times w_k}{n_h\times w_k}\\ &= \frac{n_{hR}}{n_h} \end{split} \] so the adjusted weights for units within stratum $h$ are \[ \begin{split} \tilde{w}_k &= \frac{w_k}{\estm{\phi_h}}\\ &= \frac{N}{n}\times\frac{n_h}{n_{hR}}\\ \end{split} \] We also estimate the group sizes and proportions \[ \begin{split} \widehat{F}_h &= \frac{n_h}{n}\\ \estm{N}_h &= \estm{F_h}N = N\frac{n_h}{n} \end{split} \]

Example continued

Assume that we don’t actually know the population totals $N_h$, but we want to adjust for possible differential nonresponse. Firstly we check that the data are not MCAR: do the response rates differ by management role?

Class-specific response rates are \[\begin{eqnarray*} \estm\phi_h &= \frac{n_{hR}}{n_h}&\\ \estm\phi_1 &= \frac{28}{31} &= 0.903\\ \estm\phi_1 &= \frac{68}{31} &= 0.402 \end{eqnarray*}\] These seem to differ strongly: we can test for the significance of the difference in the usual way:

Hypotheses: $H_0: \phi_1=\phi_2$ vs. $H_1: \phi_1\neq\phi_2$
Test statistic: \[\begin{eqnarray*} Z &=& \frac{(\estm\phi_1-\estm\phi_2)-(\phi_1-\phi_2)}{ \sqrt{ \frac{\estm\phi_1(1-\estm\phi_1)}{n_1} + \frac{\estm\phi_2(1-\estm\phi_2)}{n_2} }}\\ &=& \frac{(0.903-0.402)-(0)}{ \sqrt{ \frac{(0.903)(1-0.903)}{31} + \frac{(0.402)(1-0.402)}{169} }}\\ &=& \frac{0.501}{0.082} = 6.14 \end{eqnarray*}\]
The p-value of this test is $<0.001$ so we conclude a difference in response rates. The data are not MCAR.

We have established that the nonresponse is not MCAR. One possible model is that the nonresponse is MAR, and depends only on sex. However with the data we have we cannot test whether this is the case. There could be non-ignorable nonresponse occurring.

We will proceed here assuming that we have adequately described the nonresponse process with response probability only depending on sex.

	Estimated Population	Estimated Stratum Fraction	Selection Weight	Selected sample	Responding sample	Response Rate	Adjusted weight
Role	$\widehat{N}_h=N\widehat{F}_h$	$\widehat{F}_h=n_h/n$	$w=N/n$	$n_h$	$n_{hR}$	$\phi_h$	$\tilde{w}_{hR}=\widehat{N}_{h}/n_{hR}=w/\phi_h$
Manager	422	0.155	13.61	31	28	0.903	15.06
Non-manager	2299	0.845	13.61	169	68	0.402	33.81
Total	2721	1.000		200	96

In the weighting class adjustment we use the estimates $\widehat{F}_h$ and $\estm{N}_h$ defined above instead of supplied benchmarks. We form our estimate using an expression very similar to that for stratified SRSWOR, but with $F_h$ replaced by $\widehat{F}_h$: \[\begin{eqnarray*} \widehat{\bar{Y}}_{\rm wc} &=& \sum_h \widehat{F}_h \bar{y}_{hR}\\ &=& (0.155)(16.2) +(0.845)(12)\\ &=& 12.7 \end{eqnarray*}\] The variance is a little more complex however, because the weighting class estimator is biased. For this reason it is more appropriate to quote the mean squared error (MSE) of the estimate, and use this when calculating confidence intervals.
\[\begin{eqnarray*} \bfa{MSE}{\widehat{\bar{Y}}_{\rm wc}} &=& \bfa{Var}{\widehat{\bar{Y}}_{\rm wc}} +\bfa{Bias}{\widehat{\bar{Y}}_{\rm wc}}^2\\ &=& \sum_h \widehat{F}_h^2\left(1-\frac{n_{hR}}{\estm{N}_h}\right)\frac{s_{hR}^2}{n_{hR}} + \left(1-\frac{n_R}{N}\right)\frac{1}{n_R} \sum_h \widehat{F}_h(\bar{y}_{hR}-\widehat{\bar{Y}}_{\rm wc})^2\\ &=& (0.155)^2\left(1-\frac{28}{422}\right)\frac{3.1^2}{28} + (0.845)^2\left(1-\frac{68}{2299}\right)\frac{4.2^2}{68}\\ & & \qquad+ \left(1-\frac{96}{2721}\right)\frac{1}{96} \left((0.155)(16.2-12.7)^2 +(0.845)(12-12.7)^2\right)\\ &=& 0.187 + 0.003 = 0.191 \end{eqnarray*}\]

Summary.

Here are the results of the three approaches.

Adjustment	Estimate, $\estm{\bar{Y}}$	SE or RMSE	RSE	Conf. Int.
MCAR	13.2	0.4	0.033	(12.4,14.1)
MAR - Poststratification	12.6	0.4	0.034	(11.8,13.5)
MAR - Weighting Class	12.7	0.4	0.035	(11.8,13.5)

9.4.4 Notes

Poststratification is an effective means of correcting for MAR nonresponse if nonresponse depends only on the variables which define the poststrata. If we suspect that nonresponse depends on other (measured) covariates it may be most appropriate to make a weighting class adjustment before a final poststratification adjustment.

The variance of the estimator may be difficult to evaluate if poststratification or weighting class adjustments are made in a complex sample design, and in particular where the poststrata differ from the strata used in the sample selection. (e.g. if we use geographical region to select the sample, but poststratify by sex.) In these situations analytical formulae may not exist and numerical methods (such as the Jacknife) may be needed to estimate variances. (These methods will be discussed later.)

Note: Raking adjustments provide a way of poststratifying to two (or more) sets of population totals, but which are not available as crosstabulated totals. For example we may know total numbers of students by sex, and separately total numbers of students by ethnic group, but not total numbers by sex and ethnicity simultaneously.

There are many modelling approaches to the treatment of nonresponse, in particular the modelling of response propensity – the probability of response. (see for example Little and Rubin (2002), which is a whole book devoted to the treatment of missing data.)

9.5 Imputation

Imputation is a process whereby missing data are replaced. This may be done for individual items where there has been nonresponse (or inconsistent response). Alternatively, whole records may be imputed.

Imputation is often done so that a clean-looking complete dataset is created for analysis. In the dataset every sample member has a response to every (relevant) question. There are several different ways of achieving this.

9.5.1 Purposive Imputation

Where an item is missing the surveyor chooses a value that s/he considers most likely. This relies on the surveyor having relevant knowledge.

This is rarely possible, and brings with it the risk of introducing biases in the form of the surveyor’s prejudices.

9.5.2 Deductive Imputation

Where an item is missing, but its value can be deduced unambiguously from responses to other questions, then the value can be imputed.

Conversely, it should be noted that in data editing we may delete inconsistent data – e.g. where a respondent says that he is married but is only 10 years old. We may choose to delete one or other or both of the inconsistent data items.

Deductive imputation should only very rarely be possible.

9.5.3 Cell Mean Imputation

Where an numerical item is missing for a particular respondent (e.g. age or income or number of cars owned etc.) we may impute the item value by using the mean value of that item for other respondents who resemble that respondent. Respondents are grouped into cells, just as in the weighting class adjustment in Section 9.4.3, and the mean item value of all respondents in that cell is taken.

Although this process preserves the mean item value in the sample, it deflates the variance. If we have to impute the item for many records we’ll end up with a concentration of observations at the mean value, and a smaller variability in the sample. Consequently it is common to impute the value with a random draw from a normal distribution with the same mean and variance as the true respondents.

9.5.4 Hot-Deck Imputation

Once again we divide the sample members into cells using variables which are known for each sample member, whether or not they respond. Then we replace missing values with values which are copied across from other records within the cell. We can impute individual items this way, or even whole records.

How do we decide which record to copy from – i.e. how do we choose a donor record? There are various possibilities:

Sequential Hot-Deck Imputation – use the most recent record in the list from the same cell;
Random Hot-Deck Imputation – use a random record from within the cell as the donor;
Nearest-Neighbour Hot-Deck Imputation – define a similarity measure between records (based on variables which are known for all records), and choose the record which is most similar. (e.g. the one where the age is closest, or where age and income are closest.)

9.5.5 Cold-Deck Imputation

Similar to Hot-Deck imputation, but the donor records all come from another survey or some other data source. If this other data source is out of date, or if the questions asked were slightly different, then this procedure may introduce biases.

9.5.6 Regression Imputation

Replace the missing value with the prediction from a regression model. i.e. Use the respondents in the same cell to create a model of the way the outcome $Y$ depends on the known measured covariates ${\bf X}$ and use that model to predict a value for the missing record, given its particular values of the covariates.

9.5.7 Multiple Imputation

Instead of imputing values once, we impute them a number of times, using one of the methods above. That means we end up with, say, 10 datasets – each of which has had the missing values imputed.

We then calculate 10 estimates, one from each of the 10 imputed datasets, and average them. We then use the variability amongst those 10 estimates as an additional component of the variance of the estimator: the additional uncertainty introduced by the imputation process.

9.5.8 Substitution

This method of imputation occurs in the field while the survey is in progress. The nonrespondent is replaced by a neighbour, or the next person who can be found to respond instead. This is very much like quota sampling, and the sample which results is not a probability sample.

9.5.9 Notes

Imputation for item nonresponse may weaken or distort correlation of variables within a record. Only methods where entire records are copied can avoid this, since only they represent true responses of individuals.
Where a lot of imputation is necessary there may be serious doubts about the validity of the conclusions of the survey. The assumptions which underlie the imputation process may start to appear in the results (e.g. we may simply see the regression relationships in the data which we put there during imputation). Moreover, if an analyst proceeds to produce estimates based on an imputed dataset without realising that the actual sample size is much smaller than it appears (due to the smaller number of actual responses) then the standard errors produced will be too small.
Ideally every imputed value should be flagged so that the analyst knows how much imputation has been done, and therefore whether it will affect the study conclusions. Typically a new indicator variable is created for every variable where imputation is done. The indicator takes the value 1 if the value for a unit is imputed, and 0 otherwise.
Theoretical expressions for variance estimates may break down when there has been correction for nonresponse. In these cases numerical resampling techniques may be the only way to get realistic standard errors.
In general imputation should be minimal – as few values as possible imputed – and implemented by a robust, defensible and appropriate methodology.
Imputation is not necessary for variables which are structurally missing. That is, where the person did not provide an answer to a question because that person was not asked that question.

e.g. If a question asks ‘Is your mortgage at a fixed interest rate?’, such a question should only have a non-missing answer if the person had answered yes to the question ‘Do you have a mortgage?’.

9.6 Capture-Recapture

A common way of determining the level of nonresponse in censuses is the post-enumeration survey. After a census has taken place, a subset of areas are resurveyed, with a shorter questionnaire, and using a different interviewing workforce, so that the results are independent. The post-enumeration survey makes a much more strenuous effort to achieve a complete response, although there will still be some nonresponse.

In carrying out the Census and PES we see three kinds of respondents:

those in both the Census and PES
those in the Census only
those in the PES only

There is a fourth group of people who do not appear in either survey.

The sample sizes of the numbers of people responding to the census and the post-enumeration survey can thus be summarised in the following table:

	In PES?
In Census?	Yes	No	Total
Yes	$n_{11}$	$n_{12}$	$n_{1+}$
No	$n_{21}$	$0$	$n_{21}$
Total	$n_{+1}$	$n_{12}$	$n$

i.e. we saw $n_{1+}$ respondents in the census and $n_{+1}$ respondents in the PES, of whom $n_{11}$ responded to both.

The population values are the numbers of people in the whole population classified by whether they were found by the Census and whether they would be found by the PES if the PES were a full coverage survey:

	Would be found by PES?
In Census?	Yes	No	Total
Yes	$N_{11}$	$N_{12}$	$N_{1+}$
No	$N_{21}$	$N_{22}$	$N_{2+}$
Total	$N_{+1}$	$N_{+2}$	$N$

We are interested in the population total $N$. We can use the census data and the PES sample to make estimates of $N_{11}$, $N_{12}$ and $N_{21}$.
Assuming independence of the response rates to the census and the PES (a big assumption!) We can use these estimate to estimate the total \[ \estm{N} = \frac{n_{1+}n_{+1}}{n_{11}} \] This estimation method is an example of capture-recapture estimation. This type of estimation is common in biological settings where a sample of animals is caught, marked, and then released. Some time later another sample is caught and the ratio of marked to unmarked individuals is used to estimate the total animal population. In the census example the census is the first ‘capture’, and the PES is the ‘recapture.’

Note: Complex capture-recapture models exist which relax the independence assumption: these models require more than two sources of information.

The estimator $\estm{N}$ given above is biased: an approximately unbiased estimator is \[ \estm{N} = \frac{(n_{1+}+1)(n_{+1}+1)}{(n_{11}+1)} - 1 \] which has variance \[ \bfa{Var}{\estm{N}} = \frac{(n_{1+}+1)(n_{+1}+1) (n_{1+}-n_{11})(n_{+1}-n_{11})}{ (n_{11}+1)^2(n_{11}+2)} \]

Example

In a capture-recapture study of frogs on an island a sample of 50 frogs is captured, tagged and released on one night, and then a second sample of 30 frogs is captured on the second night. In the second sample there are 16 frogs with tags, and 14 without. What is the size of the population of frogs on the island?

	In second sample?
In 1st sample?	Yes	No	Total
Yes	16	34	50
No	14
Total	30

We have $n_{1+}=50$ frogs in the first sample, $n_{+1}=30$ frogs in the second sample, and $n_{11}=16$ frogs common to both. Our estimate of the population size is then \[ \estm{N} = \frac{(n_{1+}+1)(n_{+1}+1)}{(n_{11}+1)} - 1 = \frac{(51)(31)}{17} - 1 = 92 \] which has variance \[ \bfa{Var}{\estm{N}} = \frac{(n_{1+}+1)(n_{+1}+1) (n_{1+}-n_{11})(n_{+1}-n_{11})}{ (n_{11}+1)^2(n_{11}+2)} = \frac{(51)(31)(34)(14)}{(17)^2(18)} = 144.67 \] so the standard error and relative standard error are \[\begin{eqnarray*} \bfa{SE}{\estm{N}} &=& \sqrt{\bfa{Var}{\estm{N}}} = 12.03\\ \bfa{RSE}{\estm{N}} &=& \frac{\bfa{SE}{\estm{N}}}{\estm{N}} = \frac{12.03}{92} = 0.13 = 13% \end{eqnarray*}\] A 95% confidence interval for $N$ is therefore: \[ \estm{N} \pm 1.96 \bfa{SE}{\estm{N}} = 92 \pm (1.96)(12.03) = 92 \pm 24 = (68,116) \]

10 Systematic Random Sampling

$\DeclareMathOperator*{\argmin}{argmin}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\bfa}[2]{{\rm\bf #1}[#2]}$ $\newcommand{\rma}[2]{{\rm #1}[#2]}$ $\newcommand{\estm}{\widehat}$

There are certain situations where it is inconvenient or impossible to draw a SRS. For example, we may wish to take a sample of passengers arriving at an airport. Here we don’t have a frame to sample from, but we can nevertheless conceive of ways of sampling at a constant rate – e.g. to take a sample where the probability of selection is one in 20, we can draw a random number between 0 and 1 for every passenger who arrives. If that number is less than 0.05 we select the passenger into the sample.

However, it may be too difficult to implement this scheme in practice, given the setup at the arrival gate in the airport. Instead, because we believe that the passengers arrive in random order we can take a Linear Systematic Random Sample (LSRS). At the start of sampling we choose a random number $r$ between 1 and 20, and then take the $r^{\rm th}$ passenger, and then every 20th passenger after that.

Another example: we want to estimate the number of books on a Library card catalogue which have not yet been entered on the computer catalogue.

There are roughly 500,000 books in the card catalogue and we want a sample of 2000 books to check whether they exist on the computer catalogue. The card catalogue comprises 25 drawers with roughly 2000 cards in a drawer. After selecting one card randomly in the first 250 cards in the first drawer, we select every 250th card thereafter, and check whether the selected cards are on the catalogue.

This method is often used when we want to sample without constructing a list frame. If we believe that the population is randomly ordered and that units are not correlated then this would be equivalent to an SRS. However, the assumptions of random order, etc are very often violated.

10.1 Implementation of LSRS

Let $U$ be a population of $N$ units. Suppose we want a sample of size $n$. The population units are visualized as being arranged in a line.

Calculate $L$ (called the sampling interval) the integer closest to $N/n$. Then $N=mL+c$, where $m$ is the integer part of $N/L$, and $c$ is a number smaller than $L$.
Choose a random number (called the random start) $r$ from $\{1,\ldots,L\}$.
Calculate the numbers: $r, r+L, \ldots, r+(k-1)\times L, \dots$ so long as $r+(k-1)\times L \leq N$. These numbers correspond to the selected population units.

Note if $r \leq c$ then there are $m+1$ such numbers, otherwise $m$ such numbers.

Note also that $m$ may be quite different from $n$ e.g.
1. suppose $N=17$ and $n=7$, then since $17/7=2.43$ $L=2$ and so $m=8$ and $c=1$. thus possible samples are of size 9 or 8 but not 7. (i.e. there is no sampling interval $L$ that we could choose which would result in a sample of size 7.)
2. suppose $N=17$ and $n=6$, then since $17/6=2.83$ $L=3$ and so $m=5$ and $c=2$. thus possible samples are of size 6 or 5.
3. suppose $N=69$ and $n=28$, then since $69/28=2.46$ $L=2$ and so $m=34$ and $c=1$. thus possible samples are of size 35 or 34 far away from 28.
Thus our achieved sample size $n'$ (which is always $m$ or $m+1$) can be different from $n$ and the sample size for LSRS is a random variable. If a fixed sample size is required then we either need a further systematic sample from the remaining unselected units, or randomly delete units selected. However, if $N$ is large compared with $n$ the problem is minor.

10.2 Inclusion Probabilities

There are $L$ possible starting points, and hence there are $L$ possible samples. Hence the probability of selecting any particular sample is \[ p_{LSRS}(s) = \begin{cases} \frac{1}{L} & \text{if $s$ is LSRS} \\ 0 & \text{otherwise} \end{cases} \] Because each element $i$ belongs to one and only one of the $L$ equally probable systematic samples this means the 1st order inclusion probabilities are \[ \pi_{i} = \frac{1}{L} \] and also this means that for every $i\neq j$ \[ \pi_{ij} = \begin{cases} \frac{1}{L} & \text{if $i$ \& $j$ are in the same sample} \\ 0 & \text{otherwise} \end{cases} \] The fact that some of the 2nd order inclusion probabilities are zero means that valid variance estimates cannot be calculated from the sample.
(However we will see later that we can regard LSRS as a special case of cluster sampling.)

10.3 HT Estimators in LSRS

10.3.1 Total

We observed that the 1st order inclusion probabilities for a LSRS sampling scheme were \[ \pi_{i}=\frac{1}{L} \qquad \text{where $L$ was the integer closest to $N/n$} \] so that the HT estimator for the total can be constructed easily. It is \[ \widehat{Y}_{HT,LSRS} = \sum_{k\in s} \frac{y_{k}}{\pi_k} \sum_{k\in s} \frac{y_{k}}{1/L} = L\sum_{k\in s} y_{k} \] This estimator is of course unbiased.

Note that unless $N/n$ is exactly an integer, in which case $L=N/n$, then this estimator is not quite the same as that for the total in SRSWOR. If we used the HT estimator for SRSWOR in an LSRS design, where $N/n$ is not exactly an integer then we would be using a biased estimator. Of course if $N/n$ is large the bias is small.

The variance of this estimator is approximately \[ \bfa{Var}{\widehat{Y}_{HT,LSRS}} = \frac{N^2}{L}\sum_{j=1}^L ( \widehat{\bar{Y}}_j - \bar{\bar{Y}} )^2 = \frac{N^2}{L}\sum_{j=1}^L (\bar{y}_j - \bar{\bar{y}})^2 \] where \[ \bar{\bar{Y}} = \frac{1}{L}\sum_{j=1}^L\widehat{\bar{Y}}_j = \frac{1}{L}\sum_{j=1}^L\bar{y}_j = \bar{\bar{y}} \] is the average of all the possible sample averages. This variance is small if the sample averages are almost the same.

Because not all $\pi_{ij}>0$, we cannot use the HT estimator of this variance from one sample. In fact there is no unbiased estimator of the variance for systematic sampling, which is not surprising since we are really splitting the population into several subpopulations and then only sampling fully one such subpopulation.

In the absence of an unbiased estimator for the variance we may use the variance for a SRS estimate of the population total: \[ \bfa{\widehat{Var}}{\widehat{Y}_{HT,LSRS}} \simeq N^2\left(1-\frac{n}{N}\right)\frac{s_y^2}{n} \] This approximation may be sufficient, although we are aware it may lead to overestimates and underestimates of the actual sampling variance.

10.3.2 Mean

Using the natural estimator of the population mean from the sample, we have, \[ \widehat{\bar{Y}}_{HT,LSRS}=\frac{L}{N}\sum_{k \in s} y_{k} \] which will only be the sample mean if $L=N/n$. In other words, the sample mean is biased in LSRS unless $N/n$ is exactly an integer. We can form the obvious variance of the estimator, but there is no unbiased estimator for estimating this variance from a sample.

10.4 Circular Systematic Random Sampling

In LSRS we could not control the sample size $n$ exactly. Circular Systematic Random Sampling (CSRS) is a means of guaranteeing a particular sample size. Instead of regarding the population units as lying on a line, we wrap the list around on itself. We choose $L$ as above, but keep selecting from the list until we have $n$ units. If the list runs out we go back to the beginning.

10.5 Using Auxiliary Data: Ordered Populations

We have motivated the use of LSRS as being almost as good as SRS if the population list is randomly ordered. If, however, the population list is ordered with respect to some auxiliary variable $X$, known to be correlated with the outcome variable $Y$, then LSRS makes some efficiency gains over SRS.

Consider the example of countries and their military expenditures. We wish to estimate the total military expenditure. However this varies wildly between countries.

%, and we have already seen that this yields significant %gains in efficiency (i.e. smaller variance for the same sample size as an %SRS). The large variance of the SRS estimator is caused by the fact that sometimes we include the large units, and sometimes we don’t: there are many samples which are very very different from each other.

We could do this by a stratified random sample and would achieve strong gains in precision.

However if we order the population by GNP (known for every country on the frame), and then take a LSRS, we will always have a sample where there is a range of GNP values, and since GNP is strongly correlated with military expenditure, where will be a range of military expenditure values. All of the possible samples will therefore resemble each other much more strongly than they would under SRS, and each one strongly resembles the population.

We still don’t have a theoretical variance formula which can be used to compute the standard error from a single sample, however.

10.6 Example

Consider the case of data on the GDP and military expenditures of $N=150$ countries, shown in Figure 10.1.

Figure 10.1: Military expenditure and GDP

We want to estimate the mean military expenditure (in billions of dollars) using a sample of size $n=30$. The true mean value is $\bar{Y}=13.3$, the variance is $S_Y^2=4578.32$ and the variance of an estimate of the mean from a SRS is \[ \bfa{Var}{\widehat{\bar{Y}}} = \left(1-\frac{n}{N}\right)\frac{S_Y^2}{n} = 122.09 \] How does a LSRS compare?

First we calculate the sampling interval: \[ L = \ \text{closest integer to}\ \ \frac{N}{n}=\frac{150}{30}=5 \] so $L=5$. That means there are 5 possible samples, corresponding to the 5 possible random starting points. Each sample is of size $n'=30$ since $5\times30=150$.

If we draw all 5 samples from the original list (with its original ordering, alphabetical by country), we find the following results:

Sample	$\bar{y}$	$\widehat{\bar{Y}}$	$s_y^2$
1	6.00	6.00	16.95
2	40.77	40.77	146.94
3	6.66	6.66	14.91
4	7.00	7.00	17.36
5	6.23	6.23	12.05

(NB – Our estimates of the mean $\widehat{\bar{Y}}$ are in each case equal to the sample mean $\bar{y}$, because $N/n'$ is an integer. If it were not an integer these values would differ.) The second sample is very different to the rest, consequently the variance of $\widehat{\bar{Y}}$ is very large: \[ \bfa{Var}{\widehat{\bar{Y}}} = 235.46 \] i.e. a design effect of 1.93.

This result may be a consequence of an unfortunate ordering of the countries in the list. Here are the 5 possible samples drawn by LSRS after we randomly reorder the list:

Sample	$\bar{y}$	$\widehat{\bar{Y}}$	$s_y^2$
1	34.18	34.18	141.38
2	11.57	11.57	45.82
3	7.64	7.64	16.83
4	3.72	3.72	10.31
5	9.57	9.57	21.19

There is still a lot of variability between the samples, and in this case we find $\bfa{Var}{\widehat{\bar{Y}}} = 144.18$, i.e. a design effect of 1.18.

We can improve matters if we reorder the list by GDP prior to sampling. In this case we find:

Sample	$\bar{y}$	$\widehat{\bar{Y}}$	$s_y^2$
1	6.22	6.22	16.99
2	7.32	7.32	14.91
3	6.15	6.15	13.49
4	14.25	14.25	46.99
5	32.73	32.73	141.74

This has reduced the variability between the samples, and in this case we find $\bfa{Var}{\widehat{\bar{Y}}} = 128.92$, i.e. a design effect of 1.06, which is competitive with SRS.

LSRS may perform significantly worse the SRS, however in many cases LSRS yields variances which are similar. Those are the cases when the convenience of an LSRS may make it a preferable sampling scheme. In those cases the SRS variance estimates should also suffice.

11 Cluster sampling

$\DeclareMathOperator*{\argmin}{argmin}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\bfa}[2]{{\rm\bf #1}[#2]}$ $\newcommand{\rma}[2]{{\rm #1}[#2]}$ $\newcommand{\estm}{\widehat}$

We have mentioned previously that to implement an SRSWOR sample design in practice requires us to have a list frame of the population units. Clearly in many practical sampling situations a list frame of the population doesn’t exist, or if it does exist the biases in the frame are unacceptably large, or the construction of a list frame would be very expensive, indeed many times the cost of sample survey of the population. What can we do?

It may be that we can form and list clusters of population units quite cheaply: e.g. we can divide the whole of New Zealand into small geographic areas on a large map of New Zealand and label these and then say select these areas by SRSWOR. (An example of such geographic areas are Statistics New Zealand’s meshblocks: New Zealand is split into about 34000 meshblocks which contain on average about 30 dwellings.)

What we then have is not an SRSWOR of dwellings, but an SRSWOR of clusters of dwellings. We now have the opportunity of listing all the dwellings in a selected cluster and perhaps taking an SRSWOR of some of them or indeed sampling all of them. This two stage process of constructing a frame is likely to be quite cheap, since we never list the entire population.

Cluster sampling arises quite naturally in sampling biological data. For example if we are interested in determining the characteristics of a deep sea fish species, e.g. average age, average weight, etc, then it is likely that we collect the fish by trawl netting of suspected shoals of such fish. So we might randomly select some shoals of fish (first stage of clustering), then take some ‘random’ trawls through the selected shoal (second stage of clustering) and then finally we might randomly select some bins of fish and measure all the fish within the selected bins (third stage of clustering).

As you might expect the properties of such sample designs are quite different from those which don’t gselect clusters. In particular, such designs are generally not as efficient as SRSWOR designs. For example, often the design effect is considerably more than 1. However, careful design of the clusters, including making sure that the clusters are very heterogeneous and the ultimate stage of clustering produces small clusters, together with careful choice of estimators and the use of stratification to form the first stage clusters into homogeneous groups can result in designs where the design effect is 1 or even less than 1.

In the following sections we are going to discuss cluster sample designs where the sampling is SRSWOR. These are not the only possible designs, but they are the simplest. Recall that in any unusual design, provided that we can work out the first and second order inclusion probabilities, we always (via the HT estimator) have a way of estimating totals or means and their variances.

%The remainder of this chapter follows the treatment given in Chapter~5 of Lohr %(1999).

11.1 Example

## Loading required package: grid

## Loading required package: Matrix

## Loading required package: survival

## 
## Attaching package: 'survey'

## The following object is masked from 'package:graphics':
## 
##     dotchart

Suppose we have a primary school with 130 students in 12 classes. We want to estimate the total number of reading books taken home by the students. We do not have the time to count the number of books taken home by all 130 students so we are only going to look at a simple random sample of 12 students to calculate our estimates.

Let $Y_i$ be the number of books taken home by the $i^{\rm th}$ student in the school: $i=1,\ldots,M$, with population size $M=130$;
Let $y_k$ be the number of books taken home by the $k^{\rm th}$ student in the sample: $k=1,\ldots,m$, with sample size $m=12$.

(For reasons that will be clearer later in the chapter, we are using $M$ and $m$ for the population size and sample size, rather than the more usual $N$ and $n$.)

Sample member	Student (within school)	Class	Class Size	Student (within class)	Number of books
$k$	$i$				$y_k$
1	5	1	9	5	2
2	14	2	10	5	2
3	16	2	10	7	2
4	63	6	16	9	1
5	65	6	16	11	0
6	70	6	16	16	1
7	77	7	12	7	2
8	80	7	12	10	3
9	87	8	14	5	5
10	89	8	14	7	2
11	107	10	6	1	6
12	127	12	10	7	3

The sample mean is $\bar{y}=2.417$ and variance is $s_y^2=1.68^2=2.811$.
This is a case of simple random sampling: the SRSWOR estimator of the total is \[ \estm{Y} = M\bar{y} = (130)(2.417) = 314 \] with variance \[ \bfa{Var}{\estm{Y}} = M^2\left(1-\frac{m}{M}\right)\frac{s_y^2}{m} = 130^2\left(1-\frac{12}{130}\right)\frac{1.68^2}{12} = 3592.891 \] so that the standard error $\bfa{SE}{\estm{Y}}=\sqrt{3592.891}=59.9$.

However, we might not have access to a list of all the students in the school – so we don’t have a frame from which we can select a simple random sample. Instead what we do have is a list of classes at the school. We can take a simple random sample of those classes, and then go to each selected class, and take our sample from the students we find there. In this case the classes are called clusters or PSUs (Primary Sampling Units). There are 12 classes at the primary school in question, each with a small number of students or SSUs (Secondary Sampling Units). We select $n=3$ classes, and then select 4 students from each class.

Let $N$ be the number of clusters: PSUs.
In cluster $i$, there are $M_i$ students: SSUs. There are $M=\sum_{i=1}^N M_i$ SSUs in all.
Let $Y_{ij}$ be the number of library books taken home by the $j^{\rm th}$ student in class $i$.
Let $m_k$ students be the number of students selected from the $k^{\rm th}$ selected class.
Let $y_{k\ell}$ be the number of library books taken home by the $\ell^{\rm th}$ selected student in the $k^{\rm th}$ selected class.

Class	Class size	Sample Indicator		Sample Size	Sample Data
$i$	$M_i$	$I_i$	$k$	$m_k$	$y_{k1},\ldots,y_{km_k}$
1	9	1	1	4	4,4,3,4
2	10	0
3	20	0
4	8	0
5	7	0
6	16	1	2	4	6,6,6,4
7	12	0
8	14	0
9	10	0
10	6	0
11	8	0
12	10	1	3	4	1,2,3,5

We have a new sample of 12 students – but we need to be careful when we analyse these data: the 12 students are grouped into clusters. Students from the same class may be much more similar to each other than they are to all of the other students at the school: particularly if, say, one teacher in one class is particularly encouraging of use of the library.

A measure of the similarity of clusters is given by the intra-cluster correlation coefficient $\rho$ – a quantity we wish to be as small as possible: indicating that there is as much variability as possible within the clusters, so that the clusters can be regarded as mini-populations that reflect the properties of the whole population. If the clusters are very homogeneous, then sampling lots of SSUs within a selected cluster doesn’t give us as much information about the population as the case when the clusters are very diverse. \[\begin{equation} \rho = \frac{ \sum_{i}\sum_{j\neq k}(y_{ij}-\bar{\bar{Y}})(y_{ik}-\bar{\bar{Y}})}{ (M-1)(NM-1)S_Y^2} \end{equation}\]

Cluster sampling consists of two steps: first we select the PSUs (the clusters), and then we select the SSUs within them. In a one-stage cluster sample, we do a census of each selected cluster (e.g. we select a class, and then survey all of the students in the class). In a two-stage cluster sample we use some sampling method to select a sample of the SSUs in a selcted cluster.

The example above is a two-stage cluster sample: we selected a sample of classes, and then took a sample within each selected class.

11.2 Comparison with stratified sampling

In both stratified and cluster sampling we break the population up into groups before drawing the sample. However beyond this superficial resemblance stratified and cluster sampling are very different.

When we set up stratifed sampling our primary goal is to reduce the variance of estimators. We do this by placing units which are similar to each other in the same stratum. We thus attempt to put as much of the variation present in the population into the difference between strata, and attempt to make strata as internally homogeneous as possible. After stratifying we sample units from within all of the strata.

In cluster sampling our primary goal is controlling the cost of creating the frame and collecting the sample. It is for this reason that the design effects of cluster designs may be worse than stratified and SRS designs. We do not survey all of the clusters, but only a random sample of them. It is more convenient to select a few clusters and to survey those, since all of the units will be close to one another (if the clustering is done geographically). Our hope in doing so is that the selected clusters still capture all of the variability present in the population. It is therefore in our interests that clusters be as internally variable as possible, and that the clusters resemble each other as much as possible.

11.3 Notation in cluster sampling

We need to extend our notation conventions to cope with cluster sampling. The population is divided into $N$ clusters or primary sampling units (PSUs). The $i^{\rm th}$ cluster contains $M_i$ secondary sampling units (SSUs). Thus the total number of SSUs in the population is $M=\sum_{i=1}^N M_i$. The mean cluster size is \[\begin{equation} \bar{M} = \frac{1}{N}\sum_{i=1}^N M_i = \frac{M}{N} \end{equation}\] so that $M=N\bar{M}$.

11.3.1 Population Quantities

Within each cluster the SSUs are labelled $j=1,\ldots,M_i$ and the value of variable $Y$ on the $j^{\rm th}$ unit in the $i^{\rm th}$ cluster is $Y_{ij}$. (Note that the SSUs may be our responding units, or we may need to have a further stage of sampling within selected SSUs to finally select the responding units.)

The total value of $Y$ within cluster $i$ is \[\begin{equation} Y_i=\sum_{j=1}^{M_i}Y_{ij} \end{equation}\] and the $i^{\rm th}$ cluster mean is \[\begin{equation} \bar{Y}_i=\frac{1}{M_i}\sum_{j=1}^{M_i} Y_{ij} = \frac{Y_i}{M_i} \end{equation}\] The overall population mean is written \[\begin{equation} \bar{\bar{Y}} = \frac{\sum_{i=1}^N\sum_{j=1}^{M_i} Y_{ij}}{M} = \frac{\sum_{i=1}^N\sum_{j=1}^{M_i} Y_{ij}}{\sum_{i=1}^NM_i} = \frac{\sum_{i=1}^N Y_i}{N\bar{M}} \end{equation}\] which is different from the mean of cluster totals \[\begin{equation} \bar{Y}_T = \frac{1}{N}\sum_{i=1}^N Y_i = \bar{M}\bar{\bar{Y}} \end{equation}\] The population variance \[\begin{equation} S_Y^2 = \frac{1}{N\bar{M}-1}\sum_{i=1}^N\sum_{j=1}^{M_i} (Y_{ij}-\bar{\bar{Y}})^2 \end{equation}\] is a combination of within PSU variation \[\begin{equation} S_i^2 = \frac{1}{M_i-1}\sum_{j=1}^{M_i} (Y_{ij}-\bar{Y}_i)^2 \end{equation}\] and between PSU variation – captured by the population variance of PSU totals: \[\begin{equation} S_T^2 = \frac{1}{N-1}\sum_{i=1}^N (Y_i-\bar{Y}_T)^2 \tag{11.1} \end{equation}\] The quantity $S_T^2$ is also known as the between cluster variance or variance of cluster totals.

11.3.2 Sample Quantities

In general we select $n$ PSUs, and select $m_k$ units from within the $k^{\rm th}$ selected PSU. In this chapter we assume that both selection methods are SRSWOR, although any probabilistic sampling method could be used. The sample of PSUs is called the first stage sample $s_I$. The SSUs chosen from within the selected PSUs form the second stage sample $s_{II}$. The sample members within the $k^{\rm th}$ selected PSU form the subsample $s_k$.

Thus $y_{k\ell}$ is the value of the variable $Y$ on the $\ell^{\rm th}$ selected unit ($\ell=1,\ldots,m_k$) from with the $k^{\rm th}$ selected cluster ($k=1,\ldots,n$).

The sample total for the $k^{\rm th}$ selected PSU is \[\begin{equation} y_k = \sum_{\ell\in s_k} y_{k\ell} \end{equation}\] and the sample mean is \[\begin{equation} \bar{y}_k = \frac{1}{m_k}\sum_{\ell\in s_k} y_{k\ell} = \frac{y_k}{m_k} \end{equation}\] This leads directly to the HT estimator of the cluster total under SRSWOR: \[\begin{equation} \widehat{Y}_k = \frac{M_k}{m_k}\sum_{\ell\in s_k} y_{k\ell} = M_k\bar{y}_k \end{equation}\] and hence to the HT estimators of the population total: \[\begin{equation} \widehat{Y}_{HT} = \frac{N}{n}\sum_{k=1}^n \widehat{Y}_k = \frac{N}{n}\sum_{k=1}^n \frac{M_k}{m_k}\sum_{\ell\in s_k} y_{k\ell} \end{equation}\] and the population mean: \[\begin{equation} \widehat{\bar{Y}}_{HT} = \frac{\widehat{Y}_{HT}}{M} = \frac{1}{n\bar{M}}\sum_{k=1}^n \widehat{Y}_k = \frac{1}{n\bar{M}}\sum_{k=1}^n \frac{M_k}{m_k}\sum_{\ell\in s_k} y_{k\ell} \end{equation}\] The within PSU sample variance is estimated by \[\begin{equation} s_k^2 = \frac{1}{m_k-1}\sum_{\ell\in s_k} (y_{k\ell}-\bar{y}_k)^2 \end{equation}\] the mean of cluster totals is estimated by \[\begin{equation} \bar{\estm{Y}}_T = \frac{1}{n}\sum_{k=1}^n \estm{Y}_k = \frac{\estm{Y}_{HT}}{N} \end{equation}\] and the estimated variance of PSU totals is \[\begin{equation} s_T^2 = \frac{1}{n-1}\sum_{k=1}^n (\widehat{Y}_k-\bar{\estm{Y}}_T)^2 \tag{11.2} \end{equation}\] NB: $s_T^2$ is not in general an unbiased estimator of $S_T^2$, although it is unbiased in the case of single stage cluster sampling.

11.4 Single Stage Cluster Sampling

Consider the situation where our population is formed into $N$ clusters of size $M_i$, and we take a sample SRSWOR of $n$ clusters from the $N$ clusters, and then sample all the population units in the $n$ selected clusters ($m_k=M_k$). The total sample size (of SSUs) is then $m=\sum_{k\in s}M_k$. Again the natural parameter to estimate is the total $Y$.

Example - School library books

Returning the school library book example: assume that we have selected $n=3$ classes by SRSWOR out of the $N=12$ in the school, and that we have done a census of each selected class. (Note that in this case we can’t know the sample size in advance, since the clusters all have different sizes.)

The data are:

Class	Class size	Sample Indicator		Sample Size	Sample Data	Cluster totals
$i$	$M_i$	$I_i$	$k$	$m_k$	$y_{k1},\ldots,y_{km_k}$	$y_k=\sum_\ell y_{k\ell}$
1	9	1	1	9	3,4,4,4,2,4,6,4,4	35
2	10	0
3	20	0
4	8	0
5	7	0
6	16	1	2	16	1,6,4,6,3,2,5,4,1,6,0,6,4,2,2,1	53
7	12	0
8	14	0
9	10	0
10	6	0
11	8	0
12	10	1	3	10	3,3,6,4,5,2,3,5,1,4	36

11.4.1 Totals

Since we are not subsampling within the selected clusters (i.e. we are taking a census within clusters), the cluster total for the variable of interest for the selected clusters is known exactly. So the situation we have is taking a sample e.g SRSWOR of size $n$ from $N$ and measuring a new variable, now the cluster total, rather than the individual values of the population units in the cluster.

Hence the obvious estimator for the total under SRSWOR: \[\begin{equation} \widehat{Y}_{1SC,SRSWOR} = \sum_{k\in s} \frac{y_k}{\pi_k} = \frac{N}{n}\sum_{k\in s} y_k \end{equation}\] where $y_k$ is the exact value for the cluster total of the $k^{\rm th}$ selected cluster (which may be the $i^{\rm th}$ population unit), and as usual $\pi_k$ is the first order inclusion probability for the $k^{\rm th}$ selected cluster. This estimator is unbiased (since it is an HT estimator), and its variance is just \[\begin{equation} \bfa{Var}{\widehat{Y}_{1SC,SRSWOR}} = N^2\left(1-\frac{n}{N}\right)\frac{S_T^2}{n} \tag{11.3} \end{equation}\] where the variance of cluster totals $S_T^2$ is defined in Equation (11.1) above.

An estimate of the variance can be made from the sample as follows \[\begin{equation} \bfa{\widehat{Var}}{\widehat{Y}_{1SC,SRSWOR}} = N^2\left(1-\frac{n}{N}\right)\frac{s_T^2}{n} \end{equation}\] which uses $s_T^2$ (as defined in Equation (11.2) as an unbiased estimate of the variance of cluster totals $S_T^2$.

Example - School library books

We view our situation as having done a SRSWOR of $n=3$ units from $N=12$. The total number of books taken home in the selected units are $(35,53,36)$, with sample mean $\bar{y}=41.3$ and standard deviation $s_T=s_y=10.12$ (the standard deviation of $y$ is the standard deviation of the cluster totals), so our best estimate of the total number of books taken home is \[ \widehat{Y}_{1SC,SRSWOR} = N\bar{y} = (12)(41.3) = 496 \] with variance \[ \bfa{\widehat{Var}}{\widehat{Y}_{1SC,SRSWOR}} = N^2\left(1-\frac{n}{N}\right)\frac{s_T^2}{n} = 12^2\left(1-\frac{3}{12}\right)\frac{10.12^2}{3} = 3684 \] so the standard error of our estimate is $60.7$ (RSE=$0.122$).

In this estimate we aggregate up all the responses from the individual students, and do all our analyses at the level of the class.

11.4.2 Means

Clearly if the population size $M=N\bar{M}$ is known then an estimate of the population mean is given by \[\begin{equation} \widehat{\bar{Y}}_{1SC} = \frac{\widehat{Y}_{1SC}}{M} = \frac{\widehat{Y}_{1SC}}{N\bar{M}} \end{equation}\] with variance \[\begin{equation} \bfa{\widehat{Var}}{\widehat{\bar{Y}}_{1SC,SRSWOR}} = \frac{N^2}{{M}^2}\left(1-\frac{n}{N}\right)\frac{s_T^2}{n} \end{equation}\]

However it is commonly the case that the cluster sizes $M_i$ are known only in the $n$ selected clusters, in which case the true population size $M$ is unknown and must be estimated. The obvious way of doing this is to use the total estimator but now applied to the cluster size, i.e. \[\begin{equation} \widehat{M}= \frac{N}{n}\sum_{k \in s} M_k \end{equation}\] where $M_k$ is a within cluster total just like $y_k$.

Notice that in this case the estimator of the population mean is now a ratio of two random variables \[\begin{equation} \widehat{\bar{Y}}_{1SC,SRSWOR,R} = \frac{\widehat{Y}_{1SC,SRSWOR}}{\widehat{M}} = \frac{\sum_{k\in s}y_k}{\sum_{k\in s}M_k} \end{equation}\] the expression for the variance needs appropriate modifications, given below.

Example - School library books

In the above example we know the sizes $M_i$ of all of the classes, so that we know the total population $M=\sum_{i=1}^N M_i=130$ and can also calculate the mean class size $\bar{M}=\frac{M}{N}=\frac{130}{12}=10.83$.

Our estimate of the population mean number of books per child is then just the estimate of the total divided by $M$: \[ \widehat{\bar{Y}}_{1SC} = \frac{\widehat{Y}_{1SC}}{M} = \frac{496}{130} = 3.82 \] with variance divided by ${M}^2$: \[ \bfa{\widehat{Var}}{\widehat{Y}_{1SC,SRSWOR}} = \frac{N^2}{{M}^2}\left(1-\frac{n}{N}\right)\frac{s_T^2}{n} = \frac{12^2}{130^2}\left(1-\frac{3}{12}\right)\frac{10.12^2}{3} = 0.218 \] i.e. a standard error of $0.47$ (RSE=$0.122$).

11.4.3 Ratio estimator of the total

We have the HT estimator of the population total \[ \widehat{Y}_{1SC,SRSWOR} = \frac{N}{n}\sum_{k\in s} y_k \] Now we can expect the value of the within-cluster totals $y_k$ to be strongly correlated with the cluster size $M_k$, and that suggests the use of a ratio estimator to improve the variance using $M_i$ as the auxiliary variable: \[\begin{equation} \widehat{Y}_{1SC,SRSWOR,R} = \widehat{Y}_{1SC,SRSWOR}\frac{M}{\widehat{M}} = \frac{\sum_{k\in s} y_k}{\sum_{k\in s} M_k} \sum_{i=1}^N M_i \end{equation}\] This estimator has approximate variance \[\begin{eqnarray*} \bfa{Var}{\widehat{Y}_{1SC,SRSWOR,R}} &=& N^2\left(1-\frac{n}{N}\right)\frac{S_e^2}{n}\\ &=& N^2\left(1-\frac{n}{N}\right)\frac{1}{n(n-1)} \sum_{k=1}^n \left(Y_k-\frac{\sum_{k'}y_{k'}}{\sum_{k'}M_{k'}}M_k\right)^2 \end{eqnarray*}\] If the cluster means are equal then the variance is zero! This is an important property of cluster samples: they can be just as efficient as SRSWOR designs provided that all of the clusters are very similar.

11.4.4 Ratio estimator of the mean

If the total number of SSUs $M$ is unknown, then the HT estimator of the mean is unavailable, and the best estimator of the population mean is the ratio estimator \[\begin{eqnarray*} \widehat{\bar{Y}}_{1SC,SRSWOR,R} &=& \frac{\widehat{Y}_{1SC,SRSWOR}}{\widehat{M}}\\ &=& \frac{\frac{1}{n}\sum_{k\in s} y_k}{\frac{1}{n}\sum_{k\in s} M_k}\\ &=& \frac{\sum_{k\in s} y_k}{\sum_{k\in s} M_k} \end{eqnarray*}\] This estimator has approximate variance \[\begin{eqnarray*} \bfa{Var}{\widehat{\bar{Y}}_{1SC,SRSWOR,R}} &=& \frac{N^2}{\widehat{M}^2}\left(1-\frac{n}{N}\right)\frac{S_e^2}{n}\\ &=& \left(\frac{n}{\sum_kM_k}\right)^2\left(1-\frac{n}{N}\right)\frac{1}{n(n-1)} \sum_{k=1}^n \left(Y_k-\frac{\sum_{k'}y_{k'}}{\sum_{k'}M_{k'}}M_k\right)^2 \end{eqnarray*}\]

11.4.5 Remark

Where we do not know the size of the clusters at the time of sampling, we have a sample design for which the sample size of the population units is a random variable which is not well controlled if the cluster sizes are very unequal. So if most of the economic cost of the survey is in collecting the data from the sampled units within selected clusters, we could be faced with a large unknown cost when using single stage cluster designs.

11.4.6 Systematic Random Samples

The linear systematic random sample (LSRS) covered in Chapter 10 is actually a special case of a one-stage cluster sample. We divide the population (the list) into $L$ groups, starting at each of the $L$ first entries in the list and then taking the observations at intervals of $L$ down the list. We choose one of these groups at random.

The method of spacing out the members of each cluster along the list means that there is a good chance that any variations will be present in every cluster, so that the clusters will resemble each other strongly. This is the situation we want when designing a cluster sample: so that any cluster is a good proxy for the whole population.

11.5 Two stage Cluster Design

Now we consider the situation where we take a sample of clusters and then subsample within the cluster. Again we restrict ourselves to the case of SRSWOR at both stages of sampling.

Sampling therefore proceeds as follows:

First Stage. A sample of $n$ PSUs is selected by SRSWOR from the population of $N$ PSUs;
Second Stage. From the $k^{\rm th}$ selected PSU a sample of $m_k$ SSUs is drawn by SRSWOR.

The total sample size (of SSUs) is then $m=\sum_{k\in s}m_k$.

Example - School library books

We select $n=3$ classes from the $N=12$ classes in the school, and then select $m_k=4$ students from each of the selected clusters.

Class	Class size	Sample Indicator		Sample Size	Sample Data	Cluster means	Cluster std. devs
$i$	$M_i$	$I_i$	$k$	$m_k$	$y_{k1},\ldots,y_{km_k}$	$\bar{y}_k$	$s_k$
1	9	1	1	4	4,4,3,4	3.75	0.50
2	10	0
3	20	0
4	8	0
5	7	0
6	16	1	2	4	6,6,6,4	5.50	1.00
7	12	0
8	14	0
9	10	0
10	6	0
11	8	0
12	10	1	3	4	1,2,3,5	2.75	1.71

Unlike in the case of single-stage cluster sampling – we know in advance that we will achieve a total sample size of $m=12$ students in the $n=3$ classes (unless we find a class with fewer than $4$ students!).

11.5.1 Inclusion probabilities

We can define inclusion probabilities for each of the stages of sampling in an obvious way. We make the important assumption that the sampling at both stages is independent (i.e. the subsampling in one cluster is independent of the subsampling in any other cluster).

First Stage. The probability that PSU $i$ is selected into the sample is the familiar SRSWOR result: \[ \pi_i = \frac{n}{N} \] and the 2nd order inclusion probabilities are likewise \[ \pi_{ij} = \begin{cases} \frac{n(n-1)}{N(N-1)} & \text{if cluster $i\neq j$}\\ \frac{n}{N} & \text{if cluster $i=j$} \end{cases} \]
Second Stage. At the second stage we need to consider probabilities which are conditional on the first stage of selection. Thus given that the PSU $i$ has been selected the inclusion probability of SSU $j$ in PSU $i$ is \[ \pi_{j|i} = \frac{m_i}{M_i} \] since we are selecting $m_i$ units from $M_i$ in cluster $i$. The total inclusion probability of SSU $j$ in PSU $i$ is then \[ \pi_{(i)j} = \pi_{j|i}\pi_i = \frac{n}{N}\frac{m_i}{M_i} \] The second order inclusion probabilities are a little more complex. We consider the joint inclusion of unit $j$ from cluster $i$, and unit $\ell$ from cluster $k$: \[ \pi_{(i)j(k)\ell} = \begin{cases} \frac{n(n-1)}{N(N-1)}\frac{m_i}{M_i}\frac{m_k}{M_k} & \text{if cluster $i\neq k$}\\ \frac{n}{N}\frac{m_i(m_i-1)}{M_i(M_i-1)} & \text{if cluster $i=k$ and unit $j\neq\ell$}\\ \frac{n}{N}\frac{m_i}{M_i} & \text{if cluster $i=k$ and unit $j=\ell$}\\ \end{cases} \]

11.5.2 Totals

Since we are subsampling SRSWOR the obvious estimator for the totals for the sampled clusters is \[ \widehat{Y}_k = \frac{M_k}{m_k}\sum_{\ell\in s_k}y_{k\ell} \] where $s_k$ means the sample of population units selected at the second stage of sampling, i.e. sampling of population within the selected clusters. That is, the estimator of the total in a two stage cluster design is \[ \widehat{Y}_{2SC,SRSWOR} = \frac{N}{n}\sum_{k\in s_I} \widehat{Y}_k \tag{11.4} \] where $s_I$ means the sample of clusters selected at the first stage. Or in full,
\[ \widehat{Y}_{2SC,SRSWOR} = \frac{N}{n}\sum_{k\in s_I} \frac{M_k}{m_k}\sum_{\ell\in s_k}y_{k\ell} = \frac{N}{n}\sum_{k=1}^n M_k\bar{y}_k = \frac{N}{n}\sum_{k=1}^n \estm{Y}_k \tag{11.5} \] It has population variance \[\begin{equation} \bfa{Var}{\widehat{Y}_{2SC,SRSWOR}} = N^2\left(1-\frac{n}{N}\right)\frac{S_T^2}{n} + \sum_{i=1}^{N}M_i^2 \left(1-\frac{m_i}{M_i}\right)\frac{S_i^2}{m_i} \tag{11.6} \end{equation}\] The first term is identical to the variance of a single stage design (Equation (11.3)) and so is a function of the between cluster variation, $S_T^2$. The second is an additional term due to subsampling within the selected clusters, and depends on the within cluster variances $S_i^2$.

To get an estimator of the variance of the total estimator from one sample it turns out that we simply put in the sample analogues of $S_T^2$ and $S_i^2$ in the above formula, specifically \[\begin{equation} \bfa{\widehat{Var}}{\widehat{Y}_{2SC,SRSWOR}} = N^2\left(1-\frac{n}{N}\right)\frac{s_T^2}{n} + \frac{N}{n}\sum_{k\in s_I}M_k^2 \left(1-\frac{m_k}{M_k}\right)\frac{s_k^2}{m_k} \end{equation}\] Recall that whereas $s_k^2$ is an unbiased estimator for $S_k^2$, $s_T^2$ is not unbiased for $S_T^2$. This is because the between cluster variance must depend on the within cluster variabilities $S_i^2$, which are unknown and must be estimated in two-stage sampling.

As in the single stage cluster design the variable cluster size affects considerably the variance of the estimator. We can use a ratio estimator and this will reduce the first term of the variance formula (Equation @ref(eq:two.stage.var)).

11.5.3 Means

Estimates for means with their variances come from dividing the estimates for the total $\widehat{Y}$ and standard error $\bfa{SE}{\widehat{Y}}$ by the population size (number of SSUs) $M$: \[\begin{eqnarray*} \widehat{\bar{Y}}_{2SC,SRSWOR} &=& \frac{\widehat{Y}_{2SC,SRSWOR}}{M}\\ \bfa{SE}{\widehat{\bar{Y}}_{2SC,SRSWOR}} &=& \frac{\bfa{SE}{\widehat{Y}_{2SC,SRSWOR}}}{M} \end{eqnarray*}\] using Equation (11.5) and (11.6) respectively.

Example - School library books

Let’s apply these result to our example. It’s important to keep the notation correct here:

We have a population with $N=12$ clusters containing $M=130$ students
We have selected $n=3$ clusters
The sizes of the selected clusters are $M_k=(9,16,10)$
The sample sizes in the selected clusters are $m_k=(4,4,4)$ (i.e. the same in each cluster)
The cluster means are $\bar{y}_k=(3.75,5.5,2.75)$
We estimate the cluster totals using the standard HT estimator \[ \begin{split} \estm{Y}_k &= M_k\bar{y}_k\\ \estm{Y}_1 &= 9\times3.75 = 33.8\\ \estm{Y}_1 &= 16\times5.5 = 88\\ \estm{Y}_1 &= 10\times2.75 = 27.5 \end{split} \] and the variance of these totals is $s_T^2=1107.1$
The estimate of the total number of books taken home by all students in all classes is then \[ \widehat{Y}_{2SC,SRSWOR} = \frac{N}{n}\sum_{k=1}^n \estm{Y}_k = \frac{12}{3}\left( 33.8 + 88 + 27.5 \right) = \frac{12}{3}149.2 = 597 \]
The variance of this estimator is given by \[ \begin{split} \bfa{\widehat{Var}}{\widehat{Y}_{2SC,SRSWOR}} &= N^2\left(1-\frac{n}{N}\right)\frac{s_T^2}{n} + \frac{N}{n}\sum_{k\in s_I}M_k^2 \left(1-\frac{m_k}{M_k}\right)\frac{s_k^2}{m_k}\\ &= (12)^2\left(1-\frac{3}{12}\right)\frac{1107.1}{3}\\ & \qquad + \frac{12}{3}\left[ 9^2\left(1-\frac{4}{9}\right)\frac{0.5^2}{4} +16^2\left(1-\frac{4}{16}\right)\frac{1^2}{4} \right.\\ & \qquad\qquad \left. + 10^2\left(1-\frac{4}{10}\right)\frac{1.71^2}{4} \right]\\ &=3.98542\times 10^{4} + 378.2 = 4.02325\times 10^{4} \end{split} \] so that the standard error is $200.6$. This variance is dominated by the first term, which depends on the the variability between the cluster totals.

Collecting together the estimates we have of the total number of books taken home by students in this primary school:

Method	Sample Size $m$	$\widehat{Y}$	$\bfa{Var}{\widehat{Y}}$	$\bfa{SE}{\widehat{Y}}$	$\bfa{RSE}{\widehat{Y}}$	VarRatio	Deff
SRSWOR	12	314	3592.9	59.9	0.19	1.00	1.00
1SC	35	496	3684.0	60.7	0.12	1.03	3.08
2SC	12	597	40232.5	200.6	0.34	11.20	10.13

(The Deff’s here are estimated using iNZight.)

Comparing 2SC with SRSWOR we can see a very poor Deff (10.13). This is not a particularly good estimate of the Deff, due to our small sample size, but does show that cluster designs can be much less efficient than SRSWOR.

11.6 Design of cluster samples

Both in single stage and two stage cluster designs we can try to overcome the problem of variable cluster size without resorting to ratio estimators by considering a design which selects the clusters using a PPSWR design using the cluster size, or at least a recent value of it, as the size measure. However, such designs also have problems. For example, if the sampling fraction of the first stage clusters is not very small, then the probability of selecting the same cluster twice is reasonably high, and the design may not be as efficient as the one we have been considering.

In such a two stage cluster design we can specify in advance the sample size of the population units. However, we have to decide how many clusters to select and how many units to subsample within each cluster. If, as is often the case in practice, the first term of the variance formula (Equation (11.6)) is considerably larger than the second term then it makes sense to sample more clusters and subsample fewer units within the cluster. But against this is the fact that generally the cost of collecting the data is higher the more clusters are selected: e.g. if one is personally interviewing respondents in a household survey, then travel to a cluster of dwellings is generally more expensive than travel around a cluster of dwellings.

11.6.1 Formation of clusters

Clusters are often naturally occurring units (e.g. cities or nests or shoals of fish). But sometimes we are able to form the clusters – e.g. we can amalgamate suburbs or census area units to form clusters of any desired size. Where we are able to form the clusters ourselves, we should try to form them so that $S_T^2$ is as small as possible and the heterogeneity of the clusters $S_i^2$ is as large as possible. This will generally make the variance of the estimator smaller.

The trade-off of within and between variability can be summarised using an adjusted $R^2$ statistic defined as \[ R_a^2 = 1 - \frac{N\bar{M}-1}{N-1} \frac{\sum_{i=1}^N(M_i-1)S_i^2}{\sum_i\sum_j(Y_{ij}-\bar{\bar{Y}})^2} = 1-\frac{\sum_{i=1}^N(M_i-1)S_i^2}{(N-1)S_Y^2} \] This definition contains a ratio between a measure of intra-cluster variability, and total variability $S_Y^2$. If the clusters are very homogeneous then the $S_i^2$ are all very small and $R_a^2$ is close to 1. In this case cluster sampling will be very inefficient. If, however, almost all the variability of the whole population is seen within the clusters, then $R_a^2$ will be close to zero, and cluster sampling will do almost as well as SRSWOR.

If the clusters are all (nearly) the same size $\bar{M}$, an estimate of the design effect for a cluster design is given by \[ \bfa{\widehat{Deff}}{\widehat{Y}_\text{cluster}} = 1 + \frac{N(\bar{M}-1)}{N-1}R_a^2 \] Hence a larger $R_a^2$, corresponding to more internally homogeneous clusters, leads to a larger design effect (i.e. a larger variance). Where the cluster sizes are all equal then $R_a^2$ is approximately the same as the intra-cluster correlation coefficient $\rho$, which is sometimes quoted as a measure of within cluster homogeneity. If the average number of SSUs sampled per cluster is $\bar{m}$ then the design effect for one-stage cluster sampling is approximately \[ \bfa{\widehat{Deff}}{\widehat{Y}_{1SC}} = 1 + \rho(\bar{n}_c-1) \]

It is possible for $R_a^2$ and $\rho$ to be negative, in which case the clusters are almost identical in their properties: i.e. there is no variability between clusters and all the population variability is within clusters. This situation can occur in systematic sampling. In systematic sampling we break the population into a set of $L$ distinct samples, and then select one of them randomly – hence systematic sampling is just a special case of one-stage cluster sampling.

In general cluster sampling rarely does better than SRSWOR, but systematic sampling where an auxiliary variable is used to order the population is one case where cluster sampling does perform better.

In summary, the optimal choice of number of clusters sampled and number of units subsampled within clusters can be decided on variance grounds, or on economic cost grounds but ideally on a mixture of both.

11.7 Multistage designs

We may want to take a third stage of sampling (i.e. we may wish to sample within SSUs). The ideas presented in this chapter can be simply extended in this case.

For example, the Statistics NZ Household sampling frame is a multistage cluster design, where the sample is selected as follows:

NZ is stratified into $H=12$ regions (e.g. Northland, Auckland, Hawke’s Bay etc.);
Each stratum (region) is broken into $N_h$ PSUs: which are amalgamations of a few meshblocks. Each PSU contains (on average) 100 households.
A SRSWOR of $n_h$ PSUs is taken within each stratum $h$; (First stage sample.)
Within the $k^{\rm th}$ selected PSU, which is of size $M_{hk}$, a linear systematic random sample is taken of $m_{hk}$ households; (Second stage sample.)
Depending on the survey, all $A_{hk\ell}$ eligible members of the $\ell^{\rm th}$ selected household will be surveyed, or a SRSWOR of $a_{hk\ell}$ members may be taken. (Third stage sample.)

The sample size is therefore \[ \text{Sample Size} = \sum_{h=1}^H\sum_{k\in s_I}\sum_{\ell\in s_k} a_{hk\ell} \] and the sample weights of individual $j$ in household $\ell$ of PSU $k$ in stratum $h$ are \[ w_{hk\ell j} = \frac{N_h}{n_h}\frac{M_{hk}}{m_{hk}} \frac{A_{hk\ell}}{a_{hk\ell}} \] A ratio estimator (based on cluster size as discussed above) can be used to improve the variance of estimators under this design. Standard errors are computed using numerical resampling methods (jackknife), since the theoretical estimates are not sufficiently accurate.

Note that although a LSRS is taken within PSUs, the analysis is done as if it were a SRSWOR. An LSRS is taken to space out the sample within the PSU, but it is reasonable to expect that the properties of such a sample are similar to those of a SRSWOR of that PSU.

12 Regression Estimation

$\DeclareMathOperator*{\argmin}{argmin}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\bfa}[2]{{\rm\bf #1}[#2]}$ $\newcommand{\rma}[2]{{\rm #1}[#2]}$ $\newcommand{\estm}{\widehat}$

Regression models are everywhere in statistics, and when we are working with sample survey data we are not always only interested in population level estimates, but also the relationships between variables in the data.

12.1 Regression models with survey data

In regression we assume a relationship between one variable – the outcome, $Y$ – and a set of explanatory variables – the predictors, $X$.

For example if we have just one predictor variable then the simple linear regression model is
\[\begin{equation} Y_i = \alpha + \beta X_i + \varepsilon_i \qquad \text{for $i=1,\ldots,N$} \tag{12.1} \end{equation}\] Here $\alpha$ and $\beta$ are the familiar intercept and slope of a linear regression problem with one variable. The errors $\varepsilon_i$ are assumed to be independent and identically distributed, with a Normal distribution with mean 0 and variance $\sigma^2$.

The focus of our interest is usually on the value of $\beta$ – which quantifies the relationship (if any exists) between $X$ and $Y$.

Fitting regression models in sample survey models is possible, but does require some care thought: we need to be aware that the data may not necessarily be a simple random sample, and that the data may have unequal weights.

Some software packages can accept weights in regression, but one must be very careful to recognise that the word ‘weight’ can mean one of three different things.

Frequency Weights – the weight is the number of times the observation appears in the dataset – this is a compact way of representing data where there are many repetitions. An observation has high weight if it has been seen multiple times;
Precision Weights – when there is measurement error, an observation has high precision weight if it is measured very precisely, lower weight if it was measured imprecisely;
Sampling Weights – these are our survey sampling weights: the inverse of the probability of selection. An observation has high weight if it had a low probability of selection, and represents a large number of units in the population.

Statistical software that allows weights may be a bit vague in the documentation about what the weights mean. It’s imperative to be sure that you know which type of weights the software is using.

Through the specification of sample designs, iNZight treats sample weights correctly, and we can be confident about the results that it produces. The Advanced Modelling window allows the fitting of a variety of regression models, and returns parameter estimates, standard errors and confidence intervals allowing us to use survey data to investigate relationships.

There is some discussion about whether or not sampling weights actually need to be included in regression analyses, even with survey data. However the consensus is that it (almost) never harms your analysis to include the weights, and there are times when it can save you from biased estimates.

12.2 Regression estimation

We came across the idea of regression imputation in our analysis of non-response. We fitted a relationship between $X$ and $Y$ for the observed data, and imputed values for $Y$ where they were missing using the predicted values from the regression model: \[ \widehat{y}_k = \widehat{\alpha} + \widehat{\beta} x_k \] We could do this within the sample when we had either measured values of $x_k$ for all of the respondents (including the ones where $y_k$ was missing).

This suggests a more general approach to survey sampling estimation. We can regard all of the population members that we didn’t select as non-respondents, and use a regression method to impute their unobserved $Y$ values. Then we can make estimates of the population mean and total of $Y$ using those imputed values.

This is a very powerful idea.

Consider the set of $N=16$ NZ regions and their volumes of retail trade in 2019, listed in Table 12.1 and also plotted in Figure 12.1.

Table 12.1: Retail trade volumes and populations in NZ regions in 2019
	Region	Volume ($M)	Population	Population (000)
1	Northland	827.2	188700	188.7
2	Auckland	10146.2	1642800	1642.8
3	Waikato	2445.5	482100	482.1
4	Bay of Plenty	1714.3	324200	324.2
5	Gisborne	201.8	49300	49.3
6	Hawke’s Bay	829.8	173700	173.7
7	Taranaki	549.6	122700	122.7
8	Manawatu-Wanganui	1150.7	249700	249.7
9	Wellington	2578.3	527800	527.8
10	Tasman	250.3	54800	54.8
11	Nelson	320.1	52900	52.9
12	Marlborough	256.4	49200	49.2
13	West Coast	180.6	32600	32.6
14	Canterbury	3409.7	628600	628.6
15	Otago	1470.3	236200	236.2
16	Southland	543.8	101200	101.2

Figure 12.1: Retail trade volumes and populations in NZ regions in 2019

It’s very clear, and also logical, that population and volume of retail trade are correlated with one another. And it’s also clear that if we took a random sample of regions and happened to miss out Auckland that we’d severely underestimate the amount of retail trade.

The true total is $Y=2.68746\times 10^{4}$, and the the standard deviation across the 16 regions is $S_Y=2459.8$ (both in millions of dollars).

If we were to take a SRSWOR of $n=6$ regions then the standard error of an estimate of the population total retail trade derived from a sample of size $n=6$ would be \[\begin{eqnarray*} \bfa{SE}{\widehat{Y}} &=& \sqrt{N^2\left(1-\frac{n}{N}\right)\frac{S_Y^2}{n}}\\ &=& \sqrt{(16)^2\left(1-\frac{6}{16}\right)\frac{2459.8^2}{6}}\\ &=& 1.27023\times 10^{4} \end{eqnarray*}\] which would give an RSE of \[\begin{eqnarray*} \bfa{RSE}{\widehat{Y}} &=& \frac{\bfa{SE}{\widehat{Y}}}{Y}\\ &=& \frac{1.27023\times 10^{4}}{2.68746\times 10^{4}}\\ &=& 0.47 \end{eqnarray*}\] An appallingly bad 47%.

If we take an actual sample of $n=6$ observations (Table 12.2) then our sample statistics are $n=6$, $\bar{y}=691.8$ and $s_y=543.1$.

Table 12.2: Retail trade volumes and populations in NZ regions in 2019: SRS of 6 regions
		Region	Volume ($M)	Population	Population (000)
1	1	Northland	827.2	188700	188.7
5	5	Gisborne	201.8	49300	49.3
8	8	Manawatu-Wanganui	1150.7	249700	249.7
11	11	Nelson	320.1	52900	52.9
13	13	West Coast	180.6	32600	32.6
15	15	Otago	1470.3	236200	236.2

Figure 12.2: Sample of retail trade volumes and populations in NZ regions in 2019

This leads to an estimate of the total national volume of \[ \widehat{Y} = N\bar{y} = (16)(691.8) = 1.1069\times 10^{4} \] which – due to Auckland not being in the sample – is dramatically less than the truth ($2.6875\times 10^{4}$), and has standard error \[\begin{eqnarray*} \bfa{SE}{\widehat{Y}} &=& \sqrt{N^2\left(1-\frac{n}{N}\right)\frac{s_y^2}{n}}\\ &=& \sqrt{(16)^2\left(1-\frac{6}{`16}\right)\frac{543.1^2}{6}}\\ &=& 2804 \end{eqnarray*}\] This standard error is also underestimated (again, due to the absence of Auckland), and we have a false sense of security. The estimated RSE ($2804/(1.1069\times 10^{4})=0.25$) is still very bad though (primarily due to the small sample size).

Regression estimation is a way to protect us in situations like this. If the value of the population $X_i$ is known for all regions, but we only have data on retail trade volumes $Y_i$ for a sample of regions, then we can fit a regression relationship between $Y$ and $X$ in the sample. A sensible model is not the usual linear regression model but instead a no-intercept model: \[ Y_i = \beta X_i + \varepsilon_i \] in which retail trade $Y_i$ is proportional to regional population size $X_i$.

Fitting this model using iNZight, or a similar package that incorporates survey sample designs, we obtain estimates of $\beta$ and its standard error: \[ \widehat{\beta} = 5.145448 \qquad \bfa{SE}{\widehat{\beta}} = 0.401191 \] We can use this fitted model to predict the value of $Y_i$ for all of the unobserved regions.
\[ \widehat{Y}_i = \widehat{\beta} X_i \] For example, we estimate the retail trade in Auckland ($i=2$, population $X_2=1642.8$ thousand people) as \[ \widehat{Y}_2 = \widehat{\beta} X_2 = (5.145448)(1642.8) = 8452.9 \] this estimate has standard error \[ \bfa{SE}{\widehat{Y}_2} = \bfa{SE}{\widehat{\beta}X_2} = \bfa{SE}{\widehat{\beta}} X_2 = (0.401191)(1642.8) = 659 \]

All of the true data $Y_i$ (of which we have only sampled 6 points) and the estimated data $\widehat{Y}_i$ are plotted in Figure 12.3.
The model does a great job at estimating retail trade volumes using the population.

Regression estimation: at left is the sample and the fitted regression model. At right we have used the model to predict all of the data (+ symbols), and have plotted the true data (open circles) for comparison

Figure 12.3: Regression estimation: at left is the sample and the fitted regression model. At right we have used the model to predict all of the data (+ symbols), and have plotted the true data (open circles) for comparison

We can make an estimate of the total retail trade over all regions using these estimated volumes: \[\begin{eqnarray*} \widehat{Y} &=& \sum_{i=1}^N \widehat{Y}_i\\ &=& \sum_{i=1}^N \widehat{\beta} X_i\\ &=& \widehat{\beta} \sum_{i=1}^N X_i\\ &=& \widehat{\beta} X \end{eqnarray*}\] where $X=\sum_{i=1}^N X_i=4916.5$ is the total population (in thousands) of all regions combined. Thus \[ \widehat{Y} = (5.145448)(4916.5) = 2.5298\times 10^{4} \] which is now much much closer to the true value of $2.68746\times 10^{4}$. The standard error of the estimate is \[ \bfa{SE}{\widehat{Y}} = \bfa{SE}{\widehat{\beta}} X = (0.401191)(4916.5) = 1972 \] which has an RSE of $1972/(2.5298\times 10^{4})=0.078=7.8$% – this result is a reduction from an apparent RSE of 25% to 7.8% is a dramatic improvement in the quality of our estimate – which is still from a very small sample.

12.3 More complex regression models

We can generalise this simple example by using a whole set of auxiliary variables $X^{(1)}_i, X^{(2)}_i, X^{(3)}_i, \ldots$ and modelling the dependence of the outcome $Y_i$ as either a linear regression problem: \[\begin{eqnarray}\nonumber Y_i &=& \beta_0 + \beta_1 X^{(1)}_i + \beta_2 X^{(2)}_i + \beta_3 X^{(3)}_i + \ldots + \varepsilon_i\\ \tag{12.2} &=& \beta_0 + \sum_{j=1}^q \beta_j X^{(j)} + \varepsilon_i \end{eqnarray}\] or still further if we allow $Y$ to depend on the set of auxiliary variables in some more complex form \[\begin{equation} Y_i = f(X^{(1)}_i, X^{(2)}_i, \ldots; \theta) + \varepsilon_i \end{equation}\] where $f()$ is a function with a form which we choose, and $\theta$ are parameters that are to be estimated.

The most common regression model used is the linear model (Equation (12.2)). We set up the model with the $q$ variables of interest. For example in the case of the modelling military expenditure in a set of countries we might include GDP and population as sensible predictor variables. In which case our model would be \[ Y_i = \beta_0 + \beta_1 \text{(population)}_i + \beta_2 \text{(GDP)}_i \] which in matrix notation can be written ${\bf Y}=X\boldsymbol{\beta}$.

Software such as iNZight (and more particularly the R survey package which iNZight uses) can be used for all of the familiar regression models that you will have seen in other settings. What is important though, is that the sample design must be appropriately incorporated. If the sample design is not included in the analysis we’ll end up with biased estimates, incorrect standard errors, and erroneous conclusions.

12.4 Ratio estimation

In the standard no-intercept regression model we assume that \[ Y_i = \beta X_i + \varepsilon_i \qquad\text{with}\qquad \varepsilon_i\overset{\rm iid}{\sim}N(0,\sigma^2) \] That is, the variance of the errors is constant $\sigma^2$.

If we assume that all the units are independent, but that the variance of each individual $Y_i$ value is $\sigma_i^2=\sigma^2X_i$ then we have the slightly different model \[ Y_i = \beta X_i + \varepsilon_i \qquad\text{with}\qquad \varepsilon_i\overset{\rm ind}{\sim}N(0,\sigma^2X_i) \] In this model the variance of the errors is $\sigma^2 X_i$: i.e. it grows with $X_i$. Thus where $X$ is a larger there is a larger scatter of $Y$ around the model.

The best estimate of $\beta$ in this model is \[\begin{eqnarray*} \widehat{{\beta}}_R &=& \frac{\sum_k y_k}{\sum_k x_k}\\ &=& \frac{\bar{y}}{\bar{x}} \end{eqnarray*}\] This is the what is known as the classical ratio estimator $\widehat{\beta}_R$. This estimator is very close to the constant variance estimator, but its variance is slightly different: \[ \bfa{Var}{\widehat{\beta}_R} = \frac{1}{\bar{x}^2}\left(1-\frac{n}{N}\right)\frac{s_e^2}{n} \] where $s_e^2$ is the variance of the residuals \[ s_e^2 = \frac{1}{n-1}\sum_{k=1}^n (y_k-\widehat{y}_k)^2 \] and $\bar{x}$ is the sample mean. It leads to a familiar looking variance estimator for the population mean: \[ \bfa{Var}{\widehat{\bar{Y}}_R} = \left(1-\frac{n}{N}\right)\frac{s_e^2}{n} \] the difference being that the sample variance $s_y^2$ is replaced by the residual variance $s_e^2$.

Continuing the retail trade example above: we have the sample mean of retail trade being $\bar{y}=691.8$ and the sample mean of the population $\bar{x}=135$ (in thousands of people).

First estimate $\beta$: \[ \widehat{\beta}_R = \frac{\bar{y}}{\bar{x}} = \frac{691.8}{135} = 5.12812 \] Next compute the fitted values $y_k=\widehat{\beta}_Rx_k$ and the residuals \[\begin{eqnarray*} e_k &=& \text{Observed} - \text{Calculated}\\ &=& y_k - \widehat{y}_k\\ &=& y_k - \widehat{\beta}x_k \end{eqnarray*}\] For example: \[\begin{eqnarray*} \widehat{y}_1 &=& \widehat{\beta}_Rx_1\\ &=& (5.12812)(188.7)\\ &=& 967.7\\ e_1 &=& y_1 - \widehat{y}_1\\ &=& 827.2 - 967.7\\ &=& -140.5 \end{eqnarray*}\]

All of the fitted values $\widehat{y}_k$ and the residuals $e_k$ are shown in Table 12.3. The standard deviation of the residuals is $s_e=147.53$.

Table 12.3: Sample with fitted values and residuals
Area, $i$	Region	Volume, $y_k$	Population (000), $x_k$	Fitted Volumes, $\widehat{y}_k$	Residuals, $e_k=y_k-\widehat{y}_k$
1	Northland	827.2	188.7	967.7	-140.5
5	Gisborne	201.8	49.3	252.8	-51.0
8	Manawatu-Wanganui	1150.7	249.7	1280.5	-129.8
11	Nelson	320.1	52.9	271.3	48.8
13	West Coast	180.6	32.6	167.2	13.4
15	Otago	1470.3	236.2	1211.3	259.0

The variance of the population mean is then \[\begin{eqnarray*} \bfa{Var}{\widehat{\bar{Y}}_R} &=& \left(1-\frac{n}{N}\right)\frac{s_e^2}{n}\\ &=& \left(1-\frac{6}{16}\right)\frac{147.53^2}{6}\\ &=& 2267.32 = (47.62)^2 \end{eqnarray*}\] The variance of the ratio estimate of $\beta$ is (noting again that the mean size of the 6 sampled regions is $\bar{x}=135$): \[\begin{eqnarray*} \bfa{Var}{\widehat{\beta}_R} &=& \frac{1}{\bar{x}^2}\left(1-\frac{n}{N}\right)\frac{s_e^2}{n}\\ &=& \frac{1}{(135)^2}\left(1-\frac{6}{16}\right)\frac{147.53^2}{6}\\ &=& 0.1245916 = (0.352975)^2 \end{eqnarray*}\]

Our interest is in the total retail trade volume: \[\begin{eqnarray*} \widehat{Y} &=& N\widehat{\bar{Y}}\\ &=& N\widehat{\beta}_R\bar{X}\\ &=& \widehat{\beta}_R X \end{eqnarray*}\] where $X$ is the total population $X=4916.5$: \[\begin{eqnarray*} \widehat{Y} &=& (5.12812)(4916.5) = 2.5212\times 10^{4} \end{eqnarray*}\] (almost identical to our earlier regression estimate of $2.5298\times 10^{4}$).

The standard error of this estimate is \[ \bfa{SE}{\widehat{Y}} = \bfa{SE}{\widehat{\beta}_RX} = \bfa{SE}{\widehat{\beta}_R}X = (0.352975)(4916.5) = 1735.4 \] This is reduced from our regression SE of $1972$ with RSE $1735.4/2.52124\times 10^{4} = 0.069$ – and is a better estimate of our uncertainty due to the fact that the variance of retail trade really does increase with $X$.

13 Variance Estimation in Complex Sample Designs

$\DeclareMathOperator*{\argmin}{argmin}$ $\newcommand{\var}{\mathrm{Var}}$ $\newcommand{\bfa}[2]{{\rm\bf #1}[#2]}$ $\newcommand{\rma}[2]{{\rm #1}[#2]}$ $\newcommand{\estm}{\widehat}$

In this course we look at a number of straightforward sample designs (e.g. SRSWOR, Stratified SRSWOR, 1- and 2-stage Cluster sampling). These designs have well defined means of forming estimates and variance formulae. For example in STSRS estimates of a total and its variance are given by: \[\begin{eqnarray*} \widehat{Y} &=& \sum_h N_h \frac{1}{n}\sum_{k\in s_h}y_{hk}\\ \bfa{\widehat{Var}}{\widehat{Y}} &=& \sum_h N_h^2\left(1-\frac{n_h}{N_h}\right)\frac{s_h^2}{n_h} \end{eqnarray*}\] These formulae assume that the sample weights of each unit within a stratum are the same: i.e. $\frac{N_h}{n_h}$. However where a post-stratification or nonresponse weighting class adjustment is made the weights will differ between units after the adjustment. In such circumstances the analytical formulae become at best merely approximations to the actual variances, or they may fail altogether.

Taylor Series Linearisation seeks analytical approximations to variance formulae, and this is the approach used when calculating expressions for the variance of the ratio estimator.

An alternative which is feasible is to use numerical resampling techniques to estimate the variance. There are various different ways this is done, the most common being the jacknife and bootstrap methods. These methods, as well as the Random Group method and Balanced Repeated Replication (BRR) share the property that $R$ subsets or replicates of the sample are taken. For each subset a separate estimate $\widehat{T}_{(r)}$ is calculated, and the variability of these estimates is used to calculate the variance of the estimator $\widehat{T}$ without the need for analytical formulae.

Example

We’ll take the following data as an example: $n=5$ observations of $y_k$ the number of times each person exercises per week in a simple random sample from a population of $N=500$ people. The sample weight of each person is $N/n=100$.

Unit	Sex	$y_k$	Weight, $w_k$	$w_k^\ast$
1	Male	2	100	83.33
2	Male	0	100	83.33
3	Male	3	100	83.33
4	Female	5	100	125.00
5	Female	9	100	125.00
Total			500	500.00

Also assume that we know that there are 500 males and 500 females in this population. The post-stratified version of the estimate requires us to adjust the weights so that they add up correctly to the population benchmarks. The adjusted weights are given above as $w_k^\ast$.

We want to estimate the mean number of times per week that people in this population exercise $\bar{Y}$.

The theoretical result for SRSWOR (without poststratification) is: \[\begin{eqnarray*} \widehat{\bar{Y}} &=& \frac{\sum_k w_ky_k}{\sum_k w_k} = \bar{y} = 3.8\\ \bfa{\widehat{Var}}{\widehat{\bar{Y}}} &=& \left(1-\frac{n}{N}\right) \frac{s_y^2}{n}\\ &\simeq& \frac{s_y^2}{n} = \frac{3.4^2}{5} = 2.34 \end{eqnarray*}\] If we use the poststratified weights then the estimate is: \[\begin{eqnarray*} \widehat{\bar{Y}}_{\text{post}} &=& \frac{\sum_k w_k^\ast y_k}{\sum_k w_k^\ast} = 4.3 \end{eqnarray*}\] however there is no exact variance formula for this estimate since the poststrata (sex) do not correspond to selection strata (there were no selection strata in this case: it was just a SRSWOR of the whole population). Approximate analytic expressions do exist for SRSWOR, however for complex designs, especially those involving clustering, there is no formula available. We will use numerical methods to estimate the variance in this simple example case to demonstrate the procedure which can be applied in those complex cases.

13.1 Jackknife

When applied to SRSWOR the Jackknife method requires us to drop out a unit from the sample and recalculate the estimate from the reduced sample (adjusting the weights as necessary to match population benchmarks). Since there are $n$ sample members there are $n$ possible Jackknife replicate samples.

Ignoring the weights for the moment, there are 5 possible samples in our example:

Deleted Unit $r$	$y_k$ values of remaining units				Mean $\bar{y}_{r}$
1	0	3	5	9	4.25
2	2	3	5	9	4.75
3	2	0	5	9	4.00
4	2	0	3	9	3.50
5	2	0	3	5	2.50

Each replicate mean $\bar{y}_{(r)}$ is an estimate of the population mean $\bar{Y}$. We can calculate the overall mean of these replicate means: \[ \widetilde{\bar{Y}}_{\text{JK}} = \frac{1}{n}\sum_{r=1}^n \widehat{\bar{Y}}_{(r)} = \frac{1}{n}\sum_{r=1}^n \bar{y}_{(r)} = 3.8 \] In this simple case $\widetilde{\bar{Y}}$ is identical to $\widehat{\bar{Y}}=\bar{y}$.

Our variance estimate comes from the variability of the $n$ replicate means: \[ \bfa{\widehat{Var}}{\widehat{\bar{Y}}_{\text{JK}}} = \frac{n-1}{n}\sum_{r=1}^n \left(\widehat{\bar{Y}}_{(r)}-\widetilde{\bar{Y}} \right)^2 = \frac{n-1}{n}\sum_{r=1}^n \left(\bar{y}_{(r)}-\widetilde{\bar{Y}} \right)^2 = 2.34 = (1.53)^2 \] (Note that this is not just the variance of the replicate means, but rather a scaled up version of it: there is an extra factor of $n-1$.) Again this is identical to the theoretical result in this simple case.

However we see a difference when we poststratify each of the replicates. Dropping out a unit is equivalent to setting its weight to zero, and we adjust the weights within the poststrata (sex) so that the weights correctly add up to the benchmarks. Each replicate sample thus has its own set of replicate weights.

					Replicates
Unit	Sex	$y_k$	$w_k$	$w_k^\ast$	$w_k^{(1)}$	$w_k^{(2)}$	$w_k^{(3)}$	$w_k^{(4)}$	$w_k^{(5)}$
1	Male	2	100	83.33	0.00	125.00	125	83.33	83.33
2	Male	0	100	83.33	125.00	0.00	125	83.33	83.33
3	Male	3	100	83.33	125.00	125.00	0	83.33	83.33
4	Female	5	100	125	125.00	125.00	125	0.00	250.00
5	Female	9	100	125	125.00	125.00	125	250.00	0.00
				Estimates: $\widehat{\bar{Y}}_{(r)}$	4.25	4.75	4	5.33	3.33

We compute each replicate estimate $\widehat{\bar{Y}}_{(r)}$ using its own weights: \[\begin{eqnarray*} \widehat{\bar{Y}}_{(r)} &=& \frac{\sum_k w_k^{(r)} y_k}{\sum_k w_k^{(r)}} \end{eqnarray*}\] then can calculate the overall mean of these replicate means: \[ \widetilde{\bar{Y}}_{\text{JK,post}} = \frac{1}{n}\sum_{r=1}^n \widehat{\bar{Y}}_{(r)} = 4.3 \] with variance estimate comes from the variability of the $n$ replicate means: \[ \bfa{\widehat{Var}}{\widehat{\bar{Y}}_{\text{JK,post}}} = \frac{n-1}{n}\sum_{r=1}^n \left(\widehat{\bar{Y}}_{(r)}-\widetilde{\bar{Y}}_{\text{JK,post}} \right)^2 = 1.83 = (1.35)^2 \] The variance is reduced here because males and females (in this population) are different in the frequency they exercise. Post-stratification controls for and removes this source of variability in the final estimate.

Note the Jackknife variance formula lacks any finite population correction, (i.e. any terms like $(1-n/N)$) and moreover it effectively treats the sample design as having been with replacement. These effects are unimportant when sampling from large populations.

In principle the Jackknife can be applied to any estimator at all – we simply treat each replicate as if it were a real sample, evaluate the estimator for that replicate, and then look at the variance of the resulting set of estimates.

The Jackknife can be used in complex sample designs. In cluster designs we drop out whole PSUs rather than individual units, and in stratified designs the Jackknife can be applied separately to each stratum.

The number of replicates where we delete one sample member clearly increases with the sample size, and hence the computational load also increases with sample size. For very large samples it is sufficient to take a sample of the replicates when calculating the variance. Also note that the Jackknife does not perform well for some statistics (in particular quantiles).

In addition to an estimate of the variance of an estimator, the Jackknife also provides an estimate of its bias: i.e. the rescaled difference between the whole sample estimate $\widehat{T}$ and the mean of the replicates $\widetilde{T}$: \[ \bfa{\widehat{Bias}}{\widehat{T}}_{\text{JK}} = (n-1)\left[\frac{1}{n}\sum_{r=1}^n\widehat{T}_{(r)} - \widehat{T}\right] = (n-1)\left[\widetilde{T}_{\text{JK}} - \widehat{T}\right] \]

13.2 Bootstrap

The bootstrap differs from the Jackknife in that it treats the sample as a mini-population, and each replicate is drawn with replacement from the sample. As with the Jackknife, the method of resampling depends on the sample design.

In our example where the original sample was a SRSWOR we also select our replicates by SRS, but with replacement. We draw $R$ replicate samples where $R$ is a large number (usually hundreds or thousands), and recalculate the estimate each time.

For example, if we ignore the postratification benchmarks we obtain the following samples each with their estimates:

$r$	Observations					$\bar{y}_{(r)}$
1	2	5	2	2	2	2.6
2	9	2	2	3	5	4.2
3	3	9	9	5	2	5.6
4	0	3	9	3	5	4
5	3	9	5	0	3	4
6	5	9	5	5	2	5.2
7	0	3	9	9	3	4.8
8	3	0	3	0	2	1.6
9	0	2	9	2	2	3
10	0	2	5	2	2	2.2
$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$	$\vdots$

Note that units can get into the same sample multiple times. This is intended.

The overall bootstrap estimate after $R=1000$ replicates is \[ \widetilde{\bar{Y}}_{\text{BS}} = \frac{1}{R}\sum_{r=1}^R \widehat{\bar{Y}}_{(r)} = \frac{1}{n}\sum_{r=1}^R \bar{y}_{(r)} = 3.8 \] with variance \[ \bfa{\widehat{Var}}{\widehat{\bar{Y}}_{\text{BS}}} = \frac{1}{R-1}\sum_{r=1}^R \left(\widehat{\bar{Y}}_{(r)}-\widetilde{\bar{Y}}_{\text{BS}} \right)^2 = 1.92 \] The bootstrap also provides a bias estimate \[ \bfa{\widehat{Bias}}{\widehat{T}}_{\text{BS}} = \frac{1}{R}\sum_{r=1}^R\widehat{T}_{(r)} - \widehat{T} = \widetilde{T}_{\text{BS}} - \widehat{T} \]

Poststratifying each replicate gives an estimate of the mean of 4.3 and bootstrap variance 0.73. (Although note: for this small population this variance is a poor estimate, since there are many bootstrap replicate samples where there are no males and many where there are no females selected. These have to be deleted from the calculation – since we can’t poststratify them – and this artificially reduces the variances.)

The bootstrap can be even more computationally intensive than the Jackknife, but is even more widely applicable.

13.3 Complex Designs

The jacknife and bootstrap resampling operate always at the first stage of selection. So if we have a cluster sample, the jacknife drops out whole clusters, but we don’t then go on to drop out individual units from those clusters – even if the sample design selects a subsample from each selected cluster. 3 In a stratified design, we would ideally run the jacknife separately in each stratum – i.e. we should separately drop out a single cluster from each stratum leaving all the other strata complete.

However in practice, a single cluster may be dropped from each stratum when constructing a set of replicate weights.

Once the clusters are deleted, all of the other clusters are reweighted to add up to the correct population totals, or benchmarks if poststratification is being done.

It is common that unit record datasets will be provided with a set of 100 or more replicate weight columns which are to be used to calculate the variances of all estimates. These sets of weights are subsets of the full set of possible replicate weights – but are a sufficiently large number to give good variance estimates. They are calculated by the data provider who may be unwilling to release the information about the sample frame which is required for the practitioner to calculate such weights themselves. Moreover, release of a single set of replicate weights means that all users will calculate the same estimates from the dataset.

Sets of replicate weights provided in this way are usually jacknife (rather than bootstrap) replicates, since it is usually the case that a smaller number of jacknife replicates are required to give the same reliability.

14 Field Work

14.1 Timing

The timing of a survey operation is determined by the resources available, and (more importantly) the need for the information. There is no point reporting an opinion poll of voting intentions after the election has taken place. Results must be timely (released at a useful time).

This may mean that a large interviewing force is required, to get the information collected in a short time window before an important event, or before an important decision is made (a decision relying on the survey results).

Surveys with extended surveying periods can be affected by current events (e.g. the Canterbury eathquake, changes in the economy, or a political scandal) and also by the changing seasons through the year. People may respond differently depending on the time of year (think about a survey of agricultural workers, or of ski instructors). Surveying in the school holidays can lead to a lot of non-contact non-response.

14.2 Survey Mode

A major question in field work is of course the mode of the survey, since it determines to a large extent what kind of field work is required.

Interviewer administered, Self administered, Observational?
Face to face? Postal? Telephone? Video Call? Web form? Physical measurements?
Computer involvement? All on paper?

Some common modes:

Personal interview – a face-to-face encounter where the interviewer asks the questions, the respondent replies verbally. The interviewer writes down the responses on a paper questionnaire (or a tape recorder/video camera captures the responses for later analysis – the respondent must know if s/he is being recorded).

It is common for multiple choice questions to have show cards with the list of possible responses (e.g. ethnicities), and the respondent chooses one.
CAPI – Computer Assisted Personal Interview – as above, but the interviewer records the answers on a laptop. The laptop also provides the script which the interviewer reads.
CATI – Computer Assisted Telephone Interview – as above, but not face-to-face – instead over the phone. The interviewer records the answers on a laptop (show cards cannot be used of course).

For very short surveys, e.g. of customer satisfaction, the responses may be collected by touch tone phone or by voice recognition software.
CAVI – Computer Assisted Video Interview – like CAPI, but not in person – instead using video conferencing. The interviewer records the answers on a computer, and can if needed display show cards on screen.
CASI – Computer Assisted Self Interview – A face-to-face encounter with an interviewer, but the respondent enters data directly into the laptop the interviewer has brought. There may be a CASI section in a CAPI interview, for questions which are very sensitive.
CASQ – Computer assisted Self-administered Questionnaire – the questionnaire is provided in, say, a web application. No interviewer is involved.
Postal Survey – the questionnaire is posted to the respondent, with a self addressed (and post-paid) return envelope.

The mode of the survey affects the way that people respond – people will respond to the same question differently if it is asked in different modes. This is called the mode effect.

It can be difficult to get detailed or personal information over the phone, and such interviews are usually short. Questions in telephone questionnaires tend also to be short, since there is no opportunity to use visual cues or show cards, and no ability by the respondent to reread the questionnaire. Sensitive questions may be better in a face-to-face situation, although this increases the likelihood that people won’t give honest answers to questions that they feel might make them look bad (uncaring, unfeeling, judgemental, ungenerous), or show them to have views that are held only by a small minority. This latter effect is called Social Desirability Bias.

14.3 Field Work in Practice

Having selected a sample frame, a sample size, and a method of selection, the units from which data are to be collected can be identified and contacted.

The relevant steps should be clearly planned out, and relevant contingencies allowed for.

Example. If the sample is of students inside schools, I may only have a list of schools, but not of classes, nor of students. I’ve decided to do a sample of students selected from classes within schools. How does a questionnaire get from my office to the desk of a selected student, and back again? What actually has to take place?

Select a set of schools from the frame (e.g. by SRS or stratified SRS)
Write to the principal of each selected school, explain the survey (possibly in an appointment face to face), get agreement, schedule a time for sampling;
For respondents under the age of 16 I need to get parental permission – I need to write a letter home, and get them all back again. (Should this letter go to all students’ parents, or only of my selected ones?)
I’ll need a list of classes within the school to sample from (can I rely on the admin staff at the school to make the selection for me? probably not).
If I have multiple collectors covering different schools I need to train them all up first. I’ll need training sessions and a set of instructions. I may need to run role play sessions with them. They need to all act exactly the same way (it shouldn’t make any difference which collector goes to which school).
At each selected school I then need to select the classes randomly (and be sure that students belong to only one class). I need to contact the teacher of each selected class, and arrange a suitable time and place to survey.
I need to arrive with enough surveys to distribute, explain and get the students to carry out the survey. If I’m subsampling within the class I’ll need a means of randomly selecting from within the class. I need to decide what to do about students who are absent on the day.
I get the students to complete the survey.
I then collect all of the surveys back.
I need to be able to store the surveys securely until they can get back to the office. If I have lots of interviewers/collectors they’ll all need to have secure locations too.
I need a method of getting the surveys back to the office from multiple collectors. I may get the collectors to post their sets of surveys back. I may be able to send encrypted emails of electronic survey data, or better to physically transfer data sticks.
I need to capture the data from paper questionnaires into electronic form for analysis. I may get two or more people to enter the data independently and compare results for consistency. Electronic data need to be read from laptops or data sticks into my survey dataset.
After data capture – what do I do with the survey forms? Do I keep them? where? for how long? When and how do I destroy them?
I need to interpret the unit record data I have collected – coding responses to standard classification systems and perhaps doing text analysis of open ended question answers. I may need to edit or impute data.
I need to calculate the survey weights, making adjustments for unit non-response.
I finally have a unit record dataset that I can use for analysis.
Who will have access to the data? How long will I keep it? Can I give it to anyone else? Can I use it for research I hadn’t planned when I set the survey up?

There is a wide variety of surveys, and correspondingly a wide variety of considerations for field work. Here is a non-exhaustive list of some things to consider:

Surveys of animals (including humans) have to take account of the fact that animals move, and that they hide/sleep/migrate/are born and die. How do I adjust for these factors? are they problematic?

Short time frame surveys can usually avoid the problems of birth and death. Movement is more of a problem.
For surveys of animals a strict set of instructions is required – and an observation schedule, analogous to a questionnaire, is needed. The instructions may include the requirement that surveyors do not communicate with each other to ensure the independence of their results. (What do I do about differing levels of skill in observation? What if the weather is bad?)
Coverage rules: I need to ensure that each unit has only one chance of selection. If I’m surveying people in dwellings, but it takes 3 months to do all my surveying, what do I do (a) with people who have multiple dwellings, and (b) with people who move during the survey period? or who are on holiday?
The flow of documents: how do the surveys get to and from the respondents. Do I need lots of stamped addressed envelopes? How do I ensure the security of the data I am collecting?
Is any equipment required? (Laptops, weighing scales, pens and paper, mobile phones, cameras, recording devices, light meters, …)
How do I ensure the highest possible response rate? What kind of pre-notification is needed? What permissions are required (e.g. to approach children? or go on to private property?)

Surveys of humans and non-human animals conducted by Universities and by government usually require an ethics committee sign off. In an actual survey you’d need that all signed off before you start.
Am I offering incentives for response (a gift? a prize draw?) How do I distribute these?
What do I do about non-response? How do I even know if it is happening? Do I need serial numbers on all the questionnaires? Can I check to see how many people have responded so far?

In a postal survey it is usually better to send out new questionnaires and a reminder letter, instead of just a letter asking people to complete a questionnaire that they may have lost (what happens if they send back two questionnaires? will you even know if this has happened?).

How many reminders will you send?
In a postal survey – how will you address the envelope? If you are subsampling within a household you can’t ask the housholders to randomly select someone – it won’t be random! (and it probably won’t happen at all).

You really need the selected person’s name and address – this information needs to be on the frame. (Face-to-face dwelling based surveys can find out names on the doorstep.)
What information needs to be supplied to the respondents? Some of this should go in a pre-notification letter if this is being used:
- Who is running the survey
- What the survey is for
- What will happen (e.g. an interviewer will call …)
- How long the survey will take, and other information about what the respondent will have to do (e.g. complete a paper questionnaire, or answer a set of questions face-to-face with an interviewer)
- The privacy and confidentiality issues relevant to the survey
- How the data will be used
- How to contact the survey organisers
- An opportunity to see the survey report?
How many call backs are required before a collector can finally code a sample member as a non-response? Will you collect information about whether a non-response was a non-contact or a refusal or some other reason?

What if the selected person cannot speak English? (or whatever other language(s) are being used?)
Large surveys may have a dress rehearsal where a small amount of data is collected using the methods and procedures that the survey will use – just to test the field work procedures.
Will proxy responses be allowed? (i.e. one person responding on behalf of another)? This is justified only in very particular circumstances. Proxies can work if the information is particularly simple (is this person still unemployed, asked of a family member), or if the person is a caregiver for the respondent, and knows a lot about their situation. But they do carry a significant risk that the proxy respondent can’t (or in some cases won’t) answer accurately.

15 Questionnaires

This Section concentrates on surveys of human subjects, but many of these stages still apply in studies of animals, plants and inanimate subjects.

15.1 Stages of Questionnaire Development

Formulate the survey objectives;
Convene a set of Focus Groups to investigate their experiences and inform the formulation/refinement of the objectives;
Ensure the appropriate conceptual framework exists for the data collection. Define key concepts and terms;
Decide on classification standards;
Identify constraints imposed by the sample design, identify relevant information available from the survey frame;
Decide the survey mode;
Develop a list of topic areas, and any relationships or dependencies among them;
Develop a list of items to be collected – justify each in relation to the objectives;
Develop question wording, sets of responses (for closed questions)
Develop connecting material, and instructions;
Develop the layout of the questionnaire – includes instructions, routing, contingency questions. This may mean programming the questionnaire into a software package;
Develop/obtain additional materials (including showcards, any measuring devices, envelopes etc.);
Design and test the data capture/editing/coding procedures;
Test (and revise) the questionnaire (not all of these are necessary):
- Desk Check – the designer and colleagues test the questionnaire;
- Cognitive Test – a small set of respondents are selected (not necessarily from the survey population) to complete the questionnaire, and asked to express their reactions and thought processes as they do so;
- Focus Group – use the questionnaire with members of the survey population, but in an office situation. Discuss with them their reactions to the questionnaire, wording, terms used, flow, instructions, layout etc.;
- Pilot Test – use the questionnaire in a real setting with people who are in the survey population, and who will give real responses;
- Dress Rehearsal – use the questionnaire in conjunction with the full sample selection process and full survey operation;
Collect the data!

15.2 Format

In general:

Start with interesting questions, relevant to the subject at hand, to hook in the respondent’s interest, and ensure a higher response rate;

This usually (though not always) means that the demographic questions (‘And now about you …’: age/sex/…) are found at the end;

[A notable exception: The first questions in the Census are Name, Sex, Address]
Don’t put the most important items at the end of the questionnaire;
Don’t put the most sensitive or difficult questions first;
Group items into logical sections;
Keep it short. Don’t ask questions you don’t need to.

For self completion questionnaires (paper and electronic):

The instructions must be especially clear;
There should be contact information in case of queries or difficulties;
Space items out on the page or screen;
Make it clear where the respondent should write, and HOW (e.g. ‘tick the boxes provided to indicate your answer’)

In phone questionnaires:

Brevity is vital to ensure the respondent doesn’t hang up
With no visual cues, and no possibility of looking back over the questionnaire, questions must be particularly simple

In all questionnaires:

Provide background information about the purpose of the survey, who is running it;
Make it clear what the respondent is agreeing to by providing their information;
Make it clear whether or not (and it is usually is) participation is voluntary;
Explain that the data of individual respondents will not be released to others, and that no individual respondent will be identifiable when the survey results are published.

15.3 Questions

Questions can usually be divided into two types: open or closed.

Open questions allow the respondents to answer in any way they like. On paper questionnaires a space is left for the respondent to write one or more sentences.

e.g. Is there anything more you’d like to add?

What three things would improve this course?

Open questions are most suitable for qualitative and pilot surveys. They are time consuming to analyse in large surveys.

Closed questions have a prespecified set of possible responses.

What gender are you? $\square$ Male $\square$ Female $\square$ Another Gender;

These come in two important forms:

Tick one box only (single response)
Tick all that apply (multiple responses)

Electronic questionnaires can enforce these rules – but on paper the respondent can easily break them (inappropriately giving zero, one, two or more responses)

Focus groups and pilot surveys with many open questions administered to a small sample can be used to refine a set of frequent responses that are offered in a closed question in the main survey.

Example. In a focus group one might ask the open question: ‘What is your religion?’ and based on the distribution of responses develop the question:

’What is your religion? (Tick any that apply):

$\square$   No religion
$\square$   Christian
$\square$   Buddhist
$\square$   Hindu
$\square$   Muslim
$\square$   Jewish
$\square$   Other: (please specify) ___________________

Optical Character Recognition (OCR) software can look at a scanned image of the open ‘Other’ answer, and try to convert it automatically to text. This may need human intervention however.

Things to consider when designing a closed question:

It is normal to put the most frequently chosen categories first;
The options provided should in principle be exhaustive – everyone should be able to tick one box;
The open option ‘Other’ is usually included to cover cases which are not frequently chosen (even if the surveyors are expecting these);
If a question is ‘tick one box only’ then the options should be mutually exclusive – it should really be the case that only one box can apply;
The order that the options are offered in affects how people answer
One may choose deliberately to exclude some valid categories, because they apply to too small a part of the population, or will encourage flippant answers.

Example. The sex question in the census (and in almost all questionnaires) is a good example of this. There are people who do not classify themselves as either Male or Female. Yet no ‘Other’ box is provided. Think about why surveyors do not allow an ‘Other’ response (and do not list other alternatives). (Note that a new question about gender is starting to be used in New Zealand, and it allows the options Male/Female/Another Gender.)

Example. A lot of people recently started putting ‘Jedi’ as their religion. Why isn’t this listed as an option?
Two boxes that may or may not be shown are the Don’t Know and Refused box. These are usually present on any questionnaire completed by an interviewer, but are rarely present in a self-completion questionnaire.

This is to discourage a lot of item non-response.
- Don’t Know – may be used on a paper questionnaire where it is of interest to know if people are undecided or uninformed.
  
  ‘Which party will you vote for in the next election?’
  
  The undecideds are are interesting group in this case. But not in the case
  
  ‘How many bedrooms are there in your house?’
  
  where we want people to think about this and give a good answer.
- Refused – may be used when the surveyors feel that the question may be an invasion of privacy for some people. Even having the option of ‘prefer not to say’ may encourage people to respond in one of the other categories.
A Not Applicable box should be shown if a respondent is being asked a question that may not apply to him/her.

It is preferable never to ask such a question (and to avoid with a skip, as in ‘If Yes please go to Q5, if No please go to Q10’). But in short paper questionnaires it can be inevitable.
Think of the coding and classification system that you’re going to use. But be aware that you can collect data in one way, and then transform and analyse it in another.

Example. Can ask for a free text response – ‘How many years have you lived at this address?’ and then convert the number into ranges ($<$1 year, 1-4 years, 5-9 years 10+ years).

Example. Can ask for the make and type of car, but then only analyse it by size (which we derive from the make and type).

Ask questions that are easy for the respondent to answer, but which still make your analysis possible.
Likert Scales are closed questions asking for a level of agreement. Convenient for collecting a lot of questions about opinions, can be laid out in a grid. A disadvantage of this is that it encourages some respondents to answer in a single column throughout (e.g. all Excellent or all Awful).

Consider whether the direction of all statements is the same (e.g. all positive – so that 5 is always indicating approval).

$\square$   Strongly Disagree
$\square$   Disagree
$\square$   Neither agree nor disagree
$\square$   Agree
$\square$   Strongly Agree

An ODD number of response categories allows respondents to be neutral, whereas an EVEN number forces a response in one direction or other.

$\square$   Strongly Disagree
$\square$   Disagree
$\square$   Agree
$\square$   Strongly Agree

Notice there are some questions where the centre is the most positive response:

‘How was the workload in this course?’

$\square$   Far too little
$\square$   Too little
$\square$   About right
$\square$   A bit too much
$\square$   Far too much

Five or seven categories is usual.

Attention Checks - Some questionnaires offer rewards/incentives for completion. An unfortunate consequence of this is that some respondents are only completing the questionnaire to get the reward, and do not pay attention to the content of questions. These ‘donkey responses’ are typified by giving the same response to every question - e.g. always ticking ‘Strongly agree’ - so that they can complete the questionnaire with minimum thought or effort.

Some questionnaires including attention checking questions to detect this.

For example we might request a specific response to a multi-choice question:

‘To check that you are paying attention, please answer Strongly Disagree to this question’

$\square$   Strongly Disagree
$\square$   Disagree
$\square$   Neutral
$\square$   Agree
$\square$   Strongly Agree

or request a specific text response in an open question:

‘To check that you are paying attention, please enter the number 545 in the text box below’

Other issues to consider:

The questionnaire needs a title (and possibly a date)
Use short sentences; (Two short sentences are better than one long one.)
Avoid unnecessary (double) negatives:

How strongly do you agree with:
ASK: ‘This course has good support for learning iNZight’
NOT: ‘This course does not have good support for learning iNZight’
Use terms consistently (e.g. refer consistently to household OR dwelling)
Use terms that will be understood (avoid technical jargon)
Don’t use abbreviations unless you are very sure they will be understood. ‘NZ’ is probably all right in New Zealand (though migrants and visitors won’t be familiar with it); ‘VUW’ would not be appropriate in most situations.
Terms should be unambiguous (not vaguely defined, or ambiguous)
Terms should be free from moral or other overtones – they should not be emotionally loaded.

Compare questions about ‘terrorists’ and ‘freedom fighters’.
Don’t ask leading questions (the respondent is guided to an answer by the question)
Don’t ask questions which assume a state of affairs (e.g. don’t assume anything about what the respondent thinks about the world)
Ask for only ONE piece of information – i.e. Don’t ask double-barrelled (or worse) questions:

NOT: Do you think that there should be longer sentences and hard physical labour for those convicted of violent offences?

Instead:
1. Do you think there should be longer sentences for those convicted of violent offences?
2. Do you think hard physical labour should be part of the sentences of those convicted of violent offences?
Be clear whether you are seeking facts (‘Do you …’) or opinions (‘Do you think …’) from the respondent.
Number the pages
Use colour to create contrasts between instructions and questions, and assist with the flow of the questionnaire
Ask information that the respondent can provide – and without too much difficulty.

It may be easier to ask what the respondent recently did, rather than generally does:

ASK: ‘How many times did you go to the cinema last week?’
NOT: ‘How many times do you go to the cinema per week?’

You’ll be averaging over lots of respondents – so it doesn’t matter if one respondent had a big movie week last week, others won’t have. (Just as long as you don’t ask this question after a film festival.)
Shorter recall periods lead to more precise answers – best not have a long recall period (e.g. ask about last month, rather than last year).

Recent events will be recalled more accuarately than more distant events: this is called recall bias.
Respondents often agree with statements, how ever they are worded. If only one side of an argument is stated then respondents often support it.

Use a neutral approach.

In particular, avoid predisposing questions

ASK: ‘Do you think students should be helped with after school activities?’
NOT: ‘Do you think teachers should help with students’ after school activities?’

This question suggests a particular solution to a problem, but is not the only possible solution. The respondent should be able to agree that students need help, but not with the solution that you are offering.
The order questions are asked in makes a difference to the way people respond.

Ask a lot of questions about how safe someone feels from criminal violence and then a question about tougher prison sentences, the pattern of responses will be different than if the prison sentences question was asked in the beginning.
Don’t ask questions that will embarrass the respondent, or make him/her feel exposed.

Example. The Māori Language survey asks for self-assessed language ability, rather than actually testing the respondent’s language skill. There are two main reasons (i) to reduce the time the interview takes and (ii) not to embarrass a respondent who would feel that this was a test, and s/he might become nervous and unable to speak naturally.

Ask questions in a way that doesn’t make the respondent feel stupid if s/he can’t answer:

ASK: ‘Can you tell me if…?’
NOT: ‘Do you know if…?’
Indirect questions can soften an approach to a difficult subject:

ASK: ‘What do you think a student should do if…?’
THEN: ‘What do most students actually do when…?’
THEN: ‘What would you do if…?’
Finish the questionnaire with a thank you statement

15.4 Interviewers

What additional information can the interviewer provide if the respondent asks?

Note that there is a risk that if the respondent and interviewer become too chatty, then the respondent will start to treat the interviewer as a friend, and may succumb to social desirability bias – giving answers s/he thinks the interviewer will want to hear.
Can the interviewer use probes – asking further questions if the respondent does not seem to be answering the question in the right way?

Is the interviewer allowed to improvise in any way? or should the interviewer stick to a script?

The risk is that some interviewers may be more or less successful at this, and responses to the questionnaire will not be uniform across interviewers.

The interviewer also may make a mistake when re-explaining a concept in other words, and introduce a misconception.
Interviewers need training
- Survey aims, objectives and sample design
- Concepts related to the survey and what it is trying to measure (e.g. the concept of ethnicity, or general health, or …)
- Questionnaire content
- Field Work procedures – making contact, making an appointment, visiting a person in their home, document flow matters etc.
- Conducting an interview professionally. Keeping their own views separate.
Training may involve lectures, discussion groups and role playing.
Interviewers may need to be of a specific age, sex, ethnicity for some surveys. They may need specific language skills.
An important principle is that respondents should give the same answers to the questions, no matter which interviewer conducted the interview.

Think about how this can be ensured.

16 Data Release

When data are to be released in any form several questions need to be asked:

Are the data of sufficient quality to release?
Is there a risk of identifying individuals, or releasing any information given in confidence?
Are the data being used for the purpose for which they were collected? (What were the respondents told about the uses of the data at the time of collection?)
Are there any legal, commercial or ethical considerations?

16.1 Data and Statistics Act 2022

Link to the Act: https://www.legislation.govt.nz/act/public/2022/0039/latest/LMS418574.html

From the Statistics New Zealand website (www.stats.govt.nz):

https://www.stats.govt.nz/about-us/legislation-policies-and-guidelines/

Statistics New Zealand operates under the authority of the Data and Statistics Act 2022. The Act is in seven parts:

Part 1 - The Act ensures that that high-quality, impartial, and objective official statistics are produced relating to New Zealand to inform the public and inform
decision making.
Part 2 - Defines roles and responsibilities, including in particular the Government Statistician and creates the department known as Statistics New Zealand;
Part 3 Provides for the collection of data and for matters concerning statistical confidentiality.
Part 4 relates to the production of official statistics including the powers of the Minister of Statistics, and obligations for publication;
Part 5 relates to access to data for research;
Part 6 relates to offences and enforcement related to official statistics;
Part 7 contains general provisions.

16.2 Data quality

If sample numbers are small then estimates from that sample may be very imprecise. This in particular applies to estimates for subpopulations which are reported as part of large surveys.

Where data are imprecise it is good practice to flag cells or estimates which are unreliable, or to suppress such cells altogether. Consequential cell suppression is not necessary when estimates are suppressed for quality reasons.

For example in the 2001 Māori Language Survey Statistics New Zealand published, by age and sex, counts of the numbers of people who could speak at differing levels of proficiency. The output estimates are given in Table 16.1.

Table 16.1: Speaking proficiency by age and sex (Maori Language Survey 2001)
	Proficiency
AgeGroup	Very Well	Well	Fairly Well	Not Very Well	Few Words or Phrases	Total
Males
15-24 years	**	**	3869	10862	27083	43373
25-34 years	**	**	*2668	9394	22572	35552
35-44 years	**	**	*2720	4982	23008	32633
45-54 years	*2164	**	*2137	4001	12134	21161
55+ years	4952	*1115	1928	2621	9142	19757
Total	9045	4309	13322	31860	93939	152476
Females
15-24 years	*1516	*2386	7441	10618	24247	46209
25-34 years	**	*1172	4881	10572	25232	42378
35-44 years	**	**	3465	10481	21731	37478
45-54 years	*1299	*1191	*2894	5138	12251	22773
55+ years	5184	*1258	2691	3629	9507	22269
Total	9449	6880	21372	40437	92969	171107
Total
15-24 years	*2145	3317	11310	21480	51330	89582
25-34 years	**	*1773	7549	19966	47804	77930
35-44 years	*1912	*1812	6185	15463	44739	70111
45-54 years	3463	*1915	5031	9139	24385	43934
55+ years	10136	2373	4619	6249	18649	42026
Total	18494	11190	34694	72297	186908	323583
Number of Maori language speakers at each proficiency level for Maori aged 15 years and over
* Sampling Error $>$30%
** Sampling Error $>$50%: these cells are suppressed
Source: Statistics NZ

Because of small sample sizes certain of the cells have been suppressed (margins of error greater than 50% of the size estimate) or a flagged as very uncertain (margins of error greater than 30%).

The absolute and relative errors are shown in Table @ref{tab:absolute-errors} and 16.3

Table 16.2: Absolute Sampling Errors (Maori Language Survey 2001)
	Proficiency
AgeGroup	Very Well/Well	Fairly Well	Not Very Well	Few Words or Phrases
Males
15-24 years	604	916	1646	1730
25-34 years	506	906	1464	1526
35-44 years	627	747	1236	1477
45-54 years	892	796	982	1158
55+ years	847	472	603	953
Total	1423	2005	3442	3677
Females
15-24 years	1172	1535	1906	2480
25-34 years	635	1098	1831	2011
35-44 years	753	1014	1846	1977
45-54 years	833	808	1091	1239
55+ years	751	516	682	999
Total	1939	2466	4524	5200
Total
15-24 years	1347	1801	2650	3257
25-34 years	831	1449	2575	2757
35-44 years	1030	1327	2414	2612
45-54 years	1279	1101	1487	1768
55+ years	1216	763	942	1599
Total	2560	3377	6487	7209
Number of Maori language speakers at each proficiency level for Maori aged 15 years and over

Table 16.3: Relative Sampling Errors (%) (Maori Language Survey 2001)
	Proficiency
AgeGroup	Very Well/Well	Fairly Well	Not Very Well	Few Words or Phrases
Males
15-24 years	39	24	15	6
25-34 years	55	34	16	7
35-44 years	33	28	25	6
45-54 years	31	37	24	10
55+ years	14	25	23	10
Total	11	15	11	4
Females
15-24 years	30	21	18	10
25-34 years	38	22	17	8
35-44 years	42	29	18	9
45-54 years	33	28	21	10
55+ years	12	19	19	11
Total	12	12	11	6
Total
15-24 years	25	16	12	6
25-34 years	32	19	13	6
35-44 years	28	22	16	6
45-54 years	24	22	16	7
55+ years	10	17	15	9
Total	9	10	9	4
Number of Maori language speakers at each proficiency level for Maori aged 15 years and over

16.3 Confidentialising Data

Survey data are usually collected with some assurances about confidentiality. For example all data collected by Statistics New Zealand is covered by the Data and Statistics Act 2022 which states in Section 39(1):

The Statistician must take all reasonable steps to ensure that the Statistician does not publish or otherwise disclose data in a form that could reasonably be expected to identify any individual or organisation.

There are a number of exceptions however.

When data are released there is a risk that individuals and their characteristics could be identified. Steps may need to be taken to prevent this disclosure risk when survey data are released.

There is a clear risk where there is an individual who is unique in the sample – a sample unique (they differ from everyone else in the sample, but not necessarily everyone in the population).
Such an individual will end up in a cell with a cell with a count of 1 in a crosstabulation of a (set of) categorical variable(s).

If that person is also a population unique (they differ from everyone else in the population), then publication of such data will lead to significant disclosure about that individual.

Where the sampling fraction is low (i.e. the sample weights are high) then the risk is small, and there is usually no need to confidentialise. Sample uniques are unlikely to be population uniques in this case.

However where the sampling fraction is high (especially in full coverage strata, in censuses and in populations where there are just a few influential units) there is a significant risk of disclosure since a sample unique is very likely to be a population unique. In census a sample unique is always a population unique.

For rare members of the population there is a risk that they can identify each other. For example, if there are only a two large companies operating, say, internet businesses, and if an Official Statistics Agency reports the total turnover of all internet businesses, there is a disclosure risk: each large company can subtract its own turnover from the published total, and deduce the size of its competitor’s business and its market share. This is an unacceptable disclosure from the point of view of those companies, even if no company apart from the two main companies could possibly deduce their separate turnovers.

16.4 Unit Records

Official Statistics Agencies, such as Statistics New Zealand, may produce Confidentialised Unit Record Files or CURFs. These are usually sample survey datasets released to researchers or other government departments.

Census data are rarely released in this way.

Researchers agree to use the data only for specific purposes, and to destroy the unit record data after use.

Where unit records are to be released the data can be confidentialised by:

Removing all personal identifiers such as name and address;
Replacing administrative identifiers (e.g. IRD number) with some other identifier so that the data provider can identify a person, but the researcher cannot;
Add `noise’ to the data – e.g. add a random amount to each person’s income before creating income bands in a table;
Swap data between records (taking care that this does not significantly change the overall statistical properties of the data);
Replace actual data with imputed data (e.g. by regression imputation);
Replace data with bands (e.g. report income in bands rather than actual values).
Top code data. For example, actual incomes are recorded except for all of the top earners, who are put into a single band together (e.g. $>$$200,000).

These procedures mean that anyone looking at the data cannot be sure that the data in any particular record is a true set of data for that individual.

16.5 Tables

Where cell counts in tables are small there is a risk of identifying individuals and their characteristics. One rule which is often used to determine whether or not a cell poses a disclosure risk is the $(n,k)$ rule:

A cell is a risk if $n$ respondents or less contribute $k$% or more to the value of a cell.

For example we might consider is a cell a risk if $n$=3 respondents or fewer contribute $k$=80% or more of the cell value. For a table of counts that would mean we would consider a cell to be a risk where the count is $y$ if $3>0.8y$: i.e. where $y<3/0.8=3.75$: i.e. counts of 3 or less.

Another rule for deciding if a cell is risky is the $p$% rule:

This includes estimation by one of the other contributors to the cell. For example consider a cell with a value of $100,000. If one business contributed $40,000 to the cell, and knew that it had a larger competitor, then it could deduce that the competitor contributed between $40,000 and $60,000 to the cell. Estimating that competitor’s value as $50,000 would mean that the competitor’s value us being estimated to within $10,000, or 20%, of its true value.

The operationalisation of the $p$% rule means that you only have to check if the second largest contributor can find out about the highest contributor. This is the ‘worst case’, meaning that if the second contributor can’t break the confidentialising, then it follows that every other contributor can’t either.

The values of $(n,k)$ and $p%$ that are used by Statistics New Zealand are themselves confidential.

There are various options when tabular data are confidentialised.

Suppress risky cells. This means not publishing a value in those cells. This usually means consequential suppression of some other non-risky cells, in order that the value of the suppressed cell not be deducible;
Construct tables from confidentialised unit records;
Amalagamate rows and/or columns until all cells are large;
Random round each cell entry. Statistics New Zealand does this random rounding to base 3 for Census and other tables of counts.
The procedure is as follows:
- If a count $x$ is a multiple of 3, i.e. $x=3m$ it is left unchanged;
- If a count is a multiple of 3 + 1: i.e. $x=3m+1$ then it is rounded down to $3m$ with probability $\frac{2}{3}$, and rounded up to $3m+3$ with probability $\frac{1}{3}$.
  
  Draw a random number $r$ between 0 and 1: if $r<\frac23$ round down otherwise round up.
- If a count is a multiple of 3 + 2: i.e. $x=3m+2$ then it is rounded down to $3m$ with probability $\frac{1}{3}$, and rounded up to $3m+3$ with probability $\frac{2}{3}$.
  
  Draw a random number $r$ between 0 and 1: if $r<\frac13$ round down otherwise round up.
Counts in the margins of tables may be left unchanged, random rounded independently, or recalculated as the sums of the cell entries. Except in the latter case this means the cells in a table may not add up to the published margins.

Example 1. Consider the following table from the 2001 Census on populations by ethnicity in small areas. The data in the table has been random rounded to base 3. The columns of figures do not add up to the published totals because of the random rounding.

Table 16.4: Census counts for districts in the central South Island
Ethnic Group	Timaru	Mackenzie	Waimate
Pacific Peoples
Samoan	159	0	18
Cook Island Maori nfd	42	3	12
Tongan	66	0	9
Niuean	6	0	0
Fijian (except Fiji Indian/Indo-Fijian)	12	3	0
Tokelauan	12	0	0
Tuvalu Islander/Ellice Islander	6	0	0
Rarotongan	9	0	0
Society Islander (including Tahitian)	0	0	0
Other Pacific Peoples	18	3	3
All Pacific Peoples
All Pacific Peoples	297	12	39
All Ethnic Groups
All Ethnic Groups	41082	3546	6978

The data for the Mackenzie District might have originally looked like the values below – these data could be treated by suppressing cells or by random rounding. Since the counts are so small cell suppression leads to an exaggerated loss of data. Some nonrisky cells are suppressed in order that the sensitive cells remain unidentifiable after suppression.

Table 16.5: Mackenzie District - Confidentialisation
Ethnic Group	Original	CellSuppression	Random
Pacific Peoples
Samoan	0	0	0
Cook Island Maori nfd	5	5	3
Tongan	0	0	0
Niuean	0		0
Fijian (except Fiji Indian/Indo-Fijian)	3		3
Tokelauan	0		0
Tuvalu Islander/Ellice Islander	0	0	0
Rarotongan	1		0
Society Islander (including Tahitian)	0		0
Other Pacific Peoples	4	4	3
All Pacific Peoples
All Pacific Peoples	13	13	12
All Ethnic Groups
All Ethnic Groups	3544	3544	3546

Example 2. Here is a table with some small counts:

Table 16.6: Original data
	A	B	C	Total
P	10	12	6	28
Q	7	6	11	24
R	5	1	2	8
Total	22	19	19	60

If we take the view that cells with counts smaller than 4 are too risky to release, we have two cells that need to be confidentialised.

Solution 1: Amalgamation – We can combine rows Q and R:

Table 16.7: Amalgamated data
	A	B	C	Total
P	10	12	6	28
QR	12	7	13	32
Total	22	19	19	60

This eliminates any information about the different distributions for Q and R.

Solution 2: Cell suppression – We can suppress the risky cells

Table 16.8: Cell Suppression
	A	B	C	Total
P	10	12	6	28
Q	7	6	11	24
R	5			8
Total	22	19	19	60

This leaves the data for all non-risky cells visible, and removes the data in the risky cells. However since we have the column totals we can deduce the missing data. That means we have to suppress two further non-risky cells in order for the cell suppression to effectively disguise the risky cells:

Table 16.9: Cell suppression
	A	B	C	Total
P	10			28
Q	7	6	11	24
R	5			8
Total	22	19	19	60

This necessary suppression of non-risky cells is called consequential cell suppression.

Solution 3: Random Rounding – We can random round each cell to base 3 (or some other base of our choice). Multiples of 3 are left undisturbed. Any other number is next to a multiple of three, and two units away from another multiple of three. In random rounding we round either up or down – with a higher probability of rounding to the closer value.

Value	Rounds To
0	0 always
1	0 with probability $\frac23$, 3 with probability $\frac13$
2	0 with probability $\frac13$, 3 with probability $\frac23$
3	3 always
4	3 with probability $\frac23$, 6 with probability $\frac13$
5	3 with probability $\frac13$, 6 with probability $\frac23$
6	6 always
7	6 with probability $\frac23$, 9 with probability $\frac13$
8	6 with probability $\frac13$, 9 with probability $\frac23$
9	9 always
10	9 with probability $\frac23$, 12 with probability $\frac13$
11	9 with probability $\frac13$, 12 with probability $\frac23$
12	12 always
etc.

This means that if we see:

Random Round Value	Could actual have been:
0	0,1,2
3	1,2,3,4,5
6	4,5,6,7,8
9	7,8,9,10,11
12	10,11,12,13,14
etc.

Each time we random round a table, we end up with a slightly different version. We round each cell separately – which means that we round the margins from their original values, we don’t add up the random rounded values.

Table 16.10: Random Rounded (base 3)
	A	B	C	Total
P	9	12	6	27
Q	6	9	12	24
R	6	0	0	9
Total	21	18	21	60

The good thing about rounding every cell separately is that the margin values are not far from their true values. The frustrating thing is that the cells in the table don’t necessarily add up to their margins any more. (e.g. in the table above in the Q row 6+9+12=27 but the row total is stated to be 24.

16.6 Other Types of Release

Graphical displays are equivalent in many ways to tables and unit records, and many of the confidentialising methods listed above apply to them.

A scatterplot is a visual subset of a unit record dataset: we see pairs of individual values of continuous variables displayed. Noise could be added to the data points to confidentialise the data. In a histogram the bands can be chosen to be wide enough to group many respondents together, and the smallest and highest bins could be open to the bottom and/or top coded to protect the outlying respondents.

References

Cochran, W. G. 1977. Sampling Techniques. Third. New York: Wiley.

Kish, L. 1965. Survey Sampling. New York: Wiley.

Little, R. J. A., and D. B. Rubin. 2002. Statistical Analysis with Missing Data. Second. Hoboken: Wiley.

Lohr, Sharon L. 1999. Sampling: Design and Analysis. Pacific Grove, CA, USA: Duxbury Press.

Lumley, Thomas. 2004. “Analysis of Complex Survey Samples.” Journal of Statistical Software 9 (1): 1–19.

Lumley, Thomas S. 2010. Complex Surveys: A Guide to Analysis Using r (Wiley Series in Survey Methodology). Wiley.

Särndal, C-E., B. Swensson, and J. Wretmann. 1992. Model Assisted Survey Sampling. New York: Springer-Verlag.

Scheaffer, R. L., W. Mendenhall III, and R. L. Ott. 2006. Elementary Survey Sampling. Sixth. Belmont: Duxbury.

Zealand, Statistics New. 1995. A Guide to Good Survey Design. Second. Wellington: Statistics New Zealand.

A Computing the Required Sample Size

Identify your design estimate – this is the estimate that will underlie a headline you could imagine being written about your survey.
Identify the desired MOE, $m$, for this design estimate, noting whether you want this MOE to apply to the overall estimate, or for that estimate in each of, say, $H$ subgroups/strata.
Compute the sample size required, $n_1$, assuming that you’re going to select a Simple Random Sample and ignoring the fpc.

For estimation of a mean $\bar{Y}$ of some characteristic $y$: \[ n_1 = \left(\frac{Z}{MOE}\right)^2 s^2 \] where $Z$ is appropriate for the desired level of confidence and $s$ is the estimated population standard deviation of the characteristic $y$.

For estimation of a proportion $P$: \[ n_1 = \left(\frac{Z}{MOE}\right)^2 p(1-p) \] where $p$ is the estimated proportion, or 0.5 if really unknown.
If the sample size is a substantial proportion of the population or subgroup size $N$, then apply the fpc adjustment: \[ n_2 = \frac{n_1}{1+\frac{n_1}{N}} \]
If you have $H$ subgroups and are using equal allocation then multiply by $H$ \[ n_3 = Hn_2 \] or alternatively sum over the $n_2$ values in each of the strata you’ve defined.
Apply the design effect (Deff) appropriate for the actual design you’ll be using. \[ n_4 = \text{Deff}\times n_3 \]
Adjust for anticipated non-response: if response rate $\phi$ is expected then \[ n_5 = \frac{n_4}{\phi} \] (If non-response is likely to vary between strata, then apply this correction to $n_2$, within each stratum.)

Quantity	Population	Sample
size	\(N\)	\(n\)
total	\(Y = \sum_{i=1}^N Y_i\)	\(y = \sum_{k=1}^n y_k\)
mean	\(\bar{Y} = \frac{1}{N}\sum_{i=1}^N Y_i\)	\(\bar{y} = \frac{1}{n}\sum_{k=1}^n y_k\)
variance	\(\sigma_Y^2=\frac{1}{N}\sum_{i=1}^N (Y_i-\bar{Y})^2\)
adjusted variance	\(S_Y^2=\frac{1}{N-1}\sum_{i=1}^N (Y_i-\bar{Y})^2\)	\(s_y^2=\frac{1}{n-1}\sum_{k=1}^n (y_k-\bar{y})^2\)
adjusted variance for indicator variables	\(S_Y^2=\frac{N}{N-1}p(1-p)\)	\(s_y^2=\frac{n}{n-1}\widehat{p}(1-\widehat{p})\)
relative variance	\(V_Y^2=\frac{S_Y^2}{\bar{Y}^2}\)	\(v_y^2=\frac{s_y^2}{\bar{y}^2}\)
coefficient of variation	\(V_Y=\sqrt{V_Y^2}=\frac{S_Y}{\bar{Y}}\)	\(v_y=\sqrt{v_y^2}=\frac{s_y}{\bar{y}}\)
covariance	\(S_{XY}=\frac{1}{N-1}\sum_{i=1}^N (X_i-\bar{X})(Y_i-\bar{Y})\)	\(s_{xy}=\frac{1}{n-1}\sum_{k=1}^n (x_k-\bar{x})(y_k-\bar{y})\)
correlation coefficient	\(\rho_{XY}=\frac{S_{XY}}{S_XS_Y}\)	\(r_{xy}=\frac{s_{xy}}{s_xs_y}\)

\(h\)	Stratum	Stratum Fraction \(F_h\)	Cost, \(c_h\)	Std. Dev. \(S_h\)
1	Undergraduate	0.80	\(C\)	\(S\)
2	Postgraduate	0.20	\(2C\)	\(3S\)

	In PES?
In Census?	Yes	No	Total
Yes	\(n_{11}\)	\(n_{12}\)	\(n_{1+}\)
No	\(n_{21}\)	\(0\)	\(n_{21}\)
Total	\(n_{+1}\)	\(n_{12}\)	\(n\)

	Would be found by PES?
In Census?	Yes	No	Total
Yes	\(N_{11}\)	\(N_{12}\)	\(N_{1+}\)
No	\(N_{21}\)	\(N_{22}\)	\(N_{2+}\)
Total	\(N_{+1}\)	\(N_{+2}\)	\(N\)

Sample	\(i_1\)	\(i_2\)	\(y_1=Y_{i_1}\)	\(y_2=Y_{i_2}\)	Sample Mean	Probability
1	1	2	0	1	0.5	\(\frac{1}{15}\)
2	1	3	0	3	1.5	\(\frac{1}{15}\)
3	1	4	0	5	2.5	\(\frac{1}{15}\)
4	1	5	0	6	3.0	\(\frac{1}{15}\)
5	1	6	0	9	4.5	\(\frac{1}{15}\)
6	2	3	1	3	2.0	\(\frac{1}{15}\)
7	2	4	1	5	3.0	\(\frac{1}{15}\)
8	2	5	1	6	3.5	\(\frac{1}{15}\)
9	2	6	1	9	5.0	\(\frac{1}{15}\)
10	3	4	3	5	4.0	\(\frac{1}{15}\)
11	3	5	3	6	4.5	\(\frac{1}{15}\)
12	3	6	3	9	6.0	\(\frac{1}{15}\)
13	4	5	5	6	5.5	\(\frac{1}{15}\)
14	4	6	5	9	7.0	\(\frac{1}{15}\)
15	5	6	6	9	7.5	\(\frac{1}{15}\)

Parameter	Hours	Income	PostSchool
Pop. size \(N\)	200.0000	200.0000	200.0000000
Mean \(\bar{Y}\)	33.7100	575.3600	0.4750000
Total, \(Y\)	6742.0000	115072.0000	95.0000000
Adj. Variance \(S_Y^2\)	261.0713	120137.6386	0.2506281
Std. Deviation \(S_Y\)	16.1577	346.6088	0.5006277
Unadj. Variance $_Y^2	259.7659	119536.9504	0.2493750

Parameter	Hours	Income	PostSchool
Sample. size \(n\)	20.000000	20.0000	20.0000000
Mean \(\bar{y}\)	38.300000	624.3500	0.7500000
Adj. Variance \(s_y^2\)	97.273684	133580.3447	0.1973684
Std. Deviation \(s_y\)	9.862742	365.4864	0.4442617

\(h\)	Stratum	\(F_hS_h\)	\(p_h=F_hS_h/\sum_k F_kS_k\)	\(n_h=np_h\)	Cost, \(n_hc_h\)
1	Undergraduate	\((0.8)(S) = 0.8S\)	\(0.8S/1.4S = 0.57\)	57	\(57\times C=57C\)
2	Postgraduate	\((0.2)(3S) = 0.6S\)	\(0.6S/1.4S = 0.43\)	43	\(43\times 2C=86C\)
	Total	\(1.4S\)	1.00	100	\(143C\)

\(h\)	Stratum	\(F_hS_h/\sqrt{c_h}\)	\(p_h \propto F_hS_h/\sqrt{c_h}\)	\(n_h=np_h\)	Cost, \(n_hc_h\)
1	Undergraduate	\((0.8)(S)/\sqrt{C} = 0.80S/\sqrt{C}\)	\(0.80/1.22 = 0.66\)	66	\(66\times C=66C\)
2	Postgraduate	\((0.2)(3S)/\sqrt{2C} = 0.42S/\sqrt{C}\)	\(0.42/1.22 = 0.34\)	34	\(34\times 2C=68C\)
	Total	\(1.22S/\sqrt{C}\)	1.00	100	\(134C\)

Stratum	Size	Std. Dev	Fraction		Equal	Proportional	Neyman
\(h\)	\(N_h\)	\(S_h\)	\(F_h\)	\(N_hS_h\)	\(p_h=\frac{1}{H}\)	\(p_h=F_h\)	\(p_h=\frac{N_hS_h}{\sum_hN_hS_h}\)
1	1300	0.496	0.26	644.8	0.25	0.26	0.14
2	2289	0.853	0.458	1952.5	0.25	0.46	0.43
3	932	0.852	0.186	794.1	0.25	0.19	0.17
4	479	2.441	0.096	1169.2	0.25	0.10	0.26
Total	5000		1.000	4560.6	1.00	1.00	1.00

	Classification	Sum of weights
1	Ineligible pre-contact	\(A\)
2	Ineligible post-contact	\(B\)
3	Eligible Non-Responding	\(C\)
4	Eligible Responding	\(D\)
5	Eligibility not established	\(E\)

	Estimated Population	Estimated Stratum Fraction	Selection Weight	Selected sample	Responding sample	Response Rate	Adjusted weight
Role	\(\widehat{N}_h=N\widehat{F}_h\)	\(\widehat{F}_h=n_h/n\)	\(w=N/n\)	\(n_h\)	\(n_{hR}\)	\(\phi_h\)	\(\tilde{w}_{hR}=\widehat{N}_{h}/n_{hR}=w/\phi_h\)
Manager	422	0.155	13.61	31	28	0.903	15.06
Non-manager	2299	0.845	13.61	169	68	0.402	33.81
Total	2721	1.000		200	96

Sample member	Student (within school)	Class	Class Size	Student (within class)	Number of books
\(k\)	\(i\)				\(y_k\)
1	5	1	9	5	2
2	14	2	10	5	2
3	16	2	10	7	2
4	63	6	16	9	1
5	65	6	16	11	0
6	70	6	16	16	1
7	77	7	12	7	2
8	80	7	12	10	3
9	87	8	14	5	5
10	89	8	14	7	2
11	107	10	6	1	6
12	127	12	10	7	3

Class	Class size	Sample Indicator		Sample Size	Sample Data
\(i\)	\(M_i\)	\(I_i\)	\(k\)	\(m_k\)	\(y_{k1},\ldots,y_{km_k}\)
1	9	1	1	4	4,4,3,4
2	10	0
3	20	0
4	8	0
5	7	0
6	16	1	2	4	6,6,6,4
7	12	0
8	14	0
9	10	0
10	6	0
11	8	0
12	10	1	3	4	1,2,3,5

					Replicates
Unit	Sex	\(y_k\)	\(w_k\)	\(w_k^\ast\)	\(w_k^{(1)}\)	\(w_k^{(2)}\)	\(w_k^{(3)}\)	\(w_k^{(4)}\)	\(w_k^{(5)}\)
1	Male	2	100	83.33	0.00	125.00	125	83.33	83.33
2	Male	0	100	83.33	125.00	0.00	125	83.33	83.33
3	Male	3	100	83.33	125.00	125.00	0	83.33	83.33
4	Female	5	100	125	125.00	125.00	125	0.00	250.00
5	Female	9	100	125	125.00	125.00	125	250.00	0.00
				Estimates: \(\widehat{\bar{Y}}_{(r)}\)	4.25	4.75	4	5.33	3.33

\(r\)	Observations					\(\bar{y}_{(r)}\)
1	2	5	2	2	2	2.6
2	9	2	2	3	5	4.2
3	3	9	9	5	2	5.6
4	0	3	9	3	5	4
5	3	9	5	0	3	4
6	5	9	5	5	2	5.2
7	0	3	9	9	3	4.8
8	3	0	3	0	2	1.6
9	0	2	9	2	2	3
10	0	2	5	2	2	2.2
\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)

Value	Rounds To
0	0 always
1	0 with probability \(\frac23\), 3 with probability \(\frac13\)
2	0 with probability \(\frac13\), 3 with probability \(\frac23\)
3	3 always
4	3 with probability \(\frac23\), 6 with probability \(\frac13\)
5	3 with probability \(\frac13\), 6 with probability \(\frac23\)
6	6 always
7	6 with probability \(\frac23\), 9 with probability \(\frac13\)
8	6 with probability \(\frac13\), 9 with probability \(\frac23\)
9	9 always
10	9 with probability \(\frac23\), 12 with probability \(\frac13\)
11	9 with probability \(\frac13\), 12 with probability \(\frac23\)
12	12 always
etc.

STAT392: Sample Surveys

Richard Arnold et al.

2025-05-29

1 Introductory Remarks

1.1 Recommended reading

2 Sample Surveys

2.1 The Survey Process

2.1.1 Household Labour Force Survey

2.1.2 Quality of Life Survey

2.1.3 Survey of Hector’s Dolphins between Motunau and Timaru

2.2 Survey Error

2.2.1 Sampling Error

2.2.2 Non-Sampling Error

2.2.3 Examples

Invalid Instrument

Invalid Instrument

Coverage Error and Non-response Bias

3 Developing Objectives

Step by step guide to developing a statement of objectives

3.1 An overview

3.2 Before you start

3.3 Where to start

3.4 State who it is you want to ask your research questions about

3.5 Periodicity

3.6 What are the important quantitative research questions that you want the survey to answer?

3.7 Key outputs

3.8 Other outputs

3.9 List of variables

3.10 Keep checking

4 Sampling and Estimation

4.1 Sampling Error

4.2 Statistical Review

4.3 Expected Value, Bias and Mean Squared Error

4.3.1 Example

4.4 What are the advantages of sampling?

4.5 What are the disadvantages of sampling?

4.6 Approaches to sampling

4.7 Examples of sampling schemes

4.7.1 SRSWOR=Simple Random Sample WithOut Replacement.

4.7.2 STSRS=Stratified Simple Random Sample.

4.7.3 LSRS=Linear Systematic Random Sample.

4.7.4 SRS1C=One-stage Cluster Sample.

4.7.5 SRS2C=Two-stage Cluster Sample.

4.7.6 STSRS1C=Stratified Cluster Sample.

4.7.7 PPSWR=Selection Probability Proportional to Size, With Replacement.

4.8 Adminstrative Data - a common alternative to sampling

5 Populations and Frames

5.1 What is a sampling frame?

5.2 What is a good sampling frame?

5.3 Screening

5.4 Why does a frame need to be so comprehensive?

5.5 Costs, Design and Operation - further complications.

6 Simple Random Sampling

6.1 Drawing a SRSWOR

6.1.1 Taking a SRSWOR using a calculator

6.1.2 Taking a SRSWOR using a spreadsheet

6.1.3 Taking a SRSWOR by scanning

6.2 Notation: properties of populations and samples

6.2.1 Survey Population

Binary Indicator Variables

6.2.2 Sample

6.2.3 Sample Probabilities

6.2.3.1 Example

6.2.4 Sample Weights

Example

6.3 Estimation in SRSWOR

Example

Example continued

6.3.1 Using sample weights to make estimates in SRSWOR

6.4 Sampling Errors

6.5 Confidence Intervals

Example continued

6.6 Quality of Estimates

Example continued

6.7 Sample Size Calculations

6.7.1 Estimation of means

Example

6.7.2 Allowing for non-response

Example continued

6.7.3 Using RSEs in sample size calculations