Chapter 16 Data Release | STAT392: Sample Surveys

16.1 Data and Statistics Act 2022

Link to the Act: https://www.legislation.govt.nz/act/public/2022/0039/latest/LMS418574.html

From the Statistics New Zealand website (www.stats.govt.nz):

https://www.stats.govt.nz/about-us/legislation-policies-and-guidelines/

Statistics New Zealand operates under the authority of the Data and Statistics Act 2022. The Act is in seven parts:

Part 1 - The Act ensures that that high-quality, impartial, and objective official statistics are produced relating to New Zealand to inform the public and inform
decision making.
Part 2 - Defines roles and responsibilities, including in particular the Government Statistician and creates the department known as Statistics New Zealand;
Part 3 Provides for the collection of data and for matters concerning statistical confidentiality.
Part 4 relates to the production of official statistics including the powers of the Minister of Statistics, and obligations for publication;
Part 5 relates to access to data for research;
Part 6 relates to offences and enforcement related to official statistics;
Part 7 contains general provisions.

16.2 Data quality

If sample numbers are small then estimates from that sample may be very imprecise. This in particular applies to estimates for subpopulations which are reported as part of large surveys.

Where data are imprecise it is good practice to flag cells or estimates which are unreliable, or to suppress such cells altogether. Consequential cell suppression is not necessary when estimates are suppressed for quality reasons.

For example in the 2001 Māori Language Survey Statistics New Zealand published, by age and sex, counts of the numbers of people who could speak at differing levels of proficiency. The output estimates are given in Table 16.1.

Table 16.1: Speaking proficiency by age and sex (Maori Language Survey 2001)
	Proficiency
AgeGroup	Very Well	Well	Fairly Well	Not Very Well	Few Words or Phrases	Total
Males
15-24 years	**	**	3869	10862	27083	43373
25-34 years	**	**	*2668	9394	22572	35552
35-44 years	**	**	*2720	4982	23008	32633
45-54 years	*2164	**	*2137	4001	12134	21161
55+ years	4952	*1115	1928	2621	9142	19757
Total	9045	4309	13322	31860	93939	152476
Females
15-24 years	*1516	*2386	7441	10618	24247	46209
25-34 years	**	*1172	4881	10572	25232	42378
35-44 years	**	**	3465	10481	21731	37478
45-54 years	*1299	*1191	*2894	5138	12251	22773
55+ years	5184	*1258	2691	3629	9507	22269
Total	9449	6880	21372	40437	92969	171107
Total
15-24 years	*2145	3317	11310	21480	51330	89582
25-34 years	**	*1773	7549	19966	47804	77930
35-44 years	*1912	*1812	6185	15463	44739	70111
45-54 years	3463	*1915	5031	9139	24385	43934
55+ years	10136	2373	4619	6249	18649	42026
Total	18494	11190	34694	72297	186908	323583
Number of Maori language speakers at each proficiency level for Maori aged 15 years and over
* Sampling Error $>$30%
** Sampling Error $>$50%: these cells are suppressed
Source: Statistics NZ

Because of small sample sizes certain of the cells have been suppressed (margins of error greater than 50% of the size estimate) or a flagged as very uncertain (margins of error greater than 30%).

The absolute and relative errors are shown in Table @ref{tab:absolute-errors} and 16.3

Table 16.2: Absolute Sampling Errors (Maori Language Survey 2001)
	Proficiency
AgeGroup	Very Well/Well	Fairly Well	Not Very Well	Few Words or Phrases
Males
15-24 years	604	916	1646	1730
25-34 years	506	906	1464	1526
35-44 years	627	747	1236	1477
45-54 years	892	796	982	1158
55+ years	847	472	603	953
Total	1423	2005	3442	3677
Females
15-24 years	1172	1535	1906	2480
25-34 years	635	1098	1831	2011
35-44 years	753	1014	1846	1977
45-54 years	833	808	1091	1239
55+ years	751	516	682	999
Total	1939	2466	4524	5200
Total
15-24 years	1347	1801	2650	3257
25-34 years	831	1449	2575	2757
35-44 years	1030	1327	2414	2612
45-54 years	1279	1101	1487	1768
55+ years	1216	763	942	1599
Total	2560	3377	6487	7209
Number of Maori language speakers at each proficiency level for Maori aged 15 years and over

Table 16.3: Relative Sampling Errors (%) (Maori Language Survey 2001)
	Proficiency
AgeGroup	Very Well/Well	Fairly Well	Not Very Well	Few Words or Phrases
Males
15-24 years	39	24	15	6
25-34 years	55	34	16	7
35-44 years	33	28	25	6
45-54 years	31	37	24	10
55+ years	14	25	23	10
Total	11	15	11	4
Females
15-24 years	30	21	18	10
25-34 years	38	22	17	8
35-44 years	42	29	18	9
45-54 years	33	28	21	10
55+ years	12	19	19	11
Total	12	12	11	6
Total
15-24 years	25	16	12	6
25-34 years	32	19	13	6
35-44 years	28	22	16	6
45-54 years	24	22	16	7
55+ years	10	17	15	9
Total	9	10	9	4
Number of Maori language speakers at each proficiency level for Maori aged 15 years and over

16.3 Confidentialising Data

Survey data are usually collected with some assurances about confidentiality. For example all data collected by Statistics New Zealand is covered by the Data and Statistics Act 2022 which states in Section 39(1):

The Statistician must take all reasonable steps to ensure that the Statistician does not publish or otherwise disclose data in a form that could reasonably be expected to identify any individual or organisation.

There are a number of exceptions however.

When data are released there is a risk that individuals and their characteristics could be identified. Steps may need to be taken to prevent this disclosure risk when survey data are released.

There is a clear risk where there is an individual who is unique in the sample – a sample unique (they differ from everyone else in the sample, but not necessarily everyone in the population).
Such an individual will end up in a cell with a cell with a count of 1 in a crosstabulation of a (set of) categorical variable(s).

If that person is also a population unique (they differ from everyone else in the population), then publication of such data will lead to significant disclosure about that individual.

Where the sampling fraction is low (i.e. the sample weights are high) then the risk is small, and there is usually no need to confidentialise. Sample uniques are unlikely to be population uniques in this case.

However where the sampling fraction is high (especially in full coverage strata, in censuses and in populations where there are just a few influential units) there is a significant risk of disclosure since a sample unique is very likely to be a population unique. In census a sample unique is always a population unique.

For rare members of the population there is a risk that they can identify each other. For example, if there are only a two large companies operating, say, internet businesses, and if an Official Statistics Agency reports the total turnover of all internet businesses, there is a disclosure risk: each large company can subtract its own turnover from the published total, and deduce the size of its competitor’s business and its market share. This is an unacceptable disclosure from the point of view of those companies, even if no company apart from the two main companies could possibly deduce their separate turnovers.

16.4 Unit Records

Official Statistics Agencies, such as Statistics New Zealand, may produce Confidentialised Unit Record Files or CURFs. These are usually sample survey datasets released to researchers or other government departments.

Census data are rarely released in this way.

Researchers agree to use the data only for specific purposes, and to destroy the unit record data after use.

Where unit records are to be released the data can be confidentialised by:

Removing all personal identifiers such as name and address;
Replacing administrative identifiers (e.g. IRD number) with some other identifier so that the data provider can identify a person, but the researcher cannot;
Add `noise’ to the data – e.g. add a random amount to each person’s income before creating income bands in a table;
Swap data between records (taking care that this does not significantly change the overall statistical properties of the data);
Replace actual data with imputed data (e.g. by regression imputation);
Replace data with bands (e.g. report income in bands rather than actual values).
Top code data. For example, actual incomes are recorded except for all of the top earners, who are put into a single band together (e.g. $>$$200,000).

These procedures mean that anyone looking at the data cannot be sure that the data in any particular record is a true set of data for that individual.

16.5 Tables

Where cell counts in tables are small there is a risk of identifying individuals and their characteristics. One rule which is often used to determine whether or not a cell poses a disclosure risk is the $(n,k)$ rule:

A cell is a risk if $n$ respondents or less contribute $k$% or more to the value of a cell.

For example we might consider is a cell a risk if $n$=3 respondents or fewer contribute $k$=80% or more of the cell value. For a table of counts that would mean we would consider a cell to be a risk where the count is $y$ if $3>0.8y$: i.e. where $y<3/0.8=3.75$: i.e. counts of 3 or less.

Another rule for deciding if a cell is risky is the $p$% rule:

This includes estimation by one of the other contributors to the cell. For example consider a cell with a value of $100,000. If one business contributed $40,000 to the cell, and knew that it had a larger competitor, then it could deduce that the competitor contributed between $40,000 and $60,000 to the cell. Estimating that competitor’s value as $50,000 would mean that the competitor’s value us being estimated to within $10,000, or 20%, of its true value.

The operationalisation of the $p$% rule means that you only have to check if the second largest contributor can find out about the highest contributor. This is the ‘worst case’, meaning that if the second contributor can’t break the confidentialising, then it follows that every other contributor can’t either.

The values of $(n,k)$ and $p%$ that are used by Statistics New Zealand are themselves confidential.

There are various options when tabular data are confidentialised.

Suppress risky cells. This means not publishing a value in those cells. This usually means consequential suppression of some other non-risky cells, in order that the value of the suppressed cell not be deducible;
Construct tables from confidentialised unit records;
Amalagamate rows and/or columns until all cells are large;
Random round each cell entry. Statistics New Zealand does this random rounding to base 3 for Census and other tables of counts.
The procedure is as follows:
- If a count $x$ is a multiple of 3, i.e. $x=3m$ it is left unchanged;
- If a count is a multiple of 3 + 1: i.e. $x=3m+1$ then it is rounded down to $3m$ with probability $\frac{2}{3}$, and rounded up to $3m+3$ with probability $\frac{1}{3}$.
  
  Draw a random number $r$ between 0 and 1: if $r<\frac23$ round down otherwise round up.
- If a count is a multiple of 3 + 2: i.e. $x=3m+2$ then it is rounded down to $3m$ with probability $\frac{1}{3}$, and rounded up to $3m+3$ with probability $\frac{2}{3}$.
  
  Draw a random number $r$ between 0 and 1: if $r<\frac13$ round down otherwise round up.
Counts in the margins of tables may be left unchanged, random rounded independently, or recalculated as the sums of the cell entries. Except in the latter case this means the cells in a table may not add up to the published margins.

Example 1. Consider the following table from the 2001 Census on populations by ethnicity in small areas. The data in the table has been random rounded to base 3. The columns of figures do not add up to the published totals because of the random rounding.

Table 16.4: Census counts for districts in the central South Island
Ethnic Group	Timaru	Mackenzie	Waimate
Pacific Peoples
Samoan	159	0	18
Cook Island Maori nfd	42	3	12
Tongan	66	0	9
Niuean	6	0	0
Fijian (except Fiji Indian/Indo-Fijian)	12	3	0
Tokelauan	12	0	0
Tuvalu Islander/Ellice Islander	6	0	0
Rarotongan	9	0	0
Society Islander (including Tahitian)	0	0	0
Other Pacific Peoples	18	3	3
All Pacific Peoples
All Pacific Peoples	297	12	39
All Ethnic Groups
All Ethnic Groups	41082	3546	6978

The data for the Mackenzie District might have originally looked like the values below – these data could be treated by suppressing cells or by random rounding. Since the counts are so small cell suppression leads to an exaggerated loss of data. Some nonrisky cells are suppressed in order that the sensitive cells remain unidentifiable after suppression.

Table 16.5: Mackenzie District - Confidentialisation
Ethnic Group	Original	CellSuppression	Random
Pacific Peoples
Samoan	0	0	0
Cook Island Maori nfd	5	5	3
Tongan	0	0	0
Niuean	0		0
Fijian (except Fiji Indian/Indo-Fijian)	3		3
Tokelauan	0		0
Tuvalu Islander/Ellice Islander	0	0	0
Rarotongan	1		0
Society Islander (including Tahitian)	0		0
Other Pacific Peoples	4	4	3
All Pacific Peoples
All Pacific Peoples	13	13	12
All Ethnic Groups
All Ethnic Groups	3544	3544	3546

Example 2. Here is a table with some small counts:

Table 16.6: Original data
	A	B	C	Total
P	10	12	6	28
Q	7	6	11	24
R	5	1	2	8
Total	22	19	19	60

If we take the view that cells with counts smaller than 4 are too risky to release, we have two cells that need to be confidentialised.

Solution 1: Amalgamation – We can combine rows Q and R:

Table 16.7: Amalgamated data
	A	B	C	Total
P	10	12	6	28
QR	12	7	13	32
Total	22	19	19	60

This eliminates any information about the different distributions for Q and R.

Solution 2: Cell suppression – We can suppress the risky cells

Table 16.8: Cell Suppression
	A	B	C	Total
P	10	12	6	28
Q	7	6	11	24
R	5			8
Total	22	19	19	60

This leaves the data for all non-risky cells visible, and removes the data in the risky cells. However since we have the column totals we can deduce the missing data. That means we have to suppress two further non-risky cells in order for the cell suppression to effectively disguise the risky cells:

Table 16.9: Cell suppression
	A	B	C	Total
P	10			28
Q	7	6	11	24
R	5			8
Total	22	19	19	60

This necessary suppression of non-risky cells is called consequential cell suppression.

Solution 3: Random Rounding – We can random round each cell to base 3 (or some other base of our choice). Multiples of 3 are left undisturbed. Any other number is next to a multiple of three, and two units away from another multiple of three. In random rounding we round either up or down – with a higher probability of rounding to the closer value.

Value	Rounds To
0	0 always
1	0 with probability $\frac23$, 3 with probability $\frac13$
2	0 with probability $\frac13$, 3 with probability $\frac23$
3	3 always
4	3 with probability $\frac23$, 6 with probability $\frac13$
5	3 with probability $\frac13$, 6 with probability $\frac23$
6	6 always
7	6 with probability $\frac23$, 9 with probability $\frac13$
8	6 with probability $\frac13$, 9 with probability $\frac23$
9	9 always
10	9 with probability $\frac23$, 12 with probability $\frac13$
11	9 with probability $\frac13$, 12 with probability $\frac23$
12	12 always
etc.

This means that if we see:

Random Round Value	Could actual have been:
0	0,1,2
3	1,2,3,4,5
6	4,5,6,7,8
9	7,8,9,10,11
12	10,11,12,13,14
etc.

Each time we random round a table, we end up with a slightly different version. We round each cell separately – which means that we round the margins from their original values, we don’t add up the random rounded values.

Table 16.10: Random Rounded (base 3)
	A	B	C	Total
P	9	12	6	27
Q	6	9	12	24
R	6	0	0	9
Total	21	18	21	60

The good thing about rounding every cell separately is that the margin values are not far from their true values. The frustrating thing is that the cells in the table don’t necessarily add up to their margins any more. (e.g. in the table above in the Q row 6+9+12=27 but the row total is stated to be 24.

16.6 Other Types of Release

Graphical displays are equivalent in many ways to tables and unit records, and many of the confidentialising methods listed above apply to them.

A scatterplot is a visual subset of a unit record dataset: we see pairs of individual values of continuous variables displayed. Noise could be added to the data points to confidentialise the data. In a histogram the bands can be chosen to be wide enough to group many respondents together, and the smallest and highest bins could be open to the bottom and/or top coded to protect the outlying respondents.

Value	Rounds To
0	0 always
1	0 with probability \(\frac23\), 3 with probability \(\frac13\)
2	0 with probability \(\frac13\), 3 with probability \(\frac23\)
3	3 always
4	3 with probability \(\frac23\), 6 with probability \(\frac13\)
5	3 with probability \(\frac13\), 6 with probability \(\frac23\)
6	6 always
7	6 with probability \(\frac23\), 9 with probability \(\frac13\)
8	6 with probability \(\frac13\), 9 with probability \(\frac23\)
9	9 always
10	9 with probability \(\frac23\), 12 with probability \(\frac13\)
11	9 with probability \(\frac13\), 12 with probability \(\frac23\)
12	12 always
etc.