Chapter 16 Data Release

When data are to be released in any form several questions need to be asked:

  • Are the data of sufficient quality to release?
  • Is there a risk of identifying individuals, or releasing any information given in confidence?
  • Are the data being used for the purpose for which they were collected? (What were the respondents told about the uses of the data at the time of collection?)
  • Are there any legal, commercial or ethical considerations?

16.1 Data and Statistics Act 2022

Link to the Act: https://www.legislation.govt.nz/act/public/2022/0039/latest/LMS418574.html

From the Statistics New Zealand website (www.stats.govt.nz):

https://www.stats.govt.nz/about-us/legislation-policies-and-guidelines/

Statistics New Zealand operates under the authority of the Data and Statistics Act 2022. The Act is in seven parts:

  • Part 1 - The Act ensures that that high-quality, impartial, and objective official statistics are produced relating to New Zealand to inform the public and inform
    decision making.
  • Part 2 - Defines roles and responsibilities, including in particular the Government Statistician and creates the department known as Statistics New Zealand;
  • Part 3 Provides for the collection of data and for matters concerning statistical confidentiality.
  • Part 4 relates to the production of official statistics including the powers of the Minister of Statistics, and obligations for publication;
  • Part 5 relates to access to data for research;
  • Part 6 relates to offences and enforcement related to official statistics;
  • Part 7 contains general provisions.

16.2 Data quality

If sample numbers are small then estimates from that sample may be very imprecise. This in particular applies to estimates for subpopulations which are reported as part of large surveys.

Where data are imprecise it is good practice to flag cells or estimates which are unreliable, or to suppress such cells altogether. Consequential cell suppression is not necessary when estimates are suppressed for quality reasons.

For example in the 2001 Māori Language Survey Statistics New Zealand published, by age and sex, counts of the numbers of people who could speak at differing levels of proficiency. The output estimates are given in Table 16.1.

Table 16.1: Speaking proficiency by age and sex (Maori Language Survey 2001)
Proficiency
AgeGroup Very Well Well Fairly Well Not Very Well Few Words or Phrases Total
Males
15-24 years ** ** 3869 10862 27083 43373
25-34 years ** ** *2668 9394 22572 35552
35-44 years ** ** *2720 4982 23008 32633
45-54 years *2164 ** *2137 4001 12134 21161
55+ years 4952 *1115 1928 2621 9142 19757
Total 9045 4309 13322 31860 93939 152476
Females
15-24 years *1516 *2386 7441 10618 24247 46209
25-34 years ** *1172 4881 10572 25232 42378
35-44 years ** ** 3465 10481 21731 37478
45-54 years *1299 *1191 *2894 5138 12251 22773
55+ years 5184 *1258 2691 3629 9507 22269
Total 9449 6880 21372 40437 92969 171107
Total
15-24 years *2145 3317 11310 21480 51330 89582
25-34 years ** *1773 7549 19966 47804 77930
35-44 years *1912 *1812 6185 15463 44739 70111
45-54 years 3463 *1915 5031 9139 24385 43934
55+ years 10136 2373 4619 6249 18649 42026
Total 18494 11190 34694 72297 186908 323583
Number of Maori language speakers at each proficiency level for Maori aged 15 years and over
* Sampling Error $>$30%
** Sampling Error $>$50%: these cells are suppressed
Source: Statistics NZ

Because of small sample sizes certain of the cells have been suppressed (margins of error greater than 50% of the size estimate) or a flagged as very uncertain (margins of error greater than 30%).

The absolute and relative errors are shown in Table @ref{tab:absolute-errors} and 16.3

Table 16.2: Absolute Sampling Errors (Maori Language Survey 2001)
Proficiency
AgeGroup Very Well/Well Fairly Well Not Very Well Few Words or Phrases
Males
15-24 years 604 916 1646 1730
25-34 years 506 906 1464 1526
35-44 years 627 747 1236 1477
45-54 years 892 796 982 1158
55+ years 847 472 603 953
Total 1423 2005 3442 3677
Females
15-24 years 1172 1535 1906 2480
25-34 years 635 1098 1831 2011
35-44 years 753 1014 1846 1977
45-54 years 833 808 1091 1239
55+ years 751 516 682 999
Total 1939 2466 4524 5200
Total
15-24 years 1347 1801 2650 3257
25-34 years 831 1449 2575 2757
35-44 years 1030 1327 2414 2612
45-54 years 1279 1101 1487 1768
55+ years 1216 763 942 1599
Total 2560 3377 6487 7209
Number of Maori language speakers at each proficiency level for Maori aged 15 years and over
Table 16.3: Relative Sampling Errors (%) (Maori Language Survey 2001)
Proficiency
AgeGroup Very Well/Well Fairly Well Not Very Well Few Words or Phrases
Males
15-24 years 39 24 15 6
25-34 years 55 34 16 7
35-44 years 33 28 25 6
45-54 years 31 37 24 10
55+ years 14 25 23 10
Total 11 15 11 4
Females
15-24 years 30 21 18 10
25-34 years 38 22 17 8
35-44 years 42 29 18 9
45-54 years 33 28 21 10
55+ years 12 19 19 11
Total 12 12 11 6
Total
15-24 years 25 16 12 6
25-34 years 32 19 13 6
35-44 years 28 22 16 6
45-54 years 24 22 16 7
55+ years 10 17 15 9
Total 9 10 9 4
Number of Maori language speakers at each proficiency level for Maori aged 15 years and over

16.3 Confidentialising Data

Survey data are usually collected with some assurances about confidentiality. For example all data collected by Statistics New Zealand is covered by the Data and Statistics Act 2022 which states in Section 39(1):

The Statistician must take all reasonable steps to ensure that the Statistician does not publish or otherwise disclose data in a form that could reasonably be expected to identify any individual or organisation.

There are a number of exceptions however.

When data are released there is a risk that individuals and their characteristics could be identified. Steps may need to be taken to prevent this disclosure risk when survey data are released.

There is a clear risk where there is an individual who is unique in the sample – a sample unique (they differ from everyone else in the sample, but not necessarily everyone in the population).
Such an individual will end up in a cell with a cell with a count of 1 in a crosstabulation of a (set of) categorical variable(s).

If that person is also a population unique (they differ from everyone else in the population), then publication of such data will lead to significant disclosure about that individual.

Where the sampling fraction is low (i.e. the sample weights are high) then the risk is small, and there is usually no need to confidentialise. Sample uniques are unlikely to be population uniques in this case.

However where the sampling fraction is high (especially in full coverage strata, in censuses and in populations where there are just a few influential units) there is a significant risk of disclosure since a sample unique is very likely to be a population unique. In census a sample unique is always a population unique.

For rare members of the population there is a risk that they can identify each other. For example, if there are only a two large companies operating, say, internet businesses, and if an Official Statistics Agency reports the total turnover of all internet businesses, there is a disclosure risk: each large company can subtract its own turnover from the published total, and deduce the size of its competitor’s business and its market share. This is an unacceptable disclosure from the point of view of those companies, even if no company apart from the two main companies could possibly deduce their separate turnovers.

16.4 Unit Records

Official Statistics Agencies, such as Statistics New Zealand, may produce Confidentialised Unit Record Files or CURFs. These are usually sample survey datasets released to researchers or other government departments.

Census data are rarely released in this way.

Researchers agree to use the data only for specific purposes, and to destroy the unit record data after use.

Where unit records are to be released the data can be confidentialised by:

  • Removing all personal identifiers such as name and address;
  • Replacing administrative identifiers (e.g. IRD number) with some other identifier so that the data provider can identify a person, but the researcher cannot;
  • Add `noise’ to the data – e.g. add a random amount to each person’s income before creating income bands in a table;
  • Swap data between records (taking care that this does not significantly change the overall statistical properties of the data);
  • Replace actual data with imputed data (e.g. by regression imputation);
  • Replace data with bands (e.g. report income in bands rather than actual values).
  • Top code data. For example, actual incomes are recorded except for all of the top earners, who are put into a single band together (e.g. \(>\)$200,000).

These procedures mean that anyone looking at the data cannot be sure that the data in any particular record is a true set of data for that individual.

16.5 Tables

Where cell counts in tables are small there is a risk of identifying individuals and their characteristics. One rule which is often used to determine whether or not a cell poses a disclosure risk is the \((n,k)\) rule:

A cell is a risk if \(n\) respondents or less contribute \(k\)% or more to the value of a cell.

For example we might consider is a cell a risk if \(n\)=3 respondents or fewer contribute \(k\)=80% or more of the cell value. For a table of counts that would mean we would consider a cell to be a risk where the count is \(y\) if \(3>0.8y\): i.e. where \(y<3/0.8=3.75\): i.e. counts of 3 or less.

Another rule for deciding if a cell is risky is the \(p\)% rule:

This includes estimation by one of the other contributors to the cell. For example consider a cell with a value of $100,000. If one business contributed $40,000 to the cell, and knew that it had a larger competitor, then it could deduce that the competitor contributed between $40,000 and $60,000 to the cell. Estimating that competitor’s value as $50,000 would mean that the competitor’s value us being estimated to within $10,000, or 20%, of its true value.

The operationalisation of the \(p\)% rule means that you only have to check if the second largest contributor can find out about the highest contributor. This is the ‘worst case’, meaning that if the second contributor can’t break the confidentialising, then it follows that every other contributor can’t either.

The values of \((n,k)\) and \(p%\) that are used by Statistics New Zealand are themselves confidential.

There are various options when tabular data are confidentialised.

  • Suppress risky cells. This means not publishing a value in those cells. This usually means consequential suppression of some other non-risky cells, in order that the value of the suppressed cell not be deducible;

  • Construct tables from confidentialised unit records;

  • Amalagamate rows and/or columns until all cells are large;

  • Random round each cell entry. Statistics New Zealand does this random rounding to base 3 for Census and other tables of counts.
    The procedure is as follows:

    • If a count \(x\) is a multiple of 3, i.e.  \(x=3m\) it is left unchanged;

    • If a count is a multiple of 3 + 1: i.e. \(x=3m+1\) then it is rounded down to \(3m\) with probability \(\frac{2}{3}\), and rounded up to \(3m+3\) with probability \(\frac{1}{3}\).

      Draw a random number \(r\) between 0 and 1: if \(r<\frac23\) round down otherwise round up.

    • If a count is a multiple of 3 + 2: i.e. \(x=3m+2\) then it is rounded down to \(3m\) with probability \(\frac{1}{3}\), and rounded up to \(3m+3\) with probability \(\frac{2}{3}\).

      Draw a random number \(r\) between 0 and 1: if \(r<\frac13\) round down otherwise round up.

    Counts in the margins of tables may be left unchanged, random rounded independently, or recalculated as the sums of the cell entries. Except in the latter case this means the cells in a table may not add up to the published margins.

Example 1. Consider the following table from the 2001 Census on populations by ethnicity in small areas. The data in the table has been random rounded to base 3. The columns of figures do not add up to the published totals because of the random rounding.

Table 16.4: Census counts for districts in the central South Island
Ethnic Group Timaru Mackenzie Waimate
Pacific Peoples
Samoan 159 0 18
Cook Island Maori nfd 42 3 12
Tongan 66 0 9
Niuean 6 0 0
Fijian (except Fiji Indian/Indo-Fijian) 12 3 0
Tokelauan 12 0 0
Tuvalu Islander/Ellice Islander 6 0 0
Rarotongan 9 0 0
Society Islander (including Tahitian) 0 0 0
Other Pacific Peoples 18 3 3
All Pacific Peoples
All Pacific Peoples 297 12 39
All Ethnic Groups
All Ethnic Groups 41082 3546 6978

The data for the Mackenzie District might have originally looked like the values below – these data could be treated by suppressing cells or by random rounding. Since the counts are so small cell suppression leads to an exaggerated loss of data. Some nonrisky cells are suppressed in order that the sensitive cells remain unidentifiable after suppression.

Table 16.5: Mackenzie District - Confidentialisation
Ethnic Group Original CellSuppression Random
Pacific Peoples
Samoan 0 0 0
Cook Island Maori nfd 5 5 3
Tongan 0 0 0
Niuean 0 0
Fijian (except Fiji Indian/Indo-Fijian) 3 3
Tokelauan 0 0
Tuvalu Islander/Ellice Islander 0 0 0
Rarotongan 1 0
Society Islander (including Tahitian) 0 0
Other Pacific Peoples 4 4 3
All Pacific Peoples
All Pacific Peoples 13 13 12
All Ethnic Groups
All Ethnic Groups 3544 3544 3546

Example 2. Here is a table with some small counts:

Table 16.6: Original data
A B C Total
P 10 12 6 28
Q 7 6 11 24
R 5 1 2 8
Total 22 19 19 60

If we take the view that cells with counts smaller than 4 are too risky to release, we have two cells that need to be confidentialised.

Solution 1: Amalgamation – We can combine rows Q and R:

Table 16.7: Amalgamated data
A B C Total
P 10 12 6 28
QR 12 7 13 32
Total 22 19 19 60

This eliminates any information about the different distributions for Q and R.

Solution 2: Cell suppression – We can suppress the risky cells

Table 16.8: Cell Suppression
A B C Total
P 10 12 6 28
Q 7 6 11 24
R 5 8
Total 22 19 19 60

This leaves the data for all non-risky cells visible, and removes the data in the risky cells. However since we have the column totals we can deduce the missing data. That means we have to suppress two further non-risky cells in order for the cell suppression to effectively disguise the risky cells:

Table 16.9: Cell suppression
A B C Total
P 10 28
Q 7 6 11 24
R 5 8
Total 22 19 19 60

This necessary suppression of non-risky cells is called consequential cell suppression.

Solution 3: Random Rounding – We can random round each cell to base 3 (or some other base of our choice). Multiples of 3 are left undisturbed. Any other number is next to a multiple of three, and two units away from another multiple of three. In random rounding we round either up or down – with a higher probability of rounding to the closer value.

Value Rounds To
0 0 always
1 0 with probability \(\frac23\), 3 with probability \(\frac13\)
2 0 with probability \(\frac13\), 3 with probability \(\frac23\)
3 3 always
4 3 with probability \(\frac23\), 6 with probability \(\frac13\)
5 3 with probability \(\frac13\), 6 with probability \(\frac23\)
6 6 always
7 6 with probability \(\frac23\), 9 with probability \(\frac13\)
8 6 with probability \(\frac13\), 9 with probability \(\frac23\)
9 9 always
10 9 with probability \(\frac23\), 12 with probability \(\frac13\)
11 9 with probability \(\frac13\), 12 with probability \(\frac23\)
12 12 always
etc.

This means that if we see:

Random Round Value Could actual have been:
0 0,1,2
3 1,2,3,4,5
6 4,5,6,7,8
9 7,8,9,10,11
12 10,11,12,13,14
etc.

Each time we random round a table, we end up with a slightly different version. We round each cell separately – which means that we round the margins from their original values, we don’t add up the random rounded values.

Table 16.10: Random Rounded (base 3)
A B C Total
P 9 12 6 27
Q 6 9 12 24
R 6 0 0 9
Total 21 18 21 60

The good thing about rounding every cell separately is that the margin values are not far from their true values. The frustrating thing is that the cells in the table don’t necessarily add up to their margins any more. (e.g. in the table above in the Q row 6+9+12=27 but the row total is stated to be 24.

16.6 Other Types of Release

Graphical displays are equivalent in many ways to tables and unit records, and many of the confidentialising methods listed above apply to them.

A scatterplot is a visual subset of a unit record dataset: we see pairs of individual values of continuous variables displayed. Noise could be added to the data points to confidentialise the data. In a histogram the bands can be chosen to be wide enough to group many respondents together, and the smallest and highest bins could be open to the bottom and/or top coded to protect the outlying respondents.