Measuring risk of re‐identification in microdata: State‐of‐the art and new directions

We review the influential research carried out by Chris Skinner in the area of statistical disclosure control, and in particular quantifying the risk of re‐identification in sample microdata from a random survey drawn from a finite population. We use the sample microdata to infer population parameters when the population is unknown, and estimate the risk of re‐identification based on the notion of population uniqueness using probabilistic modelling. We also introduce a new approach to measure the risk of re‐identification for a subpopulation in a register that is not representative of the general population, for example a register of cancer patients. In addition, we can use the additional information from the register to measure the risk of re‐identification for the sample microdata. This new approach was developed by the two authors and is published here for the first time. We demonstrate this approach in an application study based on UK census data where we can compare the estimated risk measures to the known truth.

infer population parameters when the population is unknown, and estimate the risk of re-identification based on the notion of population uniqueness using probabilistic modelling. We also introduce a new approach to measure the risk of re-identification for a subpopulation in a register that is not representative of the general population, for example a register of cancer patients. In addition, we can use the additional information from the register to measure the risk of re-identification for the sample microdata. This new approach was developed by the two authors and is published here for the first time. We demonstrate this approach in an application study based on UK census data where we can compare the estimated risk measures to the known truth.

K E Y W O R D S
disclosure risks, key variables, log-linear models, model specification, probability scores estimation, registers

INTRODUCTION
Among Chris Skinner's many influential areas of interest that had a large impact on the dissemination of official statistics was statistical disclosure control (SDC), particularly focusing on measuring the risk of re-identification in sample microdata where the sample is drawn randomly from a finite population. This research was largely motivated by his collaborations with researchers at the University of Manchester in the early 90s to convince the Office of Population Censuses and Surveys (which later merged with the Central Statistical Office to form the Office for National Statistics in 1996) to release a small sample of anonymized records from the 1991 census. This led to the important paper Skinner et al. (1994) which resulted in the release of the Sample of Anonymised Records (SARs) in the United Kingdom for every census since 1991. It was during this time that Chris and others researched estimating the risk of re-identification of sample microdata by developing theory and statistical modelling frameworks, and conceptualized the disclosure risk in terms of population uniqueness given the observed sample microdata (Duncan & Lambert, 1989;Paass, 1988;Skinner, 1992). The disclosure risk scenario for the release of sample microdata containing records from a survey where the sample is drawn randomly from a finite population is based on the following assumptions: (1) there is an 'intruder' (someone with malicious intent to discredit the statistical office) who has access to the microdata and other auxiliary information from the population that allows him/her to link data sources in order to identify individuals in the sample microdata; (2) there is no 'response knowledge' meaning that the intruder does not know who was drawn into the sample of the survey. The basic definition of the risk of re-identification is therefore the probability of correctly being able to make this match. If the characteristics of the population are known, such as measured in a population register or census, this probability would be relatively straightforward to calculate. However, this is rarely the case since within statistical agencies, samples are typically drawn from area or address-based sample frames. Therefore, a statistical modelling framework is needed to estimate the probability of re-identification. This probability is conditional on the released data and information available to the intruder and defined with respect to a probabilistic model and assumptions about how the data are generated (knowledge of the sampling process). The model is with respect to key variables defined as a set of quasi-identifiers in both data sources, typically categorical such as age, sex, location, ethnicity, that when cross-classified can be used to identify cells with small sample sizes, and we particularly focus on the sample uniques. The risk of re-identification is based on the notion of population uniqueness on the set of key variables: given an observed sample unique in a table generated from the key variables, what is the probability that the cell is also a population unique?
The probabilistic modelling to estimate population uniqueness from the observed sample microdata was developed under two approaches: a full model-based framework taking into account all of the information available to intruders and modelling their behaviour (Duncan & Lambert, 1989, Lambert, 1993and later Reiter, 2005 and a more simplified approach that restricts the information that would be known to intruders (Benedetti et al., 1998;Bethlehem et al., 1990;Fienberg & Makov, 1998;Skinner & Holmes, 1998).
In Section 2, we provide an overview of the research that Chris Skinner carried out on the probabilistic modelling approach to estimate the risk of re-identification based on population uniqueness. In Section 3 we present new research that was progressed during the Data Linkage and Anonymization Programme at the Isaac Newton Institute, Cambridge UK from July to December 2016, some of which is published here for the first time based on the joint collaboration between the authors. The research extends the previous work of measuring the risk of re-identification in sample microdata to measuring the risk of re-identification in a publically available register containing a subpopulation where the membership is unknown and may be sensitive, such as a register of cancer patients. The register is clearly not a random sample of the population and hence cannot be used in the original framework described in Section 2. Section 4 presents results from an application study. In Section 5 we present conclusions and future research directions for extending the probabilistic modelling framework to measure the risk of re-identification in non-probability samples which are becoming more prevalent in recent years.

MEASURING THE RISK OF RE-IDENTIFICATION IN SAMPLE MICRODATA
In any released sample microdata from surveys of households and individuals, the direct identifying key variables, such as name, address or identification numbers, are removed. Nevertheless, disclosure risks can arise when there are small counts on a set of cross-classified indirect identifying key variables which are typically categorical, such as: age, sex, place of residence, marital status and occupation, as these can be used to identify an individual and further confidential information may be learnt from survey target variables. Under a probabilistic modelling approach, disclosure risk is assessed on the contingency table of sample counts spanned by these identifying key variables. The assumption is that the sample microdata contain responding individuals in a survey and the population counts are unknown (or only partially known through some marginal distributions). The risk of re-identification is therefore a function of both the population and the sample, and in particular the cell counts of the contingency table. Shlomo (2010) provides an overview of disclosure risk assessment in sample microdata which is summarized here to emphasize the contributions by Chris Skinner.
Individual per-record risk measures in the form of a probability of re-identification are estimated. These per-record risk measures are then aggregated to obtain global risk measures for the entire file. We denote F k the population size in cell k of a table spanned by key variables having K cells, f k the sample size in cell k, ∑ k F k = N and ∑ k f k = n. The set of sample uniques is defined by: SU = {k ∶ f k = 1} and these are the high-risk records with the potential to be population uniques. Two global disclosure risk measures (where I is the indicator function) are the following: 1. Number of sample uniques that are population uniques: 2. Expected number of correct matches for sample uniques assuming a random assignment within cell k. For example, if a sample unique matches to three individuals in the population, the match probability for that sample unique would be 1/3. Aggregating all match probabilities over the sample uniques leads us to: 2 = ∑ k I ( f k = 1) 1∕F k .
If the population frequencies F k are known then we can easily calculate the global disclosure risk measures. However, this is rarely the case and we assume that the population frequencies F k are unknown and need to be estimated. Using a probabilistic model, the risk measures are estimated by: (1) Skinner and Holmes (1998) and Elamir and Skinner (2006) propose a Poisson distribution and a log-linear model to estimate disclosure risk measures in (1). In this model, they assume that F k ∼ Pois ( k ) for each cell k. A sample is drawn by Poisson or Bernoulli sampling with a sampling fraction k in cell k: f k |F k ∼ Bin (F k , k ). It follows that: where the population cell counts F k are assumed independent given the sample cell counts f k . The parameters k are estimated using log-linear modelling. The sample frequencies f k are independent Poisson distributed with a mean of k = k k . A log-linear model for the k is expressed as: log ( k ) = x ′ k where x k is a design vector which denotes the main effects and interactions of the model for the key variables. The maximum likelihood (MLE) estimator̂are obtained by solving the score equations: ∑ The fitted values are then calculated by:̂k = exp ( x ′ k̂) and̂k =̂k∕ k . Individual disclosure risk measures for cell k are: Plugginĝk for k in (4) leads to the estimatesP(F k = 1|f k = 1) andÊ(1∕F k |f k = 1) and then tô1 and̂2 of (1). Rinott and Shlomo (2007b) consider confidence intervals for these global risk measures. Skinner and Shlomo (2008) develop a method for selecting the main effects and interactions for the log-linear model based on estimating and (approximately) minimizing the bias of the risk estimateŝ1 and̂2. Defining h ( k ) = P ( f k = 1|f k = 1) for 1 and h ( k ) = E(1∕F k |f k = 1) for 2 , they consider the expression: A Taylor expansion of h leads to the approximation and the relations E (f k ) = k k and E ( ( 2 under the hypothesis of a Poisson distribution fit lead to a further approximation of B of the form: ) .
For example, for 1 : The method selects the model using a forward search algorithm which minimizes the standardized bias estimateB i ∕ √v i for̂i, i = 1, 2, which is used as the goodness-of-fit criteria wherev i is the variance estimate ofB i . The goodness-of-fit criteriaB i ∕ √v i have an approximate standard normal distribution under the hypothesis that the expected value ofB i is zero. Skinner and Shlomo (2008) also address the estimation of disclosure risk measures under complex survey designs with stratification, clustering and survey weights. While the method described assumes that all individuals within cell k are selected independently using Bernoulli sampling, that is, ( f k = 1|F k ) = F k k (1 − k ) F k −1 , this may not be the case when sampling clusters (households). In practice, key variables typically include variables such as age, sex and occupation that tend to cut across clusters. Therefore, the above assumption holds in practice in most household surveys and does not cause bias in the estimation of the risk measures. Inclusion probabilities may vary across strata, the most common stratification is on geography. Strata indicators should always be included in the key variables to take into account differential inclusion probabilities in the log-linear model. Under complex sampling, the k can be estimated consistently using pseudo-maximum likelihood estimation (Rao & Thomas, 2003), where the estimating equation andF k is obtained by summing the survey weights in cell k:F k = ∑ i∈k w i . The resulting esti-mateŝk are plugged into expressions in (4) and k is replaced by the estimatêk = f k ∕F k . The goodness-of-fit criteriaB is also adapted to the pseudo-maximum likelihood approach.
The probabilistic modelling presented here and in other related work in the literature assume that there is no measurement error in the way the data are recorded. Besides typical errors in data capture, key variables can also purposely be misclassified as a means of masking the data, for example through record swapping or the post-randomization method (PRAM) (Gouweleeuw, et al., 1998). Shlomo and Skinner (2010) adapt the estimation of the risk of re-identification of 2 in (1) to take into account measurement errors. We denote the cross-classified key variables in the population and the microdata as X and assume that X in the microdata have undergone some misclassification or perturbation error denoted by the valueX and determined independently by a misclassification matrix M: The record-level disclosure risk measure of a match with a sample unique under measurement error is: Under assumptions of small sampling fractions and small misclassification errors, the disclosure risk measure of 2 can be approximated by: Aggregating the per-record disclosure risk measures, the global risk measure is: Note that to calculate the measure in (10) only the diagonal of the misclassification matrix needs to be known, that is, the probabilities of not being perturbed. Population counts are generally not known so the estimate in (10) can be obtained by probabilistic modelling on the misclassified sample as shown above: There have been many other contributions expanding the Poisson log-linear modelling framework for estimating the risk of re-identification in survey microdata. Ichim (2008) considers extensions by introducing the survey weights in the analysis of the contingency tables and also proposes a maximum penalized-likelihood approach to obtain smoother estimates of the risk of re-identification. Forster and Webb (2007) extend the log-linear modelling framework to a model averaging approach rather than requiring to choose a single model a priori. They use a Bayesian model averaging technique according to M possible log-linear models but limit the models to decomposable geographical models. The posterior distribution under model uncertainty is hence obtained as a weighted average of the posterior distribution under the various models. Shlomo (2006, 2007a) generalize the probabilistic modelling using the Negative Binomial distribution rather than the Poisson distribution and implement the probabilistic modelling framework on local 'neighbourhoods' of the sample uniques. Manrique-Vallier and Reiter (2012) propose an alternative to log-linear models for datasets with sparse contingency tables according to the key variables using a Bayesian version of Grade of Membership (GoM) models and they use a Markov Chain Monte Carlo algorithm for fitting the model. Carota et al. (2015) applied a Bayesian semi-parametric version of log-linear models, specifically a mixed effects log-linear model with a Dirichlet process (DP) prior.

MEASURING THE RISK OF RE-IDENTIFICATION IN A SUBPOPULATION
Up till now, the survey microdata under investigation is a random sample subset of the population and therefore it can be used to estimate population parameters for the probabilistic models to estimate disclosure risk measures based on the risk of re-identification. In addition, the intruder knows that it is possible that an individual in the sample microdata can be matched to the population as all have a non-zero chance of being selected into the sample. In this new setting, we assume that we have microdata that represents a subpopulation. It is publically available but the membership of the subpopulation is not known. For example, a subpopulation can refer to all persons with a medical condition, such as a cancer/HIV register or owns a supermarket loyalty card. The subpopulation is not representative of the population as is the case for a random sample. Similar to the case for sample microdata, we assume the same disclosure risk scenario that an intruder aims to match a record in the subpopulation to an individual in the population of which the subpopulation is a subset and that the population counts are unknown. We also assume that there are no measurement errors in the way the data are recorded in the register. In order to allow inference about population uniqueness in the subpopulation, we assume that there also exists survey microdata from a random sample and that there are categorical identifying variables X across all data sources that can be used to match a record in the subpopulation/sample microdata to the population.
As mentioned we assume that membership in the subpopulation register, denoted by the variable R, is unknown and may be sensitive and the primary concern is that an intruder can identify an individual in the subpopulation and disclose their value of R. In this case, it is reasonable to assume that the intruder cannot use R as a potential identifying key variable for the probabilistic modelling. Therefore, in order to make inference about population uniqueness the intruder makes use of the sample microdata file where the sample is drawn from the finite population. Note that the membership of the subpopulation R is also not known in the sample microdata. Thus, whereas previously the sample microdata file served two purposes, one as the file about which disclosure risk is a concern and one for inference about population uniqueness, we now suppose that the intruder must resort to using separate files for these two purposes.
As an illustration to this new setting, we assume that both survey microdata containing a random sample from the population, such as the Labour Force Survey microdata, and a Register of Cancer Patients are observed, but the inclusion into the sample and the membership of the register are not known. In addition, it is not known who in the sample is also included in the register. We can then estimate the risk of re-identification in the Register of Cancer Patients where we draw inference from the sample microdata to estimate the population parameters. Alternatively, we can estimate the risk of re-identification in the sample microdata using the additional information that is available in the observed Register of Cancer Patients.
In summary, the key additional complication is that another data source is observed and introduced into the framework as described in Section 2 of a subpopulation register that is not a random subset of the population. This new framework allows for the estimation of the risk of re-identification through population uniqueness for two settings: -Estimate the risk of re-identification in the subpopulation microdata given the data in both the subpopulation microdata and the sample microdata; -Estimate the risk of re-identification in the sample microdata given the data in the sample microdata and the additional information that can be obtained from the publically available subpopulation microdata.

Framework
Let U and U 1 denote the population and the subpopulation, respectively, with U 1 ⊂ U. We refer to members of U as individuals, although they could more generally be other types of units. Let R i be the subpopulation indicator variable for individual i with R i = 1 if i ∈ U 1 and R i = 0 otherwise. We suppose that a subpopulation microdata file has been constructed for members of U 1 . We are concerned about the possibility of an intruder matching a record in this file to a known individual in the population and thus disclosing the fact that R i = 1 for individual i. As discussed, we suppose that membership R i is a sensitive variable for which disclosure is undesirable. We suppose that any matching by the intruder makes use of a vector X of key variables which are included in the subpopulation microdata file and which the intruder may be able to determine for known individuals in the population. We suppose that R i is not included in X. We suppose that the key variables are categorical and focus our concern on an intruder who finds an exact match on X between a record in the subpopulation microdata and a known individual in the population (assuming no measurement errors). As before, we label the possible combinations of key variables by k, k = 1, … , K and refer to each combination as a cell in a multi-way contingency table. We denote the population frequency in cell k by F k so that the cell is population unique if F k = 1.
We denote the subpopulation frequencies in cell k by F 1 k . The most high-risk records are for cells with F 1 k = 1 and, analogous to the derivation presented in Skinner and Shlomo (2008), two alternative risk measures are given by * We can also convert (12) into proportions by dividing by ) . There is no way that these measures can be estimated consistently from the subpopulation microdata alone. The microdata provide information about the F 1 k but not about the F k in U. We have in mind subpopulations U 1 where the distribution of X may be quite different to that in U so the subpopulation microdata carries no direct information about the F k . We suppose, therefore, that in addition to the microdata for people in U 1 , there is a random sample microdata file in which the values of X are recorded for a probability sample s from U. We suppose that the two microdata files are not linked. Let f k denote the frequency in cell k in s. Note that the f k and F 1 k are observed, but the F k are not. If the intruder has access to the sample microdata file, then it may be advantageous to restrict attention to cells with f k = 1, leading to the following risk measures An alternative approach, following Skinner and Elliot (2002), is to focus on the cells with one entry in cell k of both the subpopulation and sample microdata, where I ( represents a sample unique in both sources of microdata, and to note that there are ∑ k F k I(F 1 k = 1, f k = 1) individuals in the population U who could be matched to these individuals using X. An alternative measure of risk is thus given by which may be interpreted as the probability that a match is correct if an intruder selects any one of the ∑ k F k I(F 1 k = 1, f k = 1) individuals at random with equal probability. Following Skinner and Shlomo (2008), suppose that F k is Poisson distributed, F k ∼ Pois ( k ) where the parameter k obeys the log-linear model Suppose that within cell k the unknown membership variable R i takes the value 1 with probability p k , independently for each of the F k units, so that F 1 k ∼ Pois ( k ) where k = k p k , and the F 1 k is binomially distributed F 1 k |F k ∼ Bin (F k , p k ) conditional on the F k . Furthermore, we assume that p k obeys the logistic model: For simplicity, the same vector x k is used in both models here, but different specifications could apply. Suppose that the sample s is obtained by Poisson or Bernoulli sampling with inclusion probability k in cell k.

Expressions for risk of re-identification measures
In this section, we provide expressions for the risk measures from Section 2 in terms of the model parameters introduced in Section 3.1. We first introduce more notation. Let f 1 k denote the frequency in cell k in s ∩ U 1 and letf 1 k = F 1 k − f 1 k . Note that s ∩ U 1 specifies the set of individuals appearing in both the sample and the subpopulation and that this set is not observed. Similarly, Figure 1 shows the Venn diagram of the decomposition of the population count F k into mutually exclusive sets.
Assuming that the selection of s is independent of R, a convenient approximation for developing the risk measures is to assume that the quantities f 1 To obtain an expression for 1 in (13), we write The only possible combination of values which leads to F k = 1, F 1 k = 1 and f k = 1 is given by (1, 0, 0, 0). Hence, Furthermore, the only possible combinations of values of which lead to F 1 k = 1 and f k = 1 are given by (1, 0, 0) and (0, 1, 1). Hence Plugging (19) and (20) into (18) and simplifying gives To evaluate * 1 , we use (14) can be estimated design-consistently without the need for modelling as shown in Section 3.3.

Estimation of risk measures
We first consider the estimation of in (14). Following the arguments of Skinner and Elliot (2002), a design-consistent estimator of is given bŷ where it is assumed that Poisson or Bernoulli sampling is employed with inclusion probability k in cell k. Design-consistent estimation is not feasible for the remaining risk measures in (12) and (13) and we adopt the probabilistic modelling approach. These measures depend on the unknown k and k and the known k via (21) and (22). Assuming the models in (15) and (16), the k and k are known functions of the parameters and . The data consist of the values f k and F 1 k . In this section, we consider how to estimate and from the f k and F 1 k . We then suppose that the risk of re-identification measures are estimated by plugging these parameter estimates into (21) and (22). If the F k were observed it would be straightforward to factor the likelihood into two components, one dependent on via F k ∼ Pois ( k ) and (15) and one dependent on via F 1 k |F k ∼ Bin (F k , p k ) and (16). However, we only observe f k and not F k and the conditional distribution F 1 k |f k cannot be expressed in general as a function of p k . We consider a two-step estimation procedure under two different approaches.

Approach A
In the first approach, Approach A, we first estimate from f k ∼ Pois ( k k ), combined with the log-linear model defined by (15) as in Skinner and Shlomo (2008). In the second step, we estimate , fixing k at the value implied by (15) with set at its value estimated at the first step. We then use (16) and the fact that k = k p k to write and then estimate from the fact that F 1 k ∼ Pois ( k ) using maximum likelihood estimation and treating k as known. The log likelihood (ignoring a constant term) is given by The score equations are then given by We obtain from (24) that Hence the score equations can be written as To obtain an estimator of , these equations can be solved by the Newton-Raphson method or the method of Fisher scoring, treating each of k and p k as functions of using (16) and (24) and treating the k as given.
The derivative of U( ) is Using the method of Fisher scoring, we replace H(ξ) by its expectation. Thus, using E(F 1 k ) = k , we have and lettinĝr denote the estimate of at the rth iteration, and setting 0 = 0 we havê ) .

Approach B
In the second approach, Approach B, we return to the microdata level and estimate for each individual in the subpopulation file the probability of membership:p i i = 1, … , N 1 where N 1 is the number of individuals in the subpopulation. The estimation is carried out by using the random sample microdata file as the reference sample as described below. We assume that the sample microdata have survey weights for each individual j, w j , j = 1, … , n. Since the probability of membershipp i corrects for the lack of representativeness in the subpopulation, we can calculate estimates for the population totalsF k by inverse probability weighted (IPW) estimation:F k = ∑ i∈k 1∕p i wherep i is the estimate ofp i . Now treating thep i as fixed from the first step, we then estimate k by the pseudo-maximum likelihood estimation shown in (7) as described in Skinner and Shlomo (2008). Definingp k = 1∕F k we estimatêk =̂kp k and calculate the risk measures in (21) and (22).
To estimate the probability of membershipp i for the subpopulation microdata, we implement the method proposed in Chen et al. (2019) summarized below. We denote the subpopulation microdata as file A and the sample microdata as file B. We stack the two files and define T i = 1 if i ∈ A and T i = 0 if i ∈ B. The probability of membership for the subpopulation microdata A is p i ≡p i (x i , ) = P(T i = 1|x i , ) where x i is the design vector denoting the main effects and interactions. The maximum likelihood estimator ofp i isp i ( x i ,̂) wherêmaximizes the log-likelihood function Since we do not observe the whole population, Chen et al. (2019) replace the second term in (27) with the Horvitz-Thompson estimator obtained from the random reference sample having survey weights w j and with information on x i , to maximize the pseudo log-likelihood function Under a logistic regression model the pseudo log-likelihood function is ) .
And the score equations: Chen, et al. (2019) propose a Newton-Raphson procedure. Lettinĝr denote the estimate of at the rth iteration, we havêr , x i x ′ i and setting 0 = 0 for the first iteration.

APPLICATION STUDY
From the UK Census 2001, cell proportions from published tables for ages 16 and over were calculated and cross-classified and if necessary, complemented with iterative proportional fitting, to obtain joint probabilities on the following variables: Geography (6 categories), Age group (14 categories), Sex (2 categories), Marital Status (6 categories), Ethnicity (16 categories), Economic Activity (10 categories) and Ill Health (2 categories). We then multiplied the proportions by 1,000,000 individuals and after rounding obtain a synthetic census microdata dataset of N = 1,003,401. The subpopulation data are those having ill health where N 1 = 179,699. We produce a multiway contingency table of size K = 161,280 cells defined by all variables except Ill Health. Table 1 compares the distributions in the population and the subpopulation microdata for key variables Age Group, Sex and Economic Activity. Table 1 clearly shows that the subpopulation mainly contains the elderly population as they are more likely to have Ill Health.

Simulation steps approach A
Step 1: Draw 100 random samples without replacement from the population using Bernoulli sampling where = 1∕50 and resulting in a sample size of n = 20,068 on average.
Step 2: On each sample, we run a log-linear model (3) where the model is the all two-way interaction model on the key variables: Geography, Age group, Sex, Marital Status, Ethnicity and Economic Activity, to estimatêk = exp ( x ′ k̂) and̂k =̂k∕ .
Step 3: Estimate the probability p k according to Approach A in Section 3.3. Here we use the same key variables and the main effects design matrix.
Step 4: Definêk =̂kp k and calculate the risk measures in (21) and (22) and compare to the true values based on the known population.

Simulation steps approach B
Step 1: Draw 100 random samples without replacement from the population using Bernoulli sampling where = 1∕50 and resulting in a sample size of n = 20,068 on average. Step 2: Use each sample as a reference sample to combine with the subpopulation microdata and estimatep i according to Approach B in Section 3.3 using the Chen, et al. (2019) method.
Here we also use the key variables: Geography, Age group, Sex, Marital Status, Ethnicity and Economic Activity and the main effects design matrix.
Step 3: Estimate the inverse probability weighted (IPW) estimatesF k . In addition, use the IPW estimates to calculate the marginal counts for the all two-way interactions on the key variables which will be used for the log-linear modelling in (7).
Step 4: Run the log-linear model described in (7) under the pseudo-maximum likelihood estimation method using the all two-way interactions model with estimates from Step 3 to estimatêk.
Step 5: Definep k = 1∕F k and estimatêk =̂kp k . Calculate the risk measures in (21) and (22) and compare to the true values known from the population.

Results
We first describe the results of the log-linear modelling in Step 2 of Approach A of the application study and the justification for using the all two-way interaction model. We estimate the number of sample uniques that are population uniques 1 as shown in (1)  We therefore use the all two-way interaction model for the log-linear models in both approaches A and B of the application study. We use the main effects model for estimating the probability scores and provide a discussion of the implications of these models in Section 4.4 summarizing the results of the application study. Future work will investigate other types of models and the development of goodness-of-fit criteria for this setting. Table 2 shows the results of the estimation of the risk measures in (21) and (22) for both Approach A and Approach B in the application study. Figure 2 shows the box plots of the same measures.
From Table 2 and Figure 2, we see that Approach B outperforms Approach A with more accurate estimated risk measures compared to their true values for both 1 and * 1 . We discuss these findings in Section 4.4. We also see that the risk measure in (14) as described in Skinner and Elliot (2002) does not require the use of models and can be estimated without bias. However, the interpretation of this measure may not be as useful compared to the 1 measure in (21) for quantifying the risk of re-identification from the perspective of the statistical agency releasing the microdata.

Summary of findings
In any two-step estimation approach, the parameters estimated in the second step are dependent on the parameters estimated in the first step. In Approach A, if a parameter k in cell k is estimated as 0 (there is no estimated expected mean (population size) in that cell), this implies that k will also be zero because of the relationship that k = k p k . As seen in Skinner and Shlomo (2008) there is monotonicity in the log-linear models to estimate the expected mean parameters based on the observed random sample. The main effects model assumes there are no zeroes in the contingency table defined by the key variables and generally spreads the population mass out too thin, thus lowering the expected mean on the sample unique cells and overestimating the risk of re-identification. On the other hand, the saturated model assumes that all zeros in the contingency table are real zeros and therefore estimates the expected mean to be too high on the sample unique cells, thus lowering the risk of re-identification. The B-goodness-of-fit criteria aims to find the right balance in the estimation of the population parameters between the zero cells that are random due to the sampling and the zero cells that are structural (real) zeros. The all two-way interactions model shows a good fit under the log-linear model in Approach A since any zero appearing in an all two-way marginal table is more likely to be a structural zero in the population. Nevertheless, in Approach A there are some cells of the contingency table that have an estimated expected mean equal to zero although there is evidence of population in that cell because of the presence of individuals in the subpopulation. Thus, we are not utilizing all the information that is available to estimate the disclosure risk measures. As a result of estimatinĝ k = 0 in cell k we obtain also that̂k = 0. Therefore, the * 1 in Approach A based on the subpopulation uniques is overestimated due to the fact that exp(0) = 1. The risk measure 1 depends on both sample uniques and subpopulation uniques and performs better with a smaller bias compared to * 1 . On the other hand, Approach B starts with the larger subpopulation dataset and that data have more information regarding the population zero and unique cells compared to the smaller random sample. We estimate first thep i , i = 1, … , N 1 using the main effects model based on the individual units of the subpopulation. Future work will look at the impact of introducing interactions in this model as well. The estimated propensitiesp i enable the robust estimation of population counts to use in the pseudo-maximum likelihood estimation of the log-linear model in (7). Approach B provides better estimates of both risk measures with a slight downward bias to * 1 but an unbiased estimate for 1 . Finally, we can demonstrate the approach of Elamir and Skinner (2006) described in Section 2 and estimate the number of subpopulation and population uniques * 1 as estimated in (1) with formula in (4) under different log-linear models. Note that in this case, we are not able to carry out a model-search using theB 1 goodness-of-fit criteria because they are not valid for a non-probability subpopulation. The true value is * 1 = 2721 as shown in Table 2. Under the independent log-linear model̂ * 1 = 2783.9 and under the all two-way interaction model̂ * 1 = 1792.6. Under a log-linear model with three main effects Geography, Age group and Sex, and three two-way interactions: Marital Status*Ethnicity, Marital Status*Economic Activity and Ethnicity*Economic Activity we obtain̂ * 1 = 2682.8.

CONCLUSIONS AND FUTURE WORK
The conclusions for assessing disclosure risk in microdata based on the risk of re-identification for subpopulation registers * 1 and/or using the subpopulation register to estimate the risk of re-identification in sample microdata 1 is to use Approach B. This assumes that the subpopulation is a large enough dataset to allow for more robust estimation of parameters and compensate for the zero cells of the contingency table. An area of future research is to refine the goodness-offit criteria for determining both the correct model used in the estimation of the probability scores and the log-linear model under the pseudo-maximum likelihood approach which produce unbiased estimates of the global risk measures. In addition, whereas Rinott and Shlomo (2007b) considered confidence intervals for the global risk measures defined in Section 2, future research is needed to adapt and develop confidence intervals for the global risk measures shown in Section 3.
As seen in the application study in Section 4, a two-step approach to estimate parameters of the disclosure risk measures may not enable estimating the risk of re-identification for samples drawn from the subpopulation, and more generally for non-probability samples. There is less information about the population to compensate for the zero cells in the contingency table due to sampling. Assessing disclosure risk for a non-probability sample is becoming more relevant in recent years with the increased use of non-probability samples to collect data for hard-to-capture populations. For this purpose it is clear that we may need to develop a new approach where the parameters k and p k are estimated simultaneously using the maximal amount of information from all available sources of data. Future research will focus on an alternative method. For example, one can treat the problem as estimation with incomplete data and use a fully Bayesian approach or an EM algorithm to estimate the unknown conditional distribution F 1 k |f k (or f 1 k |f k assuming a sample from the subpopulation or more generally a non-probability sample) under a complete data likelihood of ∏ k Pr(f 1 k , f k , F 1 k ).

ACKNOWLEDGEMENT
This research was progressed during the Data Linkage and Anonymization (DLA) Programme of the Isaac Newton Institute for Mathematical Sciences, Cambridge United Kingdom (July-December 2016) EPSRC grant EP/K032208/1.