Measurement error in earnings data: Replication of Meijer, Rohwedder, and Wansbeek's mixture model approach to combining survey and register data

Summary Meijer, Rohwedder, and Wansbeek (MRW, Journal of Business & Economic Statistics , 2012) developed methods for prediction of a single earnings figure per worker from mixture factor models fitted using earnings data from multiple linked data sources. MRW applied their method using parameter estimates of Kapteyn and Ypma's mixture factor model (KY, Journal of Labour Economics 2007) fitted to earnings data for Swedish workers aged 50+. First, we replicate MRW's empirical analysis using the Swedish model estimates. Second, we check the generality of their empirical finding with a new application. Using estimates of a KY model fit to a linked dataset on earnings for UK employees of all ages, we confirm that MRW's principal findings about the performance of their various predictors of true earnings also hold in this different setting.


| INTRODUCTION
When you have observations for individuals' earnings from both survey and administrative data, which source will provide the best prediction of true earnings? The answer is straightforward if the linked administrative data are error free.
If not, combining the information from the two sources makes intuitive sense but leaves open the question of how to do this. Meijer, Rohwedder, and Wansbeek (2012, 'MRW') provided answers. They developed methods for prediction of a single earnings measure using both data sources and compare the reliability and mean squared error (MSE) of their various predictors, considering the case in which the parameters describing the generation of the earnings and measurement error processes are given. MRW provided a detailed illustration of their prediction methods using parameter estimates reported by Kapteyn and Ypma (2007, 'KY'). We provide narrow and wide replications of MRW's empirical analysis.
First, we replicate with one minor exception MRW's empirical application. Although MRW's theory of factor score prediction methods for mixture factor analysis models is general, their illustration focuses on the specific case of KY's mixture factor model and uses KY's parameter estimates derived from survey and register data on earnings for 400 Swedish workers aged 50+ in 2003. KY's paper is important because it was the first to model error in administrative data (due to mismatch), in addition to two sources of errors in survey data. They found regression to the mean in survey error to be negligible, thereby overturning a conventional wisdom. Second, we investigate the generality of MRW's empirical findings. We apply their prediction methods using estimates from KY models fitted to data for nearly 6000 UK workers from across the full age range. Like MRW, we find that conditionally weighted and two-stage predictors which combine information from both sources have the best statistical properties.

| THE KAPTEYN-YPMA (KY) MODEL AND MULTIPLE PREDICTORS OF TRUE EARNINGS
2.1 | The Kapteyn-Ypma model for linked register-survey data Suppose there are two sources of information on log earnings for each of a set of individuals: the survey measure (denoted s) and the register (administrative) data measure (r). We are interested in predicting true log earnings, ξ, which are unobserved.
KY assumed that when a survey response is linked to the wrong person in the register data ('mismatch'), error-ridden measure ζ is observed. Thus, the register data are a mixture of two types of observations, cases R1 and R2: For the survey data, KY assumed there are three types of observation depending on whether earnings are error free (case S1), have measurement error with a regression-to-the-mean component (S2), or measurement error plus contamination (S3): Each earnings observation pair y = (r, s) 0 belongs to one of six latent classes characterized by whether cases R1 and R2 are combined with S1, S2, or S3. The class membership probabilities are π j , j = 1, …, 6, and are functions of π r , π s , and π ω (see MRW's tab. 2 [Table S2]). For example, the probability of being in class 4, that is, having administrative mismatch (R2) and error-free survey earnings (S1), is (1 − π r )π s . KY assumed that true earnings and error components are each independently and identically normally distributed: and ω$N μ ω , σ 2 ω À Á . They derive parameter estimates by maximizing the model log-likelihood (KY, 2007: Appendix B), having for identification set the size of the 'completely labelled' group, which comprises observations with error-free earnings (class 1: R1 − S1).

| MRW's predictors of true earnings
MRW provided multiple approaches to the problem of how to combine administrative and survey data on earnings to obtain the best prediction of ξ, which we now summarize. MRW begin by deriving two predictors for the case of a single latent class. The first,ξ , is a linear function of y that minimizes the mean squared error (MSE). The second,ξ U , is a linear combination that minimizes the MSE conditional on unbiasedness. If class memberships were manifest, these predictors could be applied in the multiple class case: for each class j, use class-specific versions of these predictors,ξ j andξ U, j . However, because class membership is unobserved, MRW considered 'three natural ways to proceed: (1) compute the within-class predictors for each class and combine them in a weighted average; (2) predict class membership and then use the within-class predictor for the predicted class; and (3) derive predictors that minimize the total mean squared prediction error' (Meijer et al., 2012, p. 194).
Weighted average predictors are of two types depending on whether unconditional latent class probabilities (the π j ) or class probabilities conditional on y, p j (y), are used. Two-stage predictors first allocate each individual to the class with the largest probability conditional on y. Second, given predicted classĵ, true earnings are predicted using a within-class estimator. The idea is that if class membership is predicted accurately, two-stage estimators inherit the desirable properties of the within-class predictor. Because eitherξ j orξ U, j could be the within-predictor applied to each of the three approaches, there are six potential predictors. System-wide predictors minimize total MSE directly. MRW derive one predictor satisfying linearity,ξ L , and another without restrictions,ξ, which is identical to the conditionally weighted predictor based onξ j .
For each of the seven predictors and also predicted class membership probabilities, MRW derived specific expressions for the case of the KY model. Each expression is a function of the model parameters; hence, given a set of parameter estimates, the predictors can be calculated.

| REPLICATION OF MRW'S EMPIRICAL RESULTS
MRW used the parameter estimates reported by KY (Table S5), reproduced in the first column of Table 1, to illustrate their prediction methods. (The UK estimates in Table 1 are discussed in Section 4.) Taking the parameters as given, MRW abstract from estimation and identification issues. Although MRW reported statistics to two decimal places (d.p.), their calculations are based on parameter estimates reported to three d.p., as is our replication of their work. MRW used a combination of R and Stata. We use Stata for all analysis. The predictors were straightforward to calculate given MRW's expressions. More challenging to reproduce were their fig. 1 and 2 (Figures S1 and S2), used to summarize results (see below), as this task requires grid searches to find the points defining region boundaries. Our Stata code makes this task easier for future researchers. Figure 1 shows predicted class memberships (ĵ ) based on MRW's two-step approach, for each (r, s) combination excluding the region for class 1 (r = s). No observation is predicted to be in class 4. We find minor differences from MRW's fig. 2 ( Figure S3), attributable to a typographical error in MRW's code that led to an incorrect error variance calculation and thence the boundaries of classes 5 and 6. As a consequence, regions marked as class 5 in MRW's fig. 2 ( Figure S3) are wider than in our Figure 1. But these differences are small and have no knock-on effects for the subsequent calculations to assess predictor performance. Mismatched observations lie further away from the mean of r (12.17) than do correctly matched ones. Observations with survey measurement error (class 5) tend to be relatively close to the mean of s (12.20); those also with contamination (class 6) are further away from it. Table 2 shows estimates for the six unconditional class membership probabilities and weights for the linear predictors, replicating MRW's tab. 5 (Table S3)  mismatch (classes 4-6), the within-class predictors place zero weight on the administrative data but different weights are placed on the survey data depending on the error type. For class 5, the weight of 1.01 makes a small adjustment for regression to mean error. The weights for unbiased and unrestricted within-class predictors for class 6 differ markedly. The unrestricted predictor substantially down-weights the survey data (relative to the unbiased predictor): it is 'predominantly a constant, plus some effect from the survey, reflecting a large amount of shrinkage due to the large variance of the contamination term, so that more weight is given to the population mean' (MRW,p. 198). The two unconditional probability-weighted predictors have the same weights, 89% on the administrative data and 11% on the survey data (a ratio of about 8:1). The register data get a high weight because the mismatch probability is only 4%. In contrast, the linear projection predictor minimizing MSE (ξ L ) gives a relatively low weight to the register data, 22%, and a relatively high weight to the survey data (55%), that is, a ratio of about 0.4:1.
For the conditionally weighted predictors and two-stage predictors, the weights vary with r and s. For ξ =ξ Ã =ã y ð Þr +b y ð Þs +c y ð Þ , where theã y ð Þ,b y ð Þ, andc y ð Þ terms are weighted averages of the a, b, and c coefficients, and similarly forξ Ã U . MRW calculated the relative weight given to r compared with s,ã y ð Þ=ã y ð Þ +b y ð Þ Â Ã = w r (y), say, and plot contour lines for w r (y) for different combinations of r and s.  (Table S3) falls very quickly as we move away from the mean. Thus, the two-stage predictors and system-wide predictorξ have similar relative weights in practice. The two-stage predictors give zero weight to r in the case of mismatch, and Figure 1 shows that regions where classes 5 and 6 locate correspond to areas of Figure 2 with low w r (y). MRW compared the overall performance of the seven predictors, and of r and s, in terms of their MSE, and their reliability defined as the square of correlation between the proxy and ξ. Because closed form expressions do not exist for several predictors, MRW compute performance statistics through simulation, using 10,000 draws based on the assumed model and the parameter estimates shown in Table 1. Table 3 corresponds to MRW's tab. 6 (Table S4), with our calculations based on a simulation strategy replicating theirs. Corresponding table entries are the same except for the four marked with ' † ', which each differ by one percentage point from MRW's. We attribute the differences to simulation variability and rounding. Table 3 shows, first, that 'r is a clear loser [relative to s]' (MRW,p. 199). Its reliability is less than one-half (0.49), whereas s's reliability is 0.69, and r's MSE is more than twice that for s. The probability of mismatch is small but has substantial adverse consequences in terms of statistical performance. The two unconditional weighted predictors,ξ U andξ, have remarkably low reliability and high MSE, the latter being driven by high variance rather than bias (see the rightmost two columns in Table 3).
The system-wide linear predictorξ L performs better (reliability = 0.76), but what stands out is the excellent performance of the remaining predictors, including the two two-stage predictors, which perform about as well as each other.
F I G U R E 2 Relative weight of r in the conditionally weighted predictorξ: replication of MRW's fig. 3 (Figure S4). Note: The lines connect points with the same relative weight given to r, w(y) =ã y Note: Cells marked with ' † ' differ from the corresponding MRW estimates by one percentage point.
The unrestricted system-wide predictorξ =ξ Ã has the highest reliability (0.98), but this is virtually matched byξ Ã U and the two-stage predictors, each with reliability of 0.97, and almost identical MSE (near zero). The excellent performance of the two-stage predictors reflects the high probability (96%) of correctly predicting class membership, with over half of the remaining 4% due to class 3 cases being misclassified as class 2. This has no effect on the precision of the two-stage predictors because the within-class predictor is r for both classes.

| THE PERFORMANCE OF MRW'S PREDICTORS IN A DIFFERENT CONTEXT
It is useful to know whether MRW's results about the performance of different predictors hold when the application context differs. To investigate this issue, we use estimates of KY model parameters reported by Jenkins and Rios-Avila (2020) derived from UK data. The survey data are from the Family Resources Survey (FRS), the UK's main income survey, for financial year 2011/12. Information about gross employment earnings is derived by asking what the amount received is (for up to three jobs), followed by a question about the period to which that amount refers (a week, month, or year, etc.). Responses are converted to annual amounts pro rata. Our measure of s is the logarithm of total gross earnings (the sum across all jobs reported). The administrative data are for the FRS respondents in employment who gave their consent for their survey responses to be linked to records held by the tax authorities (Her Majesty's Revenue and Customs [HMRC]). These P14 data are compiled from employers' returns on P14 forms to HMRC about wages and salaries paid to employees and taxes and National Insurance contributions withheld. Our measure of r is the logarithm of total gross earnings per year (the sum across all earnings spells reported for 2011/12).
The linked dataset contains 5971 men and women after excluding 420 observations with imputed or otherwise edited FRS earnings values. A scatterplot of r and s for the FRS-P14 data ( Figure S1) is similar to MRW's fig. 1: that the majority of the observations lie on or close to the 45 ray from the origin with relatively few observations above and below the line. However, we would expect differences in model estimates derived the two data sets, for example, differences in earnings and error variances because our data refer to UK employees of all ages whereas KY's sample were all aged 50+. The greater fraction of part-time (low earning) workers in the Swedish data likely accounts for the larger dispersion of both r and s than in the United Kingdom (Table S1). Also, a larger fraction of the Swedish sample have r > s than s < r, whereas the fractions are approximately equal in the UK sample ( Figure S1). Table 1 displays KY model estimates from the UK data derived assuming the fraction with error-free earnings is 13.9% (i.e., close to KY's choice). We define completely labelled observations (class 1) as those for whom jr − sj < 0.02. The appendix reports UK estimates derived using a completely labelled fraction of 3.4% (those for whom jr − sj < 0.005). The appendix also shows that the two sets of estimates lead to similar conclusions about predictor performance and so are not discussed further here. 1 For the UK and Swedish estimates, the means of measurement error and of contamination error are negative and regression to the mean in survey error is near zero. The standard deviation of earnings among mismatched observations is larger than the variance of true earnings in both estimate sets, but the differential is greater in KY's Swedish estimates (σ ζ /σ ξ is 2.5 by contrast with 1.7 for the UK data), which may reflect the differences in survey sample age range. In both estimate sets, contamination error is more widely dispersed than measurement error is but, again, the differential is greater for the Swedish estimates than UK ones: σ ω /σ η is greater than 12 and less than 6, respectively. The UK probability of mismatch is 6%, greater than 4% for the Swedish, consistent with Sweden's greater experience with record linkage. Figure 3 shows the predicted class membershipĵ associated with each combination of r and s for models (a) and (b). This is similar to Figure 1. In both figures, class 2 observations are close to the 45 line. Class 3 observations are close to the mean of r but associated with a wide range of s values. Mismatched observations (classes 5 and 6) have r values distant from the mean.
There are interesting differences as well. In Figure 3, the lines demarcating the boundaries between the regions where class 3 is located on the one hand and classes 5 and 6 on the other hand have a distinctly smaller positive slope than do the corresponding lines in Figure 1. In the UK data, the probability of mismatch is larger than in the Swedish case and so, for observations around the mean of r (12.17), a relatively low-or a relatively high-value of s is more likely to imply membership of classes 5 or 6 rather than membership of class 3 by comparison with the Swedish case. Table 4 shows the UK unconditional class membership probabilities (π j ) and the weights for the linear predictors of true earnings. The probabilities are broadly similar to the Swedish ones with the main differences being the lower probability of class 2 membership (0.61 versus 0.69) and a higher probability of class 3 membership (0.19 versus 0.13). There is a slightly higher probability of class 5 membership as well. Given the similarities in estimates cited so far, it is unsurprising the weights for the linear predictors are similar to those for the Swedish case (cf. Table 3).
The largest cross-national difference is for the unrestricted linear predictor (ξ L ), for which the UK case gives a larger relative weight to the register data than does the Swedish one. The UK weight on r is 0.32 by contrast with 0.22, whereas the weights on s are almost the same (0.54 and 0.55, respectively). Also, the UK within-class predictor weight on survey data for class 6 (mismatch with survey error and contamination) is larger relative to the constant (c). In both contexts, there is substantial shrinkage to the population mean because the contamination error variance is large but, in the UK case, the variance is smaller relative to measurement error variance and so less shrinkage. Figure 4 shows contour lines for the register data relative weights, w r (y), for the case of the unrestricted system-wide estimator,ξ. The UK chart is similar to Figure 2 for the Swedish case: relative weights fall very quickly as we move away from the mean. What differs across settings is the slope of the contour lines, being less steep for the UK case. The explanation is the higher prevalence of mismatch in the United Kingdom. Table 5 summarizes the performance of MRW's seven predictors for the UK case. The patterns seen for the Swedish case (Table 3) are seen here as well. For example, first, the survey data are more reliable and have lower MSE than the T A B L E 4 Unconditional class membership probabilities (π j ) for UK earnings data model and predictors of linear expressions ar + bs + ĉ register data. 2 There is an interesting difference, however. UK reliabilities for both r and s are higher than their Swedish counterparts, and moreover, the shortfall in reliability of r relative to s is smaller in the UK case. Put differently, r is no longer such a 'clear loser' (MRW,p. 199) in the UK context by comparison with the Swedish one (reliability of 0.70 compared with 0.48). In current research using the same UK dataset, we find error variances are higher for older than younger workers, suggesting that the UK-Sweden differences in reliability are related to the different age compositions of the samples. Second, the weighted unconditional estimators also perform better in the UK context than the Swedish one, with a reliability of 0.75 rather than 0.54. So too does the system-wide linear predictor, with a reliability of 0.87 rather than 0.76.
Third, the remaining predictors have excellent statistical performance, with reliabilities of 0.95 or 0.96, which are only very slightly smaller than their Swedish data model counterparts (0.97 or 0.98). The excellent performance of the two-stage predictors is again due to the high accuracy with which class membership is predicted by the models, 87%, though this rate is lower than the Swedish one (96%). Contaminated observations prove harder to identify in the UK context. However, in both settings, at least one half of the observations misclassified are class 3 individuals wrongly placed in class 2.

| SUMMARY AND CONCLUSIONS
MRW analysed approaches to predicting individuals' true earnings when you have earnings observations from linked survey and administrative data. They provided general methods for factor score prediction in mixture factor analysis models and illustrated them using KY's model of the relationships between survey, register, and true earnings, and taking KY's parameter estimates referring to Swedish data for older workers.
We have replicated MRW's empirical findings about the statistical performance of the various candidate earning predictors based on estimates derived from models fitted to data for Swedish workers aged 50+. We have also shown that their findings about predictor performance also hold in a different setting, UK employees of all ages, using Jenkins and Rios-Avila's (2020) KY model parameter estimates. The findings in common are that there are two-stage predictors and conditionally weighted predictors of true earnings that have high reliability and low MSE. These perform much better in MSE and reliability terms than do the register or survey data used separately.
These findings about earnings predictors are conditional on the KY model used to describe the data generating process. The KY model does not allow for measurement error per se in the register data, ignores heterogeneity across subgroups of workers, and assumes each factor is normally distributed. In current work, we are fitting models that relax the first two assumptions; deriving and assessing the performance of earnings predictors for these and other models raises interesting challenges for future research. Rios-Avila's (2020) note. 2 Unfortunately, the reliabilities for r and s reported by Jenkins and Rios-Avila (2020, tab. 3) are incorrect. Corrected estimates show that reliability for r is 0.69 over the full range of completely labelled fractions we considered, from 0.25% to 16.93%. Over the same range, reliability for s varies only from 0.84 to 0.80, respectively. (These estimates are derived using analytical formulae; the estimates shown in Table 6 are derived via simulation.) In sum, our conclusions about the relative reliabilities of r and s, and how they vary with the completely labelled fraction, are unaffected.

OPEN RESEARCH BADGES
This article has been awarded Open Data Badge for making publicly available the digitally-shareable data necessary to reproduce the reported results. Data is available at http://qed.econ.queensu.ca/jae/datasets/jenkins001/