Peter-Paul de Wolf, Jan van der Laan and Daan Zult
A commonly known problem in population size estimation using registers, is that registers do not necessarily cover the whole population. This may be because they intend to cover part of the population (e.g., students), due to administrative delay or because part of the target population is not registered by default (e.g., illegal persons). One of the methods to estimate the population size in the presence of undercount is the capture-recapture method that combines the information of two or more samples. In the context of census estimation registers are used instead of samples. However, the method assumes that perfect linkage between the registers can be achieved. It is known that this assumption is often violated.
In the setting of evaluating the population coverage of a census using a post-enumeration survey, a correction for linkage error was proposed. That correction was later generalized by relaxing some of the newly introduced conditions. However, the new correction method still implicitly assumed that the two registers are of equal size. We introduce a further generalization that includes both previously mentioned correction methods and at the same time deals with registers of different sizes. Specific parameter settings will correspond to the different correction methods. We show that the parameters of each method can be chosen such that the resulting estimates all equal the traditional Petersen estimate (1896) that would theoretically be obtained under truly perfect linkage.
New data sources, namely big data and the Internet, have become an important issue in statistics and for official statistics in particular. However, before these sources can be used for statistics, it is necessary to conduct a thorough analysis of sources of nonrepresentativeness.
In the article, we focus on detecting correlates of the selection mechanism that underlies Internet data sources for the secondary real estate market in Poland and results in representation errors (frame and selection errors). In order to identify characteristics of properties offered online we link data collected from the two largest advertisements services in Poland and the Register of Real Estate Prices and Values, which covers all transactions made in Poland. Quarterly data for 2016 were linked at a domain level defined by local administrative units (LAU1), the urban/rural distinction and usable floor area (UFA), categorized into four groups. To identify correlates of representation error we used a generalized additive mixed model based on almost 5,500 domains including quarters.
Results indicate that properties not advertised online differ significantly from those shown in the Internet in terms of UFA and location. A non-linear relationship with the average price per m2 can be observed, which diminishes after accounting for LAU1 units.
Statistical matching is the term for the integration of two or more data files that share a partially overlapping set of variables. Its aim is to obtain joint information on variables collected in different surveys based on different observation units. This naturally leads to an identification problem, since there is no observation that contains information on all variables of interest.
We develop the first statistical matching micro approach reflecting the natural uncertainty of statistical matching arising from the identification problem in the context of categorical data. A complete synthetic file is obtained by imprecise imputation, replacing missing entries by sets of suitable values. Altogether, we discuss three imprecise imputation strategies and propose ideas for potential refinements.
Additionally, we show how the results of imprecise imputation can be embedded into the theory of finite random sets, providing tight lower and upper bounds for probability statements. The results based on a newly developed simulation design–which is customised to the specific requirements for assessing the quality of a statistical matching procedure for categorical data–corroborate that the narrowness of these bounds is practically relevant and that these bounds almost always cover the true parameters.
Elena Dalla Chiara, Martina Menon and Federico Perali
This study generates an integrated database to measure living standards in Italy using propensity score matching. We follow the recommendations of the Commission on the Measurement of Economic Performance and Social Progress proposing that income, consumption of market goods and nonmarket activities, and wealth, rather than production, should be evaluated jointly in order to appropriately measure material welfare. Our integrated database is similar in design to the one built for the United States by the Levy Economics Institute to measure the multiple dimensions of well-being. In the United States, as is the case for Italy and most European countries, the state does not maintain a unified database to measure household economic well-being, and data sources about income and employment surveys and other surveys on wealth and the use of time have to be statistically matched. The measure of well-being is therefore the result of a multidimensional evaluation process no longer associated with a single indicator, as is usually the case when measuring gross domestic product. The estimation of individual and social welfare, multidimensional poverty and inequality does require an integrated living standard database where information about consumption, income, time use and subjective well-being are jointly available. With this objective in mind, we combine information available in four different surveys: the European Union Statistics on Income and Living Conditions Survey, the Household Budget Survey, the Time Use Survey, and the Household Conditions and Social Capital Survey. We perform three different statistical matching procedures to link the relevant dimensions of living standards contained in each survey and report both the statistical and economic tests carried out to evaluate the quality of the procedure at a high level of detail.
The research question addressed here is whether the semantic value implicit in environmental terms in an activity description text string, can be translated into economic value for firms in the construction sector. We address this question using a relatively new applied statistical method called Latent Dirichlet Allocation (LDA). We first identify a satellite register of firms in construction sector that engage in some form of environmental work. From these we construct a vocabulary of meaningful words. Then, for each firm in turn on this satellite register we take its activity description text string and process this string with LDA. This softly-classifies the descriptions on the satellite register into just seven environmentally relevant topics. With this seven-topic classification we proceed to extract a statistically meaningful weight of evidence associated with environmental terms in each activity description. This weight is applied to the associated firm’s overall output value recorded on our national Business Register to arrive at a supply side estimate of the firm’s EGSS value. On this basis we find the EGSS estimate for construction in Ireland in 2013 is about EURO 229m. We contrast this estimate with estimates from other countries obtained by demand side methods and show it compares satisfactorily, thereby enhancing its credibility. Our method also has the advantage that it provides a breakdown of EGSS output by EU environmental classifications (CEPA/CReMA) as these align closely to discovered topics. We stress the success of this application of LDA relies greatly on our small vocabulary which is constructed directly from the satellite register.
The Current Population Survey (CPS) is the source of official US labor force statistics. The wording of the CPS employment questions may not always cue respondents to include informal work in their responses, especially when providing proxy reports about other household members. In a survey experiment conducted using a sample of Amazon Mechanical Turk respondents, additional probing identified a substantial amount of informal work activity not captured by the CPS employment questions, both among those with no employment and among those categorized as employed based on answers to the CPS questions. Among respondents providing a proxy report for another household member, the share identifying additional work was systematically greater among those receiving a detailed probe that offered examples of types of informal work than among those receiving a simpler global probe. Similar differences between the effects of the detailed and the global probe were observed when respondents answered for themselves only among those who had already reported multiple jobs. The findings suggest that additional probing could improve estimates of employment and multiple job holding in the CPS and other household surveys, but that the nature of the probe is likely to be important.
Joseph W. Sakshaug, Arkadiusz Wiśniowski, Diego Andres Perez Ruiz and Annelies G. Blom
Carefully designed probability-based sample surveys can be prohibitively expensive to conduct. As such, many survey organizations have shifted away from using expensive probability samples in favor of less expensive, but possibly less accurate, nonprobability web samples. However, their lower costs and abundant availability make them a potentially useful supplement to traditional probability-based samples. We examine this notion by proposing a method of supplementing small probability samples with nonprobability samples using Bayesian inference. We consider two semi-conjugate informative prior distributions for linear regression coefficients based on nonprobability samples, one accounting for the distance between maximum likelihood coefficients derived from parallel probability and non-probability samples, and the second depending on the variability and size of the nonprobability sample. The method is evaluated in comparison with a reference prior through simulations and a real-data application involving multiple probability and nonprobability surveys fielded simultaneously using the same questionnaire. We show that the method reduces the variance and mean-squared error (MSE) of coefficient estimates and model-based predictions relative to probability-only samples. Using actual and assumed cost data we also show that the method can yield substantial cost savings (up to 55%) for a fixed MSE.
Li-Chun Zhang, Ingvild Johansen and Ragnhild Nygaard
There is generally a need to deal with quality change and new goods in the consumer price index due to the underlying dynamic item universe. Traditionally axiomatic tests are defined for a fixed universe. We propose five tests explicitly formulated for a dynamic item universe, and motivate them both from the perspectives of a cost-of-goods index and a cost-of-living index. None of the indices that are currently available for making use of scanner data satisfies all the tests at the same time. The set of tests provides a rigorous diagnostic for whether an index is completely appropriate in a dynamic item universe, as well as pointing towards the directions of possible remedies. We thus outline a large index family that potentially can satisfy all the tests.
Eleni Filippidou, Maria Koutsouba, Vassiliki Lalioti and Vassilis Lantzos
The research field of this paper is the area of Thrace, a large geopolitical-cultural unit that was divided – due to political reasons – in three subareas distributed among three different countries: Bulgaria, Turkey and Greece. A dance event that used to take place before the border demarcation but is still performed in the Greek and Turkish Thrace is that of “K’na”, a wedding dance event danced by the people of both border areas, despite of the changes in their magical-religious beliefs and the changes brought by socio-economic and cultural development. In particular, the aim of this paper is the study of the “construction” of the national identity of inhabitants both of Greek and Turkish Thrace, as this is manifested through the dance practice within the wedding event of “K’na”, through the lens of sociocybernetics. Data was gathered through ethnographic method as this is applied to the study of dance, while its interpretation was based on sociocybernetics according to Burke’s identity control theory. From the data analysis, it is showed that the “K’na” dance in Greek and Turkish Thrace constructs and reconstructs the national identity of the people who use them as a response to the messages they receive via the communication with “the national others”. In conclusion, the “construction” of the identity results from a continuous procedure of self-regulation and self-control through a cybernetic sequence of steps.