Proceedings of the Institute of Statistical Mathematics Vol. 51, No. 2, 183-197(2003)
Observed sample size indexes and the estimated population size indexes are often used to assess the disclosure risk of microdata sampled from a population. The parametric method with superpopulation models such as a Poisson gamma model and a Pitman model is presently the main method used for estimation of population size indexes based on sample size indexes. This article introduces some already proposed nonparametric estimation methods, and proposes the nonparametric maximum likelihood estimation method. There are two problems with the proposed method: the estimation result is unstable and enormous computing time is necessary. In order to resolve the first problem, we set some simple restrictions for population size indexes, for example, monotone decreasing and downwards convex. These restrictions enable us to carry out stable estimation. To remove the second problem, we approximate the likelihood function under Bernoulli sampling with the product of Poisson probability functions, and propose some convenient computational methods that do not need exhaustive computation. The target of estimation is almost always indexes of lower sizes. Thus, estimation of actual population may be feasible if we use sample size indexes of only lower sizes.
Key words: Microdata, key variable, population size indexes, sample size indexes.
Proceedings of the Institute of Statistical Mathematics Vol. 51, No. 2, 199-222(2003)
The labor force survey in Japan has a unique characteristic in its sampling scheme called the rotation sampling. This paper proposes a time-series model for analyzing the result of this survey taking this sampling scheme into consideration. In the time series model, individual unemployment is expressed by the probit model and its latent variable is related to trend, group effect and individual effect. Because of the model's non-linearity, the maximum likelihood method using the Kalman Filtering algorithm is not applicable for estimating the model's parameters. In this paper, the estimation is carried out by the Gibbs sampling technique. Unfortunately, the survey result does not include information regarding the rotation groups and, for this reason, an experiment to validate the model must rely on simulated data. Finally, a couple of important applications and extensions of the model are given to conclude this paper.
Key words: Labor force survey, rotation sampling, probit model, Gibbs sampling, state space presentation.
Proceedings of the Institute of Statistical Mathematics Vol. 51, No. 2, 223-239(2003)
There is always an identification disclosure risk when we release microdata. Some individuals or companies might be identified by sample unique observations in the dataset and then their privacy would be violated. Thus, the identification disclosure risk needs to be evaluated when we disclose microdata such as official statistics. We usually have many sample unique observations in the microdata, but most of them are not considered to be population unique. This article evaluates a probability of identification disclosure of any individual using the posterior probability of population uniqueness, and discusses how we should use it as a criterion of disclosure risk.
Key words: Identification disclosure risk, Markov chain Monte Carlo, microdata, population uniqueness, posterior probability.
Proceedings of the Institute of Statistical Mathematics Vol. 51, No. 2, 241-260(2003)
We introduce the theory of statistical disclosure control and survey the current status and perspectives of theoretical research on statistical disclosure control. In addition to an overview of international research trends, we give some detailed treatments of the works of a group of Japanese researchers including the author.
Key words: Key variable, population unique, global recoding, local recoding.
Proceedings of the Institute of Statistical Mathematics Vol. 51, No. 2, 261-295(2003)
A classical statistical problem is the study of a population with many categories. The main concern is not the probabilities of each category but their behavior as a whole when the sample size is increased. Typical examples are the ecological abundance of species, vocabularies in statistical linguistics, and patterns in archaeological artifacts. One aspect of statistical disclosure control (SDC), estimation of individuals who are unique in both population and sample, is related to the problem. This review discusses the problem of estimating the number of those categories that have a unique element in a sample and its population, based on the observed sample. The motivation to solve the problem in SDC is summarized in the beginning sections. The problem is shown to be difficult because it has the inverse problem feature. It is related to some classical problems in statistical abundance models, and the main results in this field are surveyed. These results are Zipf's law, the central limit theorem by Karlin, and the Large Number of Rare Events by a Tbilisi school. New approaches are discussed in other papers of this special issue of the journal, in particular the use of infinitely divisible probability generating function. Other approaches, an application of the Ewens-Pitman family of random partitions and a semi-parametric inference method are related to Poisson mixtures.
Key words: Abundance models, infinitely divisible probability generating functions, large number of rare events, sample and population uniqueness, statistical disclosure control, Zipf's law.
Proceedings of the Institute of Statistical Mathematics Vol. 51, No. 2, 297-319(2003)
Microdata identify each record's position in the corresponding contingency table. Hence, the anonymization of microdata is nothing but coarsening the resolution of a contingency table. Because a finer table results in the decrement of individuals in a cell, the frequency of cells of the same frequency of individuals plays an important role in the evaluation of disclosure risk. However, the estimation of the frequencies of frequencies is practically impossible without an assumption on a population. Based on the standard theory of finite population analysis, we employ superpopulation models as an assumption for the estimation. Then Zipf's law empirically validates the use of a Poisson distribution mixed by a distribution with a heavy tail. A basic population model assumes that the frequency of individuals in a cell is subject to an independent identical mixed Poisson distribution. Let the law of small numbers indicate a limiting argument that lets the number of cells be infinity, where (the expectation of) the total number of individuals is fixed. A proper model arises by applying the law of small numbers to a basic model that consists of infinitely divisible mixed Poisson distributions. Because there are many kinds of infinitely divisible mixed Poisson distributions, it may be possible to derive useful new models with the law. In order to describe various populations, developing new models is of great importance.
Key words: Population unique, infinite divisibility, compound Poisson, mixed Poisson, random partitioning of the natural numbers.
Proceedings of the Institute of Statistical Mathematics Vol. 51, No. 2, 321-335(2003)
Post Randomization Method(PRAM) is a statistical disclosure control method for anonymized sample data. It was proposed by Kooiman et al. (1997). With this method, a data provider can decrease the risk of individual identification and information disclosure by perturbing data for each record based on a pre-determined probability structure. This paper discusses some possible problems in practical use and its influence on statistical analysis results. It also proposes a software environment for PRAM.
Key words: Microdata, Markov chain, local recoding, EM algorithm.
Proceedings of the Institute of Statistical Mathematics Vol. 51, No. 2, 337-350(2003)
We introduce the theory of statistical disclosure control (SDC) for tabular data and survey the current status and perspectives of theoretical research. The second section shows the idea of cell sensitivity and its measures. Both the n-k percent rule and the prior-posterior rule are influential dominance rules. The third section outlines control methods for tabular data. These methods can be roughly classified into two groups: cell suppression, and perturbation. This section mainly treats cell suppression. Recently these methods have been aggressively researched and developed by groups in Europe and America. The fourth section gives a brief sketch of their activities and introduces SDC software.
Key words: Statistical disclosure control, tabular data, SDC-project, CASC-project, SDC software.
Proceedings of the Institute of Statistical Mathematics Vol. 51, No. 2, 351-372(2003)
Associated with the disclosure of a microdata set, there is a problem of estimating the number of unique individuals in a population. As a stochastic model of random partition of a finite population related to this problem, Ewens sampling formula (Ewens (1972)) is well-known. This formula is exchangeable and invariant under size-biased permutation. As a random partition having these two properties, Pitman (1995, 1996c) derived the Pitman sampling formula. We introduce this formula by using a simple urn model, and review its properties and related topics. For the disclosure of a microdata set and random partition, see Takemura (2003) and Sibuya (2003) of this special issue.
Key words: Ewens sampling formula, exchangeability, Mittag-Leffler distribution, Pitman sampling formula, Poisson-Dirichlet distribution.
Proceedings of the Institute of Statistical Mathematics Vol. 51, No. 2, 373-388(2003)
This paper surveys the use of official micro data for economic analyses in Japan. It covers 69 academic papers containing analysis on official micro data before 1998 when the Grant-in-Aid for Scientific Research on Priority Areas (Ministry of Education, Science, Sports and Culture) was launched to explore the potential use of official micro data in Japan. According to the survey, major fields of research using micro data were labor analysis and households' consumption and savings. Research is classified into 4 groups in terms of applied statistical methods or models: on demand tabulation, linear models, nonlinear models and identification of the distribution pattern according to the necessity of the micro data in each research. Although researchers of some national universities are more likely to appear as users of official microdata, they are not always exclusive users.
Key words: Micro data, official statistics, econometrics, empirical research.
Proceedings of the Institute of Statistical Mathematics Vol. 51, No. 2, 389-406(2003)
Discriminant analysis aims at the classification of an object into one of given classes based on information from a set of characteristics. Among the many available methods, Fisher linear discriminant analysis, the most popular approach, has so far contributed to development of science and a social system. With the advent of powerful computers and the information age, however, the issue of discriminant analysis has exploded both in sample size and data complexity. Researchers have since begun to tackle nonlinear discriminant analysis problems in a more realistic fashion. It is well known that Fisher's linear discriminant analysis is equivalent to multi-response linear regression using optimal scoring. We propose nonlinear versions of Fisher's discriminant analysis, "Kernel Flexible Discriminant Analysis (KFDA)", by replacing the linear regression function with a nonlinear kernel function. Observing that the least square approach tends to yield poor results, we use a smoothing approach in consideration of the predictive performance of the discriminant function. To determine the "best model" among the candidates, we investigate the likelihood of KFDA models and propose model selection criteria from information-theoretic and Bayesian points of view. Real data analysis and Monte Carlo experiments indicate that the proposed KFDA approach performs well in practical situations.
Key words: Fisher's linear discriminant analysis, kernel method, optimal scaling, smoothing, model selection.