Proceedings of the Institute of Statistical Mathematics Vol.72, No.2, 153-173 (2024)

Total Quality Management in Official Statistics

Shigeru Kawasaki
(Japan Statistical Association, Inc./Data Science and AI Innovation Research Center, Shiga University)
Sei Ueda
(Statistics Bureau, Ministry of Internal Affairs and Communications)

This paper outlines the efforts in total quality management of official statistics by international organizations and the Government of Japan. Quality management in this field has been developed stepwise since the 1990s. Among international efforts, the Fundamental Principles of Official Statistics established by the United Nations Statistical Commission and the Special Data Dissemination Standard of the International Monetary Fund have led to substantial improvements in the quality of official statistics worldwide. In Japan, the Statistics Law, which was fully revised in 2007, has played a key role in the broad application of total quality management in official statistics in Japan. Recently, the government encountered two major cases of inappropriate processing in two major statistical surveys. As a result, total quality management is now extensively applied to all official statistics. We conclude with a discussion of the necessary elements for promoting total quality management of official statistics.

Key words: Official statistics, Fundamental Principles of Official Statistics, Statistics Law, total quality management, United Nations Statistical Commission.


Proceedings of the Institute of Statistical Mathematics Vol.72, No.2, 175-193 (2024)

Development of the Statistical Business Register in Japan
—“Establishment Frame Database” as an Important Information Base for Society—

Masao Takahashi
(Faculty of Business and Infomatics, Nagano University)

This report provides an overview and possible future directions for developing the Establishment Frame Database, stipulated by the current Statistics Act, which positions official statistics as an information base for society. Databases such as the Establishment Frame Database in Japan are internationally referred to as statistical business registers, which are databases of economic units such as enterprises and establishments and are used for statistical purposes; they are typically maintained and managed by national statistical offices in various countries. The Establishment Frame Database is designed to utilize information sources such as the Economic Census, various statistical surveys, and administrative records to provide population information about establishments and enterprises for various statistical surveys and to create statistics related to establishments. In this report, after a brief history of the Establishment Frame Database is provided, an overview of the database is presented, followed by discussions of future possible directions, including further improvement of the coverage of the database and international cooperation.

Key words: Economic Census, administrative records information, sampling frame, register-based statistics, business demography.


Proceedings of the Institute of Statistical Mathematics Vol.72, No.2, 195-215 (2024)

Econometric Analysis of Consumption Patterns and Working Hours in Japan: Based on National Survey of Family Income Consumption and Wealth

Shinsuke Ito
(Faculty of Economics, Chuo University)
Takahisa Dejima
(Faculty of Economics, Sophia University)
Mariko Murata
(Statistical Information Institute for Consulting and Analysis)

The share of dual-earner households is increasing in developed countries, including Japan. With the accompanying changes in income composition and time allocation, examining their effects on consumption patterns and expenditures is worthwhile because doing so will enable the study of the effects of various economic policies. This paper examines the relationship between income composition and consumption patterns using individual data from the National Survey of Family Income Consumption and Wealth, to which weekly working hours have been newly added as a survey item.
Specifically, after setting up a consumption function for the detailed expense items of household consumption that have been estimated in previous studies, we newly introduced working hours as an explanatory variable. To more accurately account for the effect of permanent income on consumption, we also introduced as an explanatory variable the difference between expected and realized wages based on the wage function estimated by the Basic Survey on Wage Structure.
Consistent with previous studies, we observed that differences in the way couples work affect their spending on household consumables, lodging, and other items. In addition, we confirmed that working hours have an effect, albeit partial, on consumption expenditures. These results can be explained by the tendency among dual-earner households to increase time-saving consumption as the working hours of spouses of households increase.

Key words: Household consumption, consumption function, dual-earner households, working hour.


Proceedings of the Institute of Statistical Mathematics Vol.72, No.2, 217-231 (2024)

Study on Synthetic Data Generation Techniques for Official Statistics on Establishments and Enterprises: The Economic Census as an Example

Shinsuke Ito
(Faculty of Economics, Chuo University)
Shuji Yokomizo
(Statistical Research and Training Institute, Ministry of Internal Affairs and Communications)

Not only are anonymous data currently not available for statistical surveys of business establishments and enterprises in Japan, but producing publicly available microdata for statistical surveys of business establishments and enterprises is also difficult. Therefore, the development of a method to produce synthetic data would meet the need for test data. In this paper, we quantitatively evaluate various techniques for generating synthetic data using individual data from the Economic Census of Activity.
In this study, we quantitatively evaluated the usefulness and confidentiality of synthetic data generated using the maximum distance to average vector (MDAV) method of microaggregation, which is a disturbing method, the classification and regression tree (CART), and the conditional tabular GAN (CTGAN), a deep learning model. The results confirmed that the distributional properties such as summary statistics and correlation coefficients were reproducible when synthetic data were generated using CART. In addition, compared with the MDAV method, CART can potentially increase confidentiality while maintaining usefulness. Furthermore, for CTGAN, the degree of confidentiality was found to be higher compared with that for CART; however, the decrease in usefulness was also confirmed to be relatively greater.

Key words: Synthetic data, microaggregation, CART, CTGAN, Economic Census.


Proceedings of the Institute of Statistical Mathematics Vol.72, No.2, 233-244 (2024)

Missing Value Imputation Using a Statistical Matching Method Based on a Multinomial Logit Model

Isao Takabe
(Faculty of Data Science, Rissho University)

Statistical matching is a technique for combining different data to construct useful data. Statistical matching enables the creation of useful data without additional research or data collection and has recently been used in various fields. In this study, we introduce the method of statistical matching based on the multinomial logit model proposed in Takabe and Yamashita (2018, 2020, 2021) and also discuss the use of matching probabilities obtained as a byproduct for missing value imputation.

Key words: Statistical matching, multinomial logit model, weighted distance function, missing value imputation.


Proceedings of the Institute of Statistical Mathematics Vol.72, No.2, 245-260 (2024)

Dealing with Outliers in Official Statistics
—R Packages Implementing the MSD Estimators—

Kazumi Wada
(Statistical Research and Training Institute, Ministry of Internal Affairs and Communications (MIC))

In the field of official statistics, the univariate method known as range checking is still the predominant method for detecting outliers in continuous values; however, the importance of dealing with multivariate outliers is gradually being recognized because the products of statistical surveys increasingly include individual data, in addition to traditional statistical tables. This paper explains the difference between univariate and multivariate outliers. It also introduces the R packages RMSD and RMSDp, which implement the modified Stahel-Donoho (MSD) estimators as a multivariate outlier detection method that assumes a unimodal symmetric elliptic distribution.

Key words: Data cleaning, elliptical distribution.


Proceedings of the Institute of Statistical Mathematics Vol.72, No.2, 261-271 (2024)

Practical Examples of Education Using Official Statistics

Masaaki Sato
(Faculty of Data Science, Shiga University)

This paper introduces a component of statistical education using official statistics and explains the method of analysis using the R survey package, which is necessary for that purpose.

Key words: Microdata, anonymous data from official statistical surveys, statistical education, survey package.


Proceedings of the Institute of Statistical Mathematics Vol.72, No.2, 273-303 (2024)

Exploration of the Physical Properties of Molecular Gas in a Galaxy with High-dimensional Statistical Analysis and Future Prospects to Astronomy

Tsutomu T. Takeuchi
(Division of Particle and Astrophysical Science, Nagoya University/Research Center for Statistical Machine Learning, The Institute of Statistical Mathematics)
Kazuyoshi Yata
(Institute of Mathematics, University of Tsukuba)
Kento Egashira
(Department of Information Sciences, Tokyo University of Science)
Makoto Aoshima
(Institute of Mathematics, University of Tsukuba)
Kohji Yoshikawa
(Center for Computational Sciences, University of Tsukuba)
Aki Ishii
(Department of Information Sciences, Tokyo University of Science)
Ryusei R. Kano
(Division of Particle and Astrophysical Science, Nagoya University)
Wen E. Shi
(Division of Particle and Astrophysical Science, Nagoya University)
Aina May So
(Division of Particle and Astrophysical Science, Nagoya University/Department of Physics, Gakushuin University)
Hai-Xia Ma
(Division of Particle and Astrophysical Science, Nagoya University)
Sena A. Matsui
(Division of Particle and Astrophysical Science, Nagoya University)
Koichiro Nakanishi
(National Astronomical Observatory of Japan/Department of Astronomy, School of Science, Graduate University for Advanced Studies (SOKENDAI))
Sucheta Cooray
(National Astronomical Observatory of Japan/Research Fellow of the JSPS (PD))
Kotaro Kohno
(Institute of Astronomy, Graduate School of Science, The University of Tokyo)

If we denote the dimension of data as d and the number of samples as n, we often meet a case with nd. Traditionally in astronomy, such a situation is regarded as ill-posed, and they thought that there was no choice but to throw away most of the information in data dimension to let d < n. The data with nd is referred to as high-dimensional low sample size (HDLSS). To deal with HDLSS problems, a method called high-dimensional statistics has been developed rapidly in the last decade. In this work, we first introduce the high-dimensional statistical analysis. We apply two representative methods in the high-dimensional statistical analysis methods, the noise-reduction principal component analysis (NRPCA) and automatic sparse principal component analysis (A-SPCA), to a spectroscopic map of a nearby archetype starburst galaxy NGC 253 taken by the Atacama Large Millimeter/Submillimeter Array (ALMA). The ALMA map is a typical HDLSS dataset. First we analyzed the original data including the Doppler shift due to the systemic rotation. The high-dimensional PCA could describe the spatial structure of the rotation precisely. We then applied to the Doppler-shift corrected data to analyze more subtle spectral features. The NRPCA and A-SPCA could quantify the very complicated characteristics of the ALMA spectra. Particularly, we could extract the information of the global outflow from the center of NGC 253. This method can also be applied not only to spectroscopic survey data, but also any type of HDLSS data.

Key words: High-dimensional statistical analysis, principal component analysis, interstellar medium, molecular emission line, starburst, galaxy evolution.