Proc. Inst. Statist. Math. 66-2

Towards Agile Society

Hiroshi Maruyama

(Preferred Networks, Inc.)

The Institute of Statistical Mathematics operated the Service Science Research center from April 2011 to March 2016. We conducted several seemingly unrelated projects during this period, but in retrospect there has been a common theme, which is how to deal with changes. In this manuscript we discuss our insights on how we make ourselves more agile in order to survive changes. These insights were acquired from these projects, namely, service science, new programming paradigm, systems resilience, and individual career development.

Key words: Service science, statistical machine learning, sustainability, data scientists.

Probabilistic Modeling Technology Using Big Data: Activity for Social Implementation

Yoichi Motomura

(Artificial Intelligence Research Center, National Institute of Advance Industrial Science and Technology)

Currently, the practical application of artificial intelligence is being dramatically advanced by machine learning using big data. These efforts are also expected to help realize industrial structural reform and the smart society (``Society 5.0''). In this paper, we introduce probabilistic modeling using probabilistic latent semantic analysis and Bayesian networks. To realize the value of service and improvementin productivity, the user's behavior and preference are predicted by probabilistic models constructed from service history data (ID-POS data, questionnaire with ID, operation history with ID). Examples of real applications and efforts at social implementation are also discussed.

Key words: Service engineering, artificial intelligence technology, probabilistic modeling, Bayesian networks, probabilistic latent semantic analysis, big data.

Statistical Privacy Protection of Location Trajectories

Kazuhiro Minami

(Institute of the Statistical Mathemetics)

Nowadays, trajectory location data, which is collected from peoples' smart phones, can be used for various analytic purposes, such as traffic monitoring, urban city planning. However, due to significant concern about location privacy, location data must be anonymized properly before making it available for secondary usage. Unfortunately, trajectory location data is inherently difficult to anonymize due to its high-dimensionality. Furthermore, we need to take additional measures to prevent inference attacks exploiting strong temporal and spatial correlations among data points. In this article, we present a technique of dynamically pseudonyms that devides a location trace into multiple segments and describe a state-space model to evaluate the safety of anonymized location data.

Key words: Location privacy, anonymization, Markov chain, state space model.

High-dimensional Sparse Modeling of Large-scale Aggregated POS Data

Yinxing Li

(Graduate School of Economics and Management, Tohoku University)

Nobuhiko Terui

(Graduate School of Economics and Management, Tohoku University)

Micro-marketing based on consumer heterogeneity using disaggregated ID-POS data has been well studied in the literature, but the implementation and operation of this approach remain limited, particularly in real store management. On the other hand, while aggregated POS data are collected by most retailers, it is widely recognized that these data have never been well utilized.
We discuss the possibility of using aggregated POS data by applying new techniques in high-dimensional sparse data modeling and machine learning. In particular, we propose a procedure comprising two sub-models: the topic model first decomposes the aggregate number of sales to several different market baskets, and then hierarchical factor regression is used to reduce dimensionality and ultimately recover from the reduced dimension to the original space in order to detect the marginal effect among all products in each market basket.
The proposed model, which uses a large amount of product data, not only makes it possible to discover unexpected predictors, but also measures the quantitative relation in the form of elasticity for managerial implications.

Key words: Aggregated POS data, heterogeneity in shopping context, topic model, high-dimensional sparsedata, hierarchical factor regression.

Possibility of Achieving Consumer Understanding Using Statistical Models

Tadahiko Sato

(Faculty of Business Sciences, University of Tsukuba)

Advanced utilization of statistical models is an indispensable tool for the development of service sciences, and its importance will increase further in the future. The purpose of this paper is to explain how to use a statistical model (specifically, a Bayesian model) to achieve deep consumer understanding, which is essential for improving services. Specifically, we will organize related research on this issue and provide details of our two research. Although these two studies were in marketing research, they also provide suggestions relevant to service research.

Key words: Bayesian modeling, consumer heterogeneity, time heterogeneity, latent variable.

Applicability of Bayesian Network to Regional Health Policy in Japan

Wataru Toriumi

(Graduate School of Systems and Information Engineering, University of Tsukuba)

Yuichi Ubukata

(Graduate School of Systems and Information Engineering, University of Tsukuba)

Shinya Kuno

(Faculty of Health and Sport Sciences, University of Tsukuba/Center for Artificial Intelligence Research, University of Tsukuba)

Yukihiko Okada

(Faculty of Engineering, Information, and Systems, University of Tsukuba/Center for Artificial Intelligence Research, University of Tsukuba)

This paper discusses the service science for regional health policy and helps to contribute to the development of service science as data centric science. Because of strong accountability towards residents, it is necessary for municipal officials to use the statistical methods when planning the regional health policy as this would make it easy to explain how and why they choose it. Also, it is necessary to establish a statistical method applicable to any municipality and any disease. In this paper, we propose that the Bayesian network adopting the algorithm of Local to Global approach which enables more efficient structural learning is useful in fulfilling these needs. This algorithm is one of the constraint-based approaches and tests conditional independence with test.
In addition, we develop a disease-causing Bayesian network applicable to any municipality and any disease.

Key words: Regional health policy, artificial intelligence, accountability, ease of explanation, Bayesian network.

Dissimilarity between Aggregated Symbolic Data Using Chi-squared Statistics and Its Application to Real Estate Data

Nobuo Shimizu

(Institute of the Statistical Mathemetics)

Junji Nakano

(Institute of the Statistical Mathemetics)

Yoshikazu Yamamoto

(Faculty of Science and Engineering, Tokushima Bunri University)

In recent service science research, we often have huge amount of individual data with both continuous and categorical variables. These data sets can sometimes be divided into rather small number of naturally defined groups. In such situations, we are interested in inference and analysis for these groups, not for individual data. For describing these groups, we consider a set of descriptive statistics, and call it ``aggregated symbolic data'' (ASD). We propose to use up to second moments descriptive statistics for both continuous and categorical variables as ASD, and define a dissimilarity as the sum of chi-squared statistics among all variables including continuous variables. We apply our method to real estate data in Tokyo metropolitan area. We consider 23 cities in Tokyo as ASD and calculate dissimilarity among 23 ASDs, and investigate some characteristics relationships among ASDs by using hierarchical clustering and multidimensional scaling.

Key words: Big data, Burt matrix, chi-square statistics, hierarchical clustering, multidimensional scaling.

Estimation of Default Probability Using Regularized Nonlinear Logit Model with B-spline and Adaptive Group LASSO

Isao Takabe

(Department of Statistical Science, School of Multidisciplinary Sciences, The Graduate University for Advanced Studies/Consumer Statistics Division, Statistics Bureau, Ministry of Internal Affairs and Communications)

Satoshi Yamashita

(The Institute of Statistical Mathematics)

Linear binomial logit models are widely used for the assessment and evaluation of a company's default probability based on a company default database. Previous studies have been criticized on the following bases: (1) insufficient attention to nonlinear relationships between default probabilities and financial indicators; and (2) too much time required for variable selection from many candidates for regressors in the models. In this study, we aimed to solve these problems simultaneously by combining the following techniques: (1) nonlinear and nonparametric logistic regression model based on the B-spline; and (2) reasonable variable selection using adaptive group LASSO. We constructed a default probability prediction model using datasets of multiple periods, based on our own database of data from Japanese banks. The proposed model achieved more effective performance than models in other related studies. Compared with the method using t-statistic (p-value) or simple LASSO, our proposed method had the smallest number of explanatory variables in any period, and achieved more efficient variable selection. Moreover, estimation accuracy was improved from the viewpoint of AR (accuracy ratio) value.

Key words: Credit risk, B-spline, adaptive group LASSO.

Causal Inference for Marine Ecosystems Based on Total Power Contribution

Hiroko Kato Solvang

(Marine Mammals Research Group, Institute of Marine Research)

Subbey Sam

(Research Group on Fisheries Dynamics, Institute of Marine Research/Department of Natural Resources, Cornell University)

We introduce a statistical methodology that integrates Granger's pair-wise causal analysis and its expansion to causality basedon the log-likelihood (Partial pairwise causality), and Akaike's power contribution approach whole frequency domain (Total causality). Although the initial idea was proposed by Ozaki (2012), it has hitherto not been applied to complex marine ecosystem dynamics. In this article, we implement the approach by adding a criterion to assess significance to detect causal relationship. We perform a simulation study to verify the efficacy and sensitivity of the method, using data generated by three autoregressive models with three and five dimensions. We also applied the method to real observations to investigate causal drivers of Barents Sea capelin population dynamics. The goal of this analysis was to explore inter-species relationships, which are important food web drivers in the Barents Sea ecosystem. We present results demonstrating that the proposed methodology is a useful tool in early-stage causal analysis of complex feedback systems.

Key words: Multivariate auto-regressive model, multivariate time series data, feedback system, Granger's causality, marine ecosystem, Barents Sea.

P³: Python Parallelized Particle Filter Library

Shin'ya Nakano

(The Institute of Statistical Mathematics/School of Multidisciplinary Science, SOKENDAI)

Yuya Ariyoshi

(The Institute of Statistical Mathematics/Now at Faculty of Engineering, Nippon Bunri University)

Tomoyuki Higuchi

(The Institute of Statistical Mathematics/School of Multidisciplinary Science, SOKENDAI)

Particle filter (PF) is a class of state-estimation techniques based on Monte Carlo computation that use a large number of particles. Because PF is applicable even to nonlinear and/or non-Gaussian problems, it is used for a variety of purposes. One serious problem of PF is its computational time, which is exponential in the degrees of freedom of the state vector. Parallel computing is an effective way to decrease computational time, but this approach requires skills in parallel programming. Even for experienced users, it is challenging to achieve high computational efficiency in PF computation because the PF algorithm contains a procedure difficult to parallelize. We developed a Python library named P³ (Python Parallelized Particle Filter Library), that enables us to readily use parallel-ready PF algorithms with high parallel efficiency. In this paper, we describe the parallelized PF algorithms available in P³, as well as explaining the design and characteristics of the library.

Key words: Particle filter, parallel computing, Python.

Towards Agile Society

Probabilistic Modeling Technology Using Big Data: Activity for Social Implementation

Statistical Privacy Protection of Location Trajectories

High-dimensional Sparse Modeling of Large-scale Aggregated POS Data

Possibility of Achieving Consumer Understanding Using Statistical Models

Applicability of Bayesian Network to Regional Health Policy in Japan

Dissimilarity between Aggregated Symbolic Data Using Chi-squared Statistics and Its Application to Real Estate Data

Estimation of Default Probability Using Regularized Nonlinear Logit Model with B-spline and Adaptive Group LASSO

Causal Inference for Marine Ecosystems Based on Total Power Contribution

P3: Python Parallelized Particle Filter Library

P³: Python Parallelized Particle Filter Library