Proc. Inst. Statist. Math. 58-2

A New Approach to Machine Learning Based on Density Ratios

Masashi Sugiyama

(Department of Computer Science, Tokyo Institute of Technology)

This paper reviews a new framework for statistical machine learning that we introduced recently. A distinctive feature of this framework is that various machine learning problems are formulated as a problem of estimating the ratio of probability densities in a unified way. Then the density ratio is estimated without going through the hard task of density estimation, which results in accurate estimation. This density ratio framework includes various machine learning tasks such as non-stationarity adaptation, outlier detection, dimensionality reduction, independent component analysis, and conditional density estimation. Thus, density ratio estimation is a highly versatile tool for machine learning.

Key words: Density ratio, non-stationarity adaptation, outlier detection, dimensionality reduction, independent component analysis, conditional density estimation.

Combining Binary Machines for Multi-class: Statistical Model and Parameter Estimation

Shiro Ikeda

(The Institute of Statistical Mathematics)

Combining binary machines for multi-class classification problems is a popular idea, and many related methods have been proposed. One of the most popular methods is to use the error correcting output codes (ECOC), while another interesting idea is to use the Bradley-Terry (BT) model. In this paper, these methods are reviewed from a statistical model based viewpoint. As a result, a common framework will be given and natural extensions are derived.

Key words: Bradley-Terry model, multi-class classification, maximum likelihood estimation.

Data Analysis Method on Space of Exponential Family Distributions

Shotaro Akaho

(The National Institute of Advanced Industrial Science and Technology)

Kazuho Watanabe

(Nara Institute of Science and Technology)

Masato Okada

(Graduate School of Frontier Sciences, The University of Tokyo)

Principal component analysis (PCA) is widely used for dimension reduction, but it is only optimal for Gaussian distributed data and cannot extract a desired lower dimensional structure for non-Gaussian data. In this paper, we review research about dimension reduction for data generated from an exponential family or are given as parameters of an exponential family from an information geometrical point of view. As an extention of coventional PCA, we propose dually coupled methods for dimension reduction called e-PCA and m-PCA, in which the affine subspace of a dually coupled autoparallel coordinate system is extracted so as to minimize the sum of Kullback-Leibler divergence. We also consider the treatment for a mixture distribution that does not belong to an exponential family. The basic idea is to embed the mixture distribution into the exponential family. Further, we introduce a probabilistic model for the proposed framework and derive a clustering algorithm constrained on a lower dimensional subspace. The variational Bayes method and the Laplace approximation technique are applied in order to obtain a tractable computation time.

Key words: Principal component analysis, information geometry, dimension reduction, clustering, Bayesian estimation.

Nonparametric Inference with Positive Definite Kernels

Kenji Fukumizu

(The Institute of Statistical Mathematics)

The methodology of data analysis with positive definite kernels or reproducing kernel Hilbert spaces is called the “kernel method”, which has been developed in the machine learning field. The feature of this methodology is that data are mapped to reproducing kernel Hilbert spaces given by the positive definite kernel, and linear methods of data analysis are applied on the data mapped in the Hilbert spaces. Although the mapped data may be infinite dimensional, the special property of the inner product makes the computation efficient. More recently, it has been revealed that more basic statistics such as mean and covariance considered in reproducing kernel Hilbert spaces are useful in analyzing statistical properties such as homogeneity, independence, and conditional independence of random variables. This paper explains the basic idea of this method for new nonparametric inference, and gives a brief survey of results obtained so far, focusing particularly on nonparametric methods for discussing independence and conditional independence of variables. In discussing the properties of variables, it is important to use a class of kernels that determines a probability uniquely by the mean on the reproducing kernel Hilbert spaces. This class of kernels is called characteristic, and some theoretical analysis is also shown. With a characteristic kernel, the squared distance of the means can be applied to the two-sample test for homogeneity. If the joint probability and product of the marginals are compared, the distance is equal to the Hilbert-Schmidt norm of the covariance operator, which can be used for independence test. It is also shown that the Hilbert-Schmidt norm of the normalized covariance operator is equal to the chi-square divergence, which is a well-known measure of dependence. In the last part of this paper, the method of discussing conditional independence with kernels is briefly surveyed. The method is based on extension of the characteristic of conditional independence of Gaussian random variables to general cases by mapping variables to reproducing kernel Hilbert spaces. Some applications of conditional independence are also shown.

Key words: Positive definite kernel, reproducing kernel, Hilbert space, nonparametric, independence, conditional independence.

A Multilevel Model Using 2nd and 3rd Order Moments

Koken Ozaki

(The Institute of Statistical Mathematics)

Kentaro Nakamura

(Faculty of Management, Saitama Gakuen University)

Hiroto Murohashi

(Ochanomizu Research Center for Human Development and Education, Ochanomizu University)

Multi-stage sampling is used to collect data, such as social survey data, educational survey data, and so on. The multilevel model is appropriate for analyzing this kind of data, and is often used in sociology and pedagogy. Structural Equation Modeling (SEM), which is widely used in social sciences, can accommodate a multilevel model using multi-group analysis. In SEM, (means and) covariances are used as information. However, a method named non-normal Structural Equation Modeling (nnSEM) has recently been developed. The merits of nnSEM are that it can not only handle non-normally distributed variables, but can also statistically judge the direction of causation between cross sectional variables. In this paper, we develope a two-stage model whithin the framework of nnSEM.The model can judge the direction of causations between cross sectional variables in both sampling units. Simulation studies were performed to examine the characteristics of the model.

Key words: Multilevel model, two-stage sampling, SEM, nnSEM.