Workshop on Mathematical Approaches to Large-Dimensional Data Analysis

Title and Abstract of talks:

March 13rd (Thu)

Michael Elad (Technion, Israel)
Sparse Modeling of Graph-Structured Data ... and ... Images.
slides: ppt, pdf

Abstract: Images, video, audio, text documents, financial data, medical information, traffic info - all these and many others are data sources that can be effectively processed. Why? Is it obvious? In this talk we will start by discussing "modeling" of data as a way to enable their actual processing, putting emphasis on sparsity-based models. We will turn our attention to graph-structured data and propose a tailored sparsifying transform for its dimensionality reduction and subsequent processing. We shall conclude by showing how this new transform becomes relevant and powerful in revisiting ... classical image processing tasks.
(*) This talk covers portions from a joint work with Idan Ram and Israel Cohen.

Yoshiyuki Kabashima (Tokyo Institute of Technology)
Statistical mechanical approach to sample complexity of dictionary learning

Abstract: Finding a basis matrix (dictionary) by which objective signals are represented sparsely is of major relevance in various scientific and technological fields. We consider a problem to learn a dictionary from a set of training signals. We employ techniques of statistical mechanics of disordered systems to evaluate the size of the training set necessary to typically succeed in the dictionary learning. The results indicate that the necessary size is much smaller than previously estimated, which theoretically supports and/or encourages the use of dictionary learning in practical situations.
#This talk is based on a collaboration with Ayaka Sakata.

Toshiyuki Tanaka (Kyoto University)
TBA

Abstract:

Shiro Ikeda (Institute of Statistical Mathematics)
Some applications of sparse modelling in physical measurements
slides

Abstract: We understand that the sparsity based methods are effective for many physical measurement methods. Here we show some applications of them: two examples from astronomy (the interferometer and the Compton camera imaging), and one from diffraction imaging. All the image reconstruction methods are based on the Bayesian methods.

Kenji Nagata (University of Tokyo)
Sparse modeling and variable selection with the exchange Monte Carlo method

Abstract: Feature selection is an important process for improving the generalization capability and interpretability of learned models through the selection of a relevant feature subset. In the last two decades, a number of feature selection methods, such as L1 regularization and automatic relevance determination have been intensively developed and used in a wide range of areas. We can select a relevant subset of features, by using these feature selection methods. In this talk, we introduce a new method of an exhaustive search method, instead of those method, by using the exchange Monte Carlo method. Moreover, we propose a new hypothesis testing to judge whether the given data include the information for the desired binary classification by using an exhaustive search method.

March 14th (Fri)

Noureddine El Karoui (UC Berkeley, USA)
On high-dimensional robust regression, penalized robust regression and GLMs

Abstract: I will discuss the behavior of widely used statistical methods in the high-dimensional setting where the number of observations, n, and the number of predictors, p, are both large. I will present limit theorems about the behavior of the corresponding estimators, their asymptotic risks etc... The results apply not only to robust regression estimators, but also Lasso-type estimators and many much more complicated problems. Some of the results answer a question raised by Huber in his seminal '73 paper on robust regression.

Many surprising statistical phenomena occur: for instance, maximum likelihood methods are shown to be (grossly) inefficient, and loss functions that should be used in regression are shown to depend on the ratio p/n. This means that dimensionality should be explicitly taken into account when performing simple tasks such as regression.

More generally, we'll see that intuition based on results obtained in the small p, large n setting leads to misconceptions and the use of suboptimal procedures. It also turns out that inference is possible in this setting. We'll also see that the geometry of the design matrix plays a key role in these problems and use this fact to disprove claims of universality of some of the results.

Mathematically, the tools needed mainly come from random matrix theory, measure concentration and convex analysis.

Kei Kobayashi (Institute of Statistical Mathematics)
Curvature of empirical metrics on a data space and its deformation

Abstract: The talk is largely about empirical metrics and empirical geodesics and their application to data analysis. We introduce a novel deformation of a metric called alpha, beta, gamma metric and give a way of computation via the metric graphs. This new deformation of a metric relates to a curvature of the data space and robustness of the estimation based on it. The means by the alpha and beta metrics correspond to an intrinsic and an extrinsic mean, respectively, usually used when the data is on a specific manifold though our method is not necessarily for such cases. We show some theoretical results on the properties of the metric and give some examples to show how the metric works for data analysis.
This is a collaboration with Prof. Henry P. Wynn (London School of Economics).

Kengo Kato (University of Tokyo)
Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors
(joint work with V. Chernozhukov and D. Chetverikov)
slides

Abstract: We derive a Gaussian approximation result for the maximum of a sum of high-dimensional random vectors. Specifically, we establish conditions under which the distribution of the maximum is approximated by that of the maximum of a sum of the Gaussian random vectors with the same covariance matrices as the original vectors. This result applies when the dimension of random vectors (p) is large compared to the sample size (n); in fact, p can be much larger than n, without restricting correlations of the coordinates of these vectors. We also show that the distribution of the maximum of a sum of the random vectors with unknown covariance matrices can be consistently estimated by the distribution of the maximum of a sum of the conditional Gaussian random vectors obtained by multiplying the original vectors with i.i.d. Gaussian multipliers. This is the Gaussian multiplier (or wild) bootstrap procedure. Here too, p can be large or even much larger than n. These distributional approximations, either Gaussian or conditional Gaussian, yield a high-quality approximation to the distribution of the original maximum, often with approximation error decreasing polynomially in the sample size, and hence are of interest in many applications. We demonstrate how our Gaussian approximations and the multiplier bootstrap can be used for modern high-dimensional estimation, multiple hypothesis testing, and adaptive specification testing. All these results contain nonasymptotic bounds on approximation errors.

Arthur Gretton (University College London)
Kernel Distribution Embeddings and Tests for Three-Variable Interactions
slides

Abstract: We introduce nonparametric tests for Lancaster three-variable interaction and for total independence, using embeddings of signed measures into a reproducing kernel Hilbert space (RKHS). Such embeddings have been applied successfully in two-sample testing (of whether two distributions are the same), and independence testing (of whether two random variables are dependent). We show that test statistics for these simpler cases can be constructed from RKHS embeddings of measures, and generalize these ideas to test statistics for interactions among three variables - the two statistics obtained are the Lancaster interaction statistic, and the total independence statistic.

The Lancaster and total independence statistics are straightforward to compute, and the resulting tests are are consistent against all alternatives for a large family of reproducing kernels. We show the Lancaster test to be sensitive to cases where two independent causes individually have weak influence on a third dependent variable, but their combined effect has a strong influence. This makes the Lancaster test especially suited to finding structure in directed graphical models, where it outperforms competing nonparametric tests in detecting such V-structures.

Relevant paper: Sejdinovic, D., Gretton, A., and Bergsma, W., A Kernel Test for Three-Variable Interactions, NIPS, 2013

Shotaro Akaho (Advanced Institute of Science and Technology)
On the robust nonlinear curve fitting
slides

Abstract: Curve and surface fitting is a basic procedure in computer vision and machine learning. In order to fit nonlinear curve efficiently, feature space mapping is often used in which each point is mapped into high dimensional space by nonlinear transformation. Such a mapping enables to solve the fitting problem by linear computation. However, this simplification causes several problems, and I will pick up such problems and talk about some attempts to solve them. The first one is the difference between metric of input space and feature space. The second one is robustness against outliers, which is also a problem in linear fitting. The third one arises when feature space is extended to infinite space such as reproducing kernel Hilbert space.

Kenji Fukumizu (Institute of Statistical Mathematics)
Nonparametric Bayesian Inference with Positive Definite Kernels
slides

Abstract: Kernel methods have successfully used in many machine learning problems with favorable performance in extracting nonlinear structure of high-dimensional data. Recently, nonparametric inference methods with positive definite kernels have been developed, employing the kernel mean expression of distributions. In this approach, the distribution of a variable is represented by the mean element of the random feature vector defined with the mapping of the variable by the kernel function, and the relation between variables is expressed by a covariance operator. In this talk, I introduce Bayesian inference with this kernel mean expression; namely Bayes rule with the kernel mean and covariance expression is realized to compute the kernel mean of posterior. This approach provides a novel nonparametric way of Bayesian inference, expressing a distribution with weighted sample, and computing posterior with matrix calculation. Some theoretical results on the convergence of the estimators are also shown, and the usefulness is effectively demonstrated with the example of nonparametric hidden Markov models.

March 15th (Sat)

Milos Radovanovic 　(Univ. Novi Sad, Serbia)
Hubs in Nearest-Neighbor Graphs: Origins, Applications and Challenges
slides

Abstract: The tendency of k-nearest neighbor graphs constructed from tabular data using some distance measure to contain hubs, i.e. points with in-degree much higher than expected, has drawn a fair amount of attention in recent years due to the observed impact on techniques used in many application domains. This talk will be organized into three parts: (1) Origins, which will discuss the causes of the emergence of hubs (and their low in-degree counterparts, the anti-hubs), and their relationships with dimensionality, neighborhood size, distance concentration, and the notion of centrality; (2) Applications, where we will present some notable effects of (anti-)hubs on techniques for machine learning, data mining and information retrieval, identify two different approaches to handling hubs adopted by researchers - through fighting or embracing their existence - and review techniques and applications belonging to the two groups; and (3) Challenges, which will discuss work in progress, open problems, and areas with significant opportunities for hub-related research.

Ikumi Suzuki (National Institute of Genetics)
The effect of data centering for k-nearest neighbor
slides

Abstract: The k-nearest neighbor (kNN) is a simple and popular method in various machine learning tasks. However, when the data dimension is high the kNN often results in poor performances.

Lately, as one of the of curse of dimensionality problem, hubness phenomena is reported in terms of the kNN. A "hub" is a sample which is similar to many other samples in a dataset.

While spectral property of the laplacian-based kernels is useful for the dimensionality reduction and the spectral clustering etc., this property also effects on the kNN, namely for reduction of hub samples.

On further study, simply centring similarity measures, which can be seen as shifting the data origin to the data centroid in a feature space (data centering), has also property of hub reduction.

Ryota Tomioka (Toyota Technological Institute, Chicago)
Towards better computation-statistics trade-off in tensor decomposition
slides

Abstract: New approaches for tensor decomposition with worst case performance guarantee have recently emerged. O(rn^{d-1} is the number of samples required for recovering an n x n x ... x n d-way tensor that admits a Tucker decomposition with r x r x ... x r core proved for one of these methods called overlapped trace norm. Although this method is computationally very efficient, the rate is rather disappointing. A newer approach called square norm achieves lower O(r^{d/2}n^{d/2}) samples at the cost of higher computational demand. There is also a more theoretical approach that could achieve optimal O(rnd) but seem to be computationally intractable. I will overview these recent developments and discuss how we could achieve a better trade-off between what we can provably achieve in terms of statistical accuracy and how much computation we need to spend.

Taiji Suzuki (Tokyo Institute of Technology)
PAC-Bayesian Bound for Gaussian Process Regression and Multiple Kernel Additive Model
slides

Abstract: We develop a PAC-Bayesian bound for the convergence rate of a Bayesian variant of Multiple Kernel Learning (MKL) that is an estimation method for the sparse additive model. Standard analyses for MKL require a strong condition on the design analogous to the restricted eigenvalue condition for the analysis of Lasso and Dantzig selector. In this talk, we show a PAC-Bayesian bound which states that the Bayesian variant of MKL achieves the optimal convergence rate without such strong conditions on the design. Basically our approach is a combination of PAC-Bayes and theories of non-parametric Gaussian process regressions. Our analysis includes the existing result of Gaussian process as a special case. The analysis also offers the convergence rate of the Bayesian variant of Group Lasso as a finite dimensional special case.

Workshop on Mathematical Approaches to Large-Dimensional Data Analysis

MEXT-ISM Coop-Math Prgoram and ISM Research Center for Statistical Machine Learning Dates: March 13 - 15, 2014. Place: The Institute of Statistical Mathematics, Tachikawa, Tokyo

Title and Abstract of talks:

March 13rd (Thu)

March 14th (Fri)

March 15th (Sat)

MEXT-ISM Coop-Math Prgoram and ISM Research Center for Statistical Machine Learning
Dates: March 13 - 15, 2014. Place: The Institute of Statistical Mathematics, Tachikawa, Tokyo