Proceedings of the Institute of Statistical Mathematics Vol.47, No.1, 3-27(1999)

Recent Developments in Model Selection Theory

Hidetoshi Shimodaira
(The Institute of Statistical Mathematics)

Data analysis based on stochastic model has been shown useful in many application fields. However, it is often difficult to specify a unique good model from prior knowledge, and so we need a methodology for selecting models from data. Akaike gave the information criterion to evaluate the model in terms of prediction, and he advocated the importance of modeling in data analysis. Up to now, several kinds of information criteria have been proposed in literature, and we have to choose an appropriate one according to our purposes and the situations. In this article, we discuss the derivations of information criteria for several inference schemes. We also make some comments on the consistency of model selection. The issue of consistency concerns the limit of large sample size, but the sample size is finite in actual applications. Thus, it is important to consider the sampling error of the information criterion to evaluate the reliability (or uncertainty) of model selection. Methods such as the bootstrap selection probability, the model selection test, and the multiple comparisons of models are discussed for assessing the reliability of model selection. Further, we give a graphical method to visualize the relative locations of predictive densities for exploratory model building. Illuminating examples from variable selection in multiple regression as well as practical examples from the evolutionary tree reconstruction are given to illustrate the methodology.

Key words: Information criterion, AIC, predictive density, variable selection, Bayes model, multiple comparisons.

Proceedings of the Institute of Statistical Mathematics Vol.47, No.1, 29-48(1999)

Near Parametric Inference
—Towards Flexible Modeling—

Shinto Eguchi
(The Institute of Statistical Mathematics)

This paper introduces a near-parametric inference to extend a working area of the usual likelihood method to a wider area where the proposed method performs well against a slight departure from assumptions for a parametric model with possible directions. A diversity of semiparametric approaches has been established in order to bridge a gap between parametric and nonparametric methods. In this approach along semiparametrics the key idea is to enlarge a parametric model into the tube neighborhood so that it may relax the inflexible relation of the parametric model with the likelihood function.

Three typical applications to near-parametric inference are given as follows: (1) Density estimation by local likelihood method is discussed, where a given model is enlarged according to a data point of which density is to be estimated. In effect a structure of incomplete observation is mounted by kernel function. In this context the structure becomes vanishing as the bandwidth becomes infinity. A large bandwidth asymptotics is discussed under near parametric situation where the underlying distribution is asymptotically reduced to the parametric one. (2) In neural computational algorithm we introduce a self-organizing rule to likelihood method by considering a latent variable indexing whether each observation comes from the assumptions in the parametric setting. In particular we present a special application to principal component analysis. The proposed algorithm is of EM-type, where the conditional probability that the respective observation is well controlled given the observation is imputed in the E step; the principal component vector on the sample covariance matrix by weighting the conditional probabilities is calculated in the M step. (3) We introduce a sensitivity approach to observational bias by modeling a selectivity parameter. The key point is that the selectivity parameter is not estimated but assessed the influence against the observational possible bias deviate from pure randomness assumption under missing or allocation sampling. A selectivity index invariant with the selectivity parametrization gives a reasonable assessment whether the observational assumption is broken down.

Through these applications an advantageous point is commonly addressed such that near parametric inference keeps the same efficiency as the parametric inference reasonably, and performs well against the departure from parametric setting.

Key words: Local likelihood, near parametrics, observational bias, principal component analysis, selectivity parameter.

Proceedings of the Institute of Statistical Mathematics Vol.47, No.1, 49-61(1999)

Artificial Likelihoods for General Nonlinear Regressions

Jinfang Wang
(The Institute of Statistical Mathematics)

This paper concerns nonlinear regression models based on estimating functions. A general estimating function, g(\theta), is typically nonconservative, that is, g(\theta) is not the gradient of any scalar function. In such cases, neither quasi-likelihood nor quasi-likelihood ratio can be uniquely defined. In this paper we study the problem of nonconservative estimating functions and the associated difficulties in general linear regression. We propose semi-parametric inference approach based on artificial likelihood functions derived from vector field decomposition associated with estimating functions. Further properties of Helmholtz-type quasi-likelihood proposed by Wang (1999) are studied. In particular, we propose a method for root-selection based on bootstrap quasi-likelihood ratio. The method is applied to logistic regression with measurement error model.

Key words: Bootstrap, estimating function, generalized linear model, logistic regression with measurement error, multiple roots, vector field.

Proceedings of the Institute of Statistical Mathematics Vol.47, No.1, 63-69(1999)

Characterization of Invariant Probability Models

Hidehiko Kamiya
(The Institute of Statistical Mathematics)

Theory of statistical inference from the point of view of invariance is based on the assumption that the underlying distributions belong to the so-called invariant probability models. This paper deals with the problem of characterization of invariant probability models. In the general setting where neither the sample space nor the parameter space is isomorphic to the acting group, a characterization is given in terms of the functional form of the densities.

Key words: Invariant probability model, group action, orbit, global cross section, orbital decomposition, maximal invariant.

Proceedings of the Institute of Statistical Mathematics Vol.47, No.1, 71-79(1999)

A Study on Paulik-Seber Estimation in a Tag Experiment
—In Terms of Ancillarity and Sufficiency
in the Presence of a Nuisance Parameter—

Sakutaro Yamada and Toshihide Kitakado
(Department of Fisheries Resource Management, Tokyo University of Fisheries)

The estimation problem of the population size of fish by a tag experiment including incomplete reports is considered. Two parameters in this problem, the reporting rate and the population size, were estimated based on the conditional and marginal distribution, respectively, in the previous works. In this paper, justification for these estimation methods is given through some notions of ancillarity and sufficiency in the presence of a nuisance parameter.

Key words: Ancillarity, sufficiency, nuisance parameter, incomplete observation, tag experiment.

Proceedings of the Institute of Statistical Mathematics Vol.47, No.1, 81-90(1999)

Limitations on the Use of Bayesian Test
under a Vague Prior Distribution

Takemi Yanagimoto
(The Institute of Statistical Mathematics)

One of critiques on the statistical test procedure is raised in association with Lindley's paradox. In the test procedure the null hypothesis is rejected with the probability \alpha, which is a prefixed value named the significance level. The critique asserts that such a probability should tend to zero as the sample size tends to infinity. This requirement is called consistency of a test. The same controversy is found also in the model selection problem.

In this article we make it clear that the problem appears when the amount of information of data at hand is large while that of a prior distribution is relatively small. Then the variance of an estimator becomes much less than that of a prior distribution. Such a prior distribution is called vague. Emphasized here are: 1) Such a prior distribution is not realistic in practical scientific reasonings, 2) Careful considerations on a sample size are not taken account into, and 3) Difficulties in interpreting the posterior distribution arise. We conclude that Bayesian test will not be useful in scientific reasonings.

Key words: Confidence interval, consistency of test, model selection, posterior distribution, statistical test.

Proceedings of the Institute of Statistical Mathematics Vol.47, No.1, 91-104(1999)

Discrete Distribution Theory
in a Higher-order Markov Chain

Masayuki Uchida
(The Institute of Statistical Mathematics)

Let X-m+1, X-m+2,..., X0, X1, X2,... be a time-homogeneous {0, 1}-valued m-th order Markov chain. Distribution of the numbers of trials until the first success, i.e., geometric distribution, in the sequence X1, X2,... is studied. Geometric distribution of order k in the sequence X1, X2,... is also obtained. The probability distribution of number of "1", i.e., binomial distribution, in the sequence X1, X2,..., Xn is studied. The probability distributions of number of runs of "1" of exact length k (k > m) in the sequence X1, X2,..., Xn are also considered. There are some ways of counting numbers of runs with length k. This paper studies the distributions based on four ways of counting numbers of runs, i.e., the number of non-overlapping runs of length k, the number of runs with length greater than or equal to k, the number of overlapping runs of length k and the number of runs of length exactly k. We obtain the above four kinds of binomial distributions of order k by using the unified expressions and make the relation between the binomial distribution and the binomial distributions of order k clear.

Key words: Geometric distribution, binomial distribution, geometric distribution of order k, binomial distribution of order k, probability generating function, Markov chain.

Proceedings of the Institute of Statistical Mathematics Vol.47, No.1, 105-118(1999)

Use of Probability Generating Function
for Distribution Theory of Runs

Katuomi Hirano
(The Institute of Statistical Mathematics)
Sigeo Aki
(Department of Informatics and Mathematical Science, Osaka University)

Let X1, X2,... be a sequence of independent and identically distributed {0, 1}-valued random variables, a {0, 1}-valued Markov chain or a {0, 1}-valued higher order Markov chain. Let E0 be the event that a run of "0" of length r occurs and let E1 be the event that a run of "1" of length k occurs in the sequence X1, X2,... .

In the case of Markov dependent trials, discrete distributions related to the events E0 and E1 are studied. The probability generating functions of the distributions of the waiting times of the sooner and later occurring events are given. To obtain the probability generating function of the distribution of the number of occurrences of E1 in X1, X2,..., Xn, we use the snake oil method.

The distributions of numbers of overlapping and non-overlapping occurrences of succes-runs of length l, and the distributions of numbers of occurrences of success-runs of exact length l and of length l or more until the first occurrence of success-run of length k in the m-th (m < l < k) order Markov dependent trials are studied: The distribution of overlapping occurrences is the geometric distribution of order (k - l + 1), and the both of exact length l and of length l or more are the geometric distribution of order 1. To show these results we describe how to solve by means of the conditional probability generating function method. A finite-state Markov chain imbedding technique is also illustrated.

Key words: Probability generating function, distribution theory of run, Markov dependent trials, waiting time problems.

Proceedings of the Institute of Statistical Mathematics Vol.47, No.1, 119-142(1999)

Estimating Inequalities for Incomplete
Gamma Function Ratios

Tadashi Matsunawa
(The Institute of Statistical Mathematics)
Tomohiro Takei
(Graduate School of Science and Engineering, Chuo University)

Several approximations to estimate the incomplete gamma function ratio \gamma(p, x) having parameter p > 0 and variable x > 0 are presented in some situations. The approximations are realized by giving double-sided inequalities as \underline{\gamma}(p, x) < \gamma(p, x) < \bar{\gamma}(p, x). The resultant bounds are expected to be useful to approximation problems in statistics and in related mathematical sciences. Our approaches to obtain the bounds, (a)when p is a positive integer and (b) when p is a positive general real number, are fairly different. Numerical and graphical results on the approximations are also presented.

Key words: Incomplete gamma function ratio, approximation, double-sided estimating inequality, Maclaurin's formula, inverse factorial series, absolute convergent series, Ramanujan's conjecture.

Proceedings of the Institute of Statistical Mathematics Vol.47, No.1, 143-156(1999)

Decomposition Problem of Distributions Characterized
by Regular Variation (II)

Takaaki Shimura
(The Institute of Statistical Mathematics)

The Mellin-Stieltjes convolution (MS-convolution) and related decomposition of distributions in some classes characterized by regular variation are investigated. Maller shows that if X and Y are independent non-negative random variables with distributions \mu and \nu, respectively, and both \mu and \nu are in D2, the domain of attraction of Gaussian distribution, then the distribution of the product XY (that is, the MS-convolution \mu \circ \nu of \mu and \nu) also belongs to it and he shows that if a distribution of product of two independent random variables belongs to D2 and one of them has finite variance, then the other is in D2. He conjectures that, conversely, if \mu \circ \nu belongs to D2, then both \mu and \nu (factors of \mu \circ \nu) are in it. The first purpose of this paper is to deal with this problem in detail. It is well-known that D2 is identical with the class of distributions whose truncated variance \int| t | < x t2 \mu(dt) is slowly varying. We deal with the following class that is an extension of D2 : the class of distributions \mu on [0, \infty) with slowly varying \alpha-th truncated moments \intx0 t\alpha \mu(dt). Some subclasses of M (\alpha) are given with the property that if \mu \circ \nu belongs to it, then \mu and \nu are in M (\alpha). But, in general, a distribution in M (\alpha) could have a factor that does not belong to M (\alpha) : there exist distributions \mu \in D2 and \nu \not\in D2 such that \mu \circ \nu belongs to D2. This implies that Maller's conjecture is not true. A non-negative non-decreasing (resp. non-increasing) f is said to be decomposed into components f1 and f2, if both f1 and f2 are non-negative non-decreasing (resp. non-increasing) and f = f1+ f2. A component of non-decreasing slowly varying function is not necessarily slowly varying. Proof depends on the results on decomposition of non-decreasing slowly varying functions.

The second purpose is to consider same problem for D (\alpha) (the class of distributions \mu on [0, \infty) with regularly varying tails \mu(x, \infty) with index -\alpha (\alpha > 0)). These classes are related to various limit theorems for i.i.d. sequence. On the class D (\alpha) (\alpha > 0), there exists a big difference between slowly varying tails and regularly varying tails with negative index. In the case of D (0), if \mu \circ \nu is in D (\alpha) and \int0\infty t\varepsilon \nu(dt) < \infty for some \varepsilon > 0, then \mu belongs to D (0). But, in the case of D (\alpha) (\alpha > 0), there exist two distributions \mu and \nu such that neither \mu nor \nu has regularly varying tail but their MS-convolution \mu \circ \nu belongs to D (\alpha). Further, under the assumption that the support of \nu is a finite geometric progression, a necessary and sufficient condition for measure of \nu is given such that there exists a distribution \mu that is not regularly varying tail but their MS-convolution \mu \circ \nu has regularly varying tail.

Key words: Regularly varying function, Mellin-Stieltjes convolution, tail of distribution, truncated moment, decomposition of distribution.

Proceedings of the Institute of Statistical Mathematics Vol.47, No.1, 157-174(1999)

Entropy Methods for Random Fields Generated
by Martingales and Their Statistical Applications

Yoichi Nishiyama
(The Institute of Statistical Mathematics)

The purpose of this study is to develop entropy methods, which were first introduced for empirical processes of I.I.D. data, in order to handle some martingales with applications to statistical inference for stochastic processes.

The motivation is as follows. Since the prominent work of Dudley in 1978, the entropy methods were studied to establish laws of large numbers and central limit theorems for empirical processes indexed by classes of sets or functions in the 80's. Furthermore, some recent works have shown that the methods are useful not only for those limit theorems but also for other problems in statistics. The book by van der Vaart and Wellner in 1996 gives a nice exposition of the methods as well as a lot of applications, with emphasis on I.I.D. data. However, although some parts of the methods have a good potential to be applied also for non-I.I.D. data, no systematic study has been done in the framework of martingales, which are known to be important for analyzing a rich class of statistical models. We intend to make a step to fill this gap in the literature.

Section 1 contains an intuitive explanation about generalization of Ossiander's central limit theorem. For simplicity, the rest part of the paper is devoted only to continuous local martingales and applications to the Gaussian white noise model. Based on maximal inequalities derived in Section 2, a highlight is Section 3 that gives a weak convergence theorem. By using them, we derive the asymptotic behavior of local random fields of kernel estimators, the rate of convergence of some parametric and non-parametric M-estimators, and the asymptotic normality of integral type estimators.

Key words: Martingale, maximal inequality, central limit theorem, kernel estimator, change point, maximum likelihood estimator.

Proceedings of the Institute of Statistical Mathematics Vol.47, No.1, 175-199(1999)

Malliavin Calculus and Statistical Asymptotic Theory

Yuji Sakamoto
(Nagoya University)
Nakahiro Yoshida
(University of Tokyo)

When we derive asymptotic expansions of random variables, the smoothness of their distributions becomes a subject of discussion. In the case where the random variables are functionals of continuous-time stochastic processes, we need an infinite dimensional analysis for the study of their analytic properties, and the Malliavin calculus provides the key to the problem of the smoothness of their distributions. In the Malliavin calculus, the integration-by-parts formula plays an important role. We will first mold it for the finite dimensional case from a well-known identity, and will illustrate significance of the smoothness of the distribution in the derivation of asymptotic expansion on a finite dimensional space, with the relation to the integration-by-parts formula. Next, we will introduce the foundation of the Malliavin calculus, and will explain the theory of asymptotic expansions of the generalized Wiener functionals and their applications to the statistics. Moreover it will be shown that expansion formulas for the shrinkage estimators also follow from such a general theory as above.

Key words: Stein's identity, integration-by-parts formula, Sobolev space, generalized Wiener functional, diffusion process, shrinkage estimator.

Proceedings of the Institute of Statistical Mathematics Vol.47, No.1, 201-221(1999)

Distribution of the Maximum of Gaussian Random Field:
Tube Method and Euler Characteristic Method

Satoshi Kuriki
(The Institute of Statistical Mathematics)
Akimichi Takemura
(Faculty of Economics, University of Tokyo)

Let X(t), t \in I, be a Gaussian random field with mean 0 and variance 1. Assume that X(t) has a representation X(t) = \sumpi = 1 \phii(t) zi, where zi's are independent standard normal random variables, \phii (t)'s are smooth functions on I. Define another random field U(t) = \sumpi = 1\phii(t)yi, where (y1,..., yp) is a unit random vector distributed uniformly on the unit sphere. In this paper we first elucidate the method of tube for approximating the upper tail probabilities of the maxima of these random fields. Second we explain the method of Euler characteristic, which is another method for the same purpose. Moreover, in the cases of random fields X(t) and U(t), the method of Euler characteristic is shown to give the same result as the method of tube. From this fact, a relation between the coefficients of asymptotic expansion of the tail probability and the Euler characteristic of the index set is derived. Finally, as an example we discuss the asymptotic expansion for the largest eigenvalue of the multivariate symmetric normal random matrix.

Key words: Asymptotic expansion, Gauss-Bonnet theorem, integral geometry, Karhunen-Loeve expansion, tail probability, tube formula.

Proceedings of the Institute of Statistical Mathematics Vol.47, No.1, 223-241(1999)

Real Time Statistical Discrimination of Foreshocks
from Other Earthquake Clusters

Yosihiko Ogata and Tokuji Utsu
(The Institute of Statistical Mathematics)

This paper reviews our papers (Ogata et al., 1995, Geophysical Journal International, 121, 233-254; 1996, Geophysical Journal International, 127, 17-30). When earthquake activity begins at some place, it may be a foreshock sequence of a larger earthquake, or it may be a swarm or a simple mainshock-aftershock sequence. This paper is concerned with the conditional probability that it will be foreshock activity of a later larger earthquake, depending on the occurrence pattern of some early events in the sequence. The earthquake catalogue of the Japan Meteorological Agency (1926-1993, MJ > 4) is decomposed into numbers of clusters in time and space to compare statistical features of foreshocks with those of swarms and aftershocks. Using such a data set, we reveal some discriminating features of foreshocks relative to the other type of clusters, for example the events' stronger proximity in time and space, and a tendency towards chronologically increasing magnitudes, which encouraged us to construct models which forecast the probability of the earthquakes being foreshocks. Specifically, the probability is a function of the history of magnitude differences, spans between origin times and distances between epicentres within a cluster. For an illustrative implementation, the models were fitted to the early part of the data (1926-1975) and the validity of the forecasting procedure were checked on data from the later period (1976-1993). Two procedures for evaluating the performance of probability forecast are suggested. Further, in the very beginning of an activity where only a single event is available (i.e., either it is the first event in a cluster or a single isolated event), we also forecast the probability of the event being a foreshock as a function of its geographic location. Then, the validation of the forecast is demonstrated in a similar manner. Finally, making use of the multi-elements prediction formula, we see that the forecasting performance is enhanced by the joint use of the information in the location of the first event, and that in the subsequent inter-event history in the cluster.

Key words: Magnitude differences, multi-element prediction formula, epicentre separations, logit models, origin time spans, probability forecasts.