Proceedings of the Institute of Statistical Mathematics Vol.54, No.2, 211-222(2006)
The Markov-chain Monte Carlo method is a computer simulation algorithm that reproduces statistical ensembles. It is usually based on the Boltzmann weight factor and realizes a fixed-temperature canonical ensemble. However, when the number of degrees of freedom of the system is large, there exist a huge number of local-minimum-energy states that are separated by high-energy barriers. This forces the simulation to get trapped in energy-local-minimum states and makes it very difficult to reproduce an accurate low-temperature canonical ensemble. Generalized-ensemble algorithms are generic terms for those methods that are based on non-Boltzmann weight factors and overcome the above-mentioned difficulty by realizing a one-dimensional random walk in energy space. We review one of the generalized-ensemble algorithms, namely, the replica-exchange method, and its extensions. As an example of its application, we present the results of replica-exchange Monte Carlo simulation applied to the prediction of membrane protein structures.
Key words: Generalized-ensemble algorithm, replica-exchange method, membrane protein, transmembrane helix, protein tertiary structure prediction.
Proceedings of the Institute of Statistical Mathematics Vol.54, No.2, 223-245(2006)
This review article presents several research developments in ocean data assimilation in the tropical Pacific, primarily regarding climate-scale phenomena. Aspects of ocean state estimation, reduced space techniques, theoretical error estimations, diagnostics and estimation of observing system are also briefly reported.
Key words: Data assimilation, Tropical Pacific, climate variability.
Proceedings of the Institute of Statistical Mathematics Vol.54, No.2, 247-264(2006)
A data assimilation model has been developed for the Japan Sea circulation as a pioneering work for estimation problems in geophysical fluid dynamics. Feasible assimilation methods are selected to reflect various measurement data in a high-resolution ocean circulation model considering computational cost and accuracy. Realistic surface and subsurface states are estimated and predicted by controlling not only the surface boundary conditions but also the bottom topography in the Japan Sea. Ocean current forecasting benefits many scientific and social applications including prediction of drifting oil spills and movement of giant jellyfish.
Key words: Japan Sea, ocean circulation model, remote-sensing observation data, data assimilation, ocean forecasting.
Proceedings of the Institute of Statistical Mathematics Vol.54, No.2, 265-280(2006)
This study aims to develop a spatiotemporal model for evaluating spatiotemporal chlorophyll a (chl-a) distributions over the Sea of Japan, derived from satellite remote sensing data. Considering factors affected the chl-a distributions, we focused on satellite derived sea surface temperature (SST) and photosynthetically active radiation (PAR). In our preliminary spatial analysis, chl-a exhibited anisotropy and SST and PAR exhibited almost isotropy in the south-north and east-west directions. Furthermore, as a result of time series analysis on the change in spatial correlation, chl-a and PAR showed significant autocorrelations. We thus propose a spatiotemporal model to express change of chl-a distribution. We numerically evaluate the ability of the proposed spatiotemporal model to predict the one-month ahead change in chl-a distribution.
Key words: Phytoplankton distribution, satellite remote sensing, Sea of Japan, spatiotemporal statistical modeling.
Proceedings of the Institute of Statistical Mathematics Vol.54, No.2, 281-297(2006)
We review a new approach to earthquake forecasting. This approach is based on a statistical-physics-based method that is more effective than probabilistic long-term forecast seismic-hazard assessments. Our method is called a new Pattern Informatics (PI), which quantifies temporal variations in seismicity. The output is a map of areas in a seismogenic region where earthquakes are likely to occur in a future 10-year span. This approach has been applied to the central part of Japan. Applications of this technique to California and worldwide have also forecast the location of future earthquakes. We discuss our results reviewed in this paper from several geophysical viewpoints and indicate that the PI method shows considerable promise as an intermediate-term earthquake forecasting tool. It is of interest to understand how this PI approach can be applied to other regions.
Key words: Forecast, earthquake, seismicity, pattern informatics.
Proceedings of the Institute of Statistical Mathematics Vol.54, No.2, 299-314(2006)
We introduce our research project on prevention of accidents to children. In this research, Injury surveillance systems, probabilistic modeling, and 3-D computer simulations are studied. We collect statistical data from hospitals and homes, and construct computational models that can be used for prediction, evaluation and control of infant's behavior and injuries in using computer simulations. Computational models such as probabilistic networks are represented and utilized as knowledge by using probabilistic reasoning and information technologies described in this paper.
Key words: Bayesian network, statistical learning, probabilistic reasoning, injury prevention, human modeling.
Proceedings of the Institute of Statistical Mathematics Vol.54, No.2, 315-331(2006)
This report introduces graph mining techniques actively explored in a recent datamining study, and demonstrates its application to gene network analysis in conjunction with statistical modeling. The study on graph mining was initiated in the mid 1990's, and became widely explored after 2000 upon the proposal of its complete search algorithm. Graph mining is used to find characteristic substructures shared by some graphs in a given massive graph data. In particular, the exhaustive search of frequent subgraphs widely seen in the data is a representative task of graph mining. As this task contains subgraph isomorphism problems, which are known to be NP-complete, its high computational complexity is essential. Accordingly, the development of a practical fast algorithm for graph mining is a key issue in the study. A property for characterizing the substructures is used to mine characteristic substructures in massive graph data. A naive way to search the characteristic substructures is to check the property on every substructure in the data. However, this approach faces the combinatorial explosion of the substructures in the check. For efficient mining, most graph mining approaches limit the property to a “Downward Closure Property (DCP).” A DCP $P$ is defined as $a \subseteq b \Rightarrow P(b)\rightarrow P(a)$ by using two structures $a$ and $b$ where $P(\cdot)$ means that the property $P$ holds on a structure. Representative DCPs are a frequent itemset and a frequent graph where any of their subitemsets and subgraphs are also frequent. By this definition, if $P$ does not hold on a substructure $a$, $P$ does not hold on any superstructure of $a$ either. Accordingly, under a set of all substructures of size $k$ and DCP $P$, candidate substructures of larger size $k+1$ and DCP $P$ are limited to the join of the substructures in the set. This strongly limits the search space of characteristic substructures, and enables practical and fast graph mining. The graph mining technique was introduced to the post processing of the statistical gene network models obtained from microarray gene expression data. Bayesian network and nonparametric regression models of the gene network were in greedy manner searched in the data. We are interested that subnetworks widely appear over searched networks, since the credibility of such subnetworks are considered to be very high. Basket Analysis and a connected induced subgraph mining AcGM are applied to mine frequent subnetworks over searched networks where each network contains 801 genes. In the results of both Basket Analysis and AcGM, the frequent subnetworks are limited to very small sizes compared with the total size of the gene networks. This indicates that wide varieties of interpretations on a gene network structure are obtained from microarray gene expression data. In this study, graph mining was applied to the post processing stage of statistical modeling to extract credible gene subnetworks. Other possibilities such as direct introduction of graph mining to the search process of statistical model structures should be explored in future work.
Key words: Graph mining, data mining, statistical modeling, Bayesian network, gene network.
Proceedings of the Institute of Statistical Mathematics Vol.54, No.2, 333-356(2006)
We describe statistical methods for estimating gene networks from gene expression data and other biological information. Since information contained in gene expression data is limited, it is very difficult to accurately estimate gene networks from microarray data alone. This paper introduces two methods for overcoming this limitation. One is to estimate gene networks along with promoter element detection. The other is to estimate gene networks of two distinct organisms utilizing evolutionarily conserved relationships between genes in the two organisms. The former method tries to detect consensus motifs from a set of genes according to the network estimation, then to re-estimate the network along with the detected motifs embedded in a prior probability. The latter method simultaneously estimates two gene networks of two distinct organisms from gene expression data with the evolutionary information. The evolutionary information is defined according to the similarity of the protein sequences of the genes. The both methods use Bayesian networks as models for gene networks and estimate them from the maximization of the posterior probability of the networks. The prior probabilities are constructed based on promoter element detection and evolutionary information, respectively. We evaluate these methods through Monte Carlo simulations and real data analyses. We thus confirm that our methods can estimate gene networks more accurately than previously proposed methods.
Key words: Gene networks, gene expression data, promoter detection, evolutionary information.
Proceedings of the Institute of Statistical Mathematics Vol.54, No.2, 357-373(2006)
This paper presents a new method for infering protein networks from multiple types of genomic data. Based on a variant of kernel canonical correlation analysis, the originality is in the formalization of the protein network inference problem as a supervised graph learning problem, and in the integration of heterogeneous genomic data within this framework. Promissing results are presented on prediction of the protein network for yeast Saccharomyces cerevisiae from four types of available data: gene expressions, protein interaction data from yeast two-hybrid systems, protein localization data, and phylogenetic profiles. It is shown that the proposed method outperforms other unsupervised network inference methods. The comprehensive prediction of a global protein network enables estimation of unknown functional relationship between proteins.
Key words: Kernel methods, canonical correlation analysis, graph inference, genomic data, protein network.
Proceedings of the Institute of Statistical Mathematics Vol.54, No.2, 375-403(2006)
A class of minimum divergence methods is proposed to improve the defect of the maximum likelihood method in terms of statistical discussion including applications of PCA, ICA and pattern recognition. A challenging problem in genome data analyses is discussed, and minimum divergence methods are applied to genome data including SNPs, proteome, and microarray as an approach to solving the problem.
Key words: U-divergence, U-model, U-loss function, gene expression, robust, information geometry.
Proceedings of the Institute of Statistical Mathematics Vol.54, No.2, 405-423(2006)
Gene expression profile data is becoming increasingly important in clinical and biological research, and statistical analyses are applied from many aspects. In such research, it is necessary to draw appropriate conclusions from data, and at the same time avoid drawing inappropriate conclusions. In recent years, many supervised analyses on cell diagnosis of human diseases, e.g., cancers have tended toward overstatement. This review paper introduces some of the pitfalls into which ordinary analysers of gene expression profile data have tended to fall. It also introduces our research on cancer diagnosis as an example of supervised analyses that have carefully avoided such pitfalls. On reflection to many researches including overstatements, conservative researches have been increasing, however, in recent years, some aggressive and clever methods are appearing which approach the border and keep in the border. We also discuss future tasks in this field based on recent ideas.
Key words: Gene expression data analysis, supervised learning, semi-supervised learning, supervised feature selection.
Proceedings of the Institute of Statistical Mathematics Vol.54, No.2, 425-444(2006)
The ever-increasing expansion of the Internet constantly increases the importance of analyzing network traffic and making use of those results. To analyze network traffic in this study, we used a Bayesian time series model, which decomposes network traffic into trends, weekly and daily cycles, and colored noise components. The model is represented in a linear-Gaussian state space model that consists of system equations for each component and an observation equation, in which sums of the components yields the observed network traffic. State estimation by Kalman filter is used to obtain each component. Variance and AR parameters in the model are obtained to maximize the likelihood. Then a comparison among candidate models and selection of the components proceed based on the Akaike Information Criterion (AIC). To provide an example of an analysis using this proposed method, we report on the decomposition results of traffic data of dialup access at Koganei Campus and those at the top network domain of Hosei University, Tokyo which connects to the SINET.
Key words: Network traffic, decomposition, Bayesian time series model, state space model, Kalman filter.
Proceedings of the Institute of Statistical Mathematics Vol.54, No.2, 445-459(2006)
In this study, we apply duration analysis to specify a model of hair salon customers’ behavior regarding their visits and then predict their next revisit rates. We estimate the intensity function by adopting the Cox model for the hair salon data to examine which demographic, geographic, and/or behavioral variables influence each customer's attitude. Our target customers use three hair services; haircut, color, and permanent wave. These customers were divided into non-loyal and loyal customers using the RFM method. Estimation results showed that the intensity functions of three items have been specified by different models, and we found differences in intensity between non-loyal and loyal customers. As a next step, we apply the Cox model estimation results to calculate the revisit rates for each customer within 100 days since the last visit from the interval prediction of his/her next revisit at 95The prediction of revisit rates tell us when and how many customers will visit a salon and which services they will receive. The use of this information can help achieve an effective direct mailing strategy.
Key words: Duration analysis, Cox model, hair salon.
Proceedings of the Institute of Statistical Mathematics Vol.54, No.2, 461-480(2006)
This article reviews statistical issues that arise in evaluation methods as well as model building of directly transmitted infectious diseases to predict and manage actual outbreaks. To perform model-based predictions, it is essential to detail a priori information of transmission potential, latency and infectious periods. Population dynamics is frequently applied to the spread of disease and, in particular, the counting process is assumed to derive an estimator for the key parameter to quantify the precision of estimates and the variation in data. Thus, here we describe methods for estimating basic (and effective) reproduction number(s) based on observational epidemic records and how to interpret the intrinsic assumptions. Moreover, we introduce a backcalculation method for slowly progressive diseases (i.e., HIV/AIDS and BSE), which enables us to estimate the total number of infected individuals and obtain short-term projections. As the majority of epidemic records violate the homogeneously mixing assumption, it is fruitful to select model structure and statistical method flexibly, depending on the infectious disease of interest and the characteristics of available data. In all methods, intrinsic assumptions and validity play the most important role in prediction and estimation.
Key words: Infectious diseases, mathematical model, basic reproduction number, maximum likelihood method, martingale method.
Proceedings of the Institute of Statistical Mathematics Vol.54, No.2, 481-510(2006)
This paper considers model-selection problems in terms of mean squared error of predictors on linear time series models with a nested structure. It investigates asymptotic relative efficiency of prediction mean squared error (PMSE) of two predictors using estimated parameters, and then proposes new model selection procedures using the generalized likelihood ratio (GLR) test, whose critical regions are decided by percentile points of the noncentral chi-squared variables with the same degrees of freedom and noncentrality parameter. It also explains Akaike's information criterion (AIC) in terms of our testing procedures.
Key words: Generalized likelihood ratio test, AIC, noncentral chi squared distribution, linear time series models, PMSE, multiple tests.
Proceedings of the Institute of Statistical Mathematics Vol.54, No.2, 511-523(2006)
This paper surveys research for clarifying the notion of randomness from computational viewpoints. It first explains Martin-Löf randomness for infinite sequences. Then, after reviewing self-delimiting Kolmogorov complexity, it explains Kolmogorov randomness for finite sequences and its relation to Martin-Löf randomness. Finally, it explains recent applications of the Kolmogorov complexity notion to definition of univeral similarity distance.
Key words: Computability theory, Kolmogorov complexity, Martin-Löf randomness of infinite sequences, Kolmogorov randomness of finite sequences, information distance.