Proceedings of the Institute of Statistical Mathematics Vol.50, No.1, 3-15(2002)

Prediction of Eukaryotic Gene Structures Based
on Combined Information of Sequence Homology
and Statistical Features

Osamu Gotoh
(Computational Biology Research Center (CBRC),
National Institute of Advanced Industrial Science and Technology (AIST))

Draft sequences of complete human genome were made publicly available in February 2001. This follows publication of virtually complete genomic sequences of yeast, nematode, fruit fly, and cress. Since many bacterial and archaeal genomes have been sequenced, we already had the basic information about at least one organism in representative phylogenetic branches. The first step toward extraction of any useful information from the blue-prints of life is to identify exact structures of individual genes. Biased distributions of k-tuple oligonucleotides in coding and non-coding regions are useful in deriving "coding potentials". In addition, specific sequence patterns around transcriptional, translational, and splicing boundaries can be converted to numerical signals that help to delineate exonic and intronic regions. Several mathematical methods, including neural networks, discriminant analyses, and hidden Markov models, have been developed to assemble various lines of information into predicted genes and gene structures. These intensive efforts have dramatically improved the prediction quality based on such statistical information in the last decade. However, the success rate for correctly predicting an exon is reported to be only about 75% at the nucleotide level, so there is still considerable room of improvement. We have taken a slightly different approach. In addition to the statistical information mentioned above, we incorporate sequence-homology information to more accurately locate coding regions conserved between the target gene and one or more reference cDNA or protein sequence. The most likely gene structure is inferred by optimizing an objective score by means of a dynamic programming algorithm. To assess the performance of our method, we compared the predicted gene structures with known structures of about 300 C. elegans genes. The results indicate that the percentage of correctly predicted exons exceeded 90%, which was significantly better than those obtained by other methods.

Key words: Gene-structure prediction, exon-intron organization, splicing, genome informatics, sequence homology, alignment.

[ Full text pdf | Back ]

Proceedings of the Institute of Statistical Mathematics Vol.50, No.1, 17-31(2002)

Bayesian Hierarchical Model of Rate
of Molecular Evolution

Hirohisa Kishino
(Graduate School of Agriculture and Life Sciences, University of Tokyo)
Jeffrey L. Thorne
(Bioinformatics Research Center, North Carolina State University)

In the evolution process, biological organisms leave traces of diversification and adaptation to a genome in the form of evolutionary rate and its change. This paper first examines how the inference of the rate change plays an important role in evolution research, with a few examples from recent works on tb1 of domesticated maize and the selection pressure on it, diversified species in the Hawaiian silversword alliance and accelerated rate in regulatory genes, the viral adaptation process to the hosts, and the fate of gene duplication. It then introduces our hierarchical model that describes stochastic change of evolution rate, and briefly evaluates the performance by simulation. Finally, it discusses the possibility of hierarchical models in genome database analysis.

Key words: Stochastic change of evolutionary rate, hierarchical model, Markov chain Monte Carlo (MCMC), detection of correlated evolution, models for multiple genes, genome database analysis.

[ Full text pdf | Back ]

Proceedings of the Institute of Statistical Mathematics Vol.50, No.1, 33-44(2002)

Assessing the Uncertainty of the Cluster Analysis
Using the Bootstrap Resampling

Hidetoshi Shimodaira
(The Institute of Statistical Mathematics)

This paper reviews the method of calculating the p-value for assessing the uncertainty of cluster analysis. Considering that the dendrogram as well as the derived clusters obtained by the cluster analysis is subject to change due to the fluctuation of the sampling of the data or that of the characters, the reliability of the result is represented as the p-value, between 0 and 1. This method is applicable to a wide class of problems, and is not limited to cluster analysis, since it uses only bootstrap resampling and the 0/1-value function to indicate whether the data supports the hypothesis. The p-value is calculated from the approximately unbiased test of the region in the parameter space representing the hypothesis. The method is based on the theory of "signed distance" and "curvature" by Efron (1985) and Efron and Tibshirani (1998). The key idea to convert the theory into a practical algorithm is the multiscale bootstrap resampling of Shimodaira (2000, 2002). The issue is illustrated by the phylogeny analysis to infer the history of evolution from the DNA sequences.

Key words: Cluster analysis, bootstrap, multiscale bootstrap, approximately unbiased test, molecular phylogeny.

[ Full text pdf | Back ]

Proceedings of the Institute of Statistical Mathematics Vol.50, No.1, 45-68(2002)

Application of Molecular Phylogenetic Inference
and Associated Problems : Illustrative Data Analysis
on Early Eukaryotic Evolution

Tetsuo Hashimoto
(The Institute of Statistical Mathematics;
Department of Biosystems Science,
The Graduate University for Advanced Studies)
Nobuko Arisue
(Department of Biosystems Science, The Graduate University for Advanced Studies)
Masami Hasegawa
(The Institute of Statistical Mathematics;
Department of Biosystems Science,
The Graduate University for Advanced Studies)

The maximum likelihood method of molecular phylogeny, which infers an evolutionary tree based on sequence data of DNA, RNA and proteins, is briefly described and applied to a data analysis on early eukaryotic evolution. Possible existence of a long branch attraction artefact is introduced. This artefact has recently been regarded as one of the most serious problems in making an inferred tree misleading. To overcome this problem, evolutionary rate heterogeneity across sites is taken into consideration by {\it\Gamma}-distribution. With this approach, the phylogenetic position of microsporidia at the basal position of the eukaryotic tree in several previous analyses is shown to be an artefact caused by long branch attraction. The extremely high evolutionary rate of microsporidia in the molecules used in previous analyses may have been a major cause of the artefact. Re-analyses of the currently available molecular data with rate heterogeneity across sites and a combined analysis of these data clearly demonstrate that microsporidia are not early branching eukaryotes but are closely related to fungi.

Key words: Maximum likelihood method of molecular phylogeny, long branch attraction, rate heterogeneity across sites, {\it\Gamma}-distribution, early eukaryotic evolution, microsporidia.

[ Full text pdf | Back ]

Proceedings of the Institute of Statistical Mathematics Vol.50, No.1, 69-85(2002)

Model Misspecification in Molecular Phylogenetic Inference
as Illustrated in Evolutionary Study of Vertebrates

Ying Cao and Masami Hasegawa
(The Institute of Statistical Mathmatics)

Molecular phylogenetic inference depends on the assumed model that describes the substitution process of nucleotides or amino acids during evolution, and model misspecification can give a misleading estimate of molecular phylogeny.

During our study on the phylogenetic evolution of vertebrates, we have encountered several cases of putatively misleading trees probably due to misspecification of substitution models, and these examples are presented in this article.

Key words: Molecular phylogeny, substitution model, maximum likelihood method, model misspecification, evolution of vertebrates.

[ Full text pdf | Back ]