Proceedings of the Institute of Statistical Mathematics Vol. 56, No. 1, 3-18

## Theory and Implementation of Multiscale Bootstrap for Regions with Nonsmooth Boundaries

(Department of Mathematical and Computing Sciences, Tokyo Institute of Technology)

Bootstrap method has been used widely for computing confidence levels of the output of data analysis. This confidence level is called bootstrap probability, and it is known to be biased as a $p$-value of hypothesis testing. The multiscale bootstrap method has been invented to compute confidence levels with high accuracy by correcting the bias. Previously the boundary of hypothesis was assumed to be smooth, but the method is generalized by now for non-smooth cases. The theory and implementation of the algorithm will be explained using real examples of molecular phylogenetic inference and hierarchical clustering.

Key words: Approximately unbiased tests, bootstrap probability, bias correction, scaling-law, phylogenetic inference.

Proceedings of the Institute of Statistical Mathematics Vol. 56, No. 1, 19-35

## Power and Pitfalls of Phylogenomics: Lessons from a Genome-scale Analysis with Respect to the Root of the Eutherian Tree

(Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology)
(Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology)
(School of Life Sciences, Fudan University)

In the post-genomic era, genome-scale approaches to phylogenetic inference (phylogenomics) are being applied extensively to overcome sampling errors. Sampling error vanishes as the number of genes provided for the analysis increases, but the fully resolved tree can still be wrong if the phylogenetic inference is biased (systematic error). In the present study, we collected 2,789 genes (1Mbp) from 10 mammalian genomic sequences by screening whole-genome data, and performed an extensive maximum likelihood (ML) analysis to determine the root of the eutherian tree. The conventional method of concatenate analysis of nucleotide sequences strongly suggests a misled monophyly of Afrotheria (e.g., elephant) and Xenarthra (e.g., armadillo). However, this tree is not supported by a “Separate model” that takes into account the different tempos and modes of evolution among genes, and instead the basal Afrotheria tree is favored. This analysis demonstrates that the separate model, rather than the concatenate model, should be used in cases of phylogenetic inference for genome-scale data.

Key words: Phylogenomics, maximum likelihood, separate model, mammalian phylogeny.

Proceedings of the Institute of Statistical Mathematics Vol. 56, No. 1, 37-54

## Bayesian Divergence Time Estimation Using Codon Model

(Professional Programme for Agricultural Bioinformatics, University of Tokyo)
(Laboratory of Biometry and Bioinformatics, University of Tokyo)
(Bioinformatics Research Center, North Carolina State University)

Because evolutionary rates of molecular data can change over time, it is unreasonable to assume a molecular clock to estimate divergence times. Changes in mutation rate, effective population size and selective pressure may cause changes in either or both of the rates of synonymous and nonsynonymous substitutions. Recently, we developed a new Bayesian method to estimate divergence times and absolute rates of synonymous and nonsynonymous substitutions. Instead of assuming a molecular clock, we assume that both rates change over time following a log-normal process. By adopting a Markov chain Monte Carlo procedure, we can estimate the posterior probabilities of divergence times, synonymous and nonsynonymous rates, and rate variation parameters. This paper discusses the extension of our method to the analysis of multilocus sequence data, and shows the analysis of mammalian mitochondrial protein-coding genes.

Key words: Codon model, synonymous substitution, nonsynonymous substitution, molecular clock, divergence time.

Proceedings of the Institute of Statistical Mathematics Vol. 56, No. 1, 55-66

## Modeling Evolution of Genotypes and Phenotypes

(Center of Medical Information Science, Kochi University)
(Graduate School of Agricultural and Life Sciences, University of Tokyo)
(Graduate School of Agricultural and Life Sciences, University of Tokyo)

Molecular evolution drives evolution of phenotypes. Genetic diversity of a population depends on the balance between mutations and selections. This paper introduces our recent studies to make a bridge between molecular evolution and adaptive evolution of phenotypes. First, we analyze the molecular evolution of the HIV env sequences within hosts. With the model of the coalescent process of molecular evolution, we show the negative correlation between the effective population size and the evolutionary rate. This negative correlation is consistent with nearly a neutral model of molecular evolution. With the model of the sequence-structure fitness (SSF), we show that the rate of sequence evolution is correlated with the change in micro-structure of the V3 loop after infection. Finally, we develop a model to predict the binding ability of the protein complex in terms of the ratio of the SSF of the complex protein to that of the free-state protein. The dynamics of the ability of binding between the influenza HA sequences and four types of antibodies implied the cost for the adaptation.

Key words: Molecular evolution, structural evolution, viral evolution, coalescent process.

Proceedings of the Institute of Statistical Mathematics Vol. 56, No. 1, 67-79

## Phylogenetic Analysis Using Torus Self-Organizing Map

(Nagahama Institute of Bio-science and Technology)

To address the problem of clarifying interspecies difference of genome sequences, Self- Organizing Map (SOM) was used as a classification method concerning species and phylotype families. In order to clarify relation of clusters with species and phylotype families, we analyzed the position of each cluster on a SOM. We employed the torus map algorithm, which provided independence of the position on the map, and compared with the results obtained by the plane map algorithm. We performed SOM analysis for genome sequences of influenza virus (11585 2-kb genomic fragments). The numerical results suggested that the torus map could make clear the relation between clusters and characterize features of the phylotype families after learning.

Key words: SOM, Self-Organizing Map, bioinformatics, genome informatics, oligonucleotide frequency.

Proceedings of the Institute of Statistical Mathematics Vol. 56, No. 1, 81-99

## Origin and Phylogenetic Evolution of Pinnipedia

(School of Life Sciences, Fudan University)
(The Department of Geology and Paleontology, National Museum of Nature and Science)
(School of Life Sciences, Fudan University)
(The Institute of Statistical Mathematics)

Pinnipedia is a clade of Carnivora (Mammalia), having paddle-like limbs via aquatic adaptation. The origin of and the phylogenetic relationships among Pinnipedia remain controversial. This article outlines the history of the phylogenetic studies of Pinnipedia and discusses phylogenetic problems from paleontological and geohistorical points of view. Additionally, we emphasize the importance of model selection in inferring phylogenetic trees.

Key words: Pinnipedia, Otariidae, Phocidae, Odobenidae, phylogenetic inference, divergence time estimation.

Proceedings of the Institute of Statistical Mathematics Vol. 56, No. 1, 101-116

## Substitution Rate of Mitochondrial DNA of Primates

(The Institute of Statistical Mathematics)
(Faculty of Science, University of Antananarivo)
(Deceased August 10, 2004)
(School of Life Sciences, Fudan University)

We determined a new complete mitochondrial DNA (mtDNA) sequence of sifaka (Propithecus verreauxii ) from a feces sample and aligned it with the sequences of 14 primate mtDNAs. By using protein-encoding gene sequences, we conducted phylogenetic analyses at the amino acid level. Then, it was clearly observed that the branch lengths of Anthropoidea were much longer than those of prosimians (Strepsirrhini and Tarsier).

The molecular clock hypothesis can be tested by the likelihood ratio test with the non-clock hypothesis. As a result, the clock model was rejected and we detected that significant evolutionary rate acceleration at amino acid level of mt proteins occurred in the Anthropoids lineage after they diverged from tarsier.

There are two possible explanations for this observation; (i) the mutation rate is higher in Anthropoidea than in prosimians and accordingly both nonsynonymous and synonymous rate accerelations occurred in Anthropoidea, or (ii) the mutation rate remains unchanged and the acceleration of amino acid substitution rate (nonsynonymous rate) occurred in Anthropoidea.

In order to distinguish between the two possibilities, we estimated the nonsynonymous/synonymous rate ratio $\omega =dN/dS$ for each branch, where $dN$ and $dS$ are the number of nonsynonymous and synonymous substitutions per site. An $\omega$ ratio in exess of 1 has been regarded as an important indicator of positive selection. By using the CodeML program in PAML program package with F61 model of codon frequencies, we applied the codon-based likelihood method that allows for variable $\omega$ ratios among lineages to the 15 primate data with the evolutionary relationship. We could not find branches having $\omega$ ratios $>1$ but found that the $\omega$ ratios of Anthropoidea estimated by the free ratio model are more than twice those of prosimians.

We showed that the amino acid substitution rate of mt-proteins accelerated in Anthropoidea relative to prosimians, and that this is largely due to the increase in the nonsynonymous/synonymous rate ratio in Anthropoidea, although the increase of mutation rate of mtDNA in Anthropoidea could not be ruled out. This can be explained either by relaxation of selective constraints operating on the mt-proteins in Anthropoidea, or by adaptive evolution in Anthropoidea.

Key words: Primate, mitochondrial DNA, substitution rate, nonsynonymous substitution, synonymous substitution, adaptive evolution.

Proceedings of the Institute of Statistical Mathematics Vol. 56, No. 1, 117-131

## The Debates on Toothed Whales Monophyly: Reassessment of the Position of Sperm Whales by Using SINE Insertion Analysis

(Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology)
(Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology)
(Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology)

Morphological data have indicated that toothed whales form a monophyletic group. However, research published in the last several years has made the issue of the monophyly or paraphyly of toothed whales a subject of debate. Our group previously characterized three independent loci in which SINE insertions were shared among dolphins and sperm whales, thus supporting the traditional, morphologically based hypothesis of toothed whale monophyly. However, there are few additional molecular data supporting this topology. Thus, this issue is not yet definitively resolved. When the phylogeny of rapidly radiated taxa is examined using the SINE method, it is important to consider the ascertainment bias that arises when choosing a particular taxon for SINE loci screening. To overcome this methodological problem specific to the SINE method, we examined all possible topologies among sperm whales, dolphins and baleen whales by extensively screening SINE loci from species of all three lineages. We characterized nine independent SINE loci from the genomes of sperm whales and dolphins, all of which cluster sperm whales and dolphins but exclude baleen whales. Furthermore, we characterized ten independent loci from baleen whales, all of which were amplified in a common ancestor of these whales. From these observations, we conclude that toothed whales form a monophyletic group and that no ancestral SINE polymorphisms hinder their phylogenetic assignment despite the short divergence times of the major lineages of extant whales during evolution.

Key words: SINE, Odontoceti, sperm whale, genome library screening.

Proceedings of the Institute of Statistical Mathematics Vol. 56, No. 1, 133-144

## Phylogenetic Analysis Based on Complete Sequences of Mitochondrial DNA

(Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology)
(Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology)
(School of Life Sciences, Fudan University)

This article discusses the phylogenetic position of the order Spenisciformes in Aves based on a recent maximum likelihood analysis based on mitochondrial genomes of penguins and penguin-relatives. This analysis suggests that ciconiiform birds constitute new candidates as the closest extant relatives of penguins (previously proposed candidates were either gaviiform, podicipediform, or procellariiform birds). In addition, we amplify the problem for phylogenetic analyses to elucidate the species that has the closet relationship with extant penguins.

Key words: Phylogenetic inference, maximum likelihood analysis, mitochondrial DNA, penguins, storks.

Proceedings of the Institute of Statistical Mathematics Vol. 56, No. 1, 145-164

## Phylogenetic Inference Based on Combined Analysis of Multiple Genes —Illustrative Data Analysis of Higher-order Phylogeny of Eukaryota—

(Laboratory of Microbial Molecular Evolution, Institute of Biological Sciences, University of Tsukuba)
(Division of Global Environment and Biological Sciences, Center for Computational Sciences, University of Tsukuba)
(Department of Molecular Protozoology, Research Institute for Microbial Diseases, Osaka University)
(Laboratory of Microbial Molecular Evolution, Institute of Biological Sciences, University of Tsukuba)
(Laboratory of Microbial Molecular Evolution, Institute of Biological Sciences, University of Tsukuba)
(Division of Global Environment and Biological Sciences, Center for Computational Sciences, University of Tsukuba)

A maximum likelihood method for phylogenetic inference based on combined analysis of multiple genes is briefly introduced and applied to data analysis of higher-order eukaryotic phylogeny. Three models of branch length estimation are considered assuming that all genes (or partitions for the full data set)have the same branch length (concatenate model), each gene (partition) has a separate set of branch lengths (separate model), and branch lengths are proportional among genes (partitions) (proportional model). Fifty-three ribosomal protein genes from 29 eukaryotic species were used for the analysis. The data set consisted of 5, 842 amino acid positions. Six different models with different methods for estimating branch lengths and for partitioning the data set were compared by Akaike Information Criterion (AIC). Comparison of the AIC values for the maximum likelihood tree demonstrated that a separate model with a partition between large- and small-subunit ribosomal proteins showed the lowest AIC value, while a separate model with a partition among individual genes had the highest AIC value, suggesting that the former model best approximated the data set and the latter model was over-parameterized. It was suggested also that the tempo and mode of sequence evolution was relatively uniform across different ribosomal protein genes. Since no incongruence was observed among the six models for the selection of alternative trees, the present analysis was considered to be robust.

Key words: Phylogenetic inference, maximum likelihood method, combined analysis of multiple genes, eukaryotes, higher-order phylogeny, ribosomal protein.