Proceedings of the Institute of Statistical Mathematics Vol. 60, No. 2, 239-250 (2012)

## Statistical Methods for Inferring Network Structure

(Department of Mathematical Informatics, Graduate School of Information Science and Technology, The University of Tokyo; PRESTO, Japan Science and Technology Agency)

There have been many researches on networks during the past 15 years. This research field is referred to as network science or complex networks research. The term “network” in this context is equivalent to the term “graph” in mathematical graph theory. Data of many actual graphs are used in various fields. Independently of social network analysis researchers, researchers in statistical physics, applied mathematics, and web engineering, in particular, have started studying networks. Now, people from different fields not limited to the abovementioned fields are engaged in analysis of networks. Because network science deals with real data of networks, statistical sciences will find various applications in network researches. This review paper briefly surveys two maximum likelihood methods for inferring network structures from data. First, it explains the maximum likelihood method proposed by Newman and Leicht. They assumed that the nodes in an observed graph are partitioned into groups, and nodes in the same group tend to have similar connectivity to other nodes. Then, they established an EM algorithm to estimate the partition of the nodes and the parameter values that determine the likelihood with which nodes in certain groups are connected to each other. Second, it briefly introduces the maximum likelihood method based on a maximum entropy model. Although it is a classical approach, the method has been applied to analysis of neural activity data.

Key words: Graph, network, community structure, EM algorithm, Ising model.

Proceedings of the Institute of Statistical Mathematics Vol. 60, No. 2, 251-262 (2012)

## Recent Developments of Evolutionary Game Theory in Finite Populations

(Department of Evolutionary Studies of Biosystems, School of Advanced Sciences, The Graduate University for Advanced Studies; PRESTO, Japan Science and Technology Agency)

Traditional evolutionary game theory assumes infinitely large populations to derive deterministic game dynamics. Following a pioneering work by Nowak et al. (2004) , evolutionary game theory in a finite population has been developed. Genetic drift plays a major role in this new theory, and the advantage of strategies is measured by fixation probabilities. Stochastic game dynamics yield new predictions. For example, when there are multiple ESS (evolutionarily stable strategies) in the game, stochastic game dynamics predict the one that is most likely to occur.

This paper reviews the basic model by Nowak et al. (2004) and discusses its various theoretical extensions.

Key words: Evolutionary game, finite population, fixation probability, ESS, 1/3-rule.

Proceedings of the Institute of Statistical Mathematics Vol. 60, No. 2, 263-278 (2012)

## Measuring \alpha Diversity and the Theory of Random Strings

(Bioinformatics Center, Institute for Chemical Research, Kyoto University)

Diverse species or individuals belonging to a biological community in a certain area are termed \alpha diversity. In this paper, we first classify various methods for measuring \alpha diversity and historically review them. We then describe the author's and his coworker's recent study in which a method for measuring \alpha diversity at the sequence level was proposed by developing the theory of probability on a set of strings with Levenshtein distance. We outline the theoretical basis for our proposed method, which was provided by taking a consensus sequence as a measure of location and a mean of the Levenshtein distances from a consensus sequence as a measure of dispersion instead of a usual mean and variance, respectively, and developing asymptotic theory for them. Lastly, we describe an application of our method to microbial communities, whose \alpha diversity is more difficult to measure compared with those of animals and plants.

Key words: \alpha diversity, 16S ribosomal RNA gene sequence, random string, consensus sequence, Levenshtein distance, hierarchical variance.

Proceedings of the Institute of Statistical Mathematics Vol. 60, No. 2, 279-288 (2012)

## Algebraic Methods for Molecular Phylogenetics

(Department of Statistics, University of Kentucky, Lexington, KY 4050, U.S.A.)

Recently there have been much work and collaborations between modern biology and higher mathematics. A number of important connections have been established betweeen computational biology and the emerging field of “algebraic statistics”, which applies tools from combinatrics, computational algebra, and polyhedral geometry to statistical computational problems and statistical modeling. Phylogenetics has provided an abundunt source of applications for algebraic statistics, with research areas including phylogenetic invariants, the geometry of tree space, and analysis of phylogenetic reconstruction. The purpose of this review is to provide the reader with an introduction to this subject, a noncomprehensive guide to further reading, and a collection of more detailed case studies that provide examples of how algebraic methods have been used in the context of molecular phylogeny.

Key words: Algebraic statistics, phylogenetics.

Proceedings of the Institute of Statistical Mathematics Vol. 60, No. 2, 289-303 (2012)

## Assessment of the Performance of Phylogenetic Inference Based on Simulated Protein-coding Sequences with Significant Compositional Heterogeneity

(Graduate School of Life and Environmental Sciences, University of Tsukuba)
(Graduate School of Life and Environmental Sciences, University of Tsukuba)

Phylogenetic analyses of molecular sequence data with commonly-used ‘homogeneous’ substitution models assume the stationarity of nucleotide or amino-acid composition across tree, but real world data sometimes violate the assumption. This report assesses how significantly the violation of compositional stationarity affects the performance of homogeneous model-based phylogenetic inference by using simulated protein-coding sequences. In order to estimate parameters for sequence simulation, we prepared a real-world sequence data set of seven plastid genome-encoded protein genes with adenine plus thymine content (AT content) in all the 1st, 2nd, and 3rd codon positions extraordinarily biased between species, and subjected it to a maximum-likelihood analysis for a given model tree to estimate the parameters. The analysis was carried out assuming a ‘non-homogeneous’ codon substitution model that can accommodate the heterogeneity of nucleotide composition in three codon positions across the tree. Using the parameters estimated and the model tree, we simulated protein-coding sequence data with compositional heterogeneity between species by the Monte-Carlo method. Finally, we tested the performance of homogeneous model-based phylogenetic analyses both at nucleotide and amino acid sequence levels for recovering the model (‘correct’) tree. The results clearly demonstrated that both of the two analyses mostly failed to recover the correct tree but instead strongly favored artifactual trees attracted by the parallel compositional convergence between distantly-related species. This is de facto the first simulation study that assessed the appropriateness of applying homogeneous substitution models in phylogenetic analyses to protein-coding sequence data containing significant compositional heterogeneity.

Key words: Phylogenetic inference, maximum-likelihood analysis, compositional heterogeneity, codon substitution model, simulation, model misspecification.

Proceedings of the Institute of Statistical Mathematics Vol. 60, No. 2, 305-316 (2012)

## Spatial Distribution of Selection Pressure on a Virus Protein Deriving Its Adaptation to the Environment

(Center of Medical Information Science, Kochi University)
(Graduate School of Agriculture and Life Science, University of Tokyo)

Proteins adapt to environments by gaining and/or obtaining functions. The adaptation to an environment is achieved by substituting the amino acid sequence and the amino acid substitution results from selection of mutations on a protein-coding gene. Hence mutations on a protein-coding gene are under the selection pressure of the environment and the strength and character of selection pressure may vary among the temporal domains in an evolutionary process. Thus, revealing the spatio-temporal fluctuation of the selection pressure improves our knowledge of adaptive evolution of the protein. We developed a method for detecting the spatial fluctuation of the selection pressure on a protein based on the hierarchical Bayesian model. The prior distribution of spatial aggregation of selection pressure is described by the Ising model, which has a theoretical framework established in the field of magnetic material physics. The hyper-parameters that define the strength and range of the spatial clustering are estimated by maximizing the marginal likelihood. The model of the prior-distribution is hard to normalize. Thus, we estimated the log marginal likelihood based on the thermodynamic integration. We applied the method to detect the spatial fluctuation of the selection pressure on the influenza hemagglutinin protein.

Key words: Molecular evolution, selection pressure, spatial distribution, hierarchical Bayesian model, Ising model.

Proceedings of the Institute of Statistical Mathematics Vol. 60, No. 2, 317-325 (2012)

## A Comparison of Statistical Power of EHH-based Methods for Detecting a Signature of Recent Positive Selection

(Molecular and Genetic Epidemiology, Faculty of Medicine, University of Tsukuba)

Extended haplotype homozygosity (EHH)-based methods such as relative EHH (REHH) and integrated haplotype score (iHS) tests have been used for detecting a signature of recent positive selection in human populations. In this study, with a slight modification of definition of EHH, a revised REHH (rREHH) test is proposed in which the divergence of test statistic can be avoided. A comparison of the statistical power of three EHH-based methods for haplotype data obtained by coalescent simulation revealed that iHS test achieved the highest power for a selected allele with low frequency, while rREHH test showed the highest power for one with high frequency. For most parameters, REHH test showed the lowest power. The present results suggest that, to efficiently detect recent positive selection, rREHH and iHS methods should be used properly based on the population frequency of core allele to be tested.

Key words: SNP, positive selection, extended haplotype homozygosity (EHH), integrated EHH (iHH), integrated haplotype score (iHS), relative EHH (REHH).

Proceedings of the Institute of Statistical Mathematics Vol. 60, No. 2, 327-339 (2012)

## Generalizations of Wright-Fisher Diffusions and Related Topics

(Department of Mathematics, Saga University)

The Wright-Fisher diffusion model is one of the most basic models in population genetics. In the 6 decades since the publication of Feller's article on this process, it has been generalized in many ways and has turned out to have connections in various contexts. This note describes some such topics.

Key words: Population genetics, diffusion process, branching process, measure-valued diffusion, stationary distribution.

Proceedings of the Institute of Statistical Mathematics Vol. 60, No. 2, 341-352 (2012)

## A Population Genetics Study Using the Small Disturbance Asymptotic Theory

(The Institute of Statistical Mathematics)

To capture analytically the change of the transient frequency distribution of a mutant arising in a population, we apply the small disturbance asymptotic theory. This enables us to obtain an approximate formula for a model that does not have an analytical description. Model of population genetics always has a finite support. On the other hand, the asymptotic expansion is given by a form in terms of a normal distribution. Hence the formula does not work well when the mutation rate is low. However, we are able to confirm that the formula gives a good approximation when time has not passed and also when the mutation rate is high.

Key words: Small asymptotic disturbance theory, frequency distribution, normal approximation.