Talk by Dr.Anand Bhaskar

11:00-11:40, September 11.

Admission Free,No Booking Necessary

Seminar Room 6 (A508) @ Institute of Statistical Mathematics
Dr.Anand Bhaskar, Ph.D., Sokendai
Identifiability and inference of population demographic models from genomic variation data

The recent torrent of genome sequence data has given us the unprecedented ability to very finely resolve the details of historical demographic processes that have shaped the genetic variation of modern human populations. Such understanding of population demography has wide-ranging applications from medical genetics to conservation biology to forensic science.

Several recent large-sample human genetics studies have found a massive excess of rare variants compared to predictions from coalescent theory that assume a randomly mating population at equilibrium. A widely cited explanation is that these polymorphism patterns are indicative of explosive and accelerating population growth in recent human history.
The sample frequency spectrum (SFS), a summary of the allele frequency information in a sample of sequences, is a widely used statistic for inferring population demography from genome sequence data. However, it has been recently shown that very different population demographies can actually generate the same expected SFS for arbitrarily large sample sizes. Although in principle this non-identifiability issue poses a thorny challenge to statistical inference, the population size functions involved in the counterexamples are arguably not so biologically realistic. We revisit this problem and examine the identifiability of demographic models under the restriction that the population sizes belong to some family of biologically-motivated functions. Under this assumption, we prove that the expected SFS of a sample uniquely determines the underlying demographic model, provided that the sample is sufficiently large. We obtain general bounds on the sample size sufficient for identifiability, and in the cases of piecewise-constant, piecewise-exponential, and piecewise-generalized-exponential models which are often assumed in population genomic studies, we provide explicit sample size bounds that only depend on the number of pieces.

Time permitting, I will also discuss some preliminary work on the geometry of the SFS and its implications for demographic inference.

This talk will be completely self-contained, and will not assume prior knowledge about coalescent theory. Based on joint work with Yun S. Song and Sébastien Roch.