In this talk I will present a new approach to modelling sequence data called the sequence memoizer. As opposed to most other sequence models, our model does not make any Markovian assumptions. Instead, we use a hierarchical Bayesian approach which enforces sharing of statistical strength across the different parts of the model. To make computations with the model efficient, and to better model the power-law statistics often observed in sequence data, we use a Bayesian nonparametric prior called the Pitman-Yor process as building blocks in the hierarchical model. We show state-of-the-art results on language modelling and text compression.
This is joint work with Frank Wood, Jan Gasthaus, Cedric Archambeau and Lancelot James.
Yee Whye Teh is a Lecturer (equivalent to an assistant professor in US system) at the Gatsby Computational Neuroscience Unit, UCL. He is interested in machine learning and Bayesian statistics. His current focus is on developing Bayesian nonparametric methodologies for unsupervised learning, computational linguistics, and genetics. Prior to his appointment he was Lee Kuan Yew Postdoctoral Fellow at the National University of Singapore and a postdoctoral fellow at University of California at Berkeley.
He obtained his Ph.D. in Computer Science at the University of Toronto in 2003.
For unsegmented languages such as Japanese and Chinese, word segmentation is often a first step for natural language processing thus has been an important problem for a long time. Lately, to deal with non-standard colloquial texts seen in blogs and twitters, supervised methods so far are no longer valid and necessitates a novel method to automatically acquire new words to segment these texts appropriately.
In this talk, I introduce the first nonparametric Bayesian generative model that can recognize words in an unsupervised fashion, even for completely unknown language. This model, called nested Pitman-Yor language model (NPYLM), can infer "words" by both its spellings and other "words", which are also unknown in advance. MCMC with efficient forward-backward algorithm is used for inference, enabling the model to be applied to huge actual texts.
For the second part, I extend the NPYLM to semi-supervised learning using Conditional Random Fields (CRF). Although NPYLM can be regarded as a kind of semi-Markov model, naive combination with semi-Markov CRF is prohibitive and proves to work badly. To cope with this problem, we convert the information between Markov CRF and semi-Markov NPYLM to yield a consistent combination of discriminative and generative models. We show the results on segmenting twitters, speech transcripts, dialects based solely on newspaper supervised data, as well as the results for standard datasets on Chinese word segmentation.
* Latter half of the talk is a joint work with Jun Suzuki and Akinori Fujino (NTT CS Labs).