統計数理研究所

第2回 Statistical Machine Learning Seminar （主催：統計数理研究所機械学習NOE)

日時: 2011年1月19日（水） 13:30-15:40

会場: 統計数理研究所セミナー室5（3階　D313,D314）
※講演は英語で行います

講演者: Yee Whye Teh (University College London)
持橋大地（NTT CS研）

世話人: 福水健次

Speaker 1: Yee Whye Teh (University College London)

Title: Hierarchical Bayesian Models of Language and Text

Abstract

In this talk I will present a new approach to modelling sequence data called the sequence memoizer. As opposed to most other sequence models, our model does not make any Markovian assumptions. Instead, we use a hierarchical Bayesian approach which enforces sharing of statistical strength across the different parts of the model. To make computations with the model efficient, and to better model the power-law statistics often observed in sequence data, we use a Bayesian nonparametric prior called the Pitman-Yor process as building blocks in the hierarchical model. We show state-of-the-art results on language modelling and text compression.

This is joint work with Frank Wood, Jan Gasthaus, Cedric Archambeau and Lancelot James.

Bio

Yee Whye Teh is a Lecturer (equivalent to an assistant professor in US system) at the Gatsby Computational Neuroscience Unit, UCL. He is interested in machine learning and Bayesian statistics. His current focus is on developing Bayesian nonparametric methodologies for unsupervised learning, computational linguistics, and genetics. Prior to his appointment he was Lee Kuan Yew Postdoctoral Fellow at the National University of Singapore and a postdoctoral fellow at University of California at Berkeley.

He obtained his Ph.D. in Computer Science at the University of Toronto in 2003.

Speaker 2: Daichi Mochihashi (NTT CS Labs )

Title: Unsupervised and Semi-supervised Learning of Nonparametric Bayesian Word Segmentation

Abstract

For unsegmented languages such as Japanese and Chinese, word segmentation is often a first step for natural language processing thus has been an important problem for a long time. Lately, to deal with non-standard colloquial texts seen in blogs and twitters, supervised methods so far are no longer valid and necessitates a novel method to automatically acquire new words to segment these texts appropriately.

In this talk, I introduce the first nonparametric Bayesian generative model that can recognize words in an unsupervised fashion, even for completely unknown language. This model, called nested Pitman-Yor language model (NPYLM), can infer "words" by both its spellings and other "words", which are also unknown in advance. MCMC with efficient forward-backward algorithm is used for inference, enabling the model to be applied to huge actual texts.

For the second part, I extend the NPYLM to semi-supervised learning using Conditional Random Fields (CRF). Although NPYLM can be regarded as a kind of semi-Markov model, naive combination with semi-Markov CRF is prohibitive and proves to work badly. To cope with this problem, we convert the information between Markov CRF and semi-Markov NPYLM to yield a consistent combination of discriminative and generative models. We show the results on segmenting twitters, speech transcripts, dialects based solely on newspaper supervised data, as well as the results for standard datasets on Chinese word segmentation.

* Latter half of the talk is a joint work with Jun Suzuki and Akinori Fujino (NTT CS Labs).

Bio: Daichi Mochihashi is a senior Research Associate (called Research Specialist) of NTT Communication Science Laboratories, Kyoto. He received BS from The University of Tokyo and PhD from the Nara Institute of Science and Technology, 1998 and 2005 respectively. His research is primarily focused on Bayesian methods in natural language processing. After several years at ATR spoken language communication research laboratories, he has been affiliated with NTT CS Labs since 2007.

▲ このページのトップへ