Proc. Inst. Statist. Math. 64-2

Statistical Models to Induce Latent Syntactic Structures

Hiroshi Noji

(Graduate School of Information Science, Nara Institute of Science and Technology)

This article describes the advancement of unsupervised syntactic parsing in the past 20 years. Unsupervised parsing aims to obtain the grammar of the language automatically from the input sentences without manually created syntactic trees. The essential point in this task is how to exploit the bias or knowledge of the grammar of the language. In this article, we compare several existing approaches from this perspective and discuss what kind of information we should provide to the model and what can be learned from such knowledge, to guide the future research direction on this area.

Key words: Computational linguistics, unsupervised syntactic parsing.

Statistical Approaches to Language Change and Linguistic Phylogenies

Yugo Murawaki

(Graduate School of Informatics, Kyoto University)

Since around the turn of the twenty-first century, there has been a growing trend to employ computer-intensive statistical methods to answer historical linguistic questions, such as language change and phylogenies of extant and documented languages. Although these questions have traditionally been addressed manually by linguists, manual analysis has limitations. Because human inference is based on logic, humans are unable to estimate continuous values (e.g., dating the common ancestor of extant languages). They are also bad at inherent uncertainty because it leads to a combinatorial explosion. Computational statistics provides powerful ways to solve these problems.

The current trend can be characterized by the fact that key results have been achieved with statistical methods originally developed in the field of molecular biology. Although historical linguistics itself has a record of adopting statistical models, the new statistical techniques have been developed largely independently of historical linguistics. Therefore their scientific foundations have yet to be fully understood by linguistic communities. We also observe that since most recent statistical studies on linguistic questions depend on ready-to-use software packages that are designed to address biological questions, linguistic phenomena that lack exact counterparts in biology tend to be left untouched.

In light of this, we first overview the new statistical models while relating them to the research history of historical linguistics. After reviewing the concept of evolution, the comparative method that exploits regular sound changes, and ill-fated glottochronology, we explain the essence of recently developed Bayesian phylogenetic models.

Since most phylogenetic models use lexical traits, they can be applied only if the group of languages in question has a sufficient number of shared lexical traits. Unfortunately, this is not the case in Japanese, and we have no choice but to seek for different kinds of traits. Later in this paper, we describe novel approaches based on typological traits, which we believe have the potential to trace the origin of the Japanese language.

Key words: Linguistic phylogeny, historical linguistics, linguistic typology, Bayesian statistics.

Theory and Practice of Conditional Random Fields

Naoaki Okazaki

(Graduate School of Information Sciences, Tohoku University)

Most tasks of Natural Language Processing are formalized as a prediction problem of an output for a given input. Assuming that an input and output have a structure such as a sequence and tree, which is a natural assumption for a language, we can formalize more tasks as the prediction problem. This paper explains Conditional Random Fields (CRF) where an input and output are in the form of a sequence. In order to apply the multi-class logistic regression to the sequential labeling problem, CRF introduces feature functions that assume the Markov property for a label sequence and facilitates an efficient inference and parameter estimation by using dynamic programming. Therefore, this paper reviews the fundamental theories of logistic regression, feature functions, training with stochastic gradient descent, regularization, etc., and describes the overall theory of CRF. In addition, it covers recent research topics and practices including active learning for CRF, learning from partially-annotated supervision data, and models with deep neural networks.

Key words: Conditional Random Fields, logistic regression, stochastic gradient descent.

Statistical Analysis of Eye-movement Data and Reading Time Data in Language Comprehension Research

Manabu Arai

(Faculty of Economics, Seijo University)

Douglas Roland

(Graduate School of Arts and Sciences, The University of Tokyo)

Research on language comprehension has made significant advances over the last 30 years or so largely owing to technological advances that have enabled researchers to conduct chronometric studies with little required cost. Furthermore, eye-tracking devices, which have played an important role in advancing the research on language comprehension, were once only available to well-funded laboratories, but are now within many researchers' reach. Although the collection of time-encoded data is easier than ever, appropriate handling of such data often requires not-so-straightforward statistical modeling. In this paper, we discuss statistical methods for analyzing eye-movement data from visual world and reading studies as well as reading times from the self-paced reading task. We argue that careful and reasonable application of Linear Mixed-Effects models as well as Generalized Mixed-Effects models can offer great advantages in many ways over traditional analyses such as ANOVA that require data aggregation over participants or items.

Key words: Linear mixed effects models, generalized linear mixed effects models, eye-movements, visual world paradigm, self-paced reading, reading times.

Difference between Number of Tweets and Real World Statistics

Eiji Aramaki

(Nara Institute of Science and Technology (NAIST))

Shoko Wakamiya

(Nara Institute of Science and Technology (NAIST))

The prevalence of social media services has brought a new approach for surveying people and social conditions. So far, various systems, such as an influenza surveillance system, an earthquake detection system and so on, have been proposed. However, information shared on social media doesn't always correspond to the real one. For example, social media services often suffer from rumors, causing lower reliability than existing media. In addition, several studies have been pointed out a limitation of both temporal and spatial accuracy in social media services. In this paper we examine the differences in terms of temporal and spatial perspectives based on Twitter data collected using our influenza surveillance system. Furthermore, we discuss a bias behind the differences.

Key words: Social media, Twitter, natural language processing, social computing, influenza.