Proceedings of the Institute of Statistical Mathematics Vol.48, No.2, 271-287(2000)

On Natural Language Statistical Information Processing

Mingzhe Jin
(Department of Social Information, Sapporo Gakuin University)

This paper describes the work done on statistical natural language processing. It is organized into two parts. Part One explains symbol and word-centered work in language processing with n-gram and Marcov models, etc. Part Two describes text processing: text classification (or categorization), authorship attribution (or stylometry) and information retrieval (or extraction), etc.

Key words: Statistical method, natural language processing, text processing.

Proceedings of the Institute of Statistical Mathematics Vol.48, No.2, 289-310(2000)

Discovering Similar Poems from Anthologies
of Classical Japanese Poems

Masayuki Takeda
(Department of Informatics, Kyushu University)
Tomoko Fukuda
(Fukuoka Jo Gakuin College)
Ichiro Nanri
(Junshin Women's Junior College)
Mayumi Yamasaki and Koichi Tamari
(Department of Informatics, Kyushu University)

WAKA is a form of traditional Japanese poetry with a 1300 year history. In this paper we attempt to semi-automatically discover similar poems in anthologies of WAKA poems. The key to success is how to define the similarity measure on poems. We introduce a unifying framework that captures the essence of string similarity measures. This framework makes it easy to design new measures appropriate for discovering similar poems. We proposed three types of similarity measures. Using them, we report successful results in finding similar poems between Kokinshu and Shinkokinshu, which are known as the best two of the twenty-one imperial anthologies. Most interestingly, we have found several instances of poetic allusion which have never been pointed out in the long history of the WAKA research.

Key words: Classical Japanese poems, analysis of expressions, similarity measures, similar poems, machine discovery.

Proceedings of the Institute of Statistical Mathematics Vol.48, No.2, 311-326(2000)

Identify a Text's Genre by Multivariate Analysis
—Using Selected Conjunctive Words and Particle-phrases—

Minori Murata
(International Center, Keio University)

It is quite important for advanced students of Japanese-language for specific purposes to understand the underlying logical structure of the text. Since the logical structure will enhance an ability to read and write technical papers.

Such items as the conjunctive words (i.e. the words which function as a conjunction in a sentence: Setsuzoku-goku) and particle-phrases (i.e. the phrases which function as a particle in a sentence: Jyoshi-sootoo-ku in Fukugo-ji) can provide important clues for understanding the logical structure of the text. The ultimate goal of this study is to clarify the logical structures of the technical texts in Japanese by focusing on the functions of conjunctive words and particle-phrases.

As a step toward achieving this objective, we chose 290 samples (14134 sentences in total) of five genres. Those five genres are (i) an introductory economics textbook, (ii) papers of the Journal of the Physical Society of Japan, (iii) papers of science and technology, (iv) editorial articles of 4 kinds of newspapers, and (v) modern novels. We counted the rate of appearance (per sentence) of the 62 selected conjunctive words and particle-phrases of each sample. The analysis was conducted in the following two steps,

(a) We first examined univariate distribution of the above 62 items and then applied the canonical discriminant analysis to 108 samples ((i) 16 samples (ii) 24 samples (iii) 14 samples (iv) 40 samples selected by random-sampling out of 222 (v) 14 samples).
(b) Secondly we applied the same method to 236 samples by use of all of the editorial articles of 4 kinds of newspapers (222 samples), and modern novels (14 samples) which were not well distinguished in the first step.

According to the result obtained in (a), these genres are classified with 12 conjunctive words and particle-phrases (out of 62) at a high apparent correct classification rate (84%). Following to the result obtained in (b), the words which distinguished 2 genres (i.e. (iv) and (v)) were clearly selected. These results indicate the existence of common conjunctive words and particle-phrases both in texts having an explicit logical structure (as in (i), (ii) and (iii)) and in texts having an implicit logical structure (as in (iv) and (v)).

Key words: Japanese for specific purposes, logical structure of a text, text's genre, canonical discriminant analysis, conjunctive words (Setsuzoku-goku), particle-phrases (Jyoshi-sootoo-ku in Fukugo-ji), rate of appearance per sentence.

Proceedings of the Institute of Statistical Mathematics Vol.48, No.2, 327-337(2000)

Approaching to the Synoptic Problem by Factor Analysis

Maki Miyake and Hiroyuki Akama
(Graduate School of Decision Society & Technology,
Tokyo Institute of Technology)
Migaku Sato
(College of Community & Human Services,
Rikkyo University)
Masanobu Nakagawa
(Graduate School of Decision Society & Technology,
Tokyo Institute of Technology)

Our study deals with some subjects of the Biblical studies through statistical analysis according to corpus linguistics. The topic of discussion is the so-called "synoptic problem" in the study of the Gospels in the New Testament, a theme constantly discussed ever since the end of the 18th century. The aim of our problem setting is to mathematically explain the mutual relationship of the four Gospels in order to clarify the nature of their interdependence and the process through which they came about.

First, all periscopes of the Synoptic Gospels are rearranged according to the parallel texts so that the entire "Synopsis" of the Four Gospels is created. Second, each text section is classified into 7 categories (4 categories with common words, 3 with uncommon words) and the number of words used in each section is carefully counted. Then, only the words that occur in more than two categories are chosen out and used as data for further analysis, so that the mutual relationship of the categories is demonstrated and the characteristics of each category are made manifest. And the method applied is that of the factor analysis.

At the same time, an application program, "Synoptic software", has been developed in order to help further statistical analyses of the Gospels. This program will also show that the same approach could be used in various fields outside the Biblical studies.

Key words: The Biblical studies, Gospels, statistical technique, PCA, factor analysis, software.

Proceedings of the Institute of Statistical Mathematics Vol.48, No.2, 339-376(2000)

Analyzing Open-ended Questions: Some Experimental Results
for Textual Data Analysis Based on InfoMiner

Noboru Ohsumi
(The Institute of Statistical Mathematics)
Ludovic Lebart
(Département Economie, Gestion, Sciences,
Sociales et Humaines, ENST)

Interest in methods for the acquisition and analysis of textual data is increasing, as electronic processing of the Japanese language becomes possible, and research efforts advance in the analysis of natural languages and in related studies. Objective and reliable techniques are required for the impartial analysis of responses to open-ended questions, particularly in surveys of social attitude and public opinion.

First, we discuss what should be solved during data acquisition, based on our experience in analyzing open-ended survey questions. Second, we summarize comparisons between the statistical data analysis thus evaluated and more conventional approaches for Japanese textual data analysis. Third, we introduce the statistical system, "InfoMiner", developed to analyze textual and related data obtained from quantitative questionnaires. InfoMiner is based on the "data science" paradigm, which re-asserts the priority of data collection methodologies in any data analysis. In particular, InfoMiner includes some functions developed specifically for analyzing the agglutinative aspect of Japanese textual data. For example, there are functions for parsing any Japanese sentence as a set of linguistic units using a morphological analysis technique. These sets of data are used as dictionaries for statistical analyses. They are then amenable to analysis by multidimensional procedures such as correspondence analysis and clustering procedures. Finally, we use InfoMiner to illustrate a partial analysis of textual data, which were acquired from some actual surveys that were originally designed and conducted for data science based on the Internet. We provide a few examples to illustrate the practicality of this type of multidimensional data analysis.

Key words: Analyzing responses to open-ended questions, textual data analysis, InfoMiner, Internet survey, morpho-logical analysis, word segmentation, text-mining, data science.