様式Ｃ－２－４

平成30（2018）年度　重点型研究実施報告書

課題番号	30－共研－4201		分野分類			統計数理研究所内分野分類			b
						主要研究分野分類			7
研究課題名	Who wrote this paper? - Examination of authorship identification using fasttext
重点テーマ	IRのための学術文献データ分析と統計的モデル研究の深化
フリガナ代表者氏名	ハットリコウタ服部恒太				ローマ字		Hattori Kota
所属機関	徳島大学
所属部局	総合科学部
職　　名	講師
配分経費	研究費	40千円		旅　費		0千円		研究参加者数		4　人

研究目的と成果（経過）の概要

Introduction
One of the ongoing projects in this research theme by Fujino (e.g., Fujino, 2017) has been investigating whether topic modeling is an effective method in identifying authorship. Specifically, his project has been investigating if topic modeling is effective in identifying authors who share an identical names and have similar research interest. In the present study, I investigated whether there is another way to handle the same task using fastText (Bojanowski et al., 2016). Here I used academic papers written by three well-known researchers in my relevant research field (i.e., speech science) who share very similar research interest. In this way, I examined whether fastText potentially identify researchers who have an identical name and share similar research interest.
Data
I chose three major researchers in speech science (i.e., Paul Iverson, Valerie Hazan, and Ann Bradlow). They all have similar research interest. I collected four academic papers for each scholar, created text data using Gimp (The GIMP Team, 2018), R (R Core Team, 2019), and the tesseract package (Ooms, 2018). This is because I thought abstracts would not be enough to run analyses with fastText. I omitted all texts in parentheses and brackets since most of the information inside of them is references or numbers related to statistical analyses. I also omitted reference sections. The data set contained 915, 1318, 1131 sentences for Iverson, Hazan, and Bradlow, respectively.
Analyses
In order to create a fastText model, I split the data set into training and test dataset. Seventy percent of the data set was used to train a model, and the rest was used to test the model (2355 sentences for the training dataset and 1009 sentences for the test dataset).
Results
The results demonstrated that, although the three researchers share similar research interest, fastText predicted authorship with approximately 90% accuracy (n-gram = 2, 88.7 %; n-gram = 3, 88.6 %; n-gram = 4, 88.7 %). Given three researchers share similar research interest in this data set, it seems fair to assume that one can possibly identify authors with identical names in similar research fields
Discussion
The present results suggest that fastText can be another approach to identify researchers who have identical names and similar research interest. In order to verify this is the case, I need to increase the size of the dataset and run fastText models. Given data collection requires some manual works, further automatization is needed in data processing. The current approach with tesseract package requires editing PDF files, running spell checks, correcting typos, and splitting sections with headers. If I can speed up these processes, it would be easier to expand the present study.
The present study leaves some potential approaches to refine the fastText approach. For example, it may be possible to predict authorships with less amount of text information. The present study exploited full texts of the academic papers. But, texts in introduction or discussion may suffice to identify authors. Another approach is to create models without stop words. This may increase prediction accuracy.
As Fujino (2017) demonstrated that topic modeling is effective in predicting authorships in cases where IR staffs needs to accurately identify authorships. The present study demonstrated that text classification with fastText can be also an effective approach to help the staffs to engage in the task.

当該研究に関する情報源（論文発表、学会発表、プレプリント、ホームページ等）

研究会を開催した場合は、テーマ・日時・場所・参加者数を記入してください。

研究参加者一覧
氏名	所属機関
武井美緒	統計数理研究所
藤野友和	福岡女子大学