平成302018)年度 重点型研究実施報告書

 

課題番号

30−共研−4201

分野分類

統計数理研究所内分野分類

b

主要研究分野分類

7

研究課題名

Who wrote this paper? - Examination of authorship identification using fasttext

重点テーマ

IRのための学術文献データ分析と統計的モデル研究の深化

フリガナ

代表者氏名

ハットリ コウタ

服部 恒太

ローマ字

Hattori Kota

所属機関

徳島大学

所属部局

総合科学部

職  名

講師

配分経費

研究費

40千円

旅 費

0千円

研究参加者数

4 人

 

研究目的と成果(経過)の概要

Introduction
One of the ongoing projects in this research theme by Fujino (e.g., Fujino, 2017) has been investigating whether topic modeling is an effective method in identifying authorship. Specifically, his project has been investigating if topic modeling is effective in identifying authors who share an identical names and have similar research interest. In the present study, I investigated whether there is another way to handle the same task using fastText (Bojanowski et al., 2016). Here I used academic papers written by three well-known researchers in my relevant research field (i.e., speech science) who share very similar research interest. In this way, I examined whether fastText potentially identify researchers who have an identical name and share similar research interest.
Data
I chose three major researchers in speech science (i.e., Paul Iverson, Valerie Hazan, and Ann Bradlow). They all have similar research interest. I collected four academic papers for each scholar, created text data using Gimp (The GIMP Team, 2018), R (R Core Team, 2019), and the tesseract package (Ooms, 2018). This is because I thought abstracts would not be enough to run analyses with fastText. I omitted all texts in parentheses and brackets since most of the information inside of them is references or numbers related to statistical analyses. I also omitted reference sections. The data set contained 915, 1318, 1131 sentences for Iverson, Hazan, and Bradlow, respectively.
Analyses
In order to create a fastText model, I split the data set into training and test dataset. Seventy percent of the data set was used to train a model, and the rest was used to test the model (2355 sentences for the training dataset and 1009 sentences for the test dataset).
Results
The results demonstrated that, although the three researchers share similar research interest, fastText predicted authorship with approximately 90% accuracy (n-gram = 2, 88.7 %; n-gram = 3, 88.6 %; n-gram = 4, 88.7 %). Given three researchers share similar research interest in this data set, it seems fair to assume that one can possibly identify authors with identical names in similar research fields
Discussion
The present results suggest that fastText can be another approach to identify researchers who have identical names and similar research interest. In order to verify this is the case, I need to increase the size of the dataset and run fastText models. Given data collection requires some manual works, further automatization is needed in data processing. The current approach with tesseract package requires editing PDF files, running spell checks, correcting typos, and splitting sections with headers. If I can speed up these processes, it would be easier to expand the present study.
The present study leaves some potential approaches to refine the fastText approach. For example, it may be possible to predict authorships with less amount of text information. The present study exploited full texts of the academic papers. But, texts in introduction or discussion may suffice to identify authors. Another approach is to create models without stop words. This may increase prediction accuracy.
As Fujino (2017) demonstrated that topic modeling is effective in predicting authorships in cases where IR staffs needs to accurately identify authorships. The present study demonstrated that text classification with fastText can be also an effective approach to help the staffs to engage in the task.

 

当該研究に関する情報源(論文発表、学会発表、プレプリント、ホームページ等)


研究会を開催した場合は、テーマ・日時・場所・参加者数を記入してください。


 

研究参加者一覧

氏名

所属機関

武井 美緒

統計数理研究所

藤野 友和

福岡女子大学