Introduction
One of the ongoing projects in this research theme by Fujino (e.g., Fujino, 2017) has been investigating whether topic modeling is an effective method in identifying authorship. Specifically, his project has been investigating if topic modeling is effective in identifying authors who share an identical names and have similar research interest. In the present study, I investigated whether there is another way to handle the same task using fastText (Bojanowski et al., 2016). Here I used academic papers written by three well-known researchers in my relevant research field (i.e., speech science) who share very similar research interest. In this way, I examined whether fastText potentially identify researchers who have an identical name and share similar research interest.
Data
I chose three major researchers in speech science (i.e., Paul Iverson, Valerie Hazan, and Ann Bradlow). They all have similar research interest. I collected four academic papers for each scholar, created text data using Gimp (The GIMP Team, 2018), R (R Core Team, 2019), and the tesseract package (Ooms, 2018). This is because I thought abstracts would not be enough to run analyses with fastText. I omitted all texts in parentheses and brackets since most of the information inside of them is references or numbers related to statistical analyses. I also omitted reference sections. The data set contained 915, 1318, 1131 sentences for Iverson, Hazan, and Bradlow, respectively.
Analyses
In order to create a fastText model, I split the data set into training and test dataset. Seventy percent of the data set was used to train a model, and the rest was used to test the model (2355 sentences for the training dataset and 1009 sentences for the test dataset).
Results
The results demonstrated that, although the three researchers share similar research interest, fastText predicted authorship with approximately 90% accuracy (n-gram = 2, 88.7 %; n-gram = 3, 88.6 %; n-gram = 4, 88.7 %). Given three researchers share similar research interest in this data set, it seems fair to assume that one can possibly identify authors with identical names in similar research fields
Discussion
The present results suggest that fastText can be another approach to identify researchers who have identical names and similar research interest. In order to verify this is the case, I need to increase the size of the dataset and run fastText models. Given data collection requires some manual works, further automatization is needed in data processing. The current approach with tesseract package requires editing PDF files, running spell checks, correcting typos, and splitting sections with headers. If I can speed up these processes, it would be easier to expand the present study.
The present study leaves some potential approaches to refine the fastText approach. For example, it may be possible to predict authorships with less amount of text information. The present study exploited full texts of the academic papers. But, texts in introduction or discussion may suffice to identify authors. Another approach is to create models without stop words. This may increase prediction accuracy.
As Fujino (2017) demonstrated that topic modeling is effective in predicting authorships in cases where IR staffs needs to accurately identify authorships. The present study demonstrated that text classification with fastText can be also an effective approach to help the staffs to engage in the task.
|