Proceedings of the Institute of Statistical Mathematics Vol.72, No.1, 3-21 (2024)

Review of Measurement Models for Technology-enhanced Testing

Kentaro Kato
(Benesse Educational Research & Development Institute)

As computer-based testing (CBT) has become a major mode in administrating educational assessments, its potential advantages over traditional paper-based testing (PBT) draw attention. Among them are the technology-enhanced items and their use for the purpose of improved measurement of learning status. Given this situation, this paper reviews recent attempts and practices in the context of exploratory analysis and psychometric modeling of various types of data generated from TEIs. The review identified the following research topics that are actively pursued: (a) effects of innovative response formats on the psychometric properties, (b) exploratory analysis of the utility of process data, (c) equivalence between the traditional measurement scale and the new scale involving TEIs, and (d) psychometric modeling of process data for revealing response processes. In terms of the last point, it was pointed out that further considerations will be necessary to produce measurement results more useful for learning and for sustainable and stable operation of tests that involve TEIs.

Key words: Computer-based testing, item response theory, technology-enhanced items, process data.


Proceedings of the Institute of Statistical Mathematics Vol.72, No.1, 23-41 (2024)

Interpretable Natural Language Processing Models for Educational Applications

Yo Ehara
(Faculty of Education, Tokyo Gakugei University)

In education applications, measuring the ability of a learner or the characteristics of a question, such as the difficulty of a question, is a fundamental task that has broad utility in learning support systems and other applications. If human teachers can interpret the abilities of examinees and the characteristics of questions rather than simply using models to predict whether subjects will answer a given question correctly, the abilities and characteristics can also be used by human teachers when teaching. In statistics, item response theory (IRT) has been used to estimate interpretable parameters from the response patterns of test takers although IRT does not use the natural language text of each question. By contrast, in natural language processing, there has been interest in research to extract item characteristics such as difficulty from the text of questions. In particular, research on difficulty estimation from text has been active in applications such as language learning support, where values that can explain much of the difficulty of a question can be extracted from easy-to-extract features, such as word frequency. In the present paper, we explain how difficulty estimation from text is related to IRT, introducing research in various fields in addition to statistics, with particular focus on learning vocabulary and reading in second languages. We then discuss recent neural methods such as self-supervised learning and Transformers, which have achieved high accuracy in recent years in analyses that consider the meaning of text.

Key words: Item response theory, natural language processing, second language acquisition.


Proceedings of the Institute of Statistical Mathematics Vol.72, No.1, 43-59 (2024)

Automated Parallel Test Assembly Using Integer Programming with Item Exposure Penalties

Kazuma Fuchimoto
(Graduate School of Informatics and Engineering, The University of Electro-Communications)
Maomi Ueno
(Graduate School of Informatics and Engineering, The University of Electro-Communications)

One feature of e-Testing is the automated test assembly of parallel test forms, by which each form has equivalent measurement accuracy, but with a different set of items. Unfortunately, the automated test assembly often causes a bias of item exposure. This difficulty of bias decreases the item and test reliability. To resolve this difficulty, this study examines a formulated test assembly problem as the objective function of integer programming with two logistic item exposure penalties: a deterministic penalty of logistic item exposure; and a stochastic penalty with logistic item exposure based on the Big-M method, which is a standard technique in mathematical programming. Numerical experiments demonstrate that the proposed methods reduce the bias of item exposure without decreasing the number of tests.

Key words: Automated test assembly, item response theory, e-Testing, item exposure problem, integer programming.


Proceedings of the Institute of Statistical Mathematics Vol.72, No.1, 61-78 (2024)

How to Obtain a Common Scale for Psychological Concept: Methods and Practices of Equating and Linking

Haruhiko Mitsunaga
(Graduate School of Education and Human Development, Nagoya University)

Testing programs can reveal examinees' scores that reflect their ability or competency. To align multiple scales of different test administrations and obtain a common scale among these tests, researchers have proposed various equating or linking procedure. An equating method might be applied when multiple tests measure the same latent scale, whereas a linking method can be considered a collective term for using a common scale that measures the latent scale under weak constraints such as unidimensionality. This article illustrates a method of equating or linking. In practice, a testing program that ensures equity requires the specification that multiple tests should be administered longitudinally. We propose a block diagram notation to visualize a test design and equating procedure and describe an example of a large-scale assessment that equates the scales of multiple grades.

Key words: Test theory, item response theory, large-scale assessment, educational measurement.


Proceedings of the Institute of Statistical Mathematics Vol.72, No.1, 79-92 (2024)

Current Status and Future Directions of Patient-Reported Outcome Item Banks Using Item Response Theory

Yoshihiko Kunisato
(School of Human Sciences, Senshu University)
Yoshitake Takebayashi
(School of Medicine, Fukushima Medical University)

Patient-reported outcomes are health status reports obtained directly from patients through methods such as questionnaires and are often used as outcomes in clinical trials. Although numerous patient-reported outcome measures have been developed based on classical test theory, the Patient-Reported Outcomes Measurement Information System (PROMIS®) has been developed an item bank for computerized adaptive tests based on item response theory. The computerized adaptive test enables the number of questions to be reduced while maintaining measurement precision, thereby lessening respondent burden. PROMIS® advances scale development in the order of item pool development, psychometric testing of items, and check of validity, setting scientific standards for scale development. The scientific standards involves (1) definition of target construct, (2) composition of individual items, (3) item pool construction, (4) determination of item bank properties, (5) testing and instrument formats, (6) validity, (7) reliability, (8) interpretability, and (9) language translation and cultural adaptation. In this paper, we discuss the scale development process in PROMIS® and deliberate the requirements when developing patient-reported outcome item banks in Japan.

Key words: Patient-reported outcome, item response theory, item bank, PROMIS®, COSMIN.


Proceedings of the Institute of Statistical Mathematics Vol.72, No.1, 93-119 (2024)

Recent Development of Parameter Estimation Methods in Diagnostic Classification Models

Kazuhiro Yamaguchi
(Division of Psychology, Institute of Human Sciences, University of Tsukuba)

Diagnostic classification models (DCMs) or cognitive diagnostic models (CDMs) are a family of models that consider elements of cognitive abilities to solve test items named “attributes” and classify test takers into attribute mastery patterns. DCMs are a useful tool to analyze educational tests and have been applied to various tests. However, estimation methods of DCMs have not been assessed in Japan. This study aimed to review the current state of development of estimation methods of DCMs and to contribute to improving the application and theoretical development of DCMs. As a result, estimation methods were developed according to the maximum likelihood estimation, Bayesian estimation, and non-parametric estimation methods. In particular, regularization methods in the maximum likelihood estimation and variational Bayes, which are relatively new methods, were employed. Finally, the remaining estimation problems and future research topics in this area were discussed.

Key words: Diagnostic classification models, cognitive diagnostic models, parameter estimation methods.


Proceedings of the Institute of Statistical Mathematics Vol.72, No.1, 121-146 (2024)

Examining Estimation Accuracy of Cognitive Diagnostic Models in Classroom Contexts
—Focusing on the Impact of Model Misspecification and Different Estimation Methods—

Shun Saso
(Faculty of Education, University of Tokyo)
Motonori Oka
(Department of Statistics, London School of Economics and Political Science)
Satoshi Usami
(Faculty of Education, University of Tokyo)

Cognitive diagnostic models (CDMs) are promising psychometric models that can estimate students' mastery of specific learning elements referred to as attributes. Despite their potential to provide diagnostic information that aids students' learning and teachers' instruction, practical applications of CDMs in educational settings for classroom assessment have been severely limited. This limitation is partly due to an insufficient exploration of the estimation accuracy of CDMs and the behavior of model selection based on information criteria in classroom applications. In the present study, we investigated the accuracy of different estimation methods and evaluated the corresponding information criteria under a simulation design that resembles classroom contexts. Whereas previous works involved simulation studies in which a model fitted to data and the true model behind data generation were the same, the present study examines the effect of model misspecification. In particular, this study considers the log-linear cognitive diagnostic model (LCDM), which is one of the generalized CDM versions, as a model used to generate data for evaluating the performance of not only the LCDM but also its submodels, including the DINA (deterministic inputs, noisy “and” gate) model, DINO (deterministic inputs, noisy “or” gate) model, RRUM (reduced reparameterized unified model), and CRUM (compensatory reparameterized unified model). The key findings are summarized as follows: (1) The CRUM, which assumes that each attribute mastery independently affects item-correct probabilities, exhibited the highest estimation accuracy for attribute mastery patterns; its accuracy was comparable to or slightly less than that of the true-data-generating model in both the maximum likelihood estimation (MLE) method and the Bayesian estimation method. (2) An increase in the number of items and a decrease in the number of attributes improved the estimation accuracy of the attribute mastery patterns, whereas an increase in the sample size did not result in such an improvement. (3) In the MLE method, the Akaike information criterion (AIC) most frequently preferred the CRUM, whereas the Bayesian information criterion (BIC) showed a preference for the most parsimonious CDMs, implying the recommendation of model selection based on the AIC results. In Bayesian estimation, the widely applicable information criterion (WAIC) most frequently preferred CDMs that best approximated the true data-generating model.

Key words: Cognitive diagnostic models, small sample size, Bayesian estimation method, information criteria, formative assessment.