back to top

Educational Assessment

EAPA Digital Event 2021


Machine Based Learning in Psychological Assessment (Keynote)

Christine DiStefano, University of South Carolina

As an analytical technique, Machine Learning has received attention in a variety of fields. This discussion is geared to researchers interested in exploring the area of Machine Learning and how it can be applied in psychological assessment.  This talk will introduce this technique and provide a brief discussion of different methodologies that can be used to uncover patterns in data.  Applications of Machine Learning to psychological assessment and areas for future research will be reviewed.

Date: 18 May 2021, 6 p.m. (Berlin time)

Register: Send email to and state your name, the event(s) you want to attend and the email address we can contact you with.


Computer-Based Innovations in International Large-Scale Assessments (Symposium)

Lale Khorramdel (Boston College, Chestnut Hill, MA, USA) & Matthias von Davier (Boston College, Chestnut Hill, MA, USA)

Computer-based large-scale assessments, such as the Programme for International Student Assessment (PISA) or the Programme for the International Assessment of Adult Competencies (PIAAC), grant exciting opportunities for innovations in item development, test administration, data collection, scoring, and analysis. They not only allow features such as the implementation of adaptive testing, automated scoring or process data collection, but also the combination of these features to potentially improve the validity and accuracy of statistical analysis and scores. However, these innovations can challenge the use of traditional statistical models as well as the interpretation of trend measures when assessments move from a paper-based (PBA) to a computerbased assessment (CBA).

The symposium’s studies cover the following recent developments:
1. Item response theory (IRT) model extensions to test and account for potential mode effects,
2. natural language processing (NLP) techniques for ensuring the efficiency of automatically
coding text responses,
3. multistage adaptive testing for increasing test efficiency and accuracy,
4. mixture IRT and response time modeling to examine rapid guessing, and
5. a latent class response time model for identifying and modeling careless responding in noncognitive assessments.

The presented approaches and developments aim to enable a smooth transition from PBA to CBA, illustrate how new technologies and psychometric models can be used to advance the field of (large-scale) assessments, and investigate the usefulness of response time and process data for more accurate proficiency estimations or a better understanding of respondents’ behavior. Innovations and challenges are discussed and illustrated using empirical examples from PISA and PIAAC.


Modeling mode effects in PISA 2015 using IRT model extensions

Matthias von Davier & Lale Khorramdel (Boston College, Chestnut Hill, MA, USA)

International large-scale assessments (ILSAs) transitioned from paper-based assessments (PBA) to computer-based assessments (CBA), facilitating the use of new item types and more effective data collection tools. CBAs allow the implementation of complex test designs and the collection of process and response time (RT) data, which can be used to improve the data quality and the accuracy of parameter estimates. However, the move to a CBA also poses challenges for the measurement of trend over time, as results of the same test administered in different modes might
not be directly comparable. In addition, it must be established whether comparability of different countries’ results within the same assessment cycle can be maintained if some countries are not ready to move to a CBA and continue to administer a PBA. Mode effects may manifest in the form of differential item functioning (DIF) observable on some items when comparing equivalent groups across assessment modes. This paper presents extensions of item response theory (IRT) models to test for mode effects and to address violations of measurement invariance if effects are present. The approach was used to establish the types of mode effects and to treat problematic items accordingly by allowing partial measurement invariance. The different models were compared using PISA 2015 data (Programme for the International Student Assessment) and the impact of different levels of measurement invariance on the comparability of results will be discussed.


Efficient Automatic Coding of PISA Text Responses: Learning from Previous Assessments

Fabian Zehner (1), Frank Goldhammer (1, 2) & Nico Andersen (1)
1 DIPF | Leibniz Institute for Research and Information in Education, Frankfurt am Main, Germany
2 Zentrum für Internationale Bildungsvergleichsstudien (ZIB), Frankfurt am Main, Germany

Automatic coding (e.g., scoring) of short-text responses allows to tackle many challenges associated with constructed-response format in assessment. For example, it satisfies the need for on-the-fly scoring for adaptive testing. The presentation outlines historical developments in automatic coding and depicts the methodology behind ReCo (Zehner, Sälzer, & Goldhammer, 2016) that draws on machine learning and natural language processing. Additionally, we report on a new study investigating whether once trained classifiers can be reused in repeated measurements. PISA’s tests contain substantial numbers of items requiring human coding (26–45 percent in 2015; OECD, 2017). When humans code responses, they bring in huge demands on resources as well as their varying subjective perspectives, coding experience, and stamina, which can harm data quality. However, the gain in consistency and effort reduction through automatic coding only holds true if classifiers can be reused across repeated measurements with sufficient accuracy and robustness towards study design changes. We investigated the difference in accuracy of classifiers that had been trained on data from PISA 2012 (Zehner et al., 2016) being applied to PISA 2012 and PISA 2015; a sensitive comparison considering the change to computer-based assessment in 2015. The data included about n = 50,000 responses to ten items. Results indicate a substantial decrease in accuracy only for one math item (Δk = -.188), while the other items ranged within Δk = [-.101, +.081]. Neglecting the math item, no aggregated decrease could be found. The presentation discusses required control mechanisms for potential operational employment.


Adaptive Testing in International Large-Scale Assessments – Considerations and Applications

Kentaro Yamamoto (1), Hyo Jeong Shin(1) & Lale Khorramdel (2)
1 Educational Testing Service, Princeton, USA
2 Boston College, Boston, USA

The move of international large-scale assessments (ILSAs) from a paper-based to a computerbased administration mode allowed the implementation of multistage adaptive testing (MST). In 2012, MST was implemented for the Programme for the International Assessment of Adult Competencies (PIAAC)for about 40 countries, and for the 2018 cycle of the Programme for International Student Assessment (PISA) for more than 80 countries. In ILSAs, the proficiency measures are based on a large item pool reflecting a broad construct framework and are assessed across multiple languages and heterogeneous populations. MST can reduce measurement errors, especially at the high and low ends of the proficiency distribution, without increasing respondents’ individual burden (number of items) while, simultaneously, enabling a broad construct coverage. Moreover, both automatically (e.g., multiple-choice items) and human-coded constructed response items can be included in the design. Using examples from PISA and PIAAC, this presentation covers the advantages and considerations of an MST design in the context of ILSAs. We will illustrate and discuss the unique features of the implemented designs, the expected gains in test efficiency and accuracy, as well as limitations and challenges of MST designs for cross-country surveys. Practical aspects and insights into utilizing MST to measure complex constructs in crosscultural surveys will be provided.


Mixture IRT Modeling Approaches and Response Time to Correct for Rapid Guessing

Artur Pokropek (1) & Lale Khorramdel (2)
1 Polish Academy of Science, Institute of Philosophy and Sociology, Warsaw, Poland
2 Boston College, Boston, USA

Within low-stakes assessments like PIAAC (Programme for the International Assessment of Adult Competencies) or other studies that measure abilities, concern about test-taking motivation is extremely crucial. Low test-taking motivation can be reflected through fast responses, omitted responses, or guessing in both cognitive and non-cognitive assessments. It is assumed that motivated examinees take a certain time to respond to questions by trying actively to solve an item. Low motivated examinees, on the other hand, might show rapid guessing indicated by very short response times (too short to analyze the items fully). Computer-based testing and response filtering based on timing information have attracted the attention of researchers because these methods promise notable increases in the accuracy of estimates and the ease of application (Wise & Kong, 2005). While simple item filtering brings practical and methodological problems, recent extensions of IRT mixture models that allow for the inclusion of collateral information promise to bring a more accurate solution. In this presentation, we will discuss and compare the performance of the HYBRID model (Yamamoto, 1989), grade of membership model (GoM; Erosheva, 2005) and
their extensions (Pokropek, 2015) using a Monte Carlo simulation study. Based on data from PIAAC, we will show how information about different characteristics of items and respondents together with response time might help detecting guessing behaviors, correcting the ability estimates and identifying items that are particularly prone to guessing.


A Latent Class Response Time Model for Identifying and Modeling Careless and Insufficient Effort Responding in Noncognitive Assessment Data

Esther Ulitzsch (1), Steffi Pohl (1), Ulf Kröhne (2) & Matthias von Davier (3)
1 Freie Universität Berlin, Berlin, Germany
2 DIPF | Leibniz Institute for Research and Information in Education, Frankfurt am Main, Germany
3 Boston College, Chestnut Hill, MA, USA

Careless and insufficient effort responding (C/IER) can pose a major threat to data quality, and, as such, to validity of inferences drawn from questionnaire data. A rich body of methods aiming at its detection has been developed. Most of these methods are tailored to detect only one specific type of C/IER pattern. However, typically different types of C/IER patterns occur within one data set, which all need to be accounted for simultaneously. We present a model-based approach for detecting manifold manifestations of C/IER at once. This is achieved by leveraging response time
(RT) information available from computer-administered questionnaires and integrating theoretical considerations on C/IER with recent psychometric modeling approaches. The approach a) takes the specifics of attentive response behavior in noncognitive assessments into account by incorporating the distance-difficulty hypothesis, b) allows for different subpopulations of respondents that differ in whether they approach the assessment with no or full attentiveness, and c) can deal with various response patterns arising from C/IER. Two versions of the model - one for item-level RTs and one for aggregated RTs - are presented. The approach is illustrated in an empirical example, comparing different RT measures. We discuss assumptions that aid in choosing the appropriate version of the model for practice.


Date: 18 May 2021, 4 p.m. (Berlin time)

Register: Send email to and state your name, the event(s) you want to attend and the email address we can contact you with.