FoL 2026
L@S

Measuring Simulation Fidelity via Statistical Detectability: A Diagnostic Framework for AI-Generated Tutoring Conversations

Mon Jun 29, 11:35 AM–12:00 PM · North 205
★ Notable speakers
Kevyn Collins-Thompson — Statistical text readability prediction; vocabulary learning; adaptive intelligent reading tutors

Large-scale, realistic simulations of learning interactions powered by large language models (LLMs) have the potential for significant impact in educational research and practice at scale: from allowing fast, inexpensive, safe evaluation and pretesting of AI-assisted systems, to scalable practice sessions in teacher preparation programs. However, if synthetic conversations do not preserve essential statistical properties of real interactions -- especially those related to cognitive aspects of teaching and learning -- conclusions drawn from them may be invalid. Basic methods to estimate the fidelity of synthetic data, such as non-parametric tests of distributional similarity or comparing marginal distributions for individual features, lack detailed diagnostic ability and interpretability, especially when real data are characterized by complex feature interactions. We propose measuring synthetic conversation quality in terms of \emph{interpretable statistical detectability}, using as a starting point recent progress in statistics developing the propensity score mean squared error (pMSE) ratio metric, originally introduced for synthetic tabular data validation. We show how to adapt the pMSE approach to conversation data by first developing a feature extraction workflow that maps variable-length natural language dialogues to a rich representation vector capturing both surface patterns (message length, vocabulary, turn structure) and cognitive dynamics (confusion duration, resolution patterns, hint sensitivity). By sampling from a dataset of authentic online tutoring dialogues for comparison, we then create and evaluate a series of increasingly sophisticated synthetic conversational datasets generated by iteratively improved prompt strategies used with a state-of-the-art LLM. We show that our fidelity assessment framework is effective at detecting real vs synthetic differences not only between marginal distributions of surface features (e.g., mean and variance of tutor message lengths), but also joint distributions of cognitive-related features (e.g., confusion rate x confusion duration). The pMSE ratio is effective at reliably replicating the sophistication levels of our prompt strategies. However, even the most advanced prompt strategy (V4) produces a pMSE ratio of 50.6\% (lower is better, zero is ideal), showing that substantial potential for improvement still exists for LLM-based generation of synthetic conversations.

Authors

Michael Ion, Kevyn Collins-Thompson