L@S

A Large Scale Randomized Control Trial Showing LLM Generated Feedback Helps Low-Knowledge Middle School Math Students with Short-Term Learning

Mon Jun 29, 3:45 PM–4:10 PM · North 205

Automated Feedback Generative AI & Large Language Models Mathematics Education Learning at Scale & MOOCs K-12 Education

★ Notable speakers

Neil Heffernan ★★ — ASSISTments intelligent tutoring platform; educational data mining at scale

Adam Sales ★ — Causal inference in educational data mining; combining machine learning with randomized trial analysis; principal stratification in ITS

The benefits of feedback on learning are well established. However, delivering effective feedback at scale remains a substantial challenge due to various logistical and cost-related constraints. This paper reports on two randomized controlled trials evaluating the effectiveness of AI-generated feedback for middle school mathematics. The first RCT compares AI-generated feedback with business-as-usual (no feedback), while the second compares AI-generated feedback with teacher-written feedback. Across both RCTs, we evaluated the benefits of AI-generated feedback using a combined 21,478 students working on 572 problems. We developed an automated pipeline using Qwen-235B to generate feedback for 3,862 commonly occurring wrong answers. Experienced educators initially validated the feedback; we subsequently automated the process using LLM-as-a-judge to verify the content at scale. Study 1 (n=16,846 students) examined the benefit of AI-generated feedback compared to business-as-usual (correctness feedback only), finding that students receiving AI feedback were 12-19\% more likely to correct their current answer and 7\% more likely to succeed on the subsequent problem. Notably, this only benefited lower-knowledge students (0-1 of 5 prior problems correct), who showed a 7\% increase in next-problem success, while effects diminished for higher-performing students. Study 2 (n=6,055 students) directly compared AI-generated feedback with expert teacher-written feedback, revealing comparable learning outcomes. Our findings demonstrate that LLM-generated feedback can match the effectiveness of teacher-authored content at a fraction of the cost (~\$0.05-0.10 per cached feedback message), with particularly strong benefits for lower-performing students. This work contributes empirical evidence demonstrating the effectiveness of human-AI collaboration in generating feedback at scale. We prove that human-AI collaboration can scale personalized learning without compromising instructional quality, while highlighting the importance of feedback for lower-performing students.

Authors

Eamon Worden, Luca Dang, Wen Chiang Lim, Sarah Miller, Jiayi Zhang, Aaron Haim, Adam Sales, Ashish Gurung, Neil Heffernan