A Large-Scale Analysis of Student Behavior with Pedagogically Constrained LLM Tutors
Large language models (LLMs) are increasingly being integrated into higher education as digital tutors, yet our understanding of how students actually behave when interacting with these systems remains limited. We present a large-scale empirical study of student behavioral patterns when interacting with an LLM-based educational system deployed across three computer science courses of varying difficulty levels. Our system uses prompt engineering and retrieval-augmented generation (RAG) to align LLM capabilities with educational goals, including a homework-detection mechanism that provides hints rather than direct answers. We collected over 12,000 conversations containing more than 20,000 student messages from 589 students and developed an LLM-based annotation framework validated with human agreement that evaluates student behaviors along four dimensions: self-regulated learning, help-seeking behaviors, cognitive engagement (grounded in the ICAP framework), and affective-motivational signals. Our analysis of 1,500 sampled conversations reveals that the majority of interactions exhibit low engagement quality (56.1%), with students rarely presenting prior work (30.0%), seldom reflecting on their learning states (8.2%), and frequently ignoring tutor hints (28.0%). We find statistically significant differences across course levels in self-regulated learning, help-seeking orientation, and cognitive engagement: entry-level students show more procedural focus and higher self-correction rates, while advanced students demonstrate greater conceptual depth but lower overall engagement with the tutoring system. These findings highlight critical gaps in students' ability to productively interact with LLM-based tutors and provide design implications for future educational AI systems.