How Well Do Large Language Models Recognize Instructional Moves? Establishing Baselines for Foundation Models in Educational Discourse
Large language models (LLMs) are increasingly used in educational contexts, yet their ability to interpret authentic instructional discourse \textit{out-of-the-box} remains unclear. We benchmark six state-of-the-art LLMs on classifying instructional moves in K-12 mathematics classroom transcripts annotated by expert educators ($\kappa>0.90$). We evaluated four prompting strategies, including zero-shot, one-shot, and few-shot prompts derived from the human coding manual. Zero-shot prompting achieved fair-to-moderate agreement ($\kappa$ = 0.38–0.48, F1 = 0.45–0.53). Providing comprehensive examples improved performance for some models (e.g., $\kappa$ = 0.48 to 0.58 for Claude 4.5 Opus; $\kappa$ = 0.38 to 0.57 for Gemini 2.5 Pro), but gains were uneven and precision remained limited (best precision = 0.56, recall = 0.75). Errors concentrated in constructs requiring inference about instructor intent; for example, models confused Pressing for Reasoning with Pressing for Accuracy (42\%–53\% false-positive rates). Overall, LLMs demonstrate meaningful but limited capacity to interpret instructional discourse, providing a baseline for educational discourse benchmarking and for designing more reliable annotation workflows.