Measuring Creativity at Scale via Multimodal Large Language Models
Recent literature in automated creativity measurement explores the use of large language models to measure creative tasks at scale, for example using text-based large language models to score text-based brainstorming activities or neural nets to rate images from a creative drawing task. This paper expands creativity measurement in several important ways. We leverage state-of-the-art multimodal large language models (MLLMs), trained on text, image, and other data, to not only model creativity tasks in a unitask approach (one model per task), but also in a multitask approach (one model for several tasks). We connect multimodal large language models to benchmarks established in psychological creativity research, and demonstrate that some MLLMs (notably native multimodal Llama 3.2 and Qwen 3-VL models) surpass the best scoring measurements by up to 5\%, while other MLLMs (notably Llama 4-109B) need additional training procedures to accommodate limited fine-tuning data. We offer evidence of the ability of MLLMs to measure creativity based on human ratings, and explore future opportunities to advance multimodal creativity assessment within learning environments.