Benchmarking AI-Generated Code: NVIDIA's ComputeEval 2025.2
Full Transcript
NVIDIA has expanded its ComputeEval benchmark for evaluating AI-generated CUDA code with the introduction of version 2025.2. This new release adds over 100 new CUDA challenges, increasing the total to 232 problems that assess the capabilities of large language models, or LLMs, in CUDA programming tasks.
The updated benchmark deliberately raises the difficulty, incorporating modern CUDA features like Tensor Cores and advanced shared memory patterns. New challenges test LLMs on their ability to utilize features such as CUDA Graphs, Streams, and Events, all within the context of real-world applications, including dynamic simulations.
NVIDIA's team has evaluated several leading LLMs using the ComputeEval framework to establish baseline performance metrics. The results were documented in a comparative table that presents the pass@1 accuracy of various models on ComputeEval 2025.2 and its predecessor, 2025.1.
Notably, the performance of all models declined with the transition to the latest benchmark, which does not imply a decrease in model capability but rather reflects the increased challenge posed by the new problems.
For example, GPT-5 achieved a pass rate of 0.5819 on the new benchmark, compared to 0.61 on the previous version, while Claude Sonnet 4.0 scored 0.5517 and DeepSeek-R1 recorded a pass rate of 0.4397. The report indicates that the latest benchmark aims to push AI to demonstrate a deeper understanding of accelerated computing nuances.
Looking forward, NVIDIA plans to expand ComputeEval's reach into additional CUDA-X libraries, including cuBLAS, CUTLASS, cuDNN, and RAPIDS. The company encourages contributions from the broader high-performance computing and AI communities, inviting developers to explore the code on GitHub and access the dataset via Hugging Face.
This initiative represents a significant step in measuring AI's role in software engineering, particularly in the realm of GPU programming and optimization.