Kaggle simplifies creation of AI benchmarks with local development

Developers can now conveniently build Kaggle Benchmarks in their local development environments, utilizing coding agents. With the use of the Kaggle CLI and AI coding agents, developers have the capability to write, push, run, and download tasks directly, which facilitates quicker measurement of model capabilities.

As AI models advance from basic chatbots to more sophisticated reasoning agents capable of coding, using tools, and solving complex problems, traditional benchmarks become insufficient. This has driven the need for dynamic and robust evaluations created by the community of users who apply these models in real-world settings.

Kaggle Benchmarks was launched to address this need, and since then, the global AI community has generated over 10,000 evaluation tasks. These tasks have contributed to the creation of reliable and transparent public leaderboards that support labs in measuring and accelerating AI advancements.

In a step forward, the recent launch of local development for Kaggle Benchmarks was announced. Prior to this update, creating evaluation tasks required working solely within Kaggle’s web-based notebook editor rather than in developers’ preferred environments.

The new update allows developers to seamlessly create, validate, push, run, and download tasks from their local development environments, such as Antigravity, VSCode, Cursor, and others, using coding agents. This change aims to better align with developers’ workflows, making the transition from concept to evaluation more efficient and intuitive.

Local development also introduces a powerful workflow: using AI coding agents to craft benchmark tasks through the ‘write-kaggle-benchmarks’ skill. This skill consists of a series of structured instructions that guide a coding agent in building tasks utilizing the kaggle-benchmarks SDK and the Kaggle CLI.

To equip your agent with this skill, simply request it. Once integrated, you can describe an evaluation in straightforward language and produce a functional task on Kaggle. The enhanced capabilities are enabled by new commands now available for Benchmarks in the Kaggle CLI.

The purpose of Kaggle Benchmarks is to democratize trustworthy AI evaluations. The belief is that measurable capabilities prompt labs to strive for improvements. By providing clear, objective indicators, the aim is to empower AI labs to focus enhancements on the most critical areas.

For AI to genuinely serve humanity, evaluations must encompass the full spectrum of real-world challenges. This launch represents a significant step towards enabling anyone, anywhere, to develop evaluations that will influence the future of AI. Ready to start? Try Kaggle Benchmarks today.