CyberGym: Evaluating AI Agents’ Cybersecurity Capabilities with Real-World Vulnerabilities at Scale

UC Berkeley researchers release CyberGym as a benchmark for evaluating AI agents cybersecurity capabilities. The reproduction rate of identifying known bugs was low (only 11.9%), but this serves as a baseline for improvements in AI agent performance over time.

More interestingly, the evaluation process discovered 15 new vulnerabilities that present security risks, a tangential benefit. As this is a new technique, I’d expect that teams will find these tools to be increasingly helpful over the next few years.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *