AI safety evals should map failure basins, not just jailbreaks
New paper argues safety testing treats jailbreaks as isolated bugs, but failures form large behavioral regions that persist across paraphrases. The authors propose mapping these “failure basins” using MAP-Elites to chart where models fail, how big those regions are, and where refusal flips to compliance. That shifts evals from incident counting to systems-level mapping. Via Agentic AI. Read more