Black-Box Reverse-Engineering Exposes LLM Guardrails as Vulnerable Attack Surface
Published Nov 11, 2025
Researchers disclosed a practical Black‐Box Guardrail Reverse‐Engineering Attack (GRA) that, using genetic algorithms and reinforcement learning, infers commercial LLMs’ safety decision policies from input–output behavior. Tested against ChatGPT, DeepSeek and Qwen3, GRA achieved over 0.92 rule‐matching accuracy at under US$85 in API costs, showing guardrails can be cheaply and reliably approximated and evaded. This elevates guardrails themselves into an exploitable attack surface, threatening compliance and safety in regulated domains (health, finance, legal, education) and amplifying risks where retrieval‐augmented or contextual inputs already degrade protections. Mitigations include obfuscating decision surfaces (randomized filtering, honey‐tokens), context‐aware robustness testing, and continuous adversarial auditing. The finding demands urgent redesign of safety architectures and threat models to treat guardrails as resilient, dynamic defenses rather than static filters.