Preprint · February 2026
Evaluating Hybrid Guardrail Architectures for Prompt Injection Defense in Large Language Models
Olanrewaju Muili · Independent Researcher; Founder, Tracevox.ai
Abstract
Prompt injection attacks exploit the instruction-following behavior of large language models (LLMs) by embedding adversarial directives within user-provided text. Production systems frequently deploy layered guardrail mechanisms, combining heuristic filters and model-based classifiers, to mitigate such attacks. However, rigorous empirical evaluations of these architectures under structured adversarial variation remain limited, and most deployed systems lack publicly reported performance metrics.
This work presents a systematic evaluation of three guardrail configurations: (1) a baseline with no guardrails, (2) regex-based heuristic filtering, and (3) a hybrid architecture combining regex filtering with an LLM-based safety classifier. The evaluation uses a two-tier benchmark of 625 prompts (standard and adversarially hard), measuring attack block rate, miss rate, benign false-positive rate, precision, recall, and F₁-score under a single-turn threat model. On hard-tier attacks, the hybrid maintains near-perfect performance while regex-only filtering degrades substantially—demonstrating robustness of layered LLM-based classification alongside persistent challenges from semantic overlap in benign security-adjacent content.
Keywords
prompt injection · LLM safety · guardrails · adversarial robustness · evaluation benchmarks