Gentel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks

Large language models (LLMs) have achieved significant success across various applications. However, despite the implementation of numerous safety guardrails, concerns persist regarding their potential for misuse. Recent investigations and analyses show that LLMs suffer serious risks of prompt injection attacks, wherein an attacker fools LLMs into outputting objectionable content by overriding the safety guardrails. Prompt injection attacks can generally be categorized into jailbreak attacks, target hijacking attacks, and prompt leakage attacks, each leading to issues such as generating illegal outputs, unauthorized privilege escalation, and privacy breaches. We introduce GenTel-Safe, a unified framework that includes a novel prompt injection attack detection method, GenTel-Shield, along with a comprehensive evaluation benchmark, GenTel-Bench, which compromises 84812 prompt injection attacks, spanning 3 major categories and 28 security scenarios.

Large Language Models (LLMs) like GPT-4, LLaMA, and Qwen have demonstrated remarkable success across a wide range of applications. However, these models remain inherently vulnerable to prompt injection attacks, which can bypass existing safety mechanisms, highlighting the urgent need for more robust attack detection methods and comprehensive evaluation benchmarks. To address these challenges, we introduce GenTel-Safe, a unified framework that includes a novel prompt injection attack detection method, GenTel-Shield, along with a comprehensive evaluation benchmark, GenTel-Bench, which compromises 84812 prompt injection attacks, spanning 3 major categories and 28 security scenarios. To prove the effectiveness of GenTel-Shield, we evaluate it together with vanilla safety guardrails against the GenTel-Bench dataset. Empirically, GenTel-Shield can achieve state-of-the-art attack detection success rates, which reveals the critical weakness of existing safeguarding techniques against harmful prompts.

Classification performance on Jailbreak Attack Scenarios. The results indicate that the Ours model outperforms all other models across key metrics, particularly in Accuracy, F1 score, and Recall, achieving 97.63%, 97.69%, and 97.34%, respectively.

Classification performance on Goal Hijacking Attack Scenarios. GenTel-Shield, achieves the highest overall performance, with a best-in-class Accuracy of 96.81% and F1 score of 96.74. Its high Precision (99.44%) and Recall (94.19%) suggest a strong balance between correctly detecting attack samples and minimizing false positives, making it more reliable for real-world application.

Classification Performance on Prompt Leaking Attack Scenarios. Our model ranks among the best for detecting injection attacks, particularly in handling complex attack scenarios, showcasing exceptional classification capabilities.

In order to explore the performance of different models in various subdivision scenarios, we conducted a series of experiments. GenTel-Shield consistently outperformed other models in most categories, achieving high accuracy across several risk scenarios.

Contact

Please feel free to email us at mhan@zju.edu.cn. And if you find this work useful in your own research, please consider citing our work.

Li, Rongchang, et al. "GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks" arXiv preprint arXiv:2409.19521 (2024).

Gentel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks

GenTel-Bench

GenTel-Shield

Abstract

Results

Contact