developer-roadmap/src/data/roadmaps/prompt-engineering/content/104-llm-pitfalls/104-prompt-hacking.md

# Prompt Hacking

Prompt hacking is a term used to describe a situation where a model, specifically a language model, is tricked or manipulated into generating outputs that violate safety guidelines or are off-topic. This could include content that's harmful, offensive, or not relevant to the prompt.

There are a few common techniques employed by users to attempt "prompt hacking," such as:

1. **Manipulating keywords**: Users may introduce specific keywords or phrases that are linked to controversial, inappropriate, or harmful content in order to trick the model into generating unsafe outputs.
2. **Playing with grammar**: Users could purposely use poor grammar, spelling, or punctuation to confuse the model and elicit responses that might not be detected by safety mitigations.
3. **Asking leading questions**: Users can try to manipulate the model by asking highly biased or loaded questions, hoping to get a similar response from the model.

To counteract prompt hacking, it's essential for developers and researchers to build in safety mechanisms such as content filters and carefully designed prompt templates to prevent the model from generating harmful or unwanted outputs. Constant monitoring, analysis, and improvement to the safety mitigations in place can help ensure the model's output aligns with the desired guidelines and behaves responsibly.

Read more about prompt hacking here [Prompt Hacking](https://learnprompting.org/docs/category/-prompt-hacking).