Add content for reliability

pull/3947/head
Kamran Ahmed 2 years ago
parent 42debdeab0
commit 591cac8bfa
  1. 4
      src/data/roadmaps/prompt-engineering/content/104-llm-pitfalls/100-citing-sources.md
  2. 2
      src/data/roadmaps/prompt-engineering/content/104-llm-pitfalls/101-bias.md
  3. 2
      src/data/roadmaps/prompt-engineering/content/104-llm-pitfalls/102-hallucinations.md
  4. 34
      src/data/roadmaps/prompt-engineering/content/104-llm-pitfalls/103-math.md
  5. 2
      src/data/roadmaps/prompt-engineering/content/104-llm-pitfalls/104-prompt-hacking.md
  6. 18
      src/data/roadmaps/prompt-engineering/content/104-llm-pitfalls/index.md
  7. 18
      src/data/roadmaps/prompt-engineering/content/105-reliability/100-debiasing.md
  8. 6
      src/data/roadmaps/prompt-engineering/content/105-reliability/101-ensembling.md
  9. 8
      src/data/roadmaps/prompt-engineering/content/105-reliability/102-self-evaluation.md
  10. 7
      src/data/roadmaps/prompt-engineering/content/105-reliability/103-calibrating-llms.md
  11. 17
      src/data/roadmaps/prompt-engineering/content/105-reliability/104-math.md
  12. 8
      src/data/roadmaps/prompt-engineering/content/105-reliability/index.md

@ -1 +1,5 @@
# Citing Sources # Citing Sources
LLMs for the most part cannot accurately cite sources. This is because they do not have access to the Internet, and do not exactly remember where their information came from. They will frequently generate sources that look good, but are entirely inaccurate.
Strategies like search augmented LLMs (LLMs that can search the Internet and other sources) can often fix this problem though.

@ -1,2 +1,4 @@
# Bias # Bias
LLMs are often biased towards generating stereotypical responses. Even with safe guards in place, they will sometimes say sexist/racist/homophobic things. Be careful when using LLMs in consumer-facing applications, and also be careful when using them in research (they can generate biased results).

@ -1,6 +1,6 @@
# Hallucinations # Hallucinations
Hallucinations are a common pitfall in Language Model (LM) outputs. Essentially, they occur when the LM generates text that is factually incorrect, nonsensical, or disconnected from the input prompt. These hallucinations can be problematic in the generated text, as they can mislead the users or cause misunderstandings. LLMs will frequently generate falsehoods when asked a question that they do not know the answer to. Sometimes they will state that they do not know the answer, but much of the time they will confidently give a wrong answer.
### Causes of Hallucinations ### Causes of Hallucinations

@ -1,35 +1,3 @@
# Math # Math
When working with language models, it's essential to understand the challenges and limitations when incorporating mathematics. In this section, we'll discuss some common pitfalls related to math in the context of prompt engineering and provide suggestions for addressing them. LLMs are often bad at math. They have difficulty solving simple math problems, and they are often unable to solve more complex math problems.
## Numerical Reasoning Limitations
Language models like GPT-3 have limitations when it comes to numerical reasoning, especially with large numbers or complex calculations. They might not always provide accurate answers or interpret the numerical context correctly.
**Recommendation:** For tasks that require precise numerical answers or involve complex calculations, consider using specialized math software or verifying the model's output using other means.
## Ambiguous Math Questions
Ambiguous or ill-defined math questions are likely to receive incorrect or nonsensical answers. Vague inputs make it challenging for the model to understand the context and provide sensible responses.
**Recommendation**: Try to make math questions as clear and specific as possible. Provide sufficient context and use precise language to minimize ambiguities.
## Units and Conversion
Language models might not automatically take units into account or perform the necessary unit conversions when working with mathematical problems, which could result in incorrect answers.
**Recommendation**: Explicitly mention the desired units and, when needed, ask the model to perform unit conversions to ensure the output aligns with the expected format or measure.
## Incorrect Interpretation of Notation
Mathematics often uses specialized notation or symbols that the language model might misinterpret. Especially when inputting symbols or notation that differ from the standard plain text, the risk of misunderstanding increases.
**Recommendation**: Make sure to use clear and common notation when presenting math problems to the model. If necessary, explain the notation or provide alternative representations to minimize confusion.
## Building on Incorrect Responses
If a sequence of math problems depends on previous answers, the model might not correct its course after providing an incorrect response. This could cascade and result in multiple subsequent errors.
**Recommendation**: Be cautious when using the model's output as the basis for subsequent calculations or questions. Verify the correctness of the intermediate steps before proceeding.
By being aware of these math-related pitfalls and applying the recommendations, you can improve the effectiveness and accuracy of your prompts when engaging language models with mathematical tasks.

@ -9,3 +9,5 @@ There are a few common techniques employed by users to attempt "prompt hacking,"
3. **Asking leading questions**: Users can try to manipulate the model by asking highly biased or loaded questions, hoping to get a similar response from the model. 3. **Asking leading questions**: Users can try to manipulate the model by asking highly biased or loaded questions, hoping to get a similar response from the model.
To counteract prompt hacking, it's essential for developers and researchers to build in safety mechanisms such as content filters and carefully designed prompt templates to prevent the model from generating harmful or unwanted outputs. Constant monitoring, analysis, and improvement to the safety mitigations in place can help ensure the model's output aligns with the desired guidelines and behaves responsibly. To counteract prompt hacking, it's essential for developers and researchers to build in safety mechanisms such as content filters and carefully designed prompt templates to prevent the model from generating harmful or unwanted outputs. Constant monitoring, analysis, and improvement to the safety mitigations in place can help ensure the model's output aligns with the desired guidelines and behaves responsibly.
Read more about prompt hacking here [Prompt Hacking](https://learnprompting.org/docs/category/-prompt-hacking).

@ -1,31 +1,27 @@
# Pitfalls of LLMs # Pitfalls of LLMs
## LLM Pitfalls LLMs are extremely powerful, but they are by no means perfect. There are many pitfalls that you should be aware of when using them.
In this section, we'll discuss some of the common pitfalls that you might encounter when working with Language Models (LLMs), particularly in the context of prompt engineering. By understanding these pitfalls, you can more effectively develop prompts and avoid potential issues that may affect the performance and utility of your model. ### Model Guessing Your Intentions
### 1. Model Guessing Your Intentions
Sometimes, LLMs might not fully comprehend the intent of your prompt and may generate generic or safe responses. To mitigate this, make your prompts more explicit or ask the model to think step-by-step before providing a final answer. Sometimes, LLMs might not fully comprehend the intent of your prompt and may generate generic or safe responses. To mitigate this, make your prompts more explicit or ask the model to think step-by-step before providing a final answer.
### 2. Sensitivity to Prompt Phrasing ### Sensitivity to Prompt Phrasing
LLMs can be sensitive to the phrasing of your prompts, which might result in completely different or inconsistent responses. Ensure that your prompts are well-phrased and clear to minimize confusion. LLMs can be sensitive to the phrasing of your prompts, which might result in completely different or inconsistent responses. Ensure that your prompts are well-phrased and clear to minimize confusion.
### 3. Model Generating Plausible but Incorrect Answers ### Model Generating Plausible but Incorrect Answers
In some cases, LLMs might generate answers that sound plausible but are actually incorrect. One way to deal with this is by adding a step for the model to verify the accuracy of its response or by prompting the model to provide evidence or a source for the given information. In some cases, LLMs might generate answers that sound plausible but are actually incorrect. One way to deal with this is by adding a step for the model to verify the accuracy of its response or by prompting the model to provide evidence or a source for the given information.
### 4. Verbose or Overly Technical Responses ### Verbose or Overly Technical Responses
LLMs, especially larger ones, may generate responses that are unnecessarily verbose or overly technical. To avoid this, explicitly guide the model by making your prompt more specific, asking for a simpler response, or requesting a particular format. LLMs, especially larger ones, may generate responses that are unnecessarily verbose or overly technical. To avoid this, explicitly guide the model by making your prompt more specific, asking for a simpler response, or requesting a particular format.
### 5. LLMs Not Asking for Clarification ### LLMs Not Asking for Clarification
When faced with an ambiguous prompt, LLMs might try to answer it without asking for clarification. To encourage the model to seek clarification, you can prepend your prompt with "If the question is unclear, please ask for clarification." When faced with an ambiguous prompt, LLMs might try to answer it without asking for clarification. To encourage the model to seek clarification, you can prepend your prompt with "If the question is unclear, please ask for clarification."
### 6. Model Failure to Perform Multi-part Tasks ### Model Failure to Perform Multi-part Tasks
Sometimes, LLMs might not complete all parts of a multi-part task or might only focus on one aspect of it. To avoid this, consider breaking the task into smaller, more manageable sub-tasks or ensure that each part of the task is clearly identified in the prompt. Sometimes, LLMs might not complete all parts of a multi-part task or might only focus on one aspect of it. To avoid this, consider breaking the task into smaller, more manageable sub-tasks or ensure that each part of the task is clearly identified in the prompt.
By being mindful of these pitfalls and implementing the suggested solutions, you can create more effective prompts and optimize the performance of your LLM.

@ -22,20 +22,4 @@ Here are a few strategies that can help you address biases in your prompts:
3. **Counter-balancing**: If a bias is unavoidable due to the context or nature of the prompt, consider counter-balancing it by providing an alternative perspective or side to the argument. 3. **Counter-balancing**: If a bias is unavoidable due to the context or nature of the prompt, consider counter-balancing it by providing an alternative perspective or side to the argument.
4. **Testing and Iterating**: Continuously test and iterate on your prompts, seeking feedback from a diverse group of reviewers to identify and correct potential biases. 4. **Testing and Iterating**: Continuously test and iterate on your prompts, seeking feedback from a diverse group of reviewers to identify and correct potential biases.
## Examples of Debiasing Learn more at [learnprompting.org](https://learnprompting.org/docs/reliability/intro)
Here's an example to illustrate debiasing in prompt engineering:
### Biased Prompt
*Who are some popular male scientists?*
This prompt assumes that scientists are more likely to be men. It also reinforces the stereotype that scientific achievements are primarily attributed to male scientists.
### Debiased Prompt
*Who are some popular scientists from diverse backgrounds and genders?*
This prompt removes any implicit gender bias and encourages a more inclusive list of scientists, showcasing different backgrounds and genders while maintaining the focus on scientific achievements.
By incorporating debiasing strategies into your prompt engineering process, you promote fairness, accountability, and neutrality in AI-generated content, supporting a more inclusive and ethical AI environment.

@ -5,13 +5,11 @@ Ensembling is a technique used to improve the reliability and accuracy of predic
There are several ensembling techniques that can be used, including: There are several ensembling techniques that can be used, including:
- **Majority voting**: Each model votes for a specific output, and the one with the most votes is the final prediction. - **Majority voting**: Each model votes for a specific output, and the one with the most votes is the final prediction.
- **Weighted voting**: Similar to majority voting, but each model has a predefined weight based on its performance, accuracy, or other criteria. The final prediction is based on the weighted sum of all model predictions. - **Weighted voting**: Similar to majority voting, but each model has a predefined weight based on its performance, accuracy, or other criteria. The final prediction is based on the weighted sum of all model predictions.
- **Bagging**: Each model is trained on a slightly different dataset, typically generated by sampling with replacement (bootstrap) from the original dataset. The predictions are then combined, usually through majority voting or averaging. - **Bagging**: Each model is trained on a slightly different dataset, typically generated by sampling with replacement (bootstrap) from the original dataset. The predictions are then combined, usually through majority voting or averaging.
- **Boosting**: A sequential ensemble method where each new model aims to correct the mistakes made by the previous models. The final prediction is a weighted combination of the outputs from all models. - **Boosting**: A sequential ensemble method where each new model aims to correct the mistakes made by the previous models. The final prediction is a weighted combination of the outputs from all models.
- **Stacking**: Multiple base models predict the output, and these predictions are used as inputs for a second-layer model, which provides the final prediction. - **Stacking**: Multiple base models predict the output, and these predictions are used as inputs for a second-layer model, which provides the final prediction.
Incorporating ensembling in your prompt engineering process can help produce more reliable results, but be mindful of factors such as increased computational complexity and potential overfitting. To achieve the best results, make sure to use diverse models in your ensemble and pay attention to tuning their parameters, balancing their weights, and selecting suitable ensembling techniques based on your specific problem and dataset. Incorporating ensembling in your prompt engineering process can help produce more reliable results, but be mindful of factors such as increased computational complexity and potential overfitting. To achieve the best results, make sure to use diverse models in your ensemble and pay attention to tuning their parameters, balancing their weights, and selecting suitable ensembling techniques based on your specific problem and dataset.
Learn more at [learnprompting.org](https://learnprompting.org/docs/reliability/intro)

@ -2,12 +2,6 @@
Self-evaluation is an essential aspect of the prompt engineering process. It involves the ability of an AI model to assess its own performance and determine the level of confidence it has in its responses. By properly incorporating self-evaluation, the AI can improve its reliability, as it will learn to identify its weaknesses and provide more accurate responses over time. Self-evaluation is an essential aspect of the prompt engineering process. It involves the ability of an AI model to assess its own performance and determine the level of confidence it has in its responses. By properly incorporating self-evaluation, the AI can improve its reliability, as it will learn to identify its weaknesses and provide more accurate responses over time.
## Importance of Self-Evaluation
- **Identify weaknesses**: A good self-evaluation system helps the AI recognize areas where it provides less accurate or irrelevant responses, thus making it possible to improve the model during future training iterations.
- **Enhance reliability**: Users are more likely to trust an AI model that understands its limitations and adjusts its responses accordingly.
- **Continuous improvement**: As an AI model evaluates its performance, it becomes equipped with new information to learn from and improve upon, ultimately leading to better overall performance.
## Implementing Self-Evaluation ## Implementing Self-Evaluation
When incorporating self-evaluation into an AI model, you should consider the following elements: When incorporating self-evaluation into an AI model, you should consider the following elements:
@ -21,3 +15,5 @@ When incorporating self-evaluation into an AI model, you should consider the fol
4. **Error monitoring**: Establish a system that continuously monitors the AI model's performance by tracking errors, outliers, and other unexpected results. This monitoring process should inform the self-evaluation mechanism and help the AI model adapt over time. 4. **Error monitoring**: Establish a system that continuously monitors the AI model's performance by tracking errors, outliers, and other unexpected results. This monitoring process should inform the self-evaluation mechanism and help the AI model adapt over time.
By incorporating self-evaluation into your AI model, you can create a more reliable system that users will trust and appreciate. This, in turn, will lead to a greater sense of confidence in the AI model and its potential to solve real-world problems. By incorporating self-evaluation into your AI model, you can create a more reliable system that users will trust and appreciate. This, in turn, will lead to a greater sense of confidence in the AI model and its potential to solve real-world problems.
Learn more at [learnprompting.org](https://learnprompting.org/docs/reliability/intro)

@ -1,6 +1,6 @@
# Calibrating LLMs # Calibrating LLMs
In the context of prompt engineering, calibrating Long-running Language Models (LLMs) is an essential step to ensure reliability and accuracy in the model’s output. Calibration refers to the process of adjusting the model to produce responses that are consistent with human-defined ratings, rankings, or scores. Calibration refers to the process of adjusting the model to produce responses that are consistent with human-defined ratings, rankings, or scores.
## Importance of Calibration ## Importance of Calibration
@ -15,11 +15,8 @@ Calibrating the LLMs helps to:
There are various techniques to calibrate LLMs that you can explore, including: There are various techniques to calibrate LLMs that you can explore, including:
1. **Prompt Conditioning**: Modifying the prompt itself to encourage desired behavior. This involves using explicit instructions or specifying the format of the desired response. 1. **Prompt Conditioning**: Modifying the prompt itself to encourage desired behavior. This involves using explicit instructions or specifying the format of the desired response.
2. **Response Rankings**: Presenting the model with multiple potential responses and asking it to rank them by quality or relevance. This technique encourages the model to eliminate inappropriate or low-quality responses by assessing them against other possible answers. 2. **Response Rankings**: Presenting the model with multiple potential responses and asking it to rank them by quality or relevance. This technique encourages the model to eliminate inappropriate or low-quality responses by assessing them against other possible answers.
3. **Model Debiasing**: Applying debiasing techniques, such as counterfactual data augmentation or fine-tuning the model with diverse, bias-mitigating training data. 3. **Model Debiasing**: Applying debiasing techniques, such as counterfactual data augmentation or fine-tuning the model with diverse, bias-mitigating training data.
4. **Temperature Adjustment**: Dynamically controlling the randomness or 'temperature' parameter during the inference to balance creativity and coherence of the output. 4. **Temperature Adjustment**: Dynamically controlling the randomness or 'temperature' parameter during the inference to balance creativity and coherence of the output.
### Iterative Calibration ### Iterative Calibration
@ -27,3 +24,5 @@ There are various techniques to calibrate LLMs that you can explore, including:
Calibration should be an iterative process, where improvements are consistently monitored and further adjustments made based on the data collected from users. Continual learning from user interactions can help increase the model's overall reliability and maintain its performance over time. Calibration should be an iterative process, where improvements are consistently monitored and further adjustments made based on the data collected from users. Continual learning from user interactions can help increase the model's overall reliability and maintain its performance over time.
Remember, calibrating LLMs is an essential part of creating reliable, high-quality language models that effectively meet user needs and expectations. Through prompt conditioning, response ranking, model debiasing, temperature adjustment, and iterative improvements, you can successfully achieve well-calibrated and reliable LLMs. Remember, calibrating LLMs is an essential part of creating reliable, high-quality language models that effectively meet user needs and expectations. Through prompt conditioning, response ranking, model debiasing, temperature adjustment, and iterative improvements, you can successfully achieve well-calibrated and reliable LLMs.
Learn more at [learnprompting.org](https://learnprompting.org/docs/reliability/intro)

@ -1 +1,18 @@
# Math # Math
As a prompt engineer, you can take the following steps to improve the reliability of Language Models (LMs) for mathematical tasks:
- Clear and specific prompts: Craft clear and specific prompts that provide the necessary context for the mathematical task. Specify the problem type, expected input format, and desired output format. Avoid ambiguous or vague instructions that can confuse the LM.
- Formatting cues: Include formatting cues in the prompts to guide the LM on how to interpret and generate mathematical expressions. For example, use LaTeX formatting or explicit notations for mathematical symbols, equations, or variables.
- Example-based prompts: Provide example-based prompts that demonstrate the desired input-output behavior. Show the model correct solutions for different problem types to help it understand the expected patterns and formats.
- Step-by-step instructions: Break down complex mathematical problems into step-by-step instructions. Provide explicit instructions on how the model should approach the problem, such as defining variables, applying specific rules or formulas, or following a particular sequence of operations.
- Error handling: Anticipate potential errors or misconceptions the LM might make, and explicitly instruct it on how to handle those cases. Provide guidance on common mistakes and offer corrective feedback to help the model learn from its errors.
- Feedback loop: Continuously evaluate the model's responses and iterate on the prompts based on user feedback. Identify areas where the LM is consistently making errors or struggling, and modify the prompts to address those specific challenges.
- Context injection: Inject additional context into the prompt to help the model better understand the problem. This can include relevant background information, specific problem constraints, or hints to guide the LM towards the correct solution.
- Progressive disclosure: Gradually reveal information or subtasks to the LM, rather than providing the entire problem at once. This can help the model focus on smaller subproblems and reduce the cognitive load, leading to more reliable outputs.
- Sanity checks: Include sanity checks in the prompt to verify the reasonableness of the model's output. For example, you can ask the model to show intermediate steps or validate the solution against known mathematical properties.
- Fine-tuning and experimentation: Fine-tune the LM on a dataset that specifically focuses on mathematical tasks. Experiment with different prompt engineering techniques and evaluate the impact on the model's reliability. Iterate on the fine-tuning process based on the results obtained.
By applying these prompt engineering strategies, you can guide the LM towards more reliable and accurate responses for mathematical tasks, improving the overall usability and trustworthiness of the model.
Learn more at [learnprompting.org](https://learnprompting.org/docs/reliability/intro)

@ -1 +1,9 @@
# Improving Reliability # Improving Reliability
To a certain extent, most of the previous techniques covered have to do with improving completion accuracy, and thus reliability, in particular self-consistency. However, there are a number of other techniques that can be used to improve reliability, beyond basic prompting strategies.
LLMs have been found to be more reliable than we might expect at interpreting what a prompt is trying to say when responding to misspelled, badly phrased, or even actively misleading prompts. Despite this ability, they still exhibit various problems including hallucinations, flawed explanations with CoT methods, and multiple biases including majority label bias, recency bias, and common token bias. Additionally, zero-shot CoT can be particularly biased when dealing with sensitive topics.
Common solutions to some of these problems include calibrators to remove a priori biases, and verifiers to score completions, as well as promoting diversity in completions.
Learn more at [learnprompting.org](https://learnprompting.org/docs/reliability/intro)

Loading…
Cancel
Save