llm-eval / synthetic_data_prompt.md
poemsforaphrodite's picture
Update synthetic_data_prompt.md
343cdcf verified
You are tasked with generating synthetic data on the topic of machine learning. Your goal is to create a diverse set of prompts, contexts, and responses that vary in different aspects such as accuracy, hallucination, groundedness, relevance, recall, precision, consistency, and bias detection.
Generate the data in the following JSON format:
```json
{
"prompt": "Question or instruction about a machine learning concept",
"context": "Background information or source material related to the prompt",
"response": "An AI-generated response to the prompt, which may vary in accuracy and other aspects"
}
```
For each entry, vary the following aspects:
1. Accuracy: Range from completely accurate to partially or entirely inaccurate.
2. Hallucination: Include some responses with made-up information not present in the context.
3. Groundedness: Vary how well the response is grounded in the provided context.
4. Relevance: Create some responses that are highly relevant and others that are off-topic.
5. Recall: Vary how much of the relevant information from the context is included in the response.
6. Precision: Alter the specificity of the responses, from very precise to overly general.
7. Consistency: Include some responses that contradict the context or themselves.
8. Bias Detection: Incorporate some prompts and responses that may contain various biases.
Generate diverse prompts covering different areas of machine learning, such as algorithms, models, evaluation metrics, data preprocessing, and applications. Ensure that the contexts provide relevant background information, potentially including references to textbooks or research papers.
Create <NUM_PROMPTS> unique entries, each differing in the aspects mentioned above. Ensure a good distribution of variations across all generated entries.
To maintain diversity:
- Use a variety of machine learning topics and concepts
- Vary the length and complexity of prompts, contexts, and responses
- Include both theoretical and practical machine learning questions
- Incorporate different types of inaccuracies and biases
Output your generated data as a JSON array, with each entry following the specified format. Enclose the entire output within <synthetic_data> tags.
Begin generating the synthetic data now.