Spaces:
Runtime error
Runtime error
Create prompts_v1.py
Browse files- prompts_v1.py +371 -0
prompts_v1.py
ADDED
@@ -0,0 +1,371 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
diversity_metrics ="""
|
2 |
+
Objective: You are an expert evaluator tasked with assessing the diversity of language and ideas in generated text. Your goal is to evaluate the RAG (Retrieval-Augmented Generation) response for lexical variety, structural diversity, and originality relative to the context provided. Assign a single evaluation score between 0 and 1.0 based on how creatively diverse the response is.
|
3 |
+
Evaluation Task: Given a question, context, and the RAG response, evaluate the diversity of the response based on these criteria:
|
4 |
+
• Lexical Diversity: Evaluate the variety of vocabulary used in the response. Does it avoid excessive repetition and demonstrate a range of vocabulary?
|
5 |
+
• Structural Variety: Assess the diversity of sentence structures, including the use of different sentence types and lengths.
|
6 |
+
• Originality: Examine the originality of the response by comparing it with the retrieved context. Does the answer bring in unique phrasing, expressions, or perspectives beyond direct repetition of the context?
|
7 |
+
Score Range: Assign scores within the following range based on the diversity in the generated text:
|
8 |
+
• 0.0: No diversity — Highly repetitive, uses basic vocabulary and predictable structures.
|
9 |
+
• 0.3: Low diversity — Limited vocabulary and repetitive structures with minor attempts at originality.
|
10 |
+
• 0.5: Moderate diversity — Some varied vocabulary and sentence types; minimal repetition but still follows similar patterns.
|
11 |
+
• 0.7: Good diversity — Generally varied vocabulary and structures, with some unique phrasing or ideas introduced.
|
12 |
+
• 1.0: High diversity — Rich and varied vocabulary, diverse sentence structures, and highly original content beyond the retrieved context.
|
13 |
+
Special Instructions:
|
14 |
+
• Evaluate based solely on the provided context and avoid assumptions.
|
15 |
+
• Focus on language variety, not the factual accuracy.
|
16 |
+
• Give partial credit if there’s some repetition but notable diversity elsewhere.
|
17 |
+
• Document reasons for score deductions to ensure consistency across evaluations.
|
18 |
+
Output Format: Return only a JSON object in the following structure:
|
19 |
+
{{
|
20 |
+
"eval_score": <score between 0.0 and 1.0>,
|
21 |
+
"explanation": "<short explanation>"
|
22 |
+
}}
|
23 |
+
Few-Shot Examples: Example 1
|
24 |
+
• Context: "The Mona Lisa is one of Leonardo da Vinci's most famous paintings."
|
25 |
+
• Question: "Describe the Mona Lisa."
|
26 |
+
• RAG Response: "The Mona Lisa, painted by Leonardo da Vinci, is a masterpiece of Renaissance art. Known for its captivating smile and intricate background, it captures timeless beauty."
|
27 |
+
• Score: 0.8
|
28 |
+
• Explanation: The response introduces varied descriptors and rich vocabulary, with minimal repetition.
|
29 |
+
Example 2
|
30 |
+
• Context: "Albert Einstein was a physicist who developed the theory of relativity."
|
31 |
+
• Question: "Who was Albert Einstein?"
|
32 |
+
• RAG Response: "Albert Einstein was a physicist who developed the theory of relativity. He was a physicist."
|
33 |
+
• Score: 0.3
|
34 |
+
• Explanation: The response is repetitive, restating information with little variety in language or structure.
|
35 |
+
Example 3
|
36 |
+
• Context: "The Amazon rainforest is the world’s largest tropical rainforest, located in South America."
|
37 |
+
• Question: "Tell me about the Amazon rainforest."
|
38 |
+
• RAG Response: "The Amazon is the largest tropical rainforest in the world, located in South America. It covers millions of square miles and is known for its biodiversity and vast rivers."
|
39 |
+
• Score: 1.0
|
40 |
+
• Explanation: The response demonstrates high diversity, introducing new vocabulary and varied sentence structure while enriching the description.
|
41 |
+
Instructions Summary:
|
42 |
+
• Consider all criteria to determine a comprehensive score.
|
43 |
+
• Provide a single numeric score that reflects the overall diversity of the RAG response.
|
44 |
+
Input Data:
|
45 |
+
• Question: {question}
|
46 |
+
• RAG’s Answer: {answer}
|
47 |
+
• Context: {context}
|
48 |
+
"""
|
49 |
+
|
50 |
+
creativity_metric = """
|
51 |
+
Objective:
|
52 |
+
You are an expert evaluator tasked with assessing the creativity of responses generated by a RAG (Retrieval-Augmented Generation) system. Your goal is to evaluate how original, inventive, and engaging the response is, particularly in creative use cases. Assign a single score from 1 to 5, where a higher score indicates greater creativity.
|
53 |
+
Evaluation Task:
|
54 |
+
Given a question, context, and RAG response, assess the creativity of the response by considering the following criteria:
|
55 |
+
• Originality: Does the response introduce fresh ideas, unique perspectives, or an original approach to answering the question?
|
56 |
+
• Inventiveness: Does it use imaginative language, such as metaphors, vivid descriptions, or analogies?
|
57 |
+
• Engagement: Is the response engaging and surprising in a positive way, capturing attention and making the content memorable?
|
58 |
+
Score Range:
|
59 |
+
Assign a score from 1 to 5 based on the creativity in the generated text:
|
60 |
+
• 1: Not creative — The response is predictable, lacks novelty, and feels flat or formulaic.
|
61 |
+
• 2: Slightly creative — Some attempts at creativity are present, but the response remains mostly conventional.
|
62 |
+
• 3: Moderately creative — The response shows a fair degree of originality and includes some inventive elements.
|
63 |
+
• 4: Creative — The response demonstrates clear creativity with multiple inventive phrases or unique ideas.
|
64 |
+
• 5: Very creative — The response is highly original, inventive, and memorable, showcasing an excellent degree of creativity.
|
65 |
+
Special Instructions:
|
66 |
+
• Focus on creativity without assessing factual accuracy; this metric purely assesses inventiveness.
|
67 |
+
• Avoid penalizing the response for minor factual inaccuracies unless they disrupt the overall creativity.
|
68 |
+
• Provide specific reasons for score deductions to ensure consistent evaluations.
|
69 |
+
Output Format:
|
70 |
+
Return only a JSON object in the following structure:
|
71 |
+
{{
|
72 |
+
"eval_score": <score between 1 and 5>,
|
73 |
+
"explanation": "<short explanation>"
|
74 |
+
}}
|
75 |
+
Few-Shot Examples:
|
76 |
+
Example 1
|
77 |
+
• Context: "The sun sets every evening, painting the sky with shades of orange and pink."
|
78 |
+
• Question: "Describe a sunset."
|
79 |
+
• RAG Response: "As the golden sun dips below the horizon, the sky transforms into a fiery canvas, with streaks of amber melting into soft pinks and purples, as if nature were an artist with a brush of light."
|
80 |
+
• Score: 5
|
81 |
+
• Explanation: The response is highly creative, with vivid and original imagery that captures attention and adds emotional depth.
|
82 |
+
Example 2
|
83 |
+
• Context: "The moon orbits the Earth and is visible at night."
|
84 |
+
• Question: "Describe the moon."
|
85 |
+
• RAG Response: "The moon, Earth’s loyal companion, softly illuminates the night, a gentle eye watching over the sleeping world."
|
86 |
+
• Score: 4
|
87 |
+
• Explanation: The response is creative, using metaphorical language and a unique perspective, though it could have incorporated more inventive details.
|
88 |
+
Example 3
|
89 |
+
• Context: "The Eiffel Tower is a famous landmark in Paris, France."
|
90 |
+
• Question: "What is the Eiffel Tower?"
|
91 |
+
• RAG Response: "The Eiffel Tower, a tall structure in Paris, is a famous tourist spot."
|
92 |
+
• Score: 1
|
93 |
+
• Explanation: The response is straightforward and lacks creativity, merely repeating known facts without adding original or engaging elements.
|
94 |
+
Example 4
|
95 |
+
• Context: "Butterflies have colorful wings and are common in gardens."
|
96 |
+
• Question: "Describe a butterfly."
|
97 |
+
• RAG Response: "A butterfly flutters gracefully, its wings like delicate stained glass, alive with colors that dance in the sunlight."
|
98 |
+
• Score: 5
|
99 |
+
• Explanation: The response uses poetic language and metaphor, showcasing high creativity and making the image come alive.
|
100 |
+
Instructions Summary:
|
101 |
+
• Consider all criteria holistically to determine a comprehensive score.
|
102 |
+
• Provide a single numeric score reflecting the overall creativity of the RAG response.
|
103 |
+
Input Data:
|
104 |
+
• Question: {question}
|
105 |
+
• RAG’s Answer: {answer}
|
106 |
+
• Context: {context}
|
107 |
+
"""
|
108 |
+
|
109 |
+
|
110 |
+
groundedness_metric ="""
|
111 |
+
Objective:
|
112 |
+
You are an expert evaluator tasked with assessing the groundedness of responses generated by a RAG (Retrieval-Augmented Generation) system. Your goal is to evaluate whether the response is fully supported by the information provided in the retrieved context, without adding any unsupported information. Assign a score from 1 to 5, where a higher score indicates a higher degree of groundedness.
|
113 |
+
Evaluation Task:
|
114 |
+
Given a question, context, and RAG response, assess the groundedness of the response based on the following criteria:
|
115 |
+
• Contextual Alignment: How accurately does the response align with the facts in the context?
|
116 |
+
• Absence of Fabrication: Does the response avoid introducing details not present in the context?
|
117 |
+
• Faithfulness: Does the response accurately reflect the information without reinterpretation or distortion?
|
118 |
+
Score Range:
|
119 |
+
Assign a score from 1 to 5 based on how well the response aligns with the context:
|
120 |
+
• 1: Not grounded — The response includes multiple unsupported or fabricated details and significantly deviates from the context.
|
121 |
+
• 2: Slightly grounded — Contains some alignment with the context but includes substantial unsupported or speculative content.
|
122 |
+
• 3: Moderately grounded — The response mostly aligns but includes minor ungrounded elements.
|
123 |
+
• 4: Mostly grounded — The response aligns well with the context, with only minor and non-critical unsupported details.
|
124 |
+
• 5: Very grounded — The response is entirely supported by the context with no ungrounded information.
|
125 |
+
Special Instructions:
|
126 |
+
• Focus strictly on the alignment with the context; do not consider creativity or stylistic aspects.
|
127 |
+
• Ignore minor stylistic choices if they do not affect factual grounding.
|
128 |
+
• Provide specific reasons for score deductions to ensure consistency.
|
129 |
+
Output Format:
|
130 |
+
Return only a JSON object in the following structure:
|
131 |
+
{{
|
132 |
+
"eval_score": <score between 1 and 5>,
|
133 |
+
"explanation": "<short explanation>"
|
134 |
+
}}
|
135 |
+
Few-Shot Examples:
|
136 |
+
Example 1
|
137 |
+
• Context: "The Eiffel Tower is located in Paris, France, and was completed in 1889."
|
138 |
+
• Question: "Where is the Eiffel Tower located?"
|
139 |
+
• RAG Response: "The Eiffel Tower is located in Paris and was built in 1889."
|
140 |
+
• Score: 5
|
141 |
+
• Explanation: The response is fully grounded in the context, accurately reflecting both location and completion date.
|
142 |
+
Example 2
|
143 |
+
• Context: "The Great Wall of China was constructed over several dynasties, most notably during the Ming Dynasty (1368-1644)."
|
144 |
+
• Question: "When was the Great Wall of China built?"
|
145 |
+
• RAG Response: "The Great Wall was built in 1368."
|
146 |
+
• Score: 3
|
147 |
+
• Explanation: The response is partially grounded, as it accurately mentions the Ming Dynasty but incorrectly specifies the single year 1368.
|
148 |
+
Example 3
|
149 |
+
• Context: "Albert Einstein developed the theory of relativity in the early 20th century."
|
150 |
+
• Question: "Who developed the theory of relativity?"
|
151 |
+
• RAG Response: "Albert Einstein developed the theory of relativity."
|
152 |
+
• Score: 5
|
153 |
+
• Explanation: The response is fully grounded, providing accurate information directly supported by the context.
|
154 |
+
Example 4
|
155 |
+
• Context: "The capital of Italy is Rome, known for its ancient history and architecture."
|
156 |
+
• Question: "What is the capital of Italy?"
|
157 |
+
• RAG Response: "The capital of Italy is Rome, which was founded by Romulus in 753 BC."
|
158 |
+
• Score: 2
|
159 |
+
• Explanation: The response introduces unsupported historical information (founding by Romulus), which is not present in the context.
|
160 |
+
Instructions Summary:
|
161 |
+
• Ensure that all information in the response is supported by the provided context.
|
162 |
+
• Assign a single numeric score reflecting the overall groundedness of the RAG answer.
|
163 |
+
Input Data:
|
164 |
+
• Question: {question}
|
165 |
+
• RAG’s Answer: {answer}
|
166 |
+
• Context: {context}
|
167 |
+
"""
|
168 |
+
|
169 |
+
|
170 |
+
coherence_metric ="""
|
171 |
+
Objective:
|
172 |
+
You are an expert evaluator tasked with assessing the coherence of responses generated by a RAG (Retrieval-Augmented Generation) system. Your goal is to evaluate how naturally and logically the response flows, considering clarity, sentence structure, and the logical integration of retrieved information. Assign a score from 1 to 5, where a higher score indicates a higher degree of coherence.
|
173 |
+
Evaluation Task:
|
174 |
+
Given a question, context, and RAG response, evaluate the coherence of the response based on the following criteria:
|
175 |
+
• Logical Flow of Ideas: Does the response present information in a logical sequence with clear transitions between ideas?
|
176 |
+
• Sentence Structure: Are sentences grammatically correct and appropriately structured to convey ideas clearly?
|
177 |
+
• Relevance and Integration of Context: Is the retrieved information presented naturally within the response, aligning with the query and context?
|
178 |
+
Score Range:
|
179 |
+
Assign a score from 1 to 5 based on the coherence of the response:
|
180 |
+
• 1: Not coherent — The response lacks logical flow, contains abrupt transitions, or is difficult to understand.
|
181 |
+
• 2: Slightly coherent — The response has some order but includes gaps in logic or awkward phrasing.
|
182 |
+
• 3: Moderately coherent — The response mostly flows well, though minor issues may disrupt clarity or logic.
|
183 |
+
• 4: Mostly coherent — The response is well-organized and mostly smooth, with only slight awkwardness or minor inconsistencies.
|
184 |
+
• 5: Very coherent — The response is seamless, logically ordered, and presents information with natural integration and readability.
|
185 |
+
Special Instructions:
|
186 |
+
• Focus on the clarity, flow, and logical consistency of the response, rather than factual correctness or creativity.
|
187 |
+
• Note any issues in sentence structure or awkward transitions when deducting points.
|
188 |
+
• Ensure consistency in scoring by documenting specific reasons for score deductions.
|
189 |
+
Output Format:
|
190 |
+
Return only a JSON object in the following structure:
|
191 |
+
{{
|
192 |
+
"eval_score": <score between 1 and 5>,
|
193 |
+
"explanation": "<short explanation>"
|
194 |
+
}}
|
195 |
+
Few-Shot Examples:
|
196 |
+
Example 1
|
197 |
+
• Context: "The Eiffel Tower, an iconic symbol of France, was completed in 1889 and is located in Paris."
|
198 |
+
• Question: "When was the Eiffel Tower built?"
|
199 |
+
• RAG Response: "The Eiffel Tower, located in Paris, was completed in 1889. It’s one of the most recognized monuments in France."
|
200 |
+
• Score: 5
|
201 |
+
• Explanation: The response flows logically and smoothly incorporates additional relevant information about the Eiffel Tower.
|
202 |
+
Example 2
|
203 |
+
• Context: "The Great Wall of China was built over many centuries, with the most well-known sections constructed during the Ming Dynasty (1368-1644)."
|
204 |
+
• Question: "When was the Great Wall of China built?"
|
205 |
+
• RAG Response: "The Great Wall of China was constructed during the Ming Dynasty, lasting from 1368-1644."
|
206 |
+
• Score: 4
|
207 |
+
• Explanation: The response is mostly coherent but simplifies the centuries-long construction process.
|
208 |
+
Example 3
|
209 |
+
• Context: "The capital of Japan is Tokyo, known for its vibrant culture, advanced technology, and rich history."
|
210 |
+
• Question: "What is the capital of Japan?"
|
211 |
+
• RAG Response: "Tokyo is the capital of Japan. It has advanced technology, and culture, it is rich in."
|
212 |
+
• Score: 2
|
213 |
+
• Explanation: The response contains awkward phrasing that disrupts the flow and clarity, making it difficult to understand fully.
|
214 |
+
Example 4
|
215 |
+
• Context: "Albert Einstein, a physicist, developed the theory of relativity, fundamentally changing modern physics."
|
216 |
+
• Question: "What did Albert Einstein develop?"
|
217 |
+
• RAG Response: "Albert Einstein, modern physics, theory of relativity changed physics fundamentally."
|
218 |
+
• Score: 1
|
219 |
+
• Explanation: The response lacks coherence, with no logical flow or clear sentence structure.
|
220 |
+
Instructions Summary:
|
221 |
+
• Evaluate the logical flow, structure, and integration of the response.
|
222 |
+
• Assign a single numeric score that reflects the overall coherence of the RAG answer.
|
223 |
+
Input Data:
|
224 |
+
• Question: {question}
|
225 |
+
• RAG’s Answer: {answer}
|
226 |
+
• Context: {context}
|
227 |
+
"""
|
228 |
+
|
229 |
+
|
230 |
+
|
231 |
+
|
232 |
+
pointwise_metric = """
|
233 |
+
Objective
|
234 |
+
You are an expert evaluator tasked with assessing the quality of a response generated by a Retrieval-Augmented Generation (RAG) system. Your goal is to assign a pointwise score based on two core criteria: Relevance and Correctness, considering how well the generated response aligns with the query and context provided. Use a scale from 0 to 5 where higher scores indicate superior performance.
|
235 |
+
|
236 |
+
Evaluation Task
|
237 |
+
Given a question, the retrieved context, and the RAG-generated response:
|
238 |
+
1. Relevance: Judge how well the response addresses the question, considering completeness, topicality, and adherence to the retrieved context.
|
239 |
+
2. Correctness: Evaluate the factual accuracy of the response relative to the provided context. Penalize fabricated or unsupported information.
|
240 |
+
Consider the following while scoring:
|
241 |
+
• Relevance:
|
242 |
+
o Does the response directly answer the question?
|
243 |
+
o Is the response complete and avoids unnecessary repetition?
|
244 |
+
o Is the response consistent with the retrieved context?
|
245 |
+
• Correctness:
|
246 |
+
o Does the response align factually with the context?
|
247 |
+
o Are all details (names, dates, events) accurate and supported?
|
248 |
+
o Does the response avoid hallucinations or unsupported claims?
|
249 |
+
|
250 |
+
Scoring Guidelines
|
251 |
+
Assign a single score between 0 and 5 that reflects both relevance and correctness:
|
252 |
+
• 0: Completely irrelevant or factually incorrect.
|
253 |
+
• 1: Poor quality; partially addresses the question with significant inaccuracies.
|
254 |
+
• 2: Subpar; attempts to answer but contains multiple issues with relevance or correctness.
|
255 |
+
• 3: Average; partially relevant and correct but incomplete or slightly inconsistent.
|
256 |
+
• 4: Good; relevant, mostly correct, and well-aligned with the context.
|
257 |
+
• 5: Excellent; perfectly relevant, factually accurate, and fully supported by the context.
|
258 |
+
|
259 |
+
Special Instructions
|
260 |
+
• Evaluate only based on the provided question, context, and response. Avoid relying on general knowledge or external information.
|
261 |
+
• Provide a brief explanation for your score, highlighting specific strengths and weaknesses.
|
262 |
+
• Be consistent across evaluations by adhering to the scoring criteria.
|
263 |
+
|
264 |
+
Output Format
|
265 |
+
Return only a JSON object in the following structure:
|
266 |
+
{{
|
267 |
+
"eval_score": <score between 0 and 5>,
|
268 |
+
"explanation": "<short explanation>"
|
269 |
+
}}
|
270 |
+
|
271 |
+
Few-Shot Examples
|
272 |
+
Example 1
|
273 |
+
• Question: "What is the capital of France?"
|
274 |
+
• Context: "Paris is the capital of France and a major European city."
|
275 |
+
• RAG Response: "Paris is the capital of France."
|
276 |
+
• Score: 5
|
277 |
+
• Explanation: The response is perfectly relevant, concise, and factually accurate.
|
278 |
+
Example 2
|
279 |
+
• Question: "Who developed the theory of relativity?"
|
280 |
+
• Context: "Albert Einstein developed the theory of relativity in the early 20th century."
|
281 |
+
• RAG Response: "Albert Einstein created the theory of relativity in 1879."
|
282 |
+
• Score: 3
|
283 |
+
• Explanation: While the response is relevant, the year provided is incorrect, reducing correctness.
|
284 |
+
Example 3
|
285 |
+
• Question: "When was the Great Wall of China built?"
|
286 |
+
• Context: "The Great Wall of China was constructed over centuries, with major sections built during the Ming Dynasty (1368-1644)."
|
287 |
+
• RAG Response: "The Great Wall of China was built in 1368 during the Ming Dynasty."
|
288 |
+
• Score: 4
|
289 |
+
• Explanation: The response is mostly accurate but oversimplifies the construction timeline.
|
290 |
+
Example 4
|
291 |
+
• Question: "What is the boiling point of water at sea level?"
|
292 |
+
• Context: "Water boils at 100°C (212°F) at standard atmospheric pressure."
|
293 |
+
• RAG Response: "Water boils at 80°C at sea level."
|
294 |
+
• Score: 1
|
295 |
+
• Explanation: The response is factually incorrect, deviating from the context.
|
296 |
+
|
297 |
+
Instructions Summary
|
298 |
+
• Judge relevance and correctness holistically to provide a single score.
|
299 |
+
• Ensure consistency by grounding your evaluations in the provided criteria
|
300 |
+
|
301 |
+
Input Data:
|
302 |
+
• Question: {question}
|
303 |
+
• RAG’s Answer: {answer}
|
304 |
+
• Context: {context}
|
305 |
+
"""
|
306 |
+
|
307 |
+
|
308 |
+
pairwise_metric ="""
|
309 |
+
Objective
|
310 |
+
You are an expert evaluator tasked with comparing responses from two models for a given query. Your goal is to determine which response is better based on predefined evaluation criteria such as relevance, correctness, coherence, and completeness. Provide a binary decision:
|
311 |
+
• 1: The first response is better.
|
312 |
+
• 0: The second response is better.
|
313 |
+
|
314 |
+
Evaluation Task
|
315 |
+
Given a question, retrieved context, and outputs from two models:
|
316 |
+
1. Relevance: Does the response directly address the question?
|
317 |
+
2. Correctness: Is the response factually accurate and aligned with the retrieved context?
|
318 |
+
3. Coherence: Does the response flow logically, making it easy to understand?
|
319 |
+
4. Completeness: Does the response sufficiently cover all aspects of the question without omitting key details or introducing unnecessary information?
|
320 |
+
Use these criteria to identify which response is superior.
|
321 |
+
|
322 |
+
Instructions
|
323 |
+
1. Carefully review the question, retrieved context, and model responses.
|
324 |
+
2. Compare the two responses based on the four evaluation criteria (relevance, correctness, coherence, and completeness).
|
325 |
+
3. Choose the better response and provide a brief explanation for your decision.
|
326 |
+
4. Avoid relying on personal knowledge or external information—evaluate solely based on the inputs provided.
|
327 |
+
|
328 |
+
Scoring Guidelines
|
329 |
+
• Assign a 1 if the first response is better.
|
330 |
+
• Assign a 0 if the second response is better.
|
331 |
+
|
332 |
+
Output Format
|
333 |
+
Provide a JSON object with your decision and reasoning:
|
334 |
+
{{
|
335 |
+
"better_response": <1 or 0>,
|
336 |
+
"explanation": "<short explanation>"
|
337 |
+
}}
|
338 |
+
|
339 |
+
Few-Shot Examples
|
340 |
+
Example 1
|
341 |
+
• Question: "What is the capital of Germany?"
|
342 |
+
• Retrieved Context: "Berlin is the capital and largest city of Germany."
|
343 |
+
• Response 1: "The capital of Germany is Berlin."
|
344 |
+
• Response 2: "The capital of Germany is Munich."
|
345 |
+
• Better Response: 1
|
346 |
+
• Explanation: Response 1 is factually accurate and aligns with the context, while Response 2 is incorrect.
|
347 |
+
Example 2
|
348 |
+
• Question: "Who developed the telephone?"
|
349 |
+
• Retrieved Context: "Alexander Graham Bell is credited with inventing the telephone in 1876."
|
350 |
+
• Response 1: "Alexander Graham Bell invented the telephone."
|
351 |
+
• Response 2: "The telephone was developed in 1876 by Alexander Graham Bell."
|
352 |
+
• Better Response: 2
|
353 |
+
• Explanation: While both responses are correct, Response 2 is more complete as it includes the year of invention.
|
354 |
+
Example 3
|
355 |
+
• Question: "What are the uses of renewable energy?"
|
356 |
+
• Retrieved Context: "Renewable energy sources like solar and wind power are used for electricity generation, heating, and reducing carbon emissions."
|
357 |
+
• Response 1: "Renewable energy is used for generating electricity and heating."
|
358 |
+
• Response 2: "Renewable energy reduces carbon emissions, generates electricity, and is used for heating."
|
359 |
+
• Better Response: 2
|
360 |
+
• Explanation: Response 2 is more complete and aligns better with the retrieved context.
|
361 |
+
|
362 |
+
Instructions Summary
|
363 |
+
• Compare the two responses based on relevance, correctness, coherence, and completeness.
|
364 |
+
• Select the better response and explain your choice concisely.
|
365 |
+
|
366 |
+
Input Data:
|
367 |
+
• Question: {question}
|
368 |
+
• Context: {context}
|
369 |
+
• Response 1: {answer_1}
|
370 |
+
• Response 2: {answer_2}
|
371 |
+
"""
|