sango07 commited on
Commit
ed31fd5
·
verified ·
1 Parent(s): df3af2b

Create prompts_v1.py

Browse files
Files changed (1) hide show
  1. prompts_v1.py +371 -0
prompts_v1.py ADDED
@@ -0,0 +1,371 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ diversity_metrics ="""
2
+ Objective: You are an expert evaluator tasked with assessing the diversity of language and ideas in generated text. Your goal is to evaluate the RAG (Retrieval-Augmented Generation) response for lexical variety, structural diversity, and originality relative to the context provided. Assign a single evaluation score between 0 and 1.0 based on how creatively diverse the response is.
3
+ Evaluation Task: Given a question, context, and the RAG response, evaluate the diversity of the response based on these criteria:
4
+ • Lexical Diversity: Evaluate the variety of vocabulary used in the response. Does it avoid excessive repetition and demonstrate a range of vocabulary?
5
+ • Structural Variety: Assess the diversity of sentence structures, including the use of different sentence types and lengths.
6
+ • Originality: Examine the originality of the response by comparing it with the retrieved context. Does the answer bring in unique phrasing, expressions, or perspectives beyond direct repetition of the context?
7
+ Score Range: Assign scores within the following range based on the diversity in the generated text:
8
+ • 0.0: No diversity — Highly repetitive, uses basic vocabulary and predictable structures.
9
+ • 0.3: Low diversity — Limited vocabulary and repetitive structures with minor attempts at originality.
10
+ • 0.5: Moderate diversity — Some varied vocabulary and sentence types; minimal repetition but still follows similar patterns.
11
+ • 0.7: Good diversity — Generally varied vocabulary and structures, with some unique phrasing or ideas introduced.
12
+ • 1.0: High diversity — Rich and varied vocabulary, diverse sentence structures, and highly original content beyond the retrieved context.
13
+ Special Instructions:
14
+ • Evaluate based solely on the provided context and avoid assumptions.
15
+ • Focus on language variety, not the factual accuracy.
16
+ • Give partial credit if there’s some repetition but notable diversity elsewhere.
17
+ • Document reasons for score deductions to ensure consistency across evaluations.
18
+ Output Format: Return only a JSON object in the following structure:
19
+ {{
20
+ "eval_score": <score between 0.0 and 1.0>,
21
+ "explanation": "<short explanation>"
22
+ }}
23
+ Few-Shot Examples: Example 1
24
+ • Context: "The Mona Lisa is one of Leonardo da Vinci's most famous paintings."
25
+ • Question: "Describe the Mona Lisa."
26
+ • RAG Response: "The Mona Lisa, painted by Leonardo da Vinci, is a masterpiece of Renaissance art. Known for its captivating smile and intricate background, it captures timeless beauty."
27
+ • Score: 0.8
28
+ • Explanation: The response introduces varied descriptors and rich vocabulary, with minimal repetition.
29
+ Example 2
30
+ • Context: "Albert Einstein was a physicist who developed the theory of relativity."
31
+ • Question: "Who was Albert Einstein?"
32
+ • RAG Response: "Albert Einstein was a physicist who developed the theory of relativity. He was a physicist."
33
+ • Score: 0.3
34
+ • Explanation: The response is repetitive, restating information with little variety in language or structure.
35
+ Example 3
36
+ • Context: "The Amazon rainforest is the world’s largest tropical rainforest, located in South America."
37
+ • Question: "Tell me about the Amazon rainforest."
38
+ • RAG Response: "The Amazon is the largest tropical rainforest in the world, located in South America. It covers millions of square miles and is known for its biodiversity and vast rivers."
39
+ • Score: 1.0
40
+ • Explanation: The response demonstrates high diversity, introducing new vocabulary and varied sentence structure while enriching the description.
41
+ Instructions Summary:
42
+ • Consider all criteria to determine a comprehensive score.
43
+ • Provide a single numeric score that reflects the overall diversity of the RAG response.
44
+ Input Data:
45
+ • Question: {question}
46
+ • RAG’s Answer: {answer}
47
+ • Context: {context}
48
+ """
49
+
50
+ creativity_metric = """
51
+ Objective:
52
+ You are an expert evaluator tasked with assessing the creativity of responses generated by a RAG (Retrieval-Augmented Generation) system. Your goal is to evaluate how original, inventive, and engaging the response is, particularly in creative use cases. Assign a single score from 1 to 5, where a higher score indicates greater creativity.
53
+ Evaluation Task:
54
+ Given a question, context, and RAG response, assess the creativity of the response by considering the following criteria:
55
+ • Originality: Does the response introduce fresh ideas, unique perspectives, or an original approach to answering the question?
56
+ • Inventiveness: Does it use imaginative language, such as metaphors, vivid descriptions, or analogies?
57
+ • Engagement: Is the response engaging and surprising in a positive way, capturing attention and making the content memorable?
58
+ Score Range:
59
+ Assign a score from 1 to 5 based on the creativity in the generated text:
60
+ • 1: Not creative — The response is predictable, lacks novelty, and feels flat or formulaic.
61
+ • 2: Slightly creative — Some attempts at creativity are present, but the response remains mostly conventional.
62
+ • 3: Moderately creative — The response shows a fair degree of originality and includes some inventive elements.
63
+ • 4: Creative — The response demonstrates clear creativity with multiple inventive phrases or unique ideas.
64
+ • 5: Very creative — The response is highly original, inventive, and memorable, showcasing an excellent degree of creativity.
65
+ Special Instructions:
66
+ • Focus on creativity without assessing factual accuracy; this metric purely assesses inventiveness.
67
+ • Avoid penalizing the response for minor factual inaccuracies unless they disrupt the overall creativity.
68
+ • Provide specific reasons for score deductions to ensure consistent evaluations.
69
+ Output Format:
70
+ Return only a JSON object in the following structure:
71
+ {{
72
+ "eval_score": <score between 1 and 5>,
73
+ "explanation": "<short explanation>"
74
+ }}
75
+ Few-Shot Examples:
76
+ Example 1
77
+ • Context: "The sun sets every evening, painting the sky with shades of orange and pink."
78
+ • Question: "Describe a sunset."
79
+ • RAG Response: "As the golden sun dips below the horizon, the sky transforms into a fiery canvas, with streaks of amber melting into soft pinks and purples, as if nature were an artist with a brush of light."
80
+ • Score: 5
81
+ • Explanation: The response is highly creative, with vivid and original imagery that captures attention and adds emotional depth.
82
+ Example 2
83
+ • Context: "The moon orbits the Earth and is visible at night."
84
+ • Question: "Describe the moon."
85
+ • RAG Response: "The moon, Earth’s loyal companion, softly illuminates the night, a gentle eye watching over the sleeping world."
86
+ • Score: 4
87
+ • Explanation: The response is creative, using metaphorical language and a unique perspective, though it could have incorporated more inventive details.
88
+ Example 3
89
+ • Context: "The Eiffel Tower is a famous landmark in Paris, France."
90
+ • Question: "What is the Eiffel Tower?"
91
+ • RAG Response: "The Eiffel Tower, a tall structure in Paris, is a famous tourist spot."
92
+ • Score: 1
93
+ • Explanation: The response is straightforward and lacks creativity, merely repeating known facts without adding original or engaging elements.
94
+ Example 4
95
+ • Context: "Butterflies have colorful wings and are common in gardens."
96
+ • Question: "Describe a butterfly."
97
+ • RAG Response: "A butterfly flutters gracefully, its wings like delicate stained glass, alive with colors that dance in the sunlight."
98
+ • Score: 5
99
+ • Explanation: The response uses poetic language and metaphor, showcasing high creativity and making the image come alive.
100
+ Instructions Summary:
101
+ • Consider all criteria holistically to determine a comprehensive score.
102
+ • Provide a single numeric score reflecting the overall creativity of the RAG response.
103
+ Input Data:
104
+ • Question: {question}
105
+ • RAG’s Answer: {answer}
106
+ • Context: {context}
107
+ """
108
+
109
+
110
+ groundedness_metric ="""
111
+ Objective:
112
+ You are an expert evaluator tasked with assessing the groundedness of responses generated by a RAG (Retrieval-Augmented Generation) system. Your goal is to evaluate whether the response is fully supported by the information provided in the retrieved context, without adding any unsupported information. Assign a score from 1 to 5, where a higher score indicates a higher degree of groundedness.
113
+ Evaluation Task:
114
+ Given a question, context, and RAG response, assess the groundedness of the response based on the following criteria:
115
+ • Contextual Alignment: How accurately does the response align with the facts in the context?
116
+ • Absence of Fabrication: Does the response avoid introducing details not present in the context?
117
+ • Faithfulness: Does the response accurately reflect the information without reinterpretation or distortion?
118
+ Score Range:
119
+ Assign a score from 1 to 5 based on how well the response aligns with the context:
120
+ • 1: Not grounded — The response includes multiple unsupported or fabricated details and significantly deviates from the context.
121
+ • 2: Slightly grounded — Contains some alignment with the context but includes substantial unsupported or speculative content.
122
+ • 3: Moderately grounded — The response mostly aligns but includes minor ungrounded elements.
123
+ • 4: Mostly grounded — The response aligns well with the context, with only minor and non-critical unsupported details.
124
+ • 5: Very grounded — The response is entirely supported by the context with no ungrounded information.
125
+ Special Instructions:
126
+ • Focus strictly on the alignment with the context; do not consider creativity or stylistic aspects.
127
+ • Ignore minor stylistic choices if they do not affect factual grounding.
128
+ • Provide specific reasons for score deductions to ensure consistency.
129
+ Output Format:
130
+ Return only a JSON object in the following structure:
131
+ {{
132
+ "eval_score": <score between 1 and 5>,
133
+ "explanation": "<short explanation>"
134
+ }}
135
+ Few-Shot Examples:
136
+ Example 1
137
+ • Context: "The Eiffel Tower is located in Paris, France, and was completed in 1889."
138
+ • Question: "Where is the Eiffel Tower located?"
139
+ • RAG Response: "The Eiffel Tower is located in Paris and was built in 1889."
140
+ • Score: 5
141
+ • Explanation: The response is fully grounded in the context, accurately reflecting both location and completion date.
142
+ Example 2
143
+ • Context: "The Great Wall of China was constructed over several dynasties, most notably during the Ming Dynasty (1368-1644)."
144
+ • Question: "When was the Great Wall of China built?"
145
+ • RAG Response: "The Great Wall was built in 1368."
146
+ • Score: 3
147
+ • Explanation: The response is partially grounded, as it accurately mentions the Ming Dynasty but incorrectly specifies the single year 1368.
148
+ Example 3
149
+ • Context: "Albert Einstein developed the theory of relativity in the early 20th century."
150
+ • Question: "Who developed the theory of relativity?"
151
+ • RAG Response: "Albert Einstein developed the theory of relativity."
152
+ • Score: 5
153
+ • Explanation: The response is fully grounded, providing accurate information directly supported by the context.
154
+ Example 4
155
+ • Context: "The capital of Italy is Rome, known for its ancient history and architecture."
156
+ • Question: "What is the capital of Italy?"
157
+ • RAG Response: "The capital of Italy is Rome, which was founded by Romulus in 753 BC."
158
+ • Score: 2
159
+ • Explanation: The response introduces unsupported historical information (founding by Romulus), which is not present in the context.
160
+ Instructions Summary:
161
+ • Ensure that all information in the response is supported by the provided context.
162
+ • Assign a single numeric score reflecting the overall groundedness of the RAG answer.
163
+ Input Data:
164
+ • Question: {question}
165
+ • RAG’s Answer: {answer}
166
+ • Context: {context}
167
+ """
168
+
169
+
170
+ coherence_metric ="""
171
+ Objective:
172
+ You are an expert evaluator tasked with assessing the coherence of responses generated by a RAG (Retrieval-Augmented Generation) system. Your goal is to evaluate how naturally and logically the response flows, considering clarity, sentence structure, and the logical integration of retrieved information. Assign a score from 1 to 5, where a higher score indicates a higher degree of coherence.
173
+ Evaluation Task:
174
+ Given a question, context, and RAG response, evaluate the coherence of the response based on the following criteria:
175
+ • Logical Flow of Ideas: Does the response present information in a logical sequence with clear transitions between ideas?
176
+ • Sentence Structure: Are sentences grammatically correct and appropriately structured to convey ideas clearly?
177
+ • Relevance and Integration of Context: Is the retrieved information presented naturally within the response, aligning with the query and context?
178
+ Score Range:
179
+ Assign a score from 1 to 5 based on the coherence of the response:
180
+ • 1: Not coherent — The response lacks logical flow, contains abrupt transitions, or is difficult to understand.
181
+ • 2: Slightly coherent — The response has some order but includes gaps in logic or awkward phrasing.
182
+ • 3: Moderately coherent — The response mostly flows well, though minor issues may disrupt clarity or logic.
183
+ • 4: Mostly coherent — The response is well-organized and mostly smooth, with only slight awkwardness or minor inconsistencies.
184
+ • 5: Very coherent — The response is seamless, logically ordered, and presents information with natural integration and readability.
185
+ Special Instructions:
186
+ • Focus on the clarity, flow, and logical consistency of the response, rather than factual correctness or creativity.
187
+ • Note any issues in sentence structure or awkward transitions when deducting points.
188
+ • Ensure consistency in scoring by documenting specific reasons for score deductions.
189
+ Output Format:
190
+ Return only a JSON object in the following structure:
191
+ {{
192
+ "eval_score": <score between 1 and 5>,
193
+ "explanation": "<short explanation>"
194
+ }}
195
+ Few-Shot Examples:
196
+ Example 1
197
+ • Context: "The Eiffel Tower, an iconic symbol of France, was completed in 1889 and is located in Paris."
198
+ • Question: "When was the Eiffel Tower built?"
199
+ • RAG Response: "The Eiffel Tower, located in Paris, was completed in 1889. It’s one of the most recognized monuments in France."
200
+ • Score: 5
201
+ • Explanation: The response flows logically and smoothly incorporates additional relevant information about the Eiffel Tower.
202
+ Example 2
203
+ • Context: "The Great Wall of China was built over many centuries, with the most well-known sections constructed during the Ming Dynasty (1368-1644)."
204
+ • Question: "When was the Great Wall of China built?"
205
+ • RAG Response: "The Great Wall of China was constructed during the Ming Dynasty, lasting from 1368-1644."
206
+ • Score: 4
207
+ • Explanation: The response is mostly coherent but simplifies the centuries-long construction process.
208
+ Example 3
209
+ • Context: "The capital of Japan is Tokyo, known for its vibrant culture, advanced technology, and rich history."
210
+ • Question: "What is the capital of Japan?"
211
+ • RAG Response: "Tokyo is the capital of Japan. It has advanced technology, and culture, it is rich in."
212
+ • Score: 2
213
+ • Explanation: The response contains awkward phrasing that disrupts the flow and clarity, making it difficult to understand fully.
214
+ Example 4
215
+ • Context: "Albert Einstein, a physicist, developed the theory of relativity, fundamentally changing modern physics."
216
+ • Question: "What did Albert Einstein develop?"
217
+ • RAG Response: "Albert Einstein, modern physics, theory of relativity changed physics fundamentally."
218
+ • Score: 1
219
+ • Explanation: The response lacks coherence, with no logical flow or clear sentence structure.
220
+ Instructions Summary:
221
+ • Evaluate the logical flow, structure, and integration of the response.
222
+ • Assign a single numeric score that reflects the overall coherence of the RAG answer.
223
+ Input Data:
224
+ • Question: {question}
225
+ • RAG’s Answer: {answer}
226
+ • Context: {context}
227
+ """
228
+
229
+
230
+
231
+
232
+ pointwise_metric = """
233
+ Objective
234
+ You are an expert evaluator tasked with assessing the quality of a response generated by a Retrieval-Augmented Generation (RAG) system. Your goal is to assign a pointwise score based on two core criteria: Relevance and Correctness, considering how well the generated response aligns with the query and context provided. Use a scale from 0 to 5 where higher scores indicate superior performance.
235
+
236
+ Evaluation Task
237
+ Given a question, the retrieved context, and the RAG-generated response:
238
+ 1. Relevance: Judge how well the response addresses the question, considering completeness, topicality, and adherence to the retrieved context.
239
+ 2. Correctness: Evaluate the factual accuracy of the response relative to the provided context. Penalize fabricated or unsupported information.
240
+ Consider the following while scoring:
241
+ • Relevance:
242
+ o Does the response directly answer the question?
243
+ o Is the response complete and avoids unnecessary repetition?
244
+ o Is the response consistent with the retrieved context?
245
+ • Correctness:
246
+ o Does the response align factually with the context?
247
+ o Are all details (names, dates, events) accurate and supported?
248
+ o Does the response avoid hallucinations or unsupported claims?
249
+
250
+ Scoring Guidelines
251
+ Assign a single score between 0 and 5 that reflects both relevance and correctness:
252
+ • 0: Completely irrelevant or factually incorrect.
253
+ • 1: Poor quality; partially addresses the question with significant inaccuracies.
254
+ • 2: Subpar; attempts to answer but contains multiple issues with relevance or correctness.
255
+ • 3: Average; partially relevant and correct but incomplete or slightly inconsistent.
256
+ • 4: Good; relevant, mostly correct, and well-aligned with the context.
257
+ • 5: Excellent; perfectly relevant, factually accurate, and fully supported by the context.
258
+
259
+ Special Instructions
260
+ • Evaluate only based on the provided question, context, and response. Avoid relying on general knowledge or external information.
261
+ • Provide a brief explanation for your score, highlighting specific strengths and weaknesses.
262
+ • Be consistent across evaluations by adhering to the scoring criteria.
263
+
264
+ Output Format
265
+ Return only a JSON object in the following structure:
266
+ {{
267
+ "eval_score": <score between 0 and 5>,
268
+ "explanation": "<short explanation>"
269
+ }}
270
+
271
+ Few-Shot Examples
272
+ Example 1
273
+ • Question: "What is the capital of France?"
274
+ • Context: "Paris is the capital of France and a major European city."
275
+ • RAG Response: "Paris is the capital of France."
276
+ • Score: 5
277
+ • Explanation: The response is perfectly relevant, concise, and factually accurate.
278
+ Example 2
279
+ • Question: "Who developed the theory of relativity?"
280
+ • Context: "Albert Einstein developed the theory of relativity in the early 20th century."
281
+ • RAG Response: "Albert Einstein created the theory of relativity in 1879."
282
+ • Score: 3
283
+ • Explanation: While the response is relevant, the year provided is incorrect, reducing correctness.
284
+ Example 3
285
+ • Question: "When was the Great Wall of China built?"
286
+ • Context: "The Great Wall of China was constructed over centuries, with major sections built during the Ming Dynasty (1368-1644)."
287
+ • RAG Response: "The Great Wall of China was built in 1368 during the Ming Dynasty."
288
+ • Score: 4
289
+ • Explanation: The response is mostly accurate but oversimplifies the construction timeline.
290
+ Example 4
291
+ • Question: "What is the boiling point of water at sea level?"
292
+ • Context: "Water boils at 100°C (212°F) at standard atmospheric pressure."
293
+ • RAG Response: "Water boils at 80°C at sea level."
294
+ • Score: 1
295
+ • Explanation: The response is factually incorrect, deviating from the context.
296
+
297
+ Instructions Summary
298
+ • Judge relevance and correctness holistically to provide a single score.
299
+ • Ensure consistency by grounding your evaluations in the provided criteria
300
+
301
+ Input Data:
302
+ • Question: {question}
303
+ • RAG’s Answer: {answer}
304
+ • Context: {context}
305
+ """
306
+
307
+
308
+ pairwise_metric ="""
309
+ Objective
310
+ You are an expert evaluator tasked with comparing responses from two models for a given query. Your goal is to determine which response is better based on predefined evaluation criteria such as relevance, correctness, coherence, and completeness. Provide a binary decision:
311
+ • 1: The first response is better.
312
+ • 0: The second response is better.
313
+
314
+ Evaluation Task
315
+ Given a question, retrieved context, and outputs from two models:
316
+ 1. Relevance: Does the response directly address the question?
317
+ 2. Correctness: Is the response factually accurate and aligned with the retrieved context?
318
+ 3. Coherence: Does the response flow logically, making it easy to understand?
319
+ 4. Completeness: Does the response sufficiently cover all aspects of the question without omitting key details or introducing unnecessary information?
320
+ Use these criteria to identify which response is superior.
321
+
322
+ Instructions
323
+ 1. Carefully review the question, retrieved context, and model responses.
324
+ 2. Compare the two responses based on the four evaluation criteria (relevance, correctness, coherence, and completeness).
325
+ 3. Choose the better response and provide a brief explanation for your decision.
326
+ 4. Avoid relying on personal knowledge or external information—evaluate solely based on the inputs provided.
327
+
328
+ Scoring Guidelines
329
+ • Assign a 1 if the first response is better.
330
+ • Assign a 0 if the second response is better.
331
+
332
+ Output Format
333
+ Provide a JSON object with your decision and reasoning:
334
+ {{
335
+ "better_response": <1 or 0>,
336
+ "explanation": "<short explanation>"
337
+ }}
338
+
339
+ Few-Shot Examples
340
+ Example 1
341
+ • Question: "What is the capital of Germany?"
342
+ • Retrieved Context: "Berlin is the capital and largest city of Germany."
343
+ • Response 1: "The capital of Germany is Berlin."
344
+ • Response 2: "The capital of Germany is Munich."
345
+ • Better Response: 1
346
+ • Explanation: Response 1 is factually accurate and aligns with the context, while Response 2 is incorrect.
347
+ Example 2
348
+ • Question: "Who developed the telephone?"
349
+ • Retrieved Context: "Alexander Graham Bell is credited with inventing the telephone in 1876."
350
+ • Response 1: "Alexander Graham Bell invented the telephone."
351
+ • Response 2: "The telephone was developed in 1876 by Alexander Graham Bell."
352
+ • Better Response: 2
353
+ • Explanation: While both responses are correct, Response 2 is more complete as it includes the year of invention.
354
+ Example 3
355
+ • Question: "What are the uses of renewable energy?"
356
+ • Retrieved Context: "Renewable energy sources like solar and wind power are used for electricity generation, heating, and reducing carbon emissions."
357
+ • Response 1: "Renewable energy is used for generating electricity and heating."
358
+ • Response 2: "Renewable energy reduces carbon emissions, generates electricity, and is used for heating."
359
+ • Better Response: 2
360
+ • Explanation: Response 2 is more complete and aligns better with the retrieved context.
361
+
362
+ Instructions Summary
363
+ • Compare the two responses based on relevance, correctness, coherence, and completeness.
364
+ • Select the better response and explain your choice concisely.
365
+
366
+ Input Data:
367
+ • Question: {question}
368
+ • Context: {context}
369
+ • Response 1: {answer_1}
370
+ • Response 2: {answer_2}
371
+ """