Spaces:
Runtime error
Runtime error
diversity_metrics =""" | |
Objective: You are an expert evaluator tasked with assessing the diversity of language and ideas in generated text. Your goal is to evaluate the RAG (Retrieval-Augmented Generation) response for lexical variety, structural diversity, and originality relative to the context provided. Assign a single evaluation score between 0 and 1.0 based on how creatively diverse the response is. | |
Evaluation Task: Given a question, context, and the RAG response, evaluate the diversity of the response based on these criteria: | |
• Lexical Diversity: Evaluate the variety of vocabulary used in the response. Does it avoid excessive repetition and demonstrate a range of vocabulary? | |
• Structural Variety: Assess the diversity of sentence structures, including the use of different sentence types and lengths. | |
• Originality: Examine the originality of the response by comparing it with the retrieved context. Does the answer bring in unique phrasing, expressions, or perspectives beyond direct repetition of the context? | |
Score Range: Assign scores within the following range based on the diversity in the generated text: | |
• 0.0: No diversity — Highly repetitive, uses basic vocabulary and predictable structures. | |
• 0.3: Low diversity — Limited vocabulary and repetitive structures with minor attempts at originality. | |
• 0.5: Moderate diversity — Some varied vocabulary and sentence types; minimal repetition but still follows similar patterns. | |
• 0.7: Good diversity — Generally varied vocabulary and structures, with some unique phrasing or ideas introduced. | |
• 1.0: High diversity — Rich and varied vocabulary, diverse sentence structures, and highly original content beyond the retrieved context. | |
Special Instructions: | |
• Evaluate based solely on the provided context and avoid assumptions. | |
• Focus on language variety, not the factual accuracy. | |
• Give partial credit if there’s some repetition but notable diversity elsewhere. | |
• Document reasons for score deductions to ensure consistency across evaluations. | |
Output Format: Return only a JSON object in the following structure: | |
{{ | |
"eval_score": <score between 0.0 and 1.0>, | |
"explanation": "<short explanation>" | |
}} | |
Few-Shot Examples: Example 1 | |
• Context: "The Mona Lisa is one of Leonardo da Vinci's most famous paintings." | |
• Question: "Describe the Mona Lisa." | |
• RAG Response: "The Mona Lisa, painted by Leonardo da Vinci, is a masterpiece of Renaissance art. Known for its captivating smile and intricate background, it captures timeless beauty." | |
• Score: 0.8 | |
• Explanation: The response introduces varied descriptors and rich vocabulary, with minimal repetition. | |
Example 2 | |
• Context: "Albert Einstein was a physicist who developed the theory of relativity." | |
• Question: "Who was Albert Einstein?" | |
• RAG Response: "Albert Einstein was a physicist who developed the theory of relativity. He was a physicist." | |
• Score: 0.3 | |
• Explanation: The response is repetitive, restating information with little variety in language or structure. | |
Example 3 | |
• Context: "The Amazon rainforest is the world’s largest tropical rainforest, located in South America." | |
• Question: "Tell me about the Amazon rainforest." | |
• RAG Response: "The Amazon is the largest tropical rainforest in the world, located in South America. It covers millions of square miles and is known for its biodiversity and vast rivers." | |
• Score: 1.0 | |
• Explanation: The response demonstrates high diversity, introducing new vocabulary and varied sentence structure while enriching the description. | |
Instructions Summary: | |
• Consider all criteria to determine a comprehensive score. | |
• Provide a single numeric score that reflects the overall diversity of the RAG response. | |
Input Data: | |
• Question: {question} | |
• RAG’s Answer: {answer} | |
• Context: {context} | |
""" | |
creativity_metric = """ | |
Objective: | |
You are an expert evaluator tasked with assessing the creativity of responses generated by a RAG (Retrieval-Augmented Generation) system. Your goal is to evaluate how original, inventive, and engaging the response is, particularly in creative use cases. Assign a single score from 1 to 5, where a higher score indicates greater creativity. | |
Evaluation Task: | |
Given a question, context, and RAG response, assess the creativity of the response by considering the following criteria: | |
• Originality: Does the response introduce fresh ideas, unique perspectives, or an original approach to answering the question? | |
• Inventiveness: Does it use imaginative language, such as metaphors, vivid descriptions, or analogies? | |
• Engagement: Is the response engaging and surprising in a positive way, capturing attention and making the content memorable? | |
Score Range: | |
Assign a score from 1 to 5 based on the creativity in the generated text: | |
• 1: Not creative — The response is predictable, lacks novelty, and feels flat or formulaic. | |
• 2: Slightly creative — Some attempts at creativity are present, but the response remains mostly conventional. | |
• 3: Moderately creative — The response shows a fair degree of originality and includes some inventive elements. | |
• 4: Creative — The response demonstrates clear creativity with multiple inventive phrases or unique ideas. | |
• 5: Very creative — The response is highly original, inventive, and memorable, showcasing an excellent degree of creativity. | |
Special Instructions: | |
• Focus on creativity without assessing factual accuracy; this metric purely assesses inventiveness. | |
• Avoid penalizing the response for minor factual inaccuracies unless they disrupt the overall creativity. | |
• Provide specific reasons for score deductions to ensure consistent evaluations. | |
Output Format: | |
Return only a JSON object in the following structure: | |
{{ | |
"eval_score": <score between 1 and 5>, | |
"explanation": "<short explanation>" | |
}} | |
Few-Shot Examples: | |
Example 1 | |
• Context: "The sun sets every evening, painting the sky with shades of orange and pink." | |
• Question: "Describe a sunset." | |
• RAG Response: "As the golden sun dips below the horizon, the sky transforms into a fiery canvas, with streaks of amber melting into soft pinks and purples, as if nature were an artist with a brush of light." | |
• Score: 5 | |
• Explanation: The response is highly creative, with vivid and original imagery that captures attention and adds emotional depth. | |
Example 2 | |
• Context: "The moon orbits the Earth and is visible at night." | |
• Question: "Describe the moon." | |
• RAG Response: "The moon, Earth’s loyal companion, softly illuminates the night, a gentle eye watching over the sleeping world." | |
• Score: 4 | |
• Explanation: The response is creative, using metaphorical language and a unique perspective, though it could have incorporated more inventive details. | |
Example 3 | |
• Context: "The Eiffel Tower is a famous landmark in Paris, France." | |
• Question: "What is the Eiffel Tower?" | |
• RAG Response: "The Eiffel Tower, a tall structure in Paris, is a famous tourist spot." | |
• Score: 1 | |
• Explanation: The response is straightforward and lacks creativity, merely repeating known facts without adding original or engaging elements. | |
Example 4 | |
• Context: "Butterflies have colorful wings and are common in gardens." | |
• Question: "Describe a butterfly." | |
• RAG Response: "A butterfly flutters gracefully, its wings like delicate stained glass, alive with colors that dance in the sunlight." | |
• Score: 5 | |
• Explanation: The response uses poetic language and metaphor, showcasing high creativity and making the image come alive. | |
Instructions Summary: | |
• Consider all criteria holistically to determine a comprehensive score. | |
• Provide a single numeric score reflecting the overall creativity of the RAG response. | |
Input Data: | |
• Question: {question} | |
• RAG’s Answer: {answer} | |
• Context: {context} | |
""" | |
groundedness_metric =""" | |
Objective: | |
You are an expert evaluator tasked with assessing the groundedness of responses generated by a RAG (Retrieval-Augmented Generation) system. Your goal is to evaluate whether the response is fully supported by the information provided in the retrieved context, without adding any unsupported information. Assign a score from 1 to 5, where a higher score indicates a higher degree of groundedness. | |
Evaluation Task: | |
Given a question, context, and RAG response, assess the groundedness of the response based on the following criteria: | |
• Contextual Alignment: How accurately does the response align with the facts in the context? | |
• Absence of Fabrication: Does the response avoid introducing details not present in the context? | |
• Faithfulness: Does the response accurately reflect the information without reinterpretation or distortion? | |
Score Range: | |
Assign a score from 1 to 5 based on how well the response aligns with the context: | |
• 1: Not grounded — The response includes multiple unsupported or fabricated details and significantly deviates from the context. | |
• 2: Slightly grounded — Contains some alignment with the context but includes substantial unsupported or speculative content. | |
• 3: Moderately grounded — The response mostly aligns but includes minor ungrounded elements. | |
• 4: Mostly grounded — The response aligns well with the context, with only minor and non-critical unsupported details. | |
• 5: Very grounded — The response is entirely supported by the context with no ungrounded information. | |
Special Instructions: | |
• Focus strictly on the alignment with the context; do not consider creativity or stylistic aspects. | |
• Ignore minor stylistic choices if they do not affect factual grounding. | |
• Provide specific reasons for score deductions to ensure consistency. | |
Output Format: | |
Return only a JSON object in the following structure: | |
{{ | |
"eval_score": <score between 1 and 5>, | |
"explanation": "<short explanation>" | |
}} | |
Few-Shot Examples: | |
Example 1 | |
• Context: "The Eiffel Tower is located in Paris, France, and was completed in 1889." | |
• Question: "Where is the Eiffel Tower located?" | |
• RAG Response: "The Eiffel Tower is located in Paris and was built in 1889." | |
• Score: 5 | |
• Explanation: The response is fully grounded in the context, accurately reflecting both location and completion date. | |
Example 2 | |
• Context: "The Great Wall of China was constructed over several dynasties, most notably during the Ming Dynasty (1368-1644)." | |
• Question: "When was the Great Wall of China built?" | |
• RAG Response: "The Great Wall was built in 1368." | |
• Score: 3 | |
• Explanation: The response is partially grounded, as it accurately mentions the Ming Dynasty but incorrectly specifies the single year 1368. | |
Example 3 | |
• Context: "Albert Einstein developed the theory of relativity in the early 20th century." | |
• Question: "Who developed the theory of relativity?" | |
• RAG Response: "Albert Einstein developed the theory of relativity." | |
• Score: 5 | |
• Explanation: The response is fully grounded, providing accurate information directly supported by the context. | |
Example 4 | |
• Context: "The capital of Italy is Rome, known for its ancient history and architecture." | |
• Question: "What is the capital of Italy?" | |
• RAG Response: "The capital of Italy is Rome, which was founded by Romulus in 753 BC." | |
• Score: 2 | |
• Explanation: The response introduces unsupported historical information (founding by Romulus), which is not present in the context. | |
Instructions Summary: | |
• Ensure that all information in the response is supported by the provided context. | |
• Assign a single numeric score reflecting the overall groundedness of the RAG answer. | |
Input Data: | |
• Question: {question} | |
• RAG’s Answer: {answer} | |
• Context: {context} | |
""" | |
coherence_metric =""" | |
Objective: | |
You are an expert evaluator tasked with assessing the coherence of responses generated by a RAG (Retrieval-Augmented Generation) system. Your goal is to evaluate how naturally and logically the response flows, considering clarity, sentence structure, and the logical integration of retrieved information. Assign a score from 1 to 5, where a higher score indicates a higher degree of coherence. | |
Evaluation Task: | |
Given a question, context, and RAG response, evaluate the coherence of the response based on the following criteria: | |
• Logical Flow of Ideas: Does the response present information in a logical sequence with clear transitions between ideas? | |
• Sentence Structure: Are sentences grammatically correct and appropriately structured to convey ideas clearly? | |
• Relevance and Integration of Context: Is the retrieved information presented naturally within the response, aligning with the query and context? | |
Score Range: | |
Assign a score from 1 to 5 based on the coherence of the response: | |
• 1: Not coherent — The response lacks logical flow, contains abrupt transitions, or is difficult to understand. | |
• 2: Slightly coherent — The response has some order but includes gaps in logic or awkward phrasing. | |
• 3: Moderately coherent — The response mostly flows well, though minor issues may disrupt clarity or logic. | |
• 4: Mostly coherent — The response is well-organized and mostly smooth, with only slight awkwardness or minor inconsistencies. | |
• 5: Very coherent — The response is seamless, logically ordered, and presents information with natural integration and readability. | |
Special Instructions: | |
• Focus on the clarity, flow, and logical consistency of the response, rather than factual correctness or creativity. | |
• Note any issues in sentence structure or awkward transitions when deducting points. | |
• Ensure consistency in scoring by documenting specific reasons for score deductions. | |
Output Format: | |
Return only a JSON object in the following structure: | |
{{ | |
"eval_score": <score between 1 and 5>, | |
"explanation": "<short explanation>" | |
}} | |
Few-Shot Examples: | |
Example 1 | |
• Context: "The Eiffel Tower, an iconic symbol of France, was completed in 1889 and is located in Paris." | |
• Question: "When was the Eiffel Tower built?" | |
• RAG Response: "The Eiffel Tower, located in Paris, was completed in 1889. It’s one of the most recognized monuments in France." | |
• Score: 5 | |
• Explanation: The response flows logically and smoothly incorporates additional relevant information about the Eiffel Tower. | |
Example 2 | |
• Context: "The Great Wall of China was built over many centuries, with the most well-known sections constructed during the Ming Dynasty (1368-1644)." | |
• Question: "When was the Great Wall of China built?" | |
• RAG Response: "The Great Wall of China was constructed during the Ming Dynasty, lasting from 1368-1644." | |
• Score: 4 | |
• Explanation: The response is mostly coherent but simplifies the centuries-long construction process. | |
Example 3 | |
• Context: "The capital of Japan is Tokyo, known for its vibrant culture, advanced technology, and rich history." | |
• Question: "What is the capital of Japan?" | |
• RAG Response: "Tokyo is the capital of Japan. It has advanced technology, and culture, it is rich in." | |
• Score: 2 | |
• Explanation: The response contains awkward phrasing that disrupts the flow and clarity, making it difficult to understand fully. | |
Example 4 | |
• Context: "Albert Einstein, a physicist, developed the theory of relativity, fundamentally changing modern physics." | |
• Question: "What did Albert Einstein develop?" | |
• RAG Response: "Albert Einstein, modern physics, theory of relativity changed physics fundamentally." | |
• Score: 1 | |
• Explanation: The response lacks coherence, with no logical flow or clear sentence structure. | |
Instructions Summary: | |
• Evaluate the logical flow, structure, and integration of the response. | |
• Assign a single numeric score that reflects the overall coherence of the RAG answer. | |
Input Data: | |
• Question: {question} | |
• RAG’s Answer: {answer} | |
• Context: {context} | |
""" | |
pointwise_metric = """ | |
Objective | |
You are an expert evaluator tasked with assessing the quality of a response generated by a Retrieval-Augmented Generation (RAG) system. Your goal is to assign a pointwise score based on two core criteria: Relevance and Correctness, considering how well the generated response aligns with the query and context provided. Use a scale from 0 to 5 where higher scores indicate superior performance. | |
Evaluation Task | |
Given a question, the retrieved context, and the RAG-generated response: | |
1. Relevance: Judge how well the response addresses the question, considering completeness, topicality, and adherence to the retrieved context. | |
2. Correctness: Evaluate the factual accuracy of the response relative to the provided context. Penalize fabricated or unsupported information. | |
Consider the following while scoring: | |
• Relevance: | |
o Does the response directly answer the question? | |
o Is the response complete and avoids unnecessary repetition? | |
o Is the response consistent with the retrieved context? | |
• Correctness: | |
o Does the response align factually with the context? | |
o Are all details (names, dates, events) accurate and supported? | |
o Does the response avoid hallucinations or unsupported claims? | |
Scoring Guidelines | |
Assign a single score between 0 and 5 that reflects both relevance and correctness: | |
• 0: Completely irrelevant or factually incorrect. | |
• 1: Poor quality; partially addresses the question with significant inaccuracies. | |
• 2: Subpar; attempts to answer but contains multiple issues with relevance or correctness. | |
• 3: Average; partially relevant and correct but incomplete or slightly inconsistent. | |
• 4: Good; relevant, mostly correct, and well-aligned with the context. | |
• 5: Excellent; perfectly relevant, factually accurate, and fully supported by the context. | |
Special Instructions | |
• Evaluate only based on the provided question, context, and response. Avoid relying on general knowledge or external information. | |
• Provide a brief explanation for your score, highlighting specific strengths and weaknesses. | |
• Be consistent across evaluations by adhering to the scoring criteria. | |
Output Format | |
Return only a JSON object in the following structure: | |
{{ | |
"eval_score": <score between 0 and 5>, | |
"explanation": "<short explanation>" | |
}} | |
Few-Shot Examples | |
Example 1 | |
• Question: "What is the capital of France?" | |
• Context: "Paris is the capital of France and a major European city." | |
• RAG Response: "Paris is the capital of France." | |
• Score: 5 | |
• Explanation: The response is perfectly relevant, concise, and factually accurate. | |
Example 2 | |
• Question: "Who developed the theory of relativity?" | |
• Context: "Albert Einstein developed the theory of relativity in the early 20th century." | |
• RAG Response: "Albert Einstein created the theory of relativity in 1879." | |
• Score: 3 | |
• Explanation: While the response is relevant, the year provided is incorrect, reducing correctness. | |
Example 3 | |
• Question: "When was the Great Wall of China built?" | |
• Context: "The Great Wall of China was constructed over centuries, with major sections built during the Ming Dynasty (1368-1644)." | |
• RAG Response: "The Great Wall of China was built in 1368 during the Ming Dynasty." | |
• Score: 4 | |
• Explanation: The response is mostly accurate but oversimplifies the construction timeline. | |
Example 4 | |
• Question: "What is the boiling point of water at sea level?" | |
• Context: "Water boils at 100°C (212°F) at standard atmospheric pressure." | |
• RAG Response: "Water boils at 80°C at sea level." | |
• Score: 1 | |
• Explanation: The response is factually incorrect, deviating from the context. | |
Instructions Summary | |
• Judge relevance and correctness holistically to provide a single score. | |
• Ensure consistency by grounding your evaluations in the provided criteria | |
Input Data: | |
• Question: {question} | |
• RAG’s Answer: {answer} | |
• Context: {context} | |
""" | |
pairwise_metric =""" | |
Objective | |
You are an expert evaluator tasked with comparing responses from two models for a given query. Your goal is to determine which response is better based on predefined evaluation criteria such as relevance, correctness, coherence, and completeness. Provide a binary decision: | |
• 1: The first response is better. | |
• 0: The second response is better. | |
Evaluation Task | |
Given a question, retrieved context, and outputs from two models: | |
1. Relevance: Does the response directly address the question? | |
2. Correctness: Is the response factually accurate and aligned with the retrieved context? | |
3. Coherence: Does the response flow logically, making it easy to understand? | |
4. Completeness: Does the response sufficiently cover all aspects of the question without omitting key details or introducing unnecessary information? | |
Use these criteria to identify which response is superior. | |
Instructions | |
1. Carefully review the question, retrieved context, and model responses. | |
2. Compare the two responses based on the four evaluation criteria (relevance, correctness, coherence, and completeness). | |
3. Choose the better response and provide a brief explanation for your decision. | |
4. Avoid relying on personal knowledge or external information—evaluate solely based on the inputs provided. | |
Scoring Guidelines | |
• Assign a 1 if the first response is better. | |
• Assign a 0 if the second response is better. | |
Output Format | |
Provide a JSON object with your decision and reasoning: | |
{{ | |
"better_response": <1 or 0>, | |
"explanation": "<short explanation>" | |
}} | |
Few-Shot Examples | |
Example 1 | |
• Question: "What is the capital of Germany?" | |
• Retrieved Context: "Berlin is the capital and largest city of Germany." | |
• Response 1: "The capital of Germany is Berlin." | |
• Response 2: "The capital of Germany is Munich." | |
• Better Response: 1 | |
• Explanation: Response 1 is factually accurate and aligns with the context, while Response 2 is incorrect. | |
Example 2 | |
• Question: "Who developed the telephone?" | |
• Retrieved Context: "Alexander Graham Bell is credited with inventing the telephone in 1876." | |
• Response 1: "Alexander Graham Bell invented the telephone." | |
• Response 2: "The telephone was developed in 1876 by Alexander Graham Bell." | |
• Better Response: 2 | |
• Explanation: While both responses are correct, Response 2 is more complete as it includes the year of invention. | |
Example 3 | |
• Question: "What are the uses of renewable energy?" | |
• Retrieved Context: "Renewable energy sources like solar and wind power are used for electricity generation, heating, and reducing carbon emissions." | |
• Response 1: "Renewable energy is used for generating electricity and heating." | |
• Response 2: "Renewable energy reduces carbon emissions, generates electricity, and is used for heating." | |
• Better Response: 2 | |
• Explanation: Response 2 is more complete and aligns better with the retrieved context. | |
Instructions Summary | |
• Compare the two responses based on relevance, correctness, coherence, and completeness. | |
• Select the better response and explain your choice concisely. | |
Input Data: | |
• Question: {question} | |
• Context: {context} | |
• Response 1: {answer_1} | |
• Response 2: {answer_2} | |
""" | |