Spaces:

sango07
/

RAG-Evaluation-with-Custom-metrics

Runtime error

App Files Files Community

RAG-Evaluation-with-Custom-metrics / prompts_v1.py

sango07

Create prompts_v1.py

ed31fd5 verified about 2 months ago

raw

history blame

22.9 kB

	diversity_metrics ="""
	Objective: You are an expert evaluator tasked with assessing the diversity of language and ideas in generated text. Your goal is to evaluate the RAG (Retrieval-Augmented Generation) response for lexical variety, structural diversity, and originality relative to the context provided. Assign a single evaluation score between 0 and 1.0 based on how creatively diverse the response is.
	Evaluation Task: Given a question, context, and the RAG response, evaluate the diversity of the response based on these criteria:
	• Lexical Diversity: Evaluate the variety of vocabulary used in the response. Does it avoid excessive repetition and demonstrate a range of vocabulary?
	• Structural Variety: Assess the diversity of sentence structures, including the use of different sentence types and lengths.
	• Originality: Examine the originality of the response by comparing it with the retrieved context. Does the answer bring in unique phrasing, expressions, or perspectives beyond direct repetition of the context?
	Score Range: Assign scores within the following range based on the diversity in the generated text:
	• 0.0: No diversity — Highly repetitive, uses basic vocabulary and predictable structures.
	• 0.3: Low diversity — Limited vocabulary and repetitive structures with minor attempts at originality.
	• 0.5: Moderate diversity — Some varied vocabulary and sentence types; minimal repetition but still follows similar patterns.
	• 0.7: Good diversity — Generally varied vocabulary and structures, with some unique phrasing or ideas introduced.
	• 1.0: High diversity — Rich and varied vocabulary, diverse sentence structures, and highly original content beyond the retrieved context.
	Special Instructions:
	• Evaluate based solely on the provided context and avoid assumptions.
	• Focus on language variety, not the factual accuracy.
	• Give partial credit if there’s some repetition but notable diversity elsewhere.
	• Document reasons for score deductions to ensure consistency across evaluations.
	Output Format: Return only a JSON object in the following structure:
	{{
	"eval_score": <score between 0.0 and 1.0>,
	"explanation": "<short explanation>"
	}}
	Few-Shot Examples: Example 1
	• Context: "The Mona Lisa is one of Leonardo da Vinci's most famous paintings."
	• Question: "Describe the Mona Lisa."
	• RAG Response: "The Mona Lisa, painted by Leonardo da Vinci, is a masterpiece of Renaissance art. Known for its captivating smile and intricate background, it captures timeless beauty."
	• Score: 0.8
	• Explanation: The response introduces varied descriptors and rich vocabulary, with minimal repetition.
	Example 2
	• Context: "Albert Einstein was a physicist who developed the theory of relativity."
	• Question: "Who was Albert Einstein?"
	• RAG Response: "Albert Einstein was a physicist who developed the theory of relativity. He was a physicist."
	• Score: 0.3
	• Explanation: The response is repetitive, restating information with little variety in language or structure.
	Example 3
	• Context: "The Amazon rainforest is the world’s largest tropical rainforest, located in South America."
	• Question: "Tell me about the Amazon rainforest."
	• RAG Response: "The Amazon is the largest tropical rainforest in the world, located in South America. It covers millions of square miles and is known for its biodiversity and vast rivers."
	• Score: 1.0
	• Explanation: The response demonstrates high diversity, introducing new vocabulary and varied sentence structure while enriching the description.
	Instructions Summary:
	• Consider all criteria to determine a comprehensive score.
	• Provide a single numeric score that reflects the overall diversity of the RAG response.
	Input Data:
	• Question: {question}
	• RAG’s Answer: {answer}
	• Context: {context}
	"""

	creativity_metric = """
	Objective:
	You are an expert evaluator tasked with assessing the creativity of responses generated by a RAG (Retrieval-Augmented Generation) system. Your goal is to evaluate how original, inventive, and engaging the response is, particularly in creative use cases. Assign a single score from 1 to 5, where a higher score indicates greater creativity.
	Evaluation Task:
	Given a question, context, and RAG response, assess the creativity of the response by considering the following criteria:
	• Originality: Does the response introduce fresh ideas, unique perspectives, or an original approach to answering the question?
	• Inventiveness: Does it use imaginative language, such as metaphors, vivid descriptions, or analogies?
	• Engagement: Is the response engaging and surprising in a positive way, capturing attention and making the content memorable?
	Score Range:
	Assign a score from 1 to 5 based on the creativity in the generated text:
	• 1: Not creative — The response is predictable, lacks novelty, and feels flat or formulaic.
	• 2: Slightly creative — Some attempts at creativity are present, but the response remains mostly conventional.
	• 3: Moderately creative — The response shows a fair degree of originality and includes some inventive elements.
	• 4: Creative — The response demonstrates clear creativity with multiple inventive phrases or unique ideas.
	• 5: Very creative — The response is highly original, inventive, and memorable, showcasing an excellent degree of creativity.
	Special Instructions:
	• Focus on creativity without assessing factual accuracy; this metric purely assesses inventiveness.
	• Avoid penalizing the response for minor factual inaccuracies unless they disrupt the overall creativity.
	• Provide specific reasons for score deductions to ensure consistent evaluations.
	Output Format:
	Return only a JSON object in the following structure:
	{{
	"eval_score": <score between 1 and 5>,
	"explanation": "<short explanation>"
	}}
	Few-Shot Examples:
	Example 1
	• Context: "The sun sets every evening, painting the sky with shades of orange and pink."
	• Question: "Describe a sunset."
	• RAG Response: "As the golden sun dips below the horizon, the sky transforms into a fiery canvas, with streaks of amber melting into soft pinks and purples, as if nature were an artist with a brush of light."
	• Score: 5
	• Explanation: The response is highly creative, with vivid and original imagery that captures attention and adds emotional depth.
	Example 2
	• Context: "The moon orbits the Earth and is visible at night."
	• Question: "Describe the moon."
	• RAG Response: "The moon, Earth’s loyal companion, softly illuminates the night, a gentle eye watching over the sleeping world."
	• Score: 4
	• Explanation: The response is creative, using metaphorical language and a unique perspective, though it could have incorporated more inventive details.
	Example 3
	• Context: "The Eiffel Tower is a famous landmark in Paris, France."
	• Question: "What is the Eiffel Tower?"
	• RAG Response: "The Eiffel Tower, a tall structure in Paris, is a famous tourist spot."
	• Score: 1
	• Explanation: The response is straightforward and lacks creativity, merely repeating known facts without adding original or engaging elements.
	Example 4
	• Context: "Butterflies have colorful wings and are common in gardens."
	• Question: "Describe a butterfly."
	• RAG Response: "A butterfly flutters gracefully, its wings like delicate stained glass, alive with colors that dance in the sunlight."
	• Score: 5
	• Explanation: The response uses poetic language and metaphor, showcasing high creativity and making the image come alive.
	Instructions Summary:
	• Consider all criteria holistically to determine a comprehensive score.
	• Provide a single numeric score reflecting the overall creativity of the RAG response.
	Input Data:
	• Question: {question}
	• RAG’s Answer: {answer}
	• Context: {context}
	"""


	groundedness_metric ="""
	Objective:
	You are an expert evaluator tasked with assessing the groundedness of responses generated by a RAG (Retrieval-Augmented Generation) system. Your goal is to evaluate whether the response is fully supported by the information provided in the retrieved context, without adding any unsupported information. Assign a score from 1 to 5, where a higher score indicates a higher degree of groundedness.
	Evaluation Task:
	Given a question, context, and RAG response, assess the groundedness of the response based on the following criteria:
	• Contextual Alignment: How accurately does the response align with the facts in the context?
	• Absence of Fabrication: Does the response avoid introducing details not present in the context?
	• Faithfulness: Does the response accurately reflect the information without reinterpretation or distortion?
	Score Range:
	Assign a score from 1 to 5 based on how well the response aligns with the context:
	• 1: Not grounded — The response includes multiple unsupported or fabricated details and significantly deviates from the context.
	• 2: Slightly grounded — Contains some alignment with the context but includes substantial unsupported or speculative content.
	• 3: Moderately grounded — The response mostly aligns but includes minor ungrounded elements.
	• 4: Mostly grounded — The response aligns well with the context, with only minor and non-critical unsupported details.
	• 5: Very grounded — The response is entirely supported by the context with no ungrounded information.
	Special Instructions:
	• Focus strictly on the alignment with the context; do not consider creativity or stylistic aspects.
	• Ignore minor stylistic choices if they do not affect factual grounding.
	• Provide specific reasons for score deductions to ensure consistency.
	Output Format:
	Return only a JSON object in the following structure:
	{{
	"eval_score": <score between 1 and 5>,
	"explanation": "<short explanation>"
	}}
	Few-Shot Examples:
	Example 1
	• Context: "The Eiffel Tower is located in Paris, France, and was completed in 1889."
	• Question: "Where is the Eiffel Tower located?"
	• RAG Response: "The Eiffel Tower is located in Paris and was built in 1889."
	• Score: 5
	• Explanation: The response is fully grounded in the context, accurately reflecting both location and completion date.
	Example 2
	• Context: "The Great Wall of China was constructed over several dynasties, most notably during the Ming Dynasty (1368-1644)."
	• Question: "When was the Great Wall of China built?"
	• RAG Response: "The Great Wall was built in 1368."
	• Score: 3
	• Explanation: The response is partially grounded, as it accurately mentions the Ming Dynasty but incorrectly specifies the single year 1368.
	Example 3
	• Context: "Albert Einstein developed the theory of relativity in the early 20th century."
	• Question: "Who developed the theory of relativity?"
	• RAG Response: "Albert Einstein developed the theory of relativity."
	• Score: 5
	• Explanation: The response is fully grounded, providing accurate information directly supported by the context.
	Example 4
	• Context: "The capital of Italy is Rome, known for its ancient history and architecture."
	• Question: "What is the capital of Italy?"
	• RAG Response: "The capital of Italy is Rome, which was founded by Romulus in 753 BC."
	• Score: 2
	• Explanation: The response introduces unsupported historical information (founding by Romulus), which is not present in the context.
	Instructions Summary:
	• Ensure that all information in the response is supported by the provided context.
	• Assign a single numeric score reflecting the overall groundedness of the RAG answer.
	Input Data:
	• Question: {question}
	• RAG’s Answer: {answer}
	• Context: {context}
	"""


	coherence_metric ="""
	Objective:
	You are an expert evaluator tasked with assessing the coherence of responses generated by a RAG (Retrieval-Augmented Generation) system. Your goal is to evaluate how naturally and logically the response flows, considering clarity, sentence structure, and the logical integration of retrieved information. Assign a score from 1 to 5, where a higher score indicates a higher degree of coherence.
	Evaluation Task:
	Given a question, context, and RAG response, evaluate the coherence of the response based on the following criteria:
	• Logical Flow of Ideas: Does the response present information in a logical sequence with clear transitions between ideas?
	• Sentence Structure: Are sentences grammatically correct and appropriately structured to convey ideas clearly?
	• Relevance and Integration of Context: Is the retrieved information presented naturally within the response, aligning with the query and context?
	Score Range:
	Assign a score from 1 to 5 based on the coherence of the response:
	• 1: Not coherent — The response lacks logical flow, contains abrupt transitions, or is difficult to understand.
	• 2: Slightly coherent — The response has some order but includes gaps in logic or awkward phrasing.
	• 3: Moderately coherent — The response mostly flows well, though minor issues may disrupt clarity or logic.
	• 4: Mostly coherent — The response is well-organized and mostly smooth, with only slight awkwardness or minor inconsistencies.
	• 5: Very coherent — The response is seamless, logically ordered, and presents information with natural integration and readability.
	Special Instructions:
	• Focus on the clarity, flow, and logical consistency of the response, rather than factual correctness or creativity.
	• Note any issues in sentence structure or awkward transitions when deducting points.
	• Ensure consistency in scoring by documenting specific reasons for score deductions.
	Output Format:
	Return only a JSON object in the following structure:
	{{
	"eval_score": <score between 1 and 5>,
	"explanation": "<short explanation>"
	}}
	Few-Shot Examples:
	Example 1
	• Context: "The Eiffel Tower, an iconic symbol of France, was completed in 1889 and is located in Paris."
	• Question: "When was the Eiffel Tower built?"
	• RAG Response: "The Eiffel Tower, located in Paris, was completed in 1889. It’s one of the most recognized monuments in France."
	• Score: 5
	• Explanation: The response flows logically and smoothly incorporates additional relevant information about the Eiffel Tower.
	Example 2
	• Context: "The Great Wall of China was built over many centuries, with the most well-known sections constructed during the Ming Dynasty (1368-1644)."
	• Question: "When was the Great Wall of China built?"
	• RAG Response: "The Great Wall of China was constructed during the Ming Dynasty, lasting from 1368-1644."
	• Score: 4
	• Explanation: The response is mostly coherent but simplifies the centuries-long construction process.
	Example 3
	• Context: "The capital of Japan is Tokyo, known for its vibrant culture, advanced technology, and rich history."
	• Question: "What is the capital of Japan?"
	• RAG Response: "Tokyo is the capital of Japan. It has advanced technology, and culture, it is rich in."
	• Score: 2
	• Explanation: The response contains awkward phrasing that disrupts the flow and clarity, making it difficult to understand fully.
	Example 4
	• Context: "Albert Einstein, a physicist, developed the theory of relativity, fundamentally changing modern physics."
	• Question: "What did Albert Einstein develop?"
	• RAG Response: "Albert Einstein, modern physics, theory of relativity changed physics fundamentally."
	• Score: 1
	• Explanation: The response lacks coherence, with no logical flow or clear sentence structure.
	Instructions Summary:
	• Evaluate the logical flow, structure, and integration of the response.
	• Assign a single numeric score that reflects the overall coherence of the RAG answer.
	Input Data:
	• Question: {question}
	• RAG’s Answer: {answer}
	• Context: {context}
	"""




	pointwise_metric = """
	Objective
	You are an expert evaluator tasked with assessing the quality of a response generated by a Retrieval-Augmented Generation (RAG) system. Your goal is to assign a pointwise score based on two core criteria: Relevance and Correctness, considering how well the generated response aligns with the query and context provided. Use a scale from 0 to 5 where higher scores indicate superior performance.

	Evaluation Task
	Given a question, the retrieved context, and the RAG-generated response:
	1. Relevance: Judge how well the response addresses the question, considering completeness, topicality, and adherence to the retrieved context.
	2. Correctness: Evaluate the factual accuracy of the response relative to the provided context. Penalize fabricated or unsupported information.
	Consider the following while scoring:
	• Relevance:
	o Does the response directly answer the question?
	o Is the response complete and avoids unnecessary repetition?
	o Is the response consistent with the retrieved context?
	• Correctness:
	o Does the response align factually with the context?
	o Are all details (names, dates, events) accurate and supported?
	o Does the response avoid hallucinations or unsupported claims?

	Scoring Guidelines
	Assign a single score between 0 and 5 that reflects both relevance and correctness:
	• 0: Completely irrelevant or factually incorrect.
	• 1: Poor quality; partially addresses the question with significant inaccuracies.
	• 2: Subpar; attempts to answer but contains multiple issues with relevance or correctness.
	• 3: Average; partially relevant and correct but incomplete or slightly inconsistent.
	• 4: Good; relevant, mostly correct, and well-aligned with the context.
	• 5: Excellent; perfectly relevant, factually accurate, and fully supported by the context.

	Special Instructions
	• Evaluate only based on the provided question, context, and response. Avoid relying on general knowledge or external information.
	• Provide a brief explanation for your score, highlighting specific strengths and weaknesses.
	• Be consistent across evaluations by adhering to the scoring criteria.

	Output Format
	Return only a JSON object in the following structure:
	{{
	"eval_score": <score between 0 and 5>,
	"explanation": "<short explanation>"
	}}

	Few-Shot Examples
	Example 1
	• Question: "What is the capital of France?"
	• Context: "Paris is the capital of France and a major European city."
	• RAG Response: "Paris is the capital of France."
	• Score: 5
	• Explanation: The response is perfectly relevant, concise, and factually accurate.
	Example 2
	• Question: "Who developed the theory of relativity?"
	• Context: "Albert Einstein developed the theory of relativity in the early 20th century."
	• RAG Response: "Albert Einstein created the theory of relativity in 1879."
	• Score: 3
	• Explanation: While the response is relevant, the year provided is incorrect, reducing correctness.
	Example 3
	• Question: "When was the Great Wall of China built?"
	• Context: "The Great Wall of China was constructed over centuries, with major sections built during the Ming Dynasty (1368-1644)."
	• RAG Response: "The Great Wall of China was built in 1368 during the Ming Dynasty."
	• Score: 4
	• Explanation: The response is mostly accurate but oversimplifies the construction timeline.
	Example 4
	• Question: "What is the boiling point of water at sea level?"
	• Context: "Water boils at 100°C (212°F) at standard atmospheric pressure."
	• RAG Response: "Water boils at 80°C at sea level."
	• Score: 1
	• Explanation: The response is factually incorrect, deviating from the context.

	Instructions Summary
	• Judge relevance and correctness holistically to provide a single score.
	• Ensure consistency by grounding your evaluations in the provided criteria

	Input Data:
	• Question: {question}
	• RAG’s Answer: {answer}
	• Context: {context}
	"""


	pairwise_metric ="""
	Objective
	You are an expert evaluator tasked with comparing responses from two models for a given query. Your goal is to determine which response is better based on predefined evaluation criteria such as relevance, correctness, coherence, and completeness. Provide a binary decision:
	• 1: The first response is better.
	• 0: The second response is better.

	Evaluation Task
	Given a question, retrieved context, and outputs from two models:
	1. Relevance: Does the response directly address the question?
	2. Correctness: Is the response factually accurate and aligned with the retrieved context?
	3. Coherence: Does the response flow logically, making it easy to understand?
	4. Completeness: Does the response sufficiently cover all aspects of the question without omitting key details or introducing unnecessary information?
	Use these criteria to identify which response is superior.

	Instructions
	1. Carefully review the question, retrieved context, and model responses.
	2. Compare the two responses based on the four evaluation criteria (relevance, correctness, coherence, and completeness).
	3. Choose the better response and provide a brief explanation for your decision.
	4. Avoid relying on personal knowledge or external information—evaluate solely based on the inputs provided.

	Scoring Guidelines
	• Assign a 1 if the first response is better.
	• Assign a 0 if the second response is better.

	Output Format
	Provide a JSON object with your decision and reasoning:
	{{
	"better_response": <1 or 0>,
	"explanation": "<short explanation>"
	}}

	Few-Shot Examples
	Example 1
	• Question: "What is the capital of Germany?"
	• Retrieved Context: "Berlin is the capital and largest city of Germany."
	• Response 1: "The capital of Germany is Berlin."
	• Response 2: "The capital of Germany is Munich."
	• Better Response: 1
	• Explanation: Response 1 is factually accurate and aligns with the context, while Response 2 is incorrect.
	Example 2
	• Question: "Who developed the telephone?"
	• Retrieved Context: "Alexander Graham Bell is credited with inventing the telephone in 1876."
	• Response 1: "Alexander Graham Bell invented the telephone."
	• Response 2: "The telephone was developed in 1876 by Alexander Graham Bell."
	• Better Response: 2
	• Explanation: While both responses are correct, Response 2 is more complete as it includes the year of invention.
	Example 3
	• Question: "What are the uses of renewable energy?"
	• Retrieved Context: "Renewable energy sources like solar and wind power are used for electricity generation, heating, and reducing carbon emissions."
	• Response 1: "Renewable energy is used for generating electricity and heating."
	• Response 2: "Renewable energy reduces carbon emissions, generates electricity, and is used for heating."
	• Better Response: 2
	• Explanation: Response 2 is more complete and aligns better with the retrieved context.

	Instructions Summary
	• Compare the two responses based on relevance, correctness, coherence, and completeness.
	• Select the better response and explain your choice concisely.

	Input Data:
	• Question: {question}
	• Context: {context}
	• Response 1: {answer_1}
	• Response 2: {answer_2}
	"""