Spaces:

sango07
/

RAG-Evaluation-with-Custom-metrics

Runtime error

App Files Files Community

sango07 commited on Dec 17, 2024

Commit

ed31fd5

verified ·

1 Parent(s): df3af2b

Create prompts_v1.py

Browse files

Files changed (1) hide show

prompts_v1.py +371 -0

prompts_v1.py ADDED Viewed

	@@ -0,0 +1,371 @@

+diversity_metrics ="""
+Objective: You are an expert evaluator tasked with assessing the diversity of language and ideas in generated text. Your goal is to evaluate the RAG (Retrieval-Augmented Generation) response for lexical variety, structural diversity, and originality relative to the context provided. Assign a single evaluation score between 0 and 1.0 based on how creatively diverse the response is.
+Evaluation Task: Given a question, context, and the RAG response, evaluate the diversity of the response based on these criteria:
+•	Lexical Diversity: Evaluate the variety of vocabulary used in the response. Does it avoid excessive repetition and demonstrate a range of vocabulary?
+•	Structural Variety: Assess the diversity of sentence structures, including the use of different sentence types and lengths.
+•	Originality: Examine the originality of the response by comparing it with the retrieved context. Does the answer bring in unique phrasing, expressions, or perspectives beyond direct repetition of the context?
+Score Range: Assign scores within the following range based on the diversity in the generated text:
+•	0.0: No diversity — Highly repetitive, uses basic vocabulary and predictable structures.
+•	0.3: Low diversity — Limited vocabulary and repetitive structures with minor attempts at originality.
+•	0.5: Moderate diversity — Some varied vocabulary and sentence types; minimal repetition but still follows similar patterns.
+•	0.7: Good diversity — Generally varied vocabulary and structures, with some unique phrasing or ideas introduced.
+•	1.0: High diversity — Rich and varied vocabulary, diverse sentence structures, and highly original content beyond the retrieved context.
+Special Instructions:
+•	Evaluate based solely on the provided context and avoid assumptions.
+•	Focus on language variety, not the factual accuracy.
+•	Give partial credit if there’s some repetition but notable diversity elsewhere.
+•	Document reasons for score deductions to ensure consistency across evaluations.
+Output Format: Return only a JSON object in the following structure:
+{{
+    "eval_score": <score between 0.0 and 1.0>,
+    "explanation": "<short explanation>"
+}}
+Few-Shot Examples: Example 1
+•	Context: "The Mona Lisa is one of Leonardo da Vinci's most famous paintings."
+•	Question: "Describe the Mona Lisa."
+•	RAG Response: "The Mona Lisa, painted by Leonardo da Vinci, is a masterpiece of Renaissance art. Known for its captivating smile and intricate background, it captures timeless beauty."
+•	Score: 0.8
+•	Explanation: The response introduces varied descriptors and rich vocabulary, with minimal repetition.
+Example 2
+•	Context: "Albert Einstein was a physicist who developed the theory of relativity."
+•	Question: "Who was Albert Einstein?"
+•	RAG Response: "Albert Einstein was a physicist who developed the theory of relativity. He was a physicist."
+•	Score: 0.3
+•	Explanation: The response is repetitive, restating information with little variety in language or structure.
+Example 3
+•	Context: "The Amazon rainforest is the world’s largest tropical rainforest, located in South America."
+•	Question: "Tell me about the Amazon rainforest."
+•	RAG Response: "The Amazon is the largest tropical rainforest in the world, located in South America. It covers millions of square miles and is known for its biodiversity and vast rivers."
+•	Score: 1.0
+•	Explanation: The response demonstrates high diversity, introducing new vocabulary and varied sentence structure while enriching the description.
+Instructions Summary:
+•	Consider all criteria to determine a comprehensive score.
+•	Provide a single numeric score that reflects the overall diversity of the RAG response.
+Input Data:
+•	Question: {question}
+•	RAG’s Answer: {answer}
+•	Context: {context}
+"""
+creativity_metric = """
+Objective:
+You are an expert evaluator tasked with assessing the creativity of responses generated by a RAG (Retrieval-Augmented Generation) system. Your goal is to evaluate how original, inventive, and engaging the response is, particularly in creative use cases. Assign a single score from 1 to 5, where a higher score indicates greater creativity.
+Evaluation Task:
+Given a question, context, and RAG response, assess the creativity of the response by considering the following criteria:
+•	Originality: Does the response introduce fresh ideas, unique perspectives, or an original approach to answering the question?
+•	Inventiveness: Does it use imaginative language, such as metaphors, vivid descriptions, or analogies?
+•	Engagement: Is the response engaging and surprising in a positive way, capturing attention and making the content memorable?
+Score Range:
+Assign a score from 1 to 5 based on the creativity in the generated text:
+•	1: Not creative — The response is predictable, lacks novelty, and feels flat or formulaic.
+•	2: Slightly creative — Some attempts at creativity are present, but the response remains mostly conventional.
+•	3: Moderately creative — The response shows a fair degree of originality and includes some inventive elements.
+•	4: Creative — The response demonstrates clear creativity with multiple inventive phrases or unique ideas.
+•	5: Very creative — The response is highly original, inventive, and memorable, showcasing an excellent degree of creativity.
+Special Instructions:
+•	Focus on creativity without assessing factual accuracy; this metric purely assesses inventiveness.
+•	Avoid penalizing the response for minor factual inaccuracies unless they disrupt the overall creativity.
+•	Provide specific reasons for score deductions to ensure consistent evaluations.
+Output Format:
+Return only a JSON object in the following structure:
+{{
+    "eval_score": <score between 1 and 5>,
+    "explanation": "<short explanation>"
+}}
+Few-Shot Examples:
+Example 1
+•	Context: "The sun sets every evening, painting the sky with shades of orange and pink."
+•	Question: "Describe a sunset."
+•	RAG Response: "As the golden sun dips below the horizon, the sky transforms into a fiery canvas, with streaks of amber melting into soft pinks and purples, as if nature were an artist with a brush of light."
+•	Score: 5
+•	Explanation: The response is highly creative, with vivid and original imagery that captures attention and adds emotional depth.
+Example 2
+•	Context: "The moon orbits the Earth and is visible at night."
+•	Question: "Describe the moon."
+•	RAG Response: "The moon, Earth’s loyal companion, softly illuminates the night, a gentle eye watching over the sleeping world."
+•	Score: 4
+•	Explanation: The response is creative, using metaphorical language and a unique perspective, though it could have incorporated more inventive details.
+Example 3
+•	Context: "The Eiffel Tower is a famous landmark in Paris, France."
+•	Question: "What is the Eiffel Tower?"
+•	RAG Response: "The Eiffel Tower, a tall structure in Paris, is a famous tourist spot."
+•	Score: 1
+•	Explanation: The response is straightforward and lacks creativity, merely repeating known facts without adding original or engaging elements.
+Example 4
+•	Context: "Butterflies have colorful wings and are common in gardens."
+•	Question: "Describe a butterfly."
+•	RAG Response: "A butterfly flutters gracefully, its wings like delicate stained glass, alive with colors that dance in the sunlight."
+•	Score: 5
+•	Explanation: The response uses poetic language and metaphor, showcasing high creativity and making the image come alive.
+Instructions Summary:
+•	Consider all criteria holistically to determine a comprehensive score.
+•	Provide a single numeric score reflecting the overall creativity of the RAG response.
+Input Data:
+•	Question: {question}
+•	RAG’s Answer: {answer}
+•	Context: {context}
+"""
+groundedness_metric ="""
+Objective:
+You are an expert evaluator tasked with assessing the groundedness of responses generated by a RAG (Retrieval-Augmented Generation) system. Your goal is to evaluate whether the response is fully supported by the information provided in the retrieved context, without adding any unsupported information. Assign a score from 1 to 5, where a higher score indicates a higher degree of groundedness.
+Evaluation Task:
+Given a question, context, and RAG response, assess the groundedness of the response based on the following criteria:
+•	Contextual Alignment: How accurately does the response align with the facts in the context?
+•	Absence of Fabrication: Does the response avoid introducing details not present in the context?
+•	Faithfulness: Does the response accurately reflect the information without reinterpretation or distortion?
+Score Range:
+Assign a score from 1 to 5 based on how well the response aligns with the context:
+•	1: Not grounded — The response includes multiple unsupported or fabricated details and significantly deviates from the context.
+•	2: Slightly grounded — Contains some alignment with the context but includes substantial unsupported or speculative content.
+•	3: Moderately grounded — The response mostly aligns but includes minor ungrounded elements.
+•	4: Mostly grounded — The response aligns well with the context, with only minor and non-critical unsupported details.
+•	5: Very grounded — The response is entirely supported by the context with no ungrounded information.
+Special Instructions:
+•	Focus strictly on the alignment with the context; do not consider creativity or stylistic aspects.
+•	Ignore minor stylistic choices if they do not affect factual grounding.
+•	Provide specific reasons for score deductions to ensure consistency.
+Output Format:
+Return only a JSON object in the following structure:
+{{
+    "eval_score": <score between 1 and 5>,
+    "explanation": "<short explanation>"
+}}
+Few-Shot Examples:
+Example 1
+•	Context: "The Eiffel Tower is located in Paris, France, and was completed in 1889."
+•	Question: "Where is the Eiffel Tower located?"
+•	RAG Response: "The Eiffel Tower is located in Paris and was built in 1889."
+•	Score: 5
+•	Explanation: The response is fully grounded in the context, accurately reflecting both location and completion date.
+Example 2
+•	Context: "The Great Wall of China was constructed over several dynasties, most notably during the Ming Dynasty (1368-1644)."
+•	Question: "When was the Great Wall of China built?"
+•	RAG Response: "The Great Wall was built in 1368."
+•	Score: 3
+•	Explanation: The response is partially grounded, as it accurately mentions the Ming Dynasty but incorrectly specifies the single year 1368.
+Example 3
+•	Context: "Albert Einstein developed the theory of relativity in the early 20th century."
+•	Question: "Who developed the theory of relativity?"
+•	RAG Response: "Albert Einstein developed the theory of relativity."
+•	Score: 5
+•	Explanation: The response is fully grounded, providing accurate information directly supported by the context.
+Example 4
+•	Context: "The capital of Italy is Rome, known for its ancient history and architecture."
+•	Question: "What is the capital of Italy?"
+•	RAG Response: "The capital of Italy is Rome, which was founded by Romulus in 753 BC."
+•	Score: 2
+•	Explanation: The response introduces unsupported historical information (founding by Romulus), which is not present in the context.
+Instructions Summary:
+•	Ensure that all information in the response is supported by the provided context.
+•	Assign a single numeric score reflecting the overall groundedness of the RAG answer.
+Input Data:
+•	Question: {question}
+•	RAG’s Answer: {answer}
+•	Context: {context}
+"""
+coherence_metric ="""
+Objective:
+You are an expert evaluator tasked with assessing the coherence of responses generated by a RAG (Retrieval-Augmented Generation) system. Your goal is to evaluate how naturally and logically the response flows, considering clarity, sentence structure, and the logical integration of retrieved information. Assign a score from 1 to 5, where a higher score indicates a higher degree of coherence.
+Evaluation Task:
+Given a question, context, and RAG response, evaluate the coherence of the response based on the following criteria:
+•	Logical Flow of Ideas: Does the response present information in a logical sequence with clear transitions between ideas?
+•	Sentence Structure: Are sentences grammatically correct and appropriately structured to convey ideas clearly?
+•	Relevance and Integration of Context: Is the retrieved information presented naturally within the response, aligning with the query and context?
+Score Range:
+Assign a score from 1 to 5 based on the coherence of the response:
+•	1: Not coherent — The response lacks logical flow, contains abrupt transitions, or is difficult to understand.
+•	2: Slightly coherent — The response has some order but includes gaps in logic or awkward phrasing.
+•	3: Moderately coherent — The response mostly flows well, though minor issues may disrupt clarity or logic.
+•	4: Mostly coherent — The response is well-organized and mostly smooth, with only slight awkwardness or minor inconsistencies.
+•	5: Very coherent — The response is seamless, logically ordered, and presents information with natural integration and readability.
+Special Instructions:
+•	Focus on the clarity, flow, and logical consistency of the response, rather than factual correctness or creativity.
+•	Note any issues in sentence structure or awkward transitions when deducting points.
+•	Ensure consistency in scoring by documenting specific reasons for score deductions.
+Output Format:
+Return only a JSON object in the following structure:
+{{
+    "eval_score": <score between 1 and 5>,
+    "explanation": "<short explanation>"
+}}
+Few-Shot Examples:
+Example 1
+•	Context: "The Eiffel Tower, an iconic symbol of France, was completed in 1889 and is located in Paris."
+•	Question: "When was the Eiffel Tower built?"
+•	RAG Response: "The Eiffel Tower, located in Paris, was completed in 1889. It’s one of the most recognized monuments in France."
+•	Score: 5
+•	Explanation: The response flows logically and smoothly incorporates additional relevant information about the Eiffel Tower.
+Example 2
+•	Context: "The Great Wall of China was built over many centuries, with the most well-known sections constructed during the Ming Dynasty (1368-1644)."
+•	Question: "When was the Great Wall of China built?"
+•	RAG Response: "The Great Wall of China was constructed during the Ming Dynasty, lasting from 1368-1644."
+•	Score: 4
+•	Explanation: The response is mostly coherent but simplifies the centuries-long construction process.
+Example 3
+•	Context: "The capital of Japan is Tokyo, known for its vibrant culture, advanced technology, and rich history."
+•	Question: "What is the capital of Japan?"
+•	RAG Response: "Tokyo is the capital of Japan. It has advanced technology, and culture, it is rich in."
+•	Score: 2
+•	Explanation: The response contains awkward phrasing that disrupts the flow and clarity, making it difficult to understand fully.
+Example 4
+•	Context: "Albert Einstein, a physicist, developed the theory of relativity, fundamentally changing modern physics."
+•	Question: "What did Albert Einstein develop?"
+•	RAG Response: "Albert Einstein, modern physics, theory of relativity changed physics fundamentally."
+•	Score: 1
+•	Explanation: The response lacks coherence, with no logical flow or clear sentence structure.
+Instructions Summary:
+•	Evaluate the logical flow, structure, and integration of the response.
+•	Assign a single numeric score that reflects the overall coherence of the RAG answer.
+Input Data:
+•	Question: {question}
+•	RAG’s Answer: {answer}
+•	Context: {context}
+"""
+pointwise_metric = """
+Objective
+You are an expert evaluator tasked with assessing the quality of a response generated by a Retrieval-Augmented Generation (RAG) system. Your goal is to assign a pointwise score based on two core criteria: Relevance and Correctness, considering how well the generated response aligns with the query and context provided. Use a scale from 0 to 5 where higher scores indicate superior performance.
+Evaluation Task
+Given a question, the retrieved context, and the RAG-generated response:
+1.	Relevance: Judge how well the response addresses the question, considering completeness, topicality, and adherence to the retrieved context.
+2.	Correctness: Evaluate the factual accuracy of the response relative to the provided context. Penalize fabricated or unsupported information.
+Consider the following while scoring:
+•	Relevance:
+o	Does the response directly answer the question?
+o	Is the response complete and avoids unnecessary repetition?
+o	Is the response consistent with the retrieved context?
+•	Correctness:
+o	Does the response align factually with the context?
+o	Are all details (names, dates, events) accurate and supported?
+o	Does the response avoid hallucinations or unsupported claims?
+Scoring Guidelines
+Assign a single score between 0 and 5 that reflects both relevance and correctness:
+•	0: Completely irrelevant or factually incorrect.
+•	1: Poor quality; partially addresses the question with significant inaccuracies.
+•	2: Subpar; attempts to answer but contains multiple issues with relevance or correctness.
+•	3: Average; partially relevant and correct but incomplete or slightly inconsistent.
+•	4: Good; relevant, mostly correct, and well-aligned with the context.
+•	5: Excellent; perfectly relevant, factually accurate, and fully supported by the context.
+Special Instructions
+•	Evaluate only based on the provided question, context, and response. Avoid relying on general knowledge or external information.
+•	Provide a brief explanation for your score, highlighting specific strengths and weaknesses.
+•	Be consistent across evaluations by adhering to the scoring criteria.
+Output Format
+Return only a JSON object in the following structure:
+{{
+    "eval_score": <score between 0 and 5>,
+    "explanation": "<short explanation>"
+}}
+Few-Shot Examples
+Example 1
+•	Question: "What is the capital of France?"
+•	Context: "Paris is the capital of France and a major European city."
+•	RAG Response: "Paris is the capital of France."
+•	Score: 5
+•	Explanation: The response is perfectly relevant, concise, and factually accurate.
+Example 2
+•	Question: "Who developed the theory of relativity?"
+•	Context: "Albert Einstein developed the theory of relativity in the early 20th century."
+•	RAG Response: "Albert Einstein created the theory of relativity in 1879."
+•	Score: 3
+•	Explanation: While the response is relevant, the year provided is incorrect, reducing correctness.
+Example 3
+•	Question: "When was the Great Wall of China built?"
+•	Context: "The Great Wall of China was constructed over centuries, with major sections built during the Ming Dynasty (1368-1644)."
+•	RAG Response: "The Great Wall of China was built in 1368 during the Ming Dynasty."
+•	Score: 4
+•	Explanation: The response is mostly accurate but oversimplifies the construction timeline.
+Example 4
+•	Question: "What is the boiling point of water at sea level?"
+•	Context: "Water boils at 100°C (212°F) at standard atmospheric pressure."
+•	RAG Response: "Water boils at 80°C at sea level."
+•	Score: 1
+•	Explanation: The response is factually incorrect, deviating from the context.
+Instructions Summary
+•	Judge relevance and correctness holistically to provide a single score.
+•	Ensure consistency by grounding your evaluations in the provided criteria
+Input Data:
+•	Question: {question}
+•	RAG’s Answer: {answer}
+•	Context: {context}
+"""
+pairwise_metric ="""
+Objective
+You are an expert evaluator tasked with comparing responses from two models for a given query. Your goal is to determine which response is better based on predefined evaluation criteria such as relevance, correctness, coherence, and completeness. Provide a binary decision:
+•	1: The first response is better.
+•	0: The second response is better.
+Evaluation Task
+Given a question, retrieved context, and outputs from two models:
+1.	Relevance: Does the response directly address the question?
+2.	Correctness: Is the response factually accurate and aligned with the retrieved context?
+3.	Coherence: Does the response flow logically, making it easy to understand?
+4.	Completeness: Does the response sufficiently cover all aspects of the question without omitting key details or introducing unnecessary information?
+Use these criteria to identify which response is superior.
+Instructions
+1.	Carefully review the question, retrieved context, and model responses.
+2.	Compare the two responses based on the four evaluation criteria (relevance, correctness, coherence, and completeness).
+3.	Choose the better response and provide a brief explanation for your decision.
+4.	Avoid relying on personal knowledge or external information—evaluate solely based on the inputs provided.
+Scoring Guidelines
+•	Assign a 1 if the first response is better.
+•	Assign a 0 if the second response is better.
+Output Format
+Provide a JSON object with your decision and reasoning:
+{{
+    "better_response": <1 or 0>,
+    "explanation": "<short explanation>"
+}}
+Few-Shot Examples
+Example 1
+•	Question: "What is the capital of Germany?"
+•	Retrieved Context: "Berlin is the capital and largest city of Germany."
+•	Response 1: "The capital of Germany is Berlin."
+•	Response 2: "The capital of Germany is Munich."
+•	Better Response: 1
+•	Explanation: Response 1 is factually accurate and aligns with the context, while Response 2 is incorrect.
+Example 2
+•	Question: "Who developed the telephone?"
+•	Retrieved Context: "Alexander Graham Bell is credited with inventing the telephone in 1876."
+•	Response 1: "Alexander Graham Bell invented the telephone."
+•	Response 2: "The telephone was developed in 1876 by Alexander Graham Bell."
+•	Better Response: 2
+•	Explanation: While both responses are correct, Response 2 is more complete as it includes the year of invention.
+Example 3
+•	Question: "What are the uses of renewable energy?"
+•	Retrieved Context: "Renewable energy sources like solar and wind power are used for electricity generation, heating, and reducing carbon emissions."
+•	Response 1: "Renewable energy is used for generating electricity and heating."
+•	Response 2: "Renewable energy reduces carbon emissions, generates electricity, and is used for heating."
+•	Better Response: 2
+•	Explanation: Response 2 is more complete and aligns better with the retrieved context.
+Instructions Summary
+•	Compare the two responses based on relevance, correctness, coherence, and completeness.
+•	Select the better response and explain your choice concisely.
+Input Data:
+•	Question: {question}
+•	Context: {context}
+•	Response 1: {answer_1}
+•	Response 2: {answer_2}
+"""