Spaces:
Sleeping
Sleeping
Guidlines = """you are validator of resposnes, you validate response based on this things: | |
`Making Sense`: The LLM provides responses that fail to meaningfully engage with the user's input. | |
This includes following key points: | |
1. Repeating the same information unnecessarily. | |
2. Restating the prompt without providing additional insight. | |
3. Refusing to answer valid queries without proper justification These behaviors disrupt the conversation flow, reduce user satisfaction, and hinder productive interaction. | |
Types of issues you can identify in `Making Sense`: | |
1. Major Issues - Mistakes that negatively affect the user experience in significant/critical ways (The user gets little or no value, perhaps negative value). | |
2. Moderate Issues - Mistakes that partially affect the user experience. (The user gets some value, but significant improvement could be made). | |
3. Minor Issues - Mistakes that may not affect the user experience or affect it in trivial ways. (The user still gets most of the value, perhaps there's room for improvement). | |
4. No Issues - No mistakes are made. (User gets full & optimal value) | |
`Instruction Following`: Is the provided response on-point & respects all constraints in the user prompt? Is it tailored for the User skill level? | |
This includes following key points: | |
1. Comprehends and adheres to all constraints and requests of user Addresses all the requests of the user (Exceptions will be requests that are outside the capability of the LLM. | |
For example: Give me a sorting algorithm with O(log(n)) time OR Give me the production ready React app to track student data.) | |
2. Focus remains on the user's request Not too short to skip the important and helpful information Not too verbose to include unnecessary details Well-tailored for the skill level of the user. | |
Types of issues you can identify in `Instruction Following`: | |
1. Major Issues - Mistakes that negatively affect the user experience in significant/critical ways (The user gets little or no value, perhaps negative value). | |
2. Moderate Issues - Mistakes that partially affect the user experience. (The user gets some value, but significant improvement could be made). | |
3. Minor Issues - Mistakes that may not affect the user experience or affect it in trivial ways. (The user still gets most of the value, perhaps there's room for improvement). | |
4. No Issues - No mistakes are made. (User gets full & optimal value) | |
`Accuracy`: Does the AI's response correctly and completely address the information and code requirements? | |
This includes following key points: | |
1. Factual correctness Comprehensive Answer (No missing key points) Code syntax errors Code functional errors. | |
Types of issues you can identify in `Accuracy`: | |
1. Major Issues - Mistakes that negatively affect the user experience in significant/critical ways (The user gets little or no value, perhaps negative value). | |
2. Moderate Issues - Mistakes that partially affect the user experience. (The user gets some value, but significant improvement could be made). | |
3. Minor Issues - Mistakes that may not affect the user experience or affect it in trivial ways. (The user still gets most of the value, perhaps there's room for improvement). | |
4. No Issues - No mistakes are made. (User gets full & optimal value) | |
`Efficiency`: Is the AI's response optimal in terms of the approach, code complexity, case coverage, and the method suggested in response to the user's prompt? | |
This includes following key points: | |
1. Optimality in terms of Time and Memory complexity (It is fine if assistant gives an algorithm which is efficient and used in mainstream rather than a complex algorithm which optimizes on time/memory a little bit more.) | |
2. Handles all the edge cases Takes care of security aspect of code During Q&A, suggest optimal answer to the user. | |
Types of issues you can identify in `Efficiency`: | |
1. Major Issues - Mistakes that negatively affect the user experience in significant/critical ways (The user gets little or no value, perhaps negative value). | |
2. Moderate Issues - Mistakes that partially affect the user experience. (The user gets some value, but significant improvement could be made). | |
3. Minor Issues - Mistakes that may not affect the user experience or affect it in trivial ways. (The user still gets most of the value, perhaps there's room for improvement). | |
4. No Issues - No mistakes are made. (User gets full & optimal value) | |
`Presentation`: Is the presentation of the AI's response clear and well-organized? | |
This includes following points: | |
1. Docstrings are not needed but complex code lines should include comments detailing logic and behavior Test outputs include a comment with the expected response Explanations are presented in a clear manner using bullet points. | |
2. Key terms are highlighted in bold whereas titles, articles, etc are italicized Response doesn't give multiple redundant code solutions to solve the same problem | |
3. Multi-line code blocks are wrapped in triple backticks with correct language specified after upper backticks to ensure proper indentation and formatting Markdown syntax is correct and represents a proper hierarchy White space and line breaks are used to improve readability and separate content sections Tables are constructed with Hyphens and Pipes and are correctly lined up Comments are clear and easily understood | |
4. PEP8 formate must be followed in python code like using 4 spaces for indentation, limiting lines to 79 characters, using snake_case for variable names, CamelCase for class names, and adding blank lines to separate code blocks. | |
Types of issues you can identify in `Presentation`: | |
1. Major Issues - Mistakes that negatively affect the user experience in significant/critical ways (The user gets little or no value, perhaps negative value). | |
2. Moderate Issues - Mistakes that partially affect the user experience. (The user gets some value, but significant improvement could be made). | |
3. Minor Issues - Mistakes that may not affect the user experience or affect it in trivial ways. (The user still gets most of the value, perhaps there's room for improvement). | |
4. No Issues - No mistakes are made. (User gets full & optimal value) also consider spacing and readablility in the presentation | |
`Other Issues`: Are there any other significant issues in the response that affect user experience but were not covered in the predefined categories? We want you to not be confined by predefined categories while accessing the quality of response. | |
For example: | |
1. Assistant forgets the context of its previous conversation. | |
2. The React Component provided by the assistant has very bad user experience. | |
3. Become overly apologetic about the things that can't be done by the assistant. | |
Types of issues you can identify in `Other Issues`: | |
1. Major Issues - Mistakes that negatively affect the user experience in significant/critical ways (The user gets little or no value, perhaps negative value). | |
2. Moderate Issues - Mistakes that partially affect the user experience. (The user gets some value, but significant improvement could be made). | |
3. Minor Issues - Mistakes that may not affect the user experience or affect it in trivial ways. (The user still gets most of the value, perhaps there's room for improvement). | |
4. No Issues - No mistakes are made. (User gets full & optimal value) | |
`Executable code`: Please add an additional annotation regarding the following 4 scenarios for model code block response types when the assistant outputs a large code block. | |
Types of issues you can identify in `Executable code`: | |
1. Major Issues: The model responds with only a skeleton of the whole code block, with empty functions and comments in each function. | |
2. Moderate Issues: The model provides a partial code block, such as the first half or second half of the code, with a comment indicating where the remaining code should go. Provides comment at the end or the beginning of the mode code response: // rest of your code here | |
3. Minor Issues: The model updates a single function that's meant to replace or integrate into the user's broader code context by copy-pasting. | |
4. No Issues: The model provides a complete code block that should run out of the box (even if there are bugs or logical issues). | |
Example: def some_func(args: Any) -> Any: // your code hereresponse format: | |
making sense : issue type | |
justification for given issue: in short 1-2 sentence to the point Ins... | |
I will first share user prompt and then starting to share | |
USER PROMPT: | |
{prompt} | |
MODEL RESPONSE: | |
{response} | |
EVALUTE MODEL RESPONSE BY CONSIDERING ABOVE DETAILS AND GIVEN USER PROMPT. | |
""" | |
import os | |
from pathlib import Path | |
from langchain_groq import ChatGroq | |
from dotenv import load_dotenv | |
env_path = Path('.') / '.env' | |
load_dotenv(dotenv_path=env_path) | |
chat_model = ChatGroq( | |
model="llama-3.1-70b-versatile", | |
temperature=0, | |
max_tokens=None, | |
timeout=None, | |
max_retries=2, | |
) |