Answered in 2 languages
Can I disable the 2 language behavior?
First the model answered in german after that in () the English translation.
User:
Hi wie geht's?
Granite:
Hallo! Ich bin Granite, ein Sprachmodell-AI entwickelt von IBM im Jahr 2024. Es freut mich, dass Sie mich kontaktieren. Wie kann ich Ihnen heute behilflich sein? (Hi! I am Granite, an AI language model developed by IBM in 2024. It's nice to meet you. How can I assist you today?)
Hi, I can't reproduce granite-3.2-8b-instruct
's performance on AlpacaEval2:
Here's what I'm getting:
LC win-rate: 16.57 (thinking)
LC win-rate: 46.37 (no thinking)
@nightdude Hi, have you parsed the output correctly to evaluate AlpacaEval2 results with thinking? Only response should be used for evaluation.
@ashish23 @ranarag : can you share the exact setting including a sample parser for evaluating AlpacaEval2?
@nightdude
yes you need to parse the results and submit only the response for evaluation. You can activate the thoughts for response generation, but only the responses must be sent for eval.
Also, we use 4096 max length (including prompt + response tokens).
Here is a simple parser:
if "Here is my response:" in generated_text:
generated_text = generated_text.split("Here is my response:")[1].strip()
elif "Here is my thought process:" in generated_text:
generated_text = generated_text.split("Here is my thought process:")[1].strip()
Thanks! Indeed I did not remove the "thinking" part from the LLM's response before evaluation. However, even after removing it, I still don't get over 0.46 LC win rate.
Here's the sample of Granite's outputs (as you see, all of them start with "Here is my thought process"):
[
{
"dataset":"helpful_base",
"instruction":"What are the names of some famous actors that started their careers on Broadway?",
"output":"Certainly, here are some renowned actors who kickstarted their careers on Broadway:\n\n1. **Robert De Niro** - Before becoming a Hollywood icon, De Niro honed his craft on the Broadway stage. His first role was in the 1965 production of \"The Action is the Juice,\" followed by \"A Hatful of Rain\" in 1965 and \"HBO Playhouse Presents: The Shrinking Bride\" in 1972.\n\n2. **Meryl Streep** - Streep began her professional acting career on Broadway, starting with \"Trelawny of the 'Wells'\" in 1975, and gaining fame through her performance in \"The Corn is Green\" in 1976.\n\n3. **Al Pacino** - Pacino started his theater career on Broadway in the late 1960s, appearing in plays like \"Does a Tiger Wear a Necktie?\" (1969) and \"The Basic Training of Pavlo Hummel\" (1971). \n\n4. **Dustin Hoffman** - Hoffman's Broadway debut was in \"Eh?,\" a 1961 play, and he later starred in \"Exit the King\" in 2009, for which he received a Tony nomination at age 72.\n\n5. **Natalie Portman** - Although relatively young in comparison to others, Portman made her Broadway debut in \"The Diary of Anne Frank\" in 2005 before transitioning to film roles.\n\n6. **Denzel Washington** - Washington began his acting journey on Broadway, appearing in \"Foreman at the Door\" (1972) and \"A Soldier's Play\" (1981), for which he won a Tony Award.\n\n7. **Viola Davis** - Davis started her career on Broadway in the early 1990s, notably in \"King Hedley II\" (1999) for which she received a Tony nomination. She's continued to balance Broadway and film work throughout her career.\n\n8. **Hugh Jackman** - Known for his role as Wolverine in the X-Men films, Jackman started on Broadway in the late 1990s, appearing in productions like \"The Boy from Oz\" for which he won a Tony Award.\n\nThese actors not only showcase the caliber of talent that Broadway has nurtured but also highlight how the stage can serve as a crucial stepping stone to stardom in cinema and television.",
"generator":"Granite-3.2-8b-instruct-think"
},
{
"dataset":"helpful_base",
"instruction":"How did US states get their names?",
"output":"The names of U.S. states are derived from a variety of sources, reflecting the rich history and cultural diversity of the nation. Here\u2019s a breakdown of some common origins:\n\n1. **Native American Languages and Tribes:** \n - A substantial number of state names originate from indigenous languages. For instance:\n - *Alabama*: Derived from the Choctaw word *alba* meaning \"vegetation\" or \"plants,\" combined with *amaha* meaning \"here we rest.\"\n - *Connecticut*: From the Mohegan word *quinnitukqut*, meaning \"long river place\" or \"at the long tidal river.\"\n - *Michigan*: Likely from the Ojibwe word *migis*, meaning \"great water\" or \"large lake.\"\n\n2. **European Explorers and Monarchs:**\n - Many states are named after European explorers or monarchs:\n - *Virginia*: Named by Sir Walter Raleigh in honor of Elizabeth I, the \"Virgin Queen.\"\n - *Florida*: Named \"La Florida\" by Spanish explorer Ponce de Le\u00f3n.\n - *Pennsylvania*: Means \"Penn's Woods,\" named after William Penn, the English proprietor.\n\n3. **Geographical Features:**\n - Several states take their names from physical characteristics:\n - *Montana*: From the Spanish word *monta\u00f1a*, meaning \"mountain.\"\n - *Vermont*: Derived from French, meaning \"Green Mountain.\"\n - *Tennessee*: From the Native American word *tanasi*, meaning \"meeting place\" or \"river of the great slope.\"\n\n4. **Notable People:**\n - Some states are named after prominent historical figures:\n - *Washington*: Named after George Washington.\n - *Jackson*: After Andrew Jackson.\n - *Jefferson*: Named in honor of Thomas Jefferson.\n\n5. **Other Miscellaneous Origins:**\n - A few states have more unique or less common origins:\n - *Illinois*: Possibly derived from the Algonquian word for \" superior men\" or \"brave men.\"\n - *Ohio*: Originated from the Iroquoian word *ohi-yo*, meaning \"great river\" or \"large creek.\"\n\nIt's worth noting that in some cases, the exact origin is debated or unclear, and names have evolved through time with shifts in language and cultural assimilation. This diversity in naming reflects the multifaceted heritage of the United States, encompassing Native American, European, and early American influences.",
"generator":"Granite-3.2-8b-instruct-think"
},
NOTE: I remove anything that comes between "Here is my thought process" " and "Here is my response: ".
The evaluator that I used is alpaca_eval_clf_cot_gpt4_turbo
(if that's not the one you've used, could you post the exact config.yaml
for the evaluator?).
My full command:
alpaca_eval evaluate_from_model \
--model_configs AlpacaEval/granite-3-2-8b-instruct-think-configs.yaml \
--annotators_config alpaca_eval_clf_cot_gpt4_turbo
NOTE: in alpaca_eval_clf_cot_gpt4_turbo/config.yaml
I set max_tokens: 4096
(instead of 300).
- Can you please post yours evaluator's config here? Because there are many evaluators (and you can configure them in many different ways).
- Can you also post the full command that you used to evaluate the model's outputs?
- Perhaps you could post the results you get with your configs on a minimal subsample of AlpacaEval2 (e.g. 64 samples with
--max_instances 64
)? Just be sure that I'm on the right track.
for context:
model_configs:
# granite-3-2-8b-instruct-think-configs.yaml
Granite-3.2-8b-instruct-think:
prompt_template: "AlpacaEval/chatml_prompt_granite_thinking.txt"
fn_completions: "openai_completions"
completions_kwargs:
model_name: "ibm-granite/granite-3.2-8b-instruct"
max_tokens: 4096
requires_chatml: True
is_chat: True
client_kwargs:
base_url: "http://localhost:4000"
pretty_name: "Granite 3.2 8b Instruct"
link: "https://huggingface.co/ibm-granite/granite-3.2-8b-instruct"
and the prompt template (chatml_prompt_granite_thinking.txt):
<|im_start|>system
You are Granite, developed by IBM. You are a helpful AI assistant.\nRespond to every user query in a comprehensive and detailed way. You can write down your thoughts and reasoning process before responding. In the thought process, engage in a comprehensive cycle of analysis, summarization, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. In the response section, based on various attempts, explorations, and reflections from the thoughts section, systematically present the final solution that you deem correct. The response should summarize the thought process. Write your thoughts after 'Here is my thought process:' and write your response after 'Here is my response:' for each user query.
<|im_end|>
<|im_start|>user
{instruction}
<|im_end|>
Thanks!!!
Sure, let me provide more details about our evaluator and the configs used.
We evaluate the alpaca task using olmes repo: https://github.com/allenai/olmes/tree/main
Here, the exact command that we use is:python oe_eval/launch.py --model $sub_model_path --task "alpaca_eval_v2::tulu" --output-dir $out_dir --model-type vllm --batch-size 1 --model-args '{"max_length": 4096}'
When using thinking, we parse the outputs in the code: https://github.com/allenai/olmes/blob/main/oe_eval/models/eleuther_vllm_causallms.py, specifically changing the "generated_text" in "generate_until_verbose" function.
We do not change the default alpaca config here: https://github.com/allenai/olmes/blob/main/oe_eval/tasks/oe_eval_tasks/alpaca_eval.py
Also, I see that you are using an incorrect prompt template, it should rather be:<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024. Today's Date: March 08, 2025. You are Granite, developed by IBM. You are a helpful AI assistant. Respond to every user query in a comprehensive and detailed way. You can write down your thoughts and reasoning process before responding. In the thought process, engage in a comprehensive cycle of analysis, summarization, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. In the response section, based on various attempts, explorations, and reflections from the thoughts section, systematically present the final solution that you deem correct. The response should summarize the thought process. Write your thoughts after 'Here is my thought process:' and write your response after 'Here is my response:' for each user query.<|end_of_text|> <|start_of_role|>user<|end_of_role|>{instruction}<|end_of_text|>
Without thinking enabled, the template will be:<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024. Today's Date: March 08, 2025. You are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|> <|start_of_role|>user<|end_of_role|>{instruction}<|end_of_text|>
Also, the gpt model that we use as judge is: gpt-4-1106-preview