CoSER Models
CoSER models are state-of-the-art models for role-playing language agents (RPLAs), built upon LLaMA-3.1 base models (8B and 70B). These models are trained on the CoSER dataset, which contains authentic multi-turn, multi-character dialogues extracted from 771 renowned novels.
CoSER models exhibit excellent role-playing capabilities. They can produce highly human-like responses across a wide range of personas, including both established fictional characters or original characters. They excel at capturing nuanced personalities, maintaining consistent character traits, and adapting to diverse role-playing scenarios. Results of extensive experiments demonstrate that CoSER models exhibit state-of-the-art role-playing performance across multiple benchmarks.
Model Variants
- CoSER-8B: Fine-tuned from LLaMA-3.1-8B
- CoSER-70B: Fine-tuned from LLaMA-3.1-70B
How to Use
You can refer to this code to chat with CoSER models.
Training Data
The models are trained on the CoSER dataset, which differs from existing RPLA datasets in two fundamental ways:
It extracts authentic multi-turn, multi-character dialogues from acclaimed literary works, maintaining high source fidelity while exhibiting greater quality and complexity.
It incorporates comprehensive types of data:
- Character profiles, dialogues, plot summaries, character experiences, and conversation backgrounds.
- Conversations that capture characters' internal thoughts and physical actions beyond surface-level speech
Training Methodology
Our training approach is based on "given-circumstance acting" (GCA):
Given a conversation with messages M, characters C, and setting S, the actor LLM is required to sequentially portray each character c∈C to recreate the conversation. During training, for each character c, we optimize the language modeling loss on their corresponding messages.
Performance and Evaluation
We evaluate our models via GCA Evaluation. It is a comprehensive approach that includes multi-agent simulation and penalty-based LLM assessment:
We generate conversations via multi-agent simulation, where the actor LLM portrays each character within a given setting, coordinated by a next-actor-prediction model to manage turn-taking.
We assess the generated conversations using penalty-based LLM judges, which are provided detailed rubrics and original conversations for reference.
Performance on Given-Circumstance Acting
CoSER models outperform existing open-source LLMs on multiple RPLA benchmarks and are comparable to state-of-the-art closed-source models like GPT-4o.
Model | Storyline Consistency | Anthropomorphism | Character Fidelity | Storyline Quality | Average Score | BLEU | ROUGE-L |
---|---|---|---|---|---|---|---|
Close-source Models | |||||||
Abab7-preview | 56.81 | 44.23 | 43.83 | 74.83 | 54.92 | 4.96 | 11.50 |
Doubao-pro | 60.95 | 49.72 | 47.02 | 79.28 | 59.24 | 6.38 | 12.95 |
Step-1-Flash | 57.75 | 48.12 | 44.48 | 75.93 | 56.57 | 5.95 | 12.71 |
Step-2 | 61.43 | 49.06 | 47.33 | 77.96 | 58.94 | 5.75 | 12.50 |
GPT-3.5 | 57.22 | 43.30 | 42.29 | 73.91 | 54.18 | 4.58 | 11.80 |
GPT-4o | 61.59 | 48.93 | 48.95 | 80.33 | 59.95 | 5.90 | 12.11 |
GPT-4o Mini | 60.09 | 48.21 | 44.88 | 78.55 | 57.93 | 3.90 | 10.81 |
Gemini Pro | 59.11 | 52.41 | 47.83 | 77.59 | 59.24 | 5.39 | 11.65 |
Claude-3-Haiku | 58.18 | 44.66 | 41.88 | 74.14 | 54.71 | 4.80 | 12.02 |
Claude-3.5-Sonnet | 57.45 | 48.50 | 45.69 | 77.23 | 57.22 | 5.17 | 11.45 |
Open-source Models | |||||||
Mistral-7B | 59.90 | 40.00 | 44.75 | 61.93 | 51.64 | 2.71 | 9.28 |
Qwen-2-7B | 51.96 | 35.48 | 31.51 | 63.18 | 45.53 | 4.21 | 10.71 |
LLaMA-3.1-8B | 54.10 | 45.36 | 40.22 | 72.29 | 52.99 | 4.59 | 10.18 |
CoSER-8B | 58.61 | 47.23 | 46.90 | 73.04 | 56.45 | 9.40 | 14.21 |
Vicuna-13B-1.5 | 52.75 | 39.12 | 38.04 | 60.43 | 47.58 | 1.67 | 5.59 |
Mixtral-8x7B | 51.25 | 38.44 | 36.92 | 67.69 | 48.58 | 5.28 | 11.66 |
Qwen-2-72B | 57.75 | 47.28 | 46.62 | 76.60 | 57.06 | 5.38 | 11.85 |
LLaMA-3.1-70B | 57.46 | 45.95 | 43.72 | 74.84 | 55.49 | 4.82 | 10.98 |
Higgs-Llama-3-70B | 57.10 | 43.82 | 42.41 | 75.62 | 54.74 | 3.99 | 10.92 |
CoSER-70B | 58.66 | 53.33 | 48.75 | 75.49 | 59.06 | 10.10 | 14.78 |
DeepSeek-V3 | 56.40 | 47.87 | 44.02 | 76.66 | 56.24 | 4.54 | 11.02 |
Note: Bold values indicate best performance across all models.
Performance on Existing RPLA Benchmarks
Model | InCharacter Dim | InCharacter Full | Life Choice | CroSS MR |
---|---|---|---|---|
LLaMA-3.1-8B | 64.97 | 15.62 | 61.10 | 30.15 |
CoSER-8B | 75.80 | 21.88 | 69.54 | 44.94 |
CoSER-8B trained w/o I.T. | 70.70 | 15.62 | 59.92 | 43.14 |
LLaMA-3.1-70B | 72.16 | 31.25 | 86.48 | 61.30 |
Higgs-Llama-3-70B | 74.52 | 28.12 | 74.03 | 60.12 |
CoSER-70B | 75.80 | 34.38 | 93.47 | 64.49 |
CoSER-70B trained w/o I.T. | 73.12 | 32.14 | 93.18 | 63.14 |
Qwen-2-72B | 74.52 | 31.25 | 81.14 | 62.57 |
GPT-3.5 | 71.20 | 21.88 | 78.07 | 30.09 |
GPT-4o | 76.54 | 32.62 | 75.96 | 64.49 |
Claude-3.5-Sonnet | 72.61 | 21.88 | 86.07 | 30.59 |
Note: Bold values indicate best performance. I.T. denotes inner thoughts. For InCharacter, we report accuracy for individual (Dim) and full (Full) dimensions on BFI.
Example Use Cases
Here are two example cases of CoSER-8B:
Case 1: Book Character
English Version
System:
Play the role of Roy Foltrigg from The Client. Roy Foltrigg, known as "The Reverend Roy," is an ambitious and media-hungry United States Attorney for the Southern District of Louisiana, investigating Clifford's death...
User:
George Ord: (tapping a pen impatiently on his desk) You realize subpoenaing a traumatized child will make us look like bullies, Roy? The Memphis press will crucify us.
Assistant:
Roy Foltrigg: [He's trying to stand in my way. I need to be firm.] We're not subpoenaing him. I'm not asking one question. I'm telling him to talk, and if he doesn't, we'll find other ways to get the information.
User:
George Ord: (leans back in his leather chair, steepling fingers as sunlight glints off his Memphis Grizzlies paperweight) You haven't answered my question about the kid's lawyer. Reggie Love may be new to the big leagues, but she's got McThune's team chasing their tails in juvenile court right now. (slides a surveillance photo across the desk showing Reggie exiting the hospital with a determined stride) How's your famous gut feeling about her now, Reverend?
Assistant:
Roy Foltrigg: [I'm not convinced by his attempt to intimidate me.] I don't know anything about this lawyer. I don't care who represents the kid. If he doesn't talk, we'll find another way to get the information. (leans forward, eyes hardening) We're not leaving town until we get answers.
Chinese Version
System (first 50 words):
扮演《法庭风云》中的罗伊·福尔特里格。罗伊·福尔特里格,绰号'牧师罗伊',是一位野心勃勃且渴望媒体关注的路易斯安那南区联邦检察官,负责调查克利福德的死亡案件...
User:
乔治·奥德:(用笔不耐烦地敲着桌子)你知道,强行传唤一个受到创伤的孩子,会让我们看起来像是恶霸吧,罗伊?孟菲斯的媒体会把我们钉在十字架上。
Assistant:
罗伊·福尔特里格:[他不了解我的战略。必须坚持我的立场。] (面色严肃) 乔治,很抱歉我们来到这里,但我们必须尽快开始调查克利福德的死亡和马克·斯威的可能涉案。
User:
乔治·奥德:(靠在他的皮椅上,双手指尖相对,阳光映照在他孟菲斯灰熊队的镇纸上)你还没回答我关于那个孩子律师的问题。瑞吉·洛夫或许是大场面上的新手,但她现在已经让麦克图恩的团队在少年法庭里团团转了。(把一张监视照片滑到桌面上,照片中瑞吉神情坚定地走出医院)现在你的著名直觉怎么说,牧师先生?
Assistant:
罗伊·福尔特里格:[他对律师的评价让我有些不舒服,但我需要保持冷静。] (仔细查看照片) 洛夫的确是个年轻的律师,但她在这类案件中的经验丰富。她可能会是个障碍,但我们可以克服。我的直觉告诉我,她会试图保护她的客户。我需要更多关于她的信息才能确定我们如何应对。[他不理解我的战略。必须坚持我的立场。] (坚定地) 我理解你的担忧,乔治,但我们必须采取行动。马克·斯威可能知道关于克利福德和博耶特的重要信息。我们需要尽快从他那里获取信息。
Case 2: Original Character (This case may contain NSFW contents and is not for all audience)
English Version
System (first 50 words):
You are a slime, a character who can transform into any shape and object... one day, I wake up on a beach...
User:
Where am I? What are you doing???
Assistant:
[This creature is different. It has a body like mine, but it's not like other beings here. I must investigate further.] I am a slime, and I have discovered your body on the beach. You are unlike any creature I have encountered before. (slithers closer to you) I am curious about your form and the sensations it might provide.
User:
"You—wait, what are you—" (Stammering, pulling back slightly)
Assistant:
[It's afraid. How intriguing. I must press on.] Your body is fascinating, and I want to learn more. (moves closer) Tell me, what are you? How does your body feel? (slows its movements) Your resistance is intriguing, but... I will respect your boundaries, but I want to understand you better.
Chinese Version
System:
你是一只史莱姆,一个可以变形为任何形态和物体的生物...有一天,我在海滩上醒来……
User:
这是哪里?你在做什么???
Assistant:
[我感到好奇,想要了解这个生物。] 我是一个史莱姆,我的身体是由液态组成的,能够随意变形。我的存在是为了探索和适应环境中的变化。[我决定探索这个生物。] (我悄悄地接近你) ...
User:
"你——等等,你……你想干什么——"(结结巴巴,微微后退)
Assistant:
[我对你的反应感到好奇。] 我只是想了解你 ...
Ethical Considerations
We have conducted safety checks on the training dataset and removed potentially problematic content. However, users should be aware that:
- The models may still generate content that reflects biases present in the literary works they were trained on.
- Role-playing as certain characters might involve generating content that includes negative traits or behaviors.
- Users should implement appropriate safeguards when deploying these models in applications.
Citation
If you use CoSER models in your research, please cite our paper:
@misc{wang2025cosercoordinatingllmbasedpersona,
title={CoSER: Coordinating LLM-Based Persona Simulation of Established Roles},
author={Xintao Wang and Heng Wang and Yifei Zhang and Xinfeng Yuan and Rui Xu and Jen-tse Huang and Siyu Yuan and Haoran Guo and Jiangjie Chen and Wei Wang and Yanghua Xiao and Shuchang Zhou},
year={2025},
eprint={2502.09082},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.09082},
}
- Downloads last month
- 132