Model Card for Llama-3.1-8B-Instruct-NLRL-Breakthrough-Value
Model Details
Model Description
- Developed by: NLRL Team
- Model type: Language Value Function Model for Breakthrough
- Language(s): English
- License: MIT
- Finetuned from model: LLaMA-3.1-8B-Instruct
This model serves as a language value function in Natural Language Reinforcement Learning (NLRL) framework, specifically trained for the Breakthrough game. It evaluates the state through natural language description and provides value assessment.
Uses
Direct Use
This model can be used as a Breakthrough position evaluator that explains its evaluation through natural language before providing the final assessment. The model generates both reasoning chains and final value judgments.
Out-of-Scope Use
This model is specifically trained for Breakthrough board state evaluation and should not be used for other games or value assessment tasks.
Training Details
Training Data
Training data consists of TD data collected through NLRL value learning process, with language-based TD estimates serving as training targets for the value function.
Training Procedure
- Trained using FSDP (Fully Sharded Data Parallel) across 4 H100 GPUs
- Learning rate: 2e-5
- Training epochs per iteration: 2
- Max sequence length: 1024
Evaluation
- Demonstrates consistent evaluation capabilities across different game states
Model Architecture
- Base model: LLaMA-3.1-8B-Instruct
- Input: Text description of Breakthrough board state
- Output: Chain-of-thought evaluation followed by value assessment
Citation
@misc{feng2024naturallanguagereinforcementlearning,
title={Natural Language Reinforcement Learning},
author={Xidong Feng and Ziyu Wan and Haotian Fu and Bo Liu and Mengyue Yang and Girish A. Koushik and Zhiyuan Hu and Ying Wen and Jun Wang},
year={2024},
eprint={2411.14251},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2411.14251},
}
Model Card Contact
- Downloads last month
- 17
Model tree for Waterhorse/Llama-3.1-8B-Instruct-NLRL-Breakthrough-Value
Base model
meta-llama/Llama-3.1-8B