Update README.md
Browse files
README.md
CHANGED
@@ -19,7 +19,7 @@ This reward model is trained to predict human preferences between pairs of respo
|
|
19 |
- Base Model: Llama3-8B with SFT & DPO
|
20 |
- Output: Single scalar reward value
|
21 |
- Parameters: 8B
|
22 |
-
- Training Framework:
|
23 |
|
24 |
## Example Usage
|
25 |
|
|
|
19 |
- Base Model: Llama3-8B with SFT & DPO
|
20 |
- Output: Single scalar reward value
|
21 |
- Parameters: 8B
|
22 |
+
- Training Framework: DeepSpeed + TRL
|
23 |
|
24 |
## Example Usage
|
25 |
|