tokyo-electron-device-ai
/

llama3-tedllm-8b-v0

Model card Files Files and versions Community

tokyo-electron-device-ai commited on 8 days ago

Commit

7277056

•

1 Parent(s): bd5bcbd

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -75,7 +75,7 @@ Note: This released model was trained exclusively on open-source datasets. We al
 * **weight_decay**: 0.1
 * **annealing_steps**: 500
-Note: We created another model name, llama3-tedllm-8b-v0-annealing as the model with the annealing_step applied. Please check [Here](https://huggingface.co/tokyo-electron-device-ai/llama3-tedllm-8b-v0-annealing) if you can use.
 ### Training Infrastructure
 The model was trained on a Cerebras Wafer-Scale Cluster, using from 4 to 16 CS-3 systems during different phases of training. Training on the Cerebras Wafer-Scale Clusters leverages Cerebras' Weight Streaming execution paradigm, which simplifies the training of LLMs by disaggregating compute from memory used for model weights. This enables efficient scaling of training across multiple nodes using simple data parallelism. You can learn more about Cerebras Wafer-Scale clusters and Weight Streaming execution [here](https://8968533.fs1.hubspotusercontent-na1.net/hubfs/8968533/Virtual%20Booth%20Docs/CS%20Weight%20Streaming%20White%20Paper.pdf).

 * **weight_decay**: 0.1
 * **annealing_steps**: 500
+Note: We created another model name, llama3-tedllm-8b-v0-annealing as the model with the annealing_step applied. If you are interested, please check [here](https://huggingface.co/tokyo-electron-device-ai/llama3-tedllm-8b-v0-annealing).
 ### Training Infrastructure
 The model was trained on a Cerebras Wafer-Scale Cluster, using from 4 to 16 CS-3 systems during different phases of training. Training on the Cerebras Wafer-Scale Clusters leverages Cerebras' Weight Streaming execution paradigm, which simplifies the training of LLMs by disaggregating compute from memory used for model weights. This enables efficient scaling of training across multiple nodes using simple data parallelism. You can learn more about Cerebras Wafer-Scale clusters and Weight Streaming execution [here](https://8968533.fs1.hubspotusercontent-na1.net/hubfs/8968533/Virtual%20Booth%20Docs/CS%20Weight%20Streaming%20White%20Paper.pdf).