sdadas commited on
Commit
464c73d
1 Parent(s): 5ec30f4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -24,9 +24,9 @@ The final distribution of documents by topic is shown in the chart below:
24
  ## Model details
25
 
26
  The models were trained for one epoch on sequences of 4096 tokens. During training, we used many modern optimizations such as:
27
- - [torch.compile](pytorch.org/docs/stable/generated/torch.compile.html)
28
  - [adamw_apex_fused](https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one#optimizer-choice) optimizer
29
- - [Flash Attention 2](github.com/Dao-AILab/flash-attention)
30
  - [Mixed precision](https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one#bf16) (`--bf16` and `--tf32` options)
31
  - [Gradient accumulation](https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one#gradient-accumulation)
32
  - [Fully Sharded Data Parallel (FSDP)](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html) with the SHARD_GRAD_OP mode
 
24
  ## Model details
25
 
26
  The models were trained for one epoch on sequences of 4096 tokens. During training, we used many modern optimizations such as:
27
+ - [torch.compile](https://pytorch.org/docs/stable/generated/torch.compile.html)
28
  - [adamw_apex_fused](https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one#optimizer-choice) optimizer
29
+ - [Flash Attention 2](https://github.com/Dao-AILab/flash-attention)
30
  - [Mixed precision](https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one#bf16) (`--bf16` and `--tf32` options)
31
  - [Gradient accumulation](https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one#gradient-accumulation)
32
  - [Fully Sharded Data Parallel (FSDP)](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html) with the SHARD_GRAD_OP mode