tinyllamas_92M

Model Details

  max_seq_len = 256
  vocab_size = 8192 
  dim = 768
  n_layers = 12
  n_heads = 12
  n_kv_heads = 12

Training Data

Training Hyperparameters

  batch_size = 64  # if gradient_accumulation_steps > 1, this is the micro-batch size
  dropout = 0.0
  # adamw optimizer
  gradient_accumulation_steps = 8  # used to simulate larger batch sizes
  learning_rate = 1e-3  # max learning rate
  max_iters = 34000  # total number of training iterations
  weight_decay = 3e-4
  beta1 = 0.9
  beta2 = 0.95
  grad_clip = 1.0  # clip gradients at this value, or disable if == 0.0
  # learning rate decay settings
  decay_lr = True  # whether to decay the learning rate
  warmup_iters = 1000  # how many steps to warm up for

Results

4xV100 GPUs used.
Run summary:
  iter 34000
  loss/train 0.8704
  loss/val 0.9966
  tokens 983040000

Downloads last month: 30

Safetensors

Model size

91.2M params

Tensor type

F32