bigscience-bot commited on
Commit
0a0a7d2
1 Parent(s): 511249d
Files changed (1) hide show
  1. logs/main_log.txt +66 -0
logs/main_log.txt CHANGED
@@ -96857,3 +96857,69 @@ time (ms)
96857
  time (ms)
96858
  iteration 1965/ 292968 | consumed samples: 4024320 | consumed tokens: 459767808 | elapsed time per iteration (ms): 106873.1 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.921288E+00 | loss scale: 32768.0 | grad norm: 20923.170 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96859
  time (ms)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96857
  time (ms)
96858
  iteration 1965/ 292968 | consumed samples: 4024320 | consumed tokens: 459767808 | elapsed time per iteration (ms): 106873.1 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.921288E+00 | loss scale: 32768.0 | grad norm: 20923.170 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96859
  time (ms)
96860
+ iteration 1966/ 292968 | consumed samples: 4026368 | consumed tokens: 460111872 | elapsed time per iteration (ms): 103668.6 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.929842E+00 | loss scale: 32768.0 | grad norm: 19834.126 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96861
+ time (ms)
96862
+ iteration 1967/ 292968 | consumed samples: 4028416 | consumed tokens: 460455936 | elapsed time per iteration (ms): 106664.6 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.931633E+00 | loss scale: 32768.0 | grad norm: 19386.027 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96863
+ time (ms)
96864
+ iteration 1968/ 292968 | consumed samples: 4030464 | consumed tokens: 460800000 | elapsed time per iteration (ms): 110508.2 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.945953E+00 | loss scale: 32768.0 | grad norm: 19908.571 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96865
+ time (ms)
96866
+ iteration 1969/ 292968 | consumed samples: 4032512 | consumed tokens: 461144064 | elapsed time per iteration (ms): 110069.9 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.896821E+00 | loss scale: 32768.0 | grad norm: 15035.351 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96867
+ time (ms)
96868
+ iteration 1970/ 292968 | consumed samples: 4034560 | consumed tokens: 461488128 | elapsed time per iteration (ms): 107170.1 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.940769E+00 | loss scale: 32768.0 | grad norm: 13950.627 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96869
+ time (ms)
96870
+ iteration 1971/ 292968 | consumed samples: 4036608 | consumed tokens: 461832192 | elapsed time per iteration (ms): 106511.3 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.931390E+00 | loss scale: 32768.0 | grad norm: 19245.494 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96871
+ time (ms)
96872
+ iteration 1972/ 292968 | consumed samples: 4038656 | consumed tokens: 462176256 | elapsed time per iteration (ms): 104143.5 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.939216E+00 | loss scale: 32768.0 | grad norm: 23053.813 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96873
+ time (ms)
96874
+ iteration 1973/ 292968 | consumed samples: 4040704 | consumed tokens: 462520320 | elapsed time per iteration (ms): 106138.7 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.959975E+00 | loss scale: 32768.0 | grad norm: 22524.458 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96875
+ time (ms)
96876
+ iteration 1974/ 292968 | consumed samples: 4042752 | consumed tokens: 462864384 | elapsed time per iteration (ms): 105586.4 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.905755E+00 | loss scale: 32768.0 | grad norm: 19440.251 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96877
+ time (ms)
96878
+ iteration 1975/ 292968 | consumed samples: 4044800 | consumed tokens: 463208448 | elapsed time per iteration (ms): 106158.7 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.915691E+00 | loss scale: 32768.0 | grad norm: 17649.388 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96879
+ time (ms)
96880
+ iteration 1976/ 292968 | consumed samples: 4046848 | consumed tokens: 463552512 | elapsed time per iteration (ms): 106708.4 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.920288E+00 | loss scale: 32768.0 | grad norm: 20503.069 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96881
+ time (ms)
96882
+ iteration 1977/ 292968 | consumed samples: 4048896 | consumed tokens: 463896576 | elapsed time per iteration (ms): 105936.2 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.945108E+00 | loss scale: 32768.0 | grad norm: 16839.813 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96883
+ time (ms)
96884
+ iteration 1978/ 292968 | consumed samples: 4050944 | consumed tokens: 464240640 | elapsed time per iteration (ms): 105458.1 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.917942E+00 | loss scale: 32768.0 | grad norm: 15257.276 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96885
+ time (ms)
96886
+ iteration 1979/ 292968 | consumed samples: 4052992 | consumed tokens: 464584704 | elapsed time per iteration (ms): 107165.3 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.927221E+00 | loss scale: 32768.0 | grad norm: 15093.813 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96887
+ time (ms)
96888
+ iteration 1980/ 292968 | consumed samples: 4055040 | consumed tokens: 464928768 | elapsed time per iteration (ms): 113081.0 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.957678E+00 | loss scale: 32768.0 | grad norm: 13839.536 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96889
+ time (ms)
96890
+ iteration 1981/ 292968 | consumed samples: 4057088 | consumed tokens: 465272832 | elapsed time per iteration (ms): 108714.8 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.917398E+00 | loss scale: 32768.0 | grad norm: 14074.082 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96891
+ time (ms)
96892
+ iteration 1982/ 292968 | consumed samples: 4059136 | consumed tokens: 465616896 | elapsed time per iteration (ms): 107604.5 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.925085E+00 | loss scale: 32768.0 | grad norm: 13534.880 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96893
+ time (ms)
96894
+ iteration 1983/ 292968 | consumed samples: 4061184 | consumed tokens: 465960960 | elapsed time per iteration (ms): 112383.1 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.944923E+00 | loss scale: 32768.0 | grad norm: 13209.445 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96895
+ time (ms)
96896
+ iteration 1984/ 292968 | consumed samples: 4063232 | consumed tokens: 466305024 | elapsed time per iteration (ms): 112954.3 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.918631E+00 | loss scale: 32768.0 | grad norm: 19787.184 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96897
+ time (ms)
96898
+ iteration 1985/ 292968 | consumed samples: 4065280 | consumed tokens: 466649088 | elapsed time per iteration (ms): 111797.0 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.935518E+00 | loss scale: 32768.0 | grad norm: 17837.294 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96899
+ time (ms)
96900
+ iteration 1986/ 292968 | consumed samples: 4067328 | consumed tokens: 466993152 | elapsed time per iteration (ms): 110679.8 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.927701E+00 | loss scale: 32768.0 | grad norm: 24145.327 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96901
+ time (ms)
96902
+ iteration 1987/ 292968 | consumed samples: 4069376 | consumed tokens: 467337216 | elapsed time per iteration (ms): 106586.4 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.924149E+00 | loss scale: 32768.0 | grad norm: 19059.242 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96903
+ time (ms)
96904
+ iteration 1988/ 292968 | consumed samples: 4071424 | consumed tokens: 467681280 | elapsed time per iteration (ms): 104497.1 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.911625E+00 | loss scale: 32768.0 | grad norm: 15092.949 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96905
+ time (ms)
96906
+ iteration 1989/ 292968 | consumed samples: 4073472 | consumed tokens: 468025344 | elapsed time per iteration (ms): 104962.9 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.930661E+00 | loss scale: 32768.0 | grad norm: 19898.790 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96907
+ time (ms)
96908
+ iteration 1990/ 292968 | consumed samples: 4075520 | consumed tokens: 468369408 | elapsed time per iteration (ms): 104607.9 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.931398E+00 | loss scale: 32768.0 | grad norm: 18910.425 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96909
+ time (ms)
96910
+ iteration 1991/ 292968 | consumed samples: 4077568 | consumed tokens: 468713472 | elapsed time per iteration (ms): 103902.9 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.927662E+00 | loss scale: 32768.0 | grad norm: 16632.425 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96911
+ time (ms)
96912
+ iteration 1992/ 292968 | consumed samples: 4079616 | consumed tokens: 469057536 | elapsed time per iteration (ms): 106519.1 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.915715E+00 | loss scale: 32768.0 | grad norm: 13302.984 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96913
+ time (ms)
96914
+ iteration 1993/ 292968 | consumed samples: 4081664 | consumed tokens: 469401600 | elapsed time per iteration (ms): 105643.5 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.921783E+00 | loss scale: 32768.0 | grad norm: 16160.708 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96915
+ time (ms)
96916
+ iteration 1994/ 292968 | consumed samples: 4083712 | consumed tokens: 469745664 | elapsed time per iteration (ms): 104271.9 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.939743E+00 | loss scale: 32768.0 | grad norm: 19586.680 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96917
+ time (ms)
96918
+ iteration 1995/ 292968 | consumed samples: 4085760 | consumed tokens: 470089728 | elapsed time per iteration (ms): 105935.4 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.918940E+00 | loss scale: 32768.0 | grad norm: 18793.983 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96919
+ time (ms)
96920
+ iteration 1996/ 292968 | consumed samples: 4087808 | consumed tokens: 470433792 | elapsed time per iteration (ms): 105026.3 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.930414E+00 | loss scale: 32768.0 | grad norm: 16737.588 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96921
+ time (ms)
96922
+ iteration 1997/ 292968 | consumed samples: 4089856 | consumed tokens: 470777856 | elapsed time per iteration (ms): 104382.1 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.952893E+00 | loss scale: 32768.0 | grad norm: 13563.057 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96923
+ time (ms)
96924
+ iteration 1998/ 292968 | consumed samples: 4091904 | consumed tokens: 471121920 | elapsed time per iteration (ms): 106021.3 | learning rate: 1.000E-04 | global batch size: 2048 | lm loss: 3.901303E+00 | loss scale: 32768.0 | grad norm: 15104.265 | num zeros: 0.0 | curriculum seqlen: 168 | number of skipped iterations: 0 | number of nan iterations: 0 |
96925
+ time (ms)