koichi12 commited on
Commit
04522a2
·
verified ·
1 Parent(s): b38ed3f

Add files using upload-large-folder tool

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. wandb/run-20240802_180656-l8nnlt0c/files/config.yaml +335 -0
  2. wandb/run-20240802_180656-l8nnlt0c/files/output.log +0 -0
  3. wandb/run-20240802_180656-l8nnlt0c/files/requirements.txt +271 -0
  4. wandb/run-20240802_180656-l8nnlt0c/files/wandb-metadata.json +215 -0
  5. wandb/run-20240802_180656-l8nnlt0c/files/wandb-summary.json +1 -0
  6. wandb/run-20240802_180656-l8nnlt0c/logs/debug.log +30 -0
  7. wandb/run-20240804_021608-l90yeme3/files/config.yaml +335 -0
  8. wandb/run-20240804_021608-l90yeme3/files/output.log +0 -0
  9. wandb/run-20240804_021608-l90yeme3/files/requirements.txt +271 -0
  10. wandb/run-20240804_021608-l90yeme3/files/wandb-metadata.json +215 -0
  11. wandb/run-20240804_021608-l90yeme3/files/wandb-summary.json +1 -0
  12. wandb/run-20240804_021608-l90yeme3/logs/debug-internal.log +0 -0
  13. wandb/run-20240804_021608-l90yeme3/logs/debug.log +29 -0
  14. wandb/run-20240804_035906-457c7q3q/files/config.yaml +335 -0
  15. wandb/run-20240804_035906-457c7q3q/files/output.log +130 -0
  16. wandb/run-20240804_035906-457c7q3q/files/requirements.txt +271 -0
  17. wandb/run-20240804_035906-457c7q3q/files/wandb-metadata.json +215 -0
  18. wandb/run-20240804_035906-457c7q3q/files/wandb-summary.json +1 -0
  19. wandb/run-20240804_035906-457c7q3q/logs/debug-internal.log +186 -0
  20. wandb/run-20240804_035906-457c7q3q/logs/debug.log +29 -0
  21. wandb/run-20240804_035906-457c7q3q/run-457c7q3q.wandb +0 -0
  22. wandb/run-20240804_143449-7tyiihss/files/config.yaml +335 -0
  23. wandb/run-20240804_143449-7tyiihss/files/output.log +135 -0
  24. wandb/run-20240804_143449-7tyiihss/files/requirements.txt +271 -0
  25. wandb/run-20240804_143449-7tyiihss/files/wandb-metadata.json +215 -0
  26. wandb/run-20240804_143449-7tyiihss/files/wandb-summary.json +1 -0
  27. wandb/run-20240804_143449-7tyiihss/logs/debug-internal.log +186 -0
  28. wandb/run-20240804_143449-7tyiihss/logs/debug.log +30 -0
  29. wandb/run-20240804_143449-7tyiihss/run-7tyiihss.wandb +0 -0
  30. wandb/run-20240804_153511-5ba5jbt6/files/config.yaml +335 -0
  31. wandb/run-20240804_153511-5ba5jbt6/files/output.log +135 -0
  32. wandb/run-20240804_153511-5ba5jbt6/files/requirements.txt +271 -0
  33. wandb/run-20240804_153511-5ba5jbt6/files/wandb-metadata.json +215 -0
  34. wandb/run-20240804_153511-5ba5jbt6/files/wandb-summary.json +1 -0
  35. wandb/run-20240804_153511-5ba5jbt6/logs/debug-internal.log +188 -0
  36. wandb/run-20240804_153511-5ba5jbt6/logs/debug.log +30 -0
  37. wandb/run-20240804_153511-5ba5jbt6/run-5ba5jbt6.wandb +0 -0
  38. wandb/run-20240812_052446-qrv0d6sp/files/config.yaml +314 -0
  39. wandb/run-20240812_052446-qrv0d6sp/files/output.log +12 -0
  40. wandb/run-20240812_052446-qrv0d6sp/files/requirements.txt +271 -0
  41. wandb/run-20240812_052446-qrv0d6sp/files/wandb-metadata.json +215 -0
  42. wandb/run-20240812_052446-qrv0d6sp/files/wandb-summary.json +1 -0
  43. wandb/run-20240812_052446-qrv0d6sp/logs/debug-internal.log +185 -0
  44. wandb/run-20240812_052446-qrv0d6sp/logs/debug.log +28 -0
  45. wandb/run-20240812_052446-qrv0d6sp/run-qrv0d6sp.wandb +0 -0
  46. wandb/run-20240812_072401-esew3nhv/files/config.yaml +335 -0
  47. wandb/run-20240812_072401-esew3nhv/files/requirements.txt +271 -0
  48. wandb/run-20240812_072401-esew3nhv/files/wandb-metadata.json +215 -0
  49. wandb/run-20240812_072401-esew3nhv/logs/debug-internal.log +240 -0
  50. wandb/run-20240812_072401-esew3nhv/logs/debug.log +29 -0
wandb/run-20240802_180656-l8nnlt0c/files/config.yaml ADDED
@@ -0,0 +1,335 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ wandb_version: 1
2
+
3
+ sharding_strategy:
4
+ desc: null
5
+ value: FULL_SHARD
6
+ checkpoint_type:
7
+ desc: null
8
+ value: LOCAL_STATE_DICT
9
+ fsdp_activation_checkpointing:
10
+ desc: null
11
+ value: true
12
+ fsdp_cpu_offload:
13
+ desc: null
14
+ value: false
15
+ low_cpu_fsdp:
16
+ desc: null
17
+ value: false
18
+ no_meta_device:
19
+ desc: null
20
+ value: false
21
+ data_path:
22
+ desc: null
23
+ value: null
24
+ split:
25
+ desc: null
26
+ value: 969, 30, 1
27
+ train_data_path:
28
+ desc: null
29
+ value:
30
+ - '4013541'
31
+ - /work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document
32
+ valid_data_path:
33
+ desc: null
34
+ value:
35
+ - '4013541'
36
+ - /work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document
37
+ test_data_path:
38
+ desc: null
39
+ value:
40
+ - '4013541'
41
+ - /work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document
42
+ data_cache_path:
43
+ desc: null
44
+ value: null
45
+ vocab_size:
46
+ desc: null
47
+ value: null
48
+ vocab_file:
49
+ desc: null
50
+ value: null
51
+ merge_file:
52
+ desc: null
53
+ value: null
54
+ seq_length:
55
+ desc: null
56
+ value: 512
57
+ num_workers:
58
+ desc: null
59
+ value: 2
60
+ tokenizer_type:
61
+ desc: null
62
+ value: Llama2Tokenizer
63
+ tokenizer_model:
64
+ desc: null
65
+ value: /share/pretrained_lm/custom/tiny-mistral/tokenizer.model.v3
66
+ reset_position_ids:
67
+ desc: null
68
+ value: false
69
+ reset_attention_mask:
70
+ desc: null
71
+ value: false
72
+ eod_mask_loss:
73
+ desc: null
74
+ value: false
75
+ retro_return_doc_ids:
76
+ desc: null
77
+ value: false
78
+ short_seq_prob:
79
+ desc: null
80
+ value: 0.1
81
+ vocab_extra_ids:
82
+ desc: null
83
+ value: 0
84
+ seed:
85
+ desc: null
86
+ value: 1234
87
+ use_mpi:
88
+ desc: null
89
+ value: false
90
+ wandb_entity:
91
+ desc: null
92
+ value: iwakawa-koichi-q5-tohoku-nlp6723
93
+ wandb_name:
94
+ desc: null
95
+ value: tiny-mistral-sample_train_2024-08-02-18:06:43
96
+ wandb_project:
97
+ desc: null
98
+ value: llm_tutorial
99
+ quantization:
100
+ desc: null
101
+ value: false
102
+ use_freeze_layers:
103
+ desc: null
104
+ value: false
105
+ freeze_layers:
106
+ desc: null
107
+ value: null
108
+ bf16:
109
+ desc: null
110
+ value: true
111
+ fp16:
112
+ desc: null
113
+ value: false
114
+ mixed_precision:
115
+ desc: null
116
+ value: true
117
+ param_dtype:
118
+ desc: null
119
+ value: null
120
+ load:
121
+ desc: null
122
+ value: /work/llm_recipes/models/tiny-mistral-sample
123
+ save:
124
+ desc: null
125
+ value: /work/llm_recipes/models/tiny-mistral-sample
126
+ base_model:
127
+ desc: null
128
+ value: /share/pretrained_lm/custom/tiny-mistral
129
+ use_better_transformer:
130
+ desc: null
131
+ value: false
132
+ grad_clip_norm:
133
+ desc: null
134
+ value: 1.0
135
+ eval_interval:
136
+ desc: null
137
+ value: 200
138
+ save_interval:
139
+ desc: null
140
+ value: 200
141
+ eval_iters:
142
+ desc: null
143
+ value: 10
144
+ optimizer:
145
+ desc: null
146
+ value: adam
147
+ lr:
148
+ desc: null
149
+ value: 2.0e-05
150
+ lr_decay_style:
151
+ desc: null
152
+ value: cosine
153
+ lr_decay_iters:
154
+ desc: null
155
+ value: 20000
156
+ lr_warmup_iters:
157
+ desc: null
158
+ value: 500
159
+ min_lr:
160
+ desc: null
161
+ value: 1.0e-06
162
+ train_iters:
163
+ desc: null
164
+ value: 20000
165
+ train_samples:
166
+ desc: null
167
+ value: null
168
+ global_batch_size:
169
+ desc: null
170
+ value: 320
171
+ micro_batch_size:
172
+ desc: null
173
+ value: 8
174
+ make_vocab_size_divisible_by:
175
+ desc: null
176
+ value: 128
177
+ sliding_window_size:
178
+ desc: null
179
+ value: 4096
180
+ skip_batch:
181
+ desc: null
182
+ value: null
183
+ no_save_optimizer_state:
184
+ desc: null
185
+ value: false
186
+ continual_pretraining:
187
+ desc: null
188
+ value: false
189
+ instruction_tuning:
190
+ desc: null
191
+ value: false
192
+ direct_preference_optimization:
193
+ desc: null
194
+ value: false
195
+ attention_dropout:
196
+ desc: null
197
+ value: 0.1
198
+ hidden_dropout:
199
+ desc: null
200
+ value: 0.1
201
+ weight_decay:
202
+ desc: null
203
+ value: 0.1
204
+ adam_beta1:
205
+ desc: null
206
+ value: 0.9
207
+ adam_beta2:
208
+ desc: null
209
+ value: 0.95
210
+ adam_eps:
211
+ desc: null
212
+ value: 1.0e-06
213
+ hf_transformer_model_dir:
214
+ desc: null
215
+ value: null
216
+ instruction_train_data_path:
217
+ desc: null
218
+ value: null
219
+ instruction_valid_data_path:
220
+ desc: null
221
+ value: null
222
+ epoch:
223
+ desc: null
224
+ value: null
225
+ instruction_dataset_size:
226
+ desc: null
227
+ value: null
228
+ save_sampler_state:
229
+ desc: null
230
+ value: false
231
+ label_smoothing:
232
+ desc: null
233
+ value: 0.0
234
+ save_n_checkpoints:
235
+ desc: null
236
+ value: 10
237
+ hf_repo_id:
238
+ desc: null
239
+ value: koichi12/tiny-mistral-sample
240
+ create_public_hf_repo:
241
+ desc: null
242
+ value: false
243
+ upload_all_checkpoints_to_hf:
244
+ desc: null
245
+ value: false
246
+ hf_upload_retry_limit:
247
+ desc: null
248
+ value: 2
249
+ exit_duration_in_mins:
250
+ desc: null
251
+ value: null
252
+ source_key:
253
+ desc: null
254
+ value: null
255
+ target_key:
256
+ desc: null
257
+ value: null
258
+ attn_implementation:
259
+ desc: null
260
+ value: flash_attention_2
261
+ efficient_instruction_tuning:
262
+ desc: null
263
+ value: false
264
+ remove_padding_masking:
265
+ desc: null
266
+ value: false
267
+ save_start_iter:
268
+ desc: null
269
+ value: null
270
+ rank:
271
+ desc: null
272
+ value: 0
273
+ world_size:
274
+ desc: null
275
+ value: 1
276
+ padded_vocab_size:
277
+ desc: null
278
+ value: 32768
279
+ gradient_accumulation_steps:
280
+ desc: null
281
+ value: 40
282
+ _wandb:
283
+ desc: null
284
+ value:
285
+ python_version: 3.10.12
286
+ cli_version: 0.16.3
287
+ framework: huggingface
288
+ huggingface_version: 4.43.3
289
+ is_jupyter_run: false
290
+ is_kaggle_kernel: false
291
+ start_time: 1722589616.489856
292
+ t:
293
+ 1:
294
+ - 1
295
+ - 11
296
+ - 49
297
+ - 55
298
+ - 71
299
+ 2:
300
+ - 1
301
+ - 11
302
+ - 49
303
+ - 55
304
+ - 71
305
+ 3:
306
+ - 13
307
+ - 16
308
+ - 23
309
+ 4: 3.10.12
310
+ 5: 0.16.3
311
+ 6: 4.43.3
312
+ 8:
313
+ - 5
314
+ 13: linux-x86_64
315
+ activation_function:
316
+ desc: null
317
+ value: silu
318
+ hidden_size:
319
+ desc: null
320
+ value: 256
321
+ model_type:
322
+ desc: null
323
+ value: mistral
324
+ max_position_embeddings:
325
+ desc: null
326
+ value: 512
327
+ num_attention_heads:
328
+ desc: null
329
+ value: 4
330
+ num_hidden_layers:
331
+ desc: null
332
+ value: 4
333
+ model_architecture:
334
+ desc: null
335
+ value: MistralForCausalLM
wandb/run-20240802_180656-l8nnlt0c/files/output.log ADDED
The diff for this file is too large to render. See raw diff
 
wandb/run-20240802_180656-l8nnlt0c/files/requirements.txt ADDED
@@ -0,0 +1,271 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ absl-py==2.1.0
2
+ accelerate==0.33.0
3
+ aiohttp==3.9.1
4
+ aiosignal==1.3.1
5
+ annotated-types==0.6.0
6
+ apex==0.1
7
+ appdirs==1.4.4
8
+ argon2-cffi-bindings==21.2.0
9
+ argon2-cffi==23.1.0
10
+ asttokens==2.4.1
11
+ astunparse==1.6.3
12
+ async-timeout==4.0.3
13
+ attrs==23.2.0
14
+ audioread==3.0.1
15
+ beautifulsoup4==4.12.3
16
+ bleach==6.1.0
17
+ blis==0.7.11
18
+ cachetools==5.3.2
19
+ catalogue==2.0.10
20
+ certifi==2024.2.2
21
+ cffi==1.16.0
22
+ charset-normalizer==3.3.2
23
+ click==8.1.7
24
+ cloudpathlib==0.16.0
25
+ cloudpickle==3.0.0
26
+ cmake==3.28.1
27
+ colorama==0.4.6
28
+ comm==0.2.1
29
+ confection==0.1.4
30
+ contourpy==1.2.0
31
+ cubinlinker==0.3.0+2.g405ac64
32
+ cuda-python==12.3.0rc4+9.gdb8c48a.dirty
33
+ cudf==23.12.0
34
+ cugraph-dgl==23.12.0
35
+ cugraph-service-client==23.12.0
36
+ cugraph-service-server==23.12.0
37
+ cugraph==23.12.0
38
+ cuml==23.12.0
39
+ cupy-cuda12x==12.3.0
40
+ cycler==0.12.1
41
+ cymem==2.0.8
42
+ cython==3.0.8
43
+ dask-cuda==23.12.0
44
+ dask-cudf==23.12.0
45
+ dask==2023.11.0
46
+ debugpy==1.8.1
47
+ decorator==5.1.1
48
+ defusedxml==0.7.1
49
+ distributed==2023.11.0
50
+ dm-tree==0.1.8
51
+ docker-pycreds==0.4.0
52
+ einops==0.7.0
53
+ exceptiongroup==1.2.0
54
+ execnet==2.0.2
55
+ executing==2.0.1
56
+ expecttest==0.1.3
57
+ fastjsonschema==2.19.1
58
+ fastrlock==0.8.2
59
+ filelock==3.13.1
60
+ flash-attn==2.4.2
61
+ fonttools==4.48.1
62
+ frozenlist==1.4.1
63
+ fsspec==2023.12.2
64
+ gast==0.5.4
65
+ gitdb==4.0.11
66
+ gitpython==3.1.43
67
+ google-auth-oauthlib==0.4.6
68
+ google-auth==2.27.0
69
+ graphsurgeon==0.4.6
70
+ grpcio==1.60.1
71
+ huggingface-hub==0.24.5
72
+ hypothesis==5.35.1
73
+ idna==3.6
74
+ importlib-metadata==7.0.1
75
+ iniconfig==2.0.0
76
+ intel-openmp==2021.4.0
77
+ ipadic==1.0.0
78
+ ipykernel==6.29.2
79
+ ipython-genutils==0.2.0
80
+ ipython==8.21.0
81
+ jedi==0.19.1
82
+ jinja2==3.1.3
83
+ joblib==1.3.2
84
+ json5==0.9.14
85
+ jsonnet==0.19.1
86
+ jsonschema-specifications==2023.12.1
87
+ jsonschema==4.21.1
88
+ jupyter-client==8.6.0
89
+ jupyter-core==5.7.1
90
+ jupyter-tensorboard==0.2.0
91
+ jupyterlab-pygments==0.3.0
92
+ jupyterlab-server==1.2.0
93
+ jupyterlab==2.3.2
94
+ jupytext==1.16.1
95
+ kiwisolver==1.4.5
96
+ langcodes==3.3.0
97
+ lazy-loader==0.3
98
+ librosa==0.10.1
99
+ llvmlite==0.40.1
100
+ locket==1.0.0
101
+ logzero==1.7.0
102
+ lxml==5.2.2
103
+ markdown-it-py==3.0.0
104
+ markdown==3.5.2
105
+ markupsafe==2.1.4
106
+ matplotlib-inline==0.1.6
107
+ matplotlib==3.8.2
108
+ mdit-py-plugins==0.4.0
109
+ mdurl==0.1.2
110
+ mecab-python3==1.0.6
111
+ mistune==3.0.2
112
+ mkl-devel==2021.1.1
113
+ mkl-include==2021.1.1
114
+ mkl==2021.1.1
115
+ mock==5.1.0
116
+ more-itertools==9.1.0
117
+ mpmath==1.3.0
118
+ msgpack==1.0.7
119
+ multidict==6.0.4
120
+ murmurhash==1.0.10
121
+ nbclient==0.9.0
122
+ nbconvert==7.16.0
123
+ nbformat==5.9.2
124
+ nest-asyncio==1.6.0
125
+ networkx==2.6.3
126
+ ninja==1.11.1.1
127
+ nltk==3.8.1
128
+ notebook==6.4.10
129
+ numba==0.57.1+1.g1ff679645
130
+ numpy==1.24.4
131
+ nvfuser==0.1.4a0+d0bb811
132
+ nvidia-dali-cuda120==1.34.0
133
+ nvidia-pyindex==1.0.9
134
+ nvtx==0.2.5
135
+ oauthlib==3.2.2
136
+ onnx==1.15.0rc2
137
+ opencv==4.7.0
138
+ optree==0.10.0
139
+ packaging==23.2
140
+ pandas==1.5.3
141
+ pandocfilters==1.5.1
142
+ parso==0.8.3
143
+ partd==1.4.1
144
+ peft==0.11.1
145
+ pexpect==4.9.0
146
+ pillow==10.2.0
147
+ pip==24.0
148
+ platformdirs==4.2.0
149
+ pluggy==1.4.0
150
+ ply==3.11
151
+ polygraphy==0.49.4
152
+ pooch==1.8.0
153
+ portalocker==2.10.1
154
+ preshed==3.0.9
155
+ prettytable==3.9.0
156
+ prometheus-client==0.19.0
157
+ prompt-toolkit==3.0.43
158
+ protobuf==4.24.4
159
+ psutil==5.9.4
160
+ ptxcompiler==0.8.1+2.g0d406d6
161
+ ptyprocess==0.7.0
162
+ pure-eval==0.2.2
163
+ pyarrow==14.0.1.dev0+gba5374836.d20240125
164
+ pyasn1-modules==0.3.0
165
+ pyasn1==0.5.1
166
+ pybind11-global==2.11.1
167
+ pybind11==2.11.1
168
+ pycocotools==2.0+nv0.8.0
169
+ pycparser==2.21
170
+ pydantic-core==2.16.2
171
+ pydantic==2.6.1
172
+ pygments==2.17.2
173
+ pylibcugraph==23.12.0
174
+ pylibcugraphops==23.12.0
175
+ pylibraft==23.12.0
176
+ pynvml==11.4.1
177
+ pyparsing==3.1.1
178
+ pytest-flakefinder==1.1.0
179
+ pytest-rerunfailures==13.0
180
+ pytest-shard==0.1.2
181
+ pytest-xdist==3.5.0
182
+ pytest==8.0.0
183
+ python-dateutil==2.8.2
184
+ python-dotenv==1.0.0
185
+ python-hostlist==1.23.0
186
+ pytorch-quantization==2.1.2
187
+ pytz==2023.3.post1
188
+ pyyaml==6.0.1
189
+ pyzmq==25.1.2
190
+ raft-dask==23.12.0
191
+ rapids-dask-dependency==23.12.1
192
+ referencing==0.33.0
193
+ regex==2023.12.25
194
+ requests-oauthlib==1.3.1
195
+ requests==2.31.0
196
+ rich==13.7.0
197
+ rmm==23.12.0
198
+ rpds-py==0.17.1
199
+ rsa==4.9
200
+ sacrebleu==2.4.0
201
+ safetensors==0.4.3
202
+ scikit-learn==1.2.0
203
+ scipy==1.12.0
204
+ send2trash==1.8.2
205
+ sentencepiece==0.1.99
206
+ sentry-sdk==2.12.0
207
+ setproctitle==1.3.3
208
+ setuptools==68.2.2
209
+ six==1.16.0
210
+ smart-open==6.4.0
211
+ smmap==5.0.1
212
+ sortedcontainers==2.4.0
213
+ soundfile==0.12.1
214
+ soupsieve==2.5
215
+ soxr==0.3.7
216
+ spacy-legacy==3.0.12
217
+ spacy-loggers==1.0.5
218
+ spacy==3.7.2
219
+ sphinx-glpi-theme==0.6
220
+ srsly==2.4.8
221
+ stack-data==0.6.3
222
+ sympy==1.12
223
+ tabulate==0.9.0
224
+ tbb==2021.11.0
225
+ tblib==3.0.0
226
+ tensorboard-data-server==0.6.1
227
+ tensorboard-plugin-wit==1.8.1
228
+ tensorboard==2.9.0
229
+ tensorrt==8.6.3
230
+ terminado==0.18.0
231
+ termplotlib==0.3.9
232
+ thinc==8.2.3
233
+ threadpoolctl==3.2.0
234
+ thriftpy2==0.4.17
235
+ tinycss2==1.2.1
236
+ tokenizers==0.19.1
237
+ toml==0.10.2
238
+ tomli==2.0.1
239
+ toolz==0.12.1
240
+ torch-tensorrt==2.3.0a0
241
+ torch==2.3.0a0+ebedce2
242
+ torchdata==0.7.1a0
243
+ torchtext==0.17.0a0
244
+ torchvision==0.18.0a0
245
+ tornado==6.4
246
+ tqdm==4.66.1
247
+ traitlets==5.9.0
248
+ transformer-engine==1.3.0+5b90b7f
249
+ transformers==4.43.3
250
+ treelite-runtime==3.9.1
251
+ treelite==3.9.1
252
+ triton==2.2.0+e28a256
253
+ typer==0.9.0
254
+ types-dataclasses==0.6.6
255
+ typing-extensions==4.9.0
256
+ ucx-py==0.35.0
257
+ uff==0.6.9
258
+ ujson==5.8.0
259
+ urllib3==1.26.18
260
+ wandb==0.16.3
261
+ wasabi==1.1.2
262
+ wcwidth==0.2.13
263
+ weasel==0.3.4
264
+ webencodings==0.5.1
265
+ werkzeug==3.0.1
266
+ wheel==0.42.0
267
+ xdoctest==1.0.2
268
+ xgboost==1.7.6
269
+ yarl==1.9.4
270
+ zict==3.0.0
271
+ zipp==3.17.0
wandb/run-20240802_180656-l8nnlt0c/files/wandb-metadata.json ADDED
@@ -0,0 +1,215 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "os": "Linux-5.15.0-91-generic-x86_64-with-glibc2.35",
3
+ "python": "3.10.12",
4
+ "heartbeatAt": "2024-08-02T09:06:57.070043",
5
+ "startedAt": "2024-08-02T09:06:56.476807",
6
+ "docker": null,
7
+ "cuda": null,
8
+ "args": [
9
+ "--seq-length",
10
+ "512",
11
+ "--sliding-window-size",
12
+ "4096",
13
+ "--micro-batch-size",
14
+ "8",
15
+ "--global-batch-size",
16
+ "320",
17
+ "--train-iters",
18
+ "20000",
19
+ "--tokenizer-type",
20
+ "Llama2Tokenizer",
21
+ "--tokenizer-model",
22
+ "/share/pretrained_lm/custom/tiny-mistral/tokenizer.model.v3",
23
+ "--train-data-path",
24
+ "4013541",
25
+ "/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document",
26
+ "--valid-data-path",
27
+ "4013541",
28
+ "/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document",
29
+ "--test-data-path",
30
+ "4013541",
31
+ "/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document",
32
+ "--lr",
33
+ "2e-5",
34
+ "--min-lr",
35
+ "1e-6",
36
+ "--lr-decay-style",
37
+ "cosine",
38
+ "--lr-warmup-iters",
39
+ "500",
40
+ "--lr-decay-iters",
41
+ "20000",
42
+ "--weight-decay",
43
+ "0.1",
44
+ "--grad-clip-norm",
45
+ "1.0",
46
+ "--optimizer",
47
+ "adam",
48
+ "--adam-beta1",
49
+ "0.9",
50
+ "--adam-beta2",
51
+ "0.95",
52
+ "--adam-eps",
53
+ "1e-6",
54
+ "--save-interval",
55
+ "200",
56
+ "--eval-interval",
57
+ "200",
58
+ "--eval-iters",
59
+ "10",
60
+ "--bf16",
61
+ "--mixed-precision",
62
+ "--base-model",
63
+ "/share/pretrained_lm/custom/tiny-mistral",
64
+ "--save",
65
+ "/work/llm_recipes/models/tiny-mistral-sample",
66
+ "--load",
67
+ "/work/llm_recipes/models/tiny-mistral-sample",
68
+ "--fsdp-activation-checkpointing",
69
+ "--sharding-strategy",
70
+ "FULL_SHARD",
71
+ "--checkpoint-type",
72
+ "LOCAL_STATE_DICT",
73
+ "--save-n-checkpoints",
74
+ "10",
75
+ "--hf-upload-retry-limit",
76
+ "2",
77
+ "--hf-repo-id",
78
+ "koichi12/tiny-mistral-sample",
79
+ "--wandb-entity",
80
+ "iwakawa-koichi-q5-tohoku-nlp6723",
81
+ "--wandb-project",
82
+ "llm_tutorial",
83
+ "--wandb-name",
84
+ "tiny-mistral-sample_train_2024-08-02-18:06:43"
85
+ ],
86
+ "state": "running",
87
+ "program": "/project/examples/finetuning.py",
88
+ "codePathLocal": "examples/finetuning.py",
89
+ "codePath": "examples/finetuning.py",
90
+ "git": {
91
+ "remote": "https://github.com/cl-tohoku/llm-recipes-failab-m1-yans.git",
92
+ "commit": "3be5353210a678dc7008f237fa16b99f2bdf36ea"
93
+ },
94
+ "email": null,
95
+ "root": "/project",
96
+ "host": "gpu-koiwa-00",
97
+ "username": "koiwa",
98
+ "executable": "/usr/bin/python",
99
+ "cpu_count": 18,
100
+ "cpu_count_logical": 18,
101
+ "cpu_freq": {
102
+ "current": 2400.0409999999997,
103
+ "min": 0.0,
104
+ "max": 0.0
105
+ },
106
+ "cpu_freq_per_core": [
107
+ {
108
+ "current": 2400.041,
109
+ "min": 0.0,
110
+ "max": 0.0
111
+ },
112
+ {
113
+ "current": 2400.041,
114
+ "min": 0.0,
115
+ "max": 0.0
116
+ },
117
+ {
118
+ "current": 2400.041,
119
+ "min": 0.0,
120
+ "max": 0.0
121
+ },
122
+ {
123
+ "current": 2400.041,
124
+ "min": 0.0,
125
+ "max": 0.0
126
+ },
127
+ {
128
+ "current": 2400.041,
129
+ "min": 0.0,
130
+ "max": 0.0
131
+ },
132
+ {
133
+ "current": 2400.041,
134
+ "min": 0.0,
135
+ "max": 0.0
136
+ },
137
+ {
138
+ "current": 2400.041,
139
+ "min": 0.0,
140
+ "max": 0.0
141
+ },
142
+ {
143
+ "current": 2400.041,
144
+ "min": 0.0,
145
+ "max": 0.0
146
+ },
147
+ {
148
+ "current": 2400.041,
149
+ "min": 0.0,
150
+ "max": 0.0
151
+ },
152
+ {
153
+ "current": 2400.041,
154
+ "min": 0.0,
155
+ "max": 0.0
156
+ },
157
+ {
158
+ "current": 2400.041,
159
+ "min": 0.0,
160
+ "max": 0.0
161
+ },
162
+ {
163
+ "current": 2400.041,
164
+ "min": 0.0,
165
+ "max": 0.0
166
+ },
167
+ {
168
+ "current": 2400.041,
169
+ "min": 0.0,
170
+ "max": 0.0
171
+ },
172
+ {
173
+ "current": 2400.041,
174
+ "min": 0.0,
175
+ "max": 0.0
176
+ },
177
+ {
178
+ "current": 2400.041,
179
+ "min": 0.0,
180
+ "max": 0.0
181
+ },
182
+ {
183
+ "current": 2400.041,
184
+ "min": 0.0,
185
+ "max": 0.0
186
+ },
187
+ {
188
+ "current": 2400.041,
189
+ "min": 0.0,
190
+ "max": 0.0
191
+ },
192
+ {
193
+ "current": 2400.041,
194
+ "min": 0.0,
195
+ "max": 0.0
196
+ }
197
+ ],
198
+ "disk": {
199
+ "/": {
200
+ "total": 0.0625,
201
+ "used": 1.1444091796875e-05
202
+ }
203
+ },
204
+ "gpu": "NVIDIA A100-SXM4-40GB",
205
+ "gpu_count": 1,
206
+ "gpu_devices": [
207
+ {
208
+ "name": "NVIDIA A100-SXM4-40GB",
209
+ "memory_total": 42949672960
210
+ }
211
+ ],
212
+ "memory": {
213
+ "total": 56.48782730102539
214
+ }
215
+ }
wandb/run-20240802_180656-l8nnlt0c/files/wandb-summary.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"training/loss": 8.714340209960938, "training/perplexity": 6089.615424118352, "utils/batch_size": 8, "utils/global_batch_size": 320, "utils/seq_len": 513, "utils/gradient_accumulation_steps": 40, "utils/iteration": 20000, "optimizer/lr": 1e-06, "optimizer/variance_l2": 0.01379441768482063, "optimizer/variance_sqrt_l2": 1.002313441281401, "optimizer/momentum_l2": 0.9743922417897144, "optimizer/weight_l2": 101.93656115447489, "optimizer/variance_l1": 1.003814697265625, "optimizer/variance_sqrt_l1": 592.75, "optimizer/momentum_l1": 429.375, "optimizer/weight_l1": 333120.0, "optimizer/variance_abs_max": 0.0012969970703125, "optimizer/variance_sqrt_abs_max": 0.0361328125, "optimizer/momentum_abs_max": 0.035400390625, "optimizer/weight_abs_max": 1.0, "stats/1_iteration_time": 1.1439334890019381, "stats/tokens_per_sec": 143504.84672253692, "stats/tokens_per_sec_per_gpu": 143504.84672253692, "stats/tflops": 10.15887571314936, "_timestamp": 1722611185.6210454, "_runtime": 21569.131189346313, "_step": 20000, "evaluation/val_loss": 8.702303886413574, "evaluation/val_ppl": 6016.75830078125, "_wandb": {"runtime": 21568}}
wandb/run-20240802_180656-l8nnlt0c/logs/debug.log ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2024-08-02 18:06:56,482 INFO MainThread:14630 [wandb_setup.py:_flush():76] Current SDK version is 0.16.3
2
+ 2024-08-02 18:06:56,482 INFO MainThread:14630 [wandb_setup.py:_flush():76] Configure stats pid to 14630
3
+ 2024-08-02 18:06:56,482 INFO MainThread:14630 [wandb_setup.py:_flush():76] Loading settings from /singularity_home/.config/wandb/settings
4
+ 2024-08-02 18:06:56,483 INFO MainThread:14630 [wandb_setup.py:_flush():76] Loading settings from /project/wandb/settings
5
+ 2024-08-02 18:06:56,483 INFO MainThread:14630 [wandb_setup.py:_flush():76] Loading settings from environment variables: {'api_key': '***REDACTED***', 'run_notes': 'Train tuny llama sample'}
6
+ 2024-08-02 18:06:56,483 INFO MainThread:14630 [wandb_setup.py:_flush():76] Applying setup settings: {'_disable_service': False}
7
+ 2024-08-02 18:06:56,483 INFO MainThread:14630 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {'program_relpath': 'examples/finetuning.py', 'program_abspath': '/project/examples/finetuning.py', 'program': '/project/examples/finetuning.py'}
8
+ 2024-08-02 18:06:56,483 INFO MainThread:14630 [wandb_init.py:_log_setup():526] Logging user logs to /project/wandb/run-20240802_180656-l8nnlt0c/logs/debug.log
9
+ 2024-08-02 18:06:56,483 INFO MainThread:14630 [wandb_init.py:_log_setup():527] Logging internal logs to /project/wandb/run-20240802_180656-l8nnlt0c/logs/debug-internal.log
10
+ 2024-08-02 18:06:56,483 INFO MainThread:14630 [wandb_init.py:init():566] calling init triggers
11
+ 2024-08-02 18:06:56,483 INFO MainThread:14630 [wandb_init.py:init():573] wandb.init called with sweep_config: {}
12
+ config: {'sharding_strategy': 'FULL_SHARD', 'checkpoint_type': 'LOCAL_STATE_DICT', 'fsdp_activation_checkpointing': True, 'fsdp_cpu_offload': False, 'low_cpu_fsdp': False, 'no_meta_device': False, 'data_path': None, 'split': '969, 30, 1', 'train_data_path': ['4013541', '/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document'], 'valid_data_path': ['4013541', '/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document'], 'test_data_path': ['4013541', '/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document'], 'data_cache_path': None, 'vocab_size': None, 'vocab_file': None, 'merge_file': None, 'seq_length': 512, 'num_workers': 2, 'tokenizer_type': 'Llama2Tokenizer', 'tokenizer_model': '/share/pretrained_lm/custom/tiny-mistral/tokenizer.model.v3', 'reset_position_ids': False, 'reset_attention_mask': False, 'eod_mask_loss': False, 'retro_return_doc_ids': False, 'short_seq_prob': 0.1, 'vocab_extra_ids': 0, 'seed': 1234, 'use_mpi': False, 'wandb_entity': 'iwakawa-koichi-q5-tohoku-nlp6723', 'wandb_name': 'tiny-mistral-sample_train_2024-08-02-18:06:43', 'wandb_project': 'llm_tutorial', 'quantization': False, 'use_freeze_layers': False, 'freeze_layers': None, 'bf16': True, 'fp16': False, 'mixed_precision': True, 'param_dtype': None, 'load': '/work/llm_recipes/models/tiny-mistral-sample', 'save': '/work/llm_recipes/models/tiny-mistral-sample', 'base_model': '/share/pretrained_lm/custom/tiny-mistral', 'use_better_transformer': False, 'grad_clip_norm': 1.0, 'eval_interval': 200, 'save_interval': 200, 'eval_iters': 10, 'optimizer': 'adam', 'lr': 2e-05, 'lr_decay_style': 'cosine', 'lr_decay_iters': 20000, 'lr_warmup_iters': 500, 'min_lr': 1e-06, 'train_iters': 20000, 'train_samples': None, 'global_batch_size': 320, 'micro_batch_size': 8, 'make_vocab_size_divisible_by': 128, 'sliding_window_size': 4096, 'skip_batch': None, 'no_save_optimizer_state': False, 'continual_pretraining': False, 'instruction_tuning': False, 'direct_preference_optimization': False, 'attention_dropout': 0.1, 'hidden_dropout': 0.1, 'weight_decay': 0.1, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_eps': 1e-06, 'hf_transformer_model_dir': None, 'instruction_train_data_path': None, 'instruction_valid_data_path': None, 'epoch': None, 'instruction_dataset_size': None, 'save_sampler_state': False, 'label_smoothing': 0.0, 'save_n_checkpoints': 10, 'hf_repo_id': 'koichi12/tiny-mistral-sample', 'create_public_hf_repo': False, 'upload_all_checkpoints_to_hf': False, 'hf_upload_retry_limit': 2, 'exit_duration_in_mins': None, 'source_key': None, 'target_key': None, 'attn_implementation': 'flash_attention_2', 'efficient_instruction_tuning': False, 'remove_padding_masking': False, 'save_start_iter': None, 'rank': 0, 'world_size': 1, 'padded_vocab_size': 32768, 'gradient_accumulation_steps': 40}
13
+ 2024-08-02 18:06:56,483 INFO MainThread:14630 [wandb_init.py:init():616] starting backend
14
+ 2024-08-02 18:06:56,483 INFO MainThread:14630 [wandb_init.py:init():620] setting up manager
15
+ 2024-08-02 18:06:56,488 INFO MainThread:14630 [backend.py:_multiprocessing_setup():105] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
16
+ 2024-08-02 18:06:56,489 INFO MainThread:14630 [wandb_init.py:init():628] backend started and connected
17
+ 2024-08-02 18:06:56,494 INFO MainThread:14630 [wandb_init.py:init():720] updated telemetry
18
+ 2024-08-02 18:06:56,505 INFO MainThread:14630 [wandb_init.py:init():753] communicating run to backend with 90.0 second timeout
19
+ 2024-08-02 18:06:56,963 INFO MainThread:14630 [wandb_run.py:_on_init():2262] communicating current version
20
+ 2024-08-02 18:06:57,043 INFO MainThread:14630 [wandb_run.py:_on_init():2271] got version response upgrade_message: "wandb version 0.17.5 is available! To upgrade, please run:\n $ pip install wandb --upgrade"
21
+
22
+ 2024-08-02 18:06:57,043 INFO MainThread:14630 [wandb_init.py:init():804] starting run threads in backend
23
+ 2024-08-02 18:06:57,104 INFO MainThread:14630 [wandb_run.py:_console_start():2241] atexit reg
24
+ 2024-08-02 18:06:57,105 INFO MainThread:14630 [wandb_run.py:_redirect():2096] redirect: wrap_raw
25
+ 2024-08-02 18:06:57,105 INFO MainThread:14630 [wandb_run.py:_redirect():2161] Wrapping output streams.
26
+ 2024-08-02 18:06:57,105 INFO MainThread:14630 [wandb_run.py:_redirect():2186] Redirects installed.
27
+ 2024-08-02 18:06:57,106 INFO MainThread:14630 [wandb_init.py:init():847] run started, returning control to user process
28
+ 2024-08-02 18:06:58,607 INFO MainThread:14630 [wandb_run.py:_config_callback():1343] config_cb None None {'activation_function': 'silu', 'hidden_size': 256, 'model_type': 'mistral', 'max_position_embeddings': 512, 'num_attention_heads': 4, 'num_hidden_layers': 4, 'model_architecture': 'MistralForCausalLM'}
29
+ 2024-08-02 18:06:58,607 INFO MainThread:14630 [wandb_run.py:_config_callback():1343] config_cb None None {'world_size': 1}
30
+ 2024-08-03 00:06:33,941 WARNING MsgRouterThr:14630 [router.py:message_loop():77] message_loop has been closed
wandb/run-20240804_021608-l90yeme3/files/config.yaml ADDED
@@ -0,0 +1,335 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ wandb_version: 1
2
+
3
+ sharding_strategy:
4
+ desc: null
5
+ value: FULL_SHARD
6
+ checkpoint_type:
7
+ desc: null
8
+ value: LOCAL_STATE_DICT
9
+ fsdp_activation_checkpointing:
10
+ desc: null
11
+ value: true
12
+ fsdp_cpu_offload:
13
+ desc: null
14
+ value: false
15
+ low_cpu_fsdp:
16
+ desc: null
17
+ value: false
18
+ no_meta_device:
19
+ desc: null
20
+ value: false
21
+ data_path:
22
+ desc: null
23
+ value: null
24
+ split:
25
+ desc: null
26
+ value: 969, 30, 1
27
+ train_data_path:
28
+ desc: null
29
+ value:
30
+ - '4013541'
31
+ - /work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document
32
+ valid_data_path:
33
+ desc: null
34
+ value:
35
+ - '4013541'
36
+ - /work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document
37
+ test_data_path:
38
+ desc: null
39
+ value:
40
+ - '4013541'
41
+ - /work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document
42
+ data_cache_path:
43
+ desc: null
44
+ value: null
45
+ vocab_size:
46
+ desc: null
47
+ value: null
48
+ vocab_file:
49
+ desc: null
50
+ value: null
51
+ merge_file:
52
+ desc: null
53
+ value: null
54
+ seq_length:
55
+ desc: null
56
+ value: 1024
57
+ num_workers:
58
+ desc: null
59
+ value: 2
60
+ tokenizer_type:
61
+ desc: null
62
+ value: Llama2Tokenizer
63
+ tokenizer_model:
64
+ desc: null
65
+ value: /share/pretrained_lm/custom/tiny-mistral/tokenizer.model.v3
66
+ reset_position_ids:
67
+ desc: null
68
+ value: false
69
+ reset_attention_mask:
70
+ desc: null
71
+ value: false
72
+ eod_mask_loss:
73
+ desc: null
74
+ value: false
75
+ retro_return_doc_ids:
76
+ desc: null
77
+ value: false
78
+ short_seq_prob:
79
+ desc: null
80
+ value: 0.1
81
+ vocab_extra_ids:
82
+ desc: null
83
+ value: 0
84
+ seed:
85
+ desc: null
86
+ value: 1234
87
+ use_mpi:
88
+ desc: null
89
+ value: false
90
+ wandb_entity:
91
+ desc: null
92
+ value: iwakawa-koichi-q5-tohoku-nlp6723
93
+ wandb_name:
94
+ desc: null
95
+ value: tiny-mistral-sample5_train_2024-08-04-02:15:57
96
+ wandb_project:
97
+ desc: null
98
+ value: llm_tutorial
99
+ quantization:
100
+ desc: null
101
+ value: false
102
+ use_freeze_layers:
103
+ desc: null
104
+ value: false
105
+ freeze_layers:
106
+ desc: null
107
+ value: null
108
+ bf16:
109
+ desc: null
110
+ value: true
111
+ fp16:
112
+ desc: null
113
+ value: false
114
+ mixed_precision:
115
+ desc: null
116
+ value: true
117
+ param_dtype:
118
+ desc: null
119
+ value: null
120
+ load:
121
+ desc: null
122
+ value: /work/llm_recipes/models/tiny-mistral-sample5
123
+ save:
124
+ desc: null
125
+ value: /work/llm_recipes/models/tiny-mistral-sample5
126
+ base_model:
127
+ desc: null
128
+ value: /share/pretrained_lm/custom/tiny-mistral
129
+ use_better_transformer:
130
+ desc: null
131
+ value: false
132
+ grad_clip_norm:
133
+ desc: null
134
+ value: 1.0
135
+ eval_interval:
136
+ desc: null
137
+ value: 200
138
+ save_interval:
139
+ desc: null
140
+ value: 200
141
+ eval_iters:
142
+ desc: null
143
+ value: 10
144
+ optimizer:
145
+ desc: null
146
+ value: adam
147
+ lr:
148
+ desc: null
149
+ value: 2.0e-05
150
+ lr_decay_style:
151
+ desc: null
152
+ value: cosine
153
+ lr_decay_iters:
154
+ desc: null
155
+ value: 20000
156
+ lr_warmup_iters:
157
+ desc: null
158
+ value: 500
159
+ min_lr:
160
+ desc: null
161
+ value: 1.0e-06
162
+ train_iters:
163
+ desc: null
164
+ value: 20000
165
+ train_samples:
166
+ desc: null
167
+ value: null
168
+ global_batch_size:
169
+ desc: null
170
+ value: 320
171
+ micro_batch_size:
172
+ desc: null
173
+ value: 8
174
+ make_vocab_size_divisible_by:
175
+ desc: null
176
+ value: 128
177
+ sliding_window_size:
178
+ desc: null
179
+ value: 4096
180
+ skip_batch:
181
+ desc: null
182
+ value: null
183
+ no_save_optimizer_state:
184
+ desc: null
185
+ value: false
186
+ continual_pretraining:
187
+ desc: null
188
+ value: false
189
+ instruction_tuning:
190
+ desc: null
191
+ value: false
192
+ direct_preference_optimization:
193
+ desc: null
194
+ value: false
195
+ attention_dropout:
196
+ desc: null
197
+ value: 0.1
198
+ hidden_dropout:
199
+ desc: null
200
+ value: 0.1
201
+ weight_decay:
202
+ desc: null
203
+ value: 0.1
204
+ adam_beta1:
205
+ desc: null
206
+ value: 0.9
207
+ adam_beta2:
208
+ desc: null
209
+ value: 0.95
210
+ adam_eps:
211
+ desc: null
212
+ value: 1.0e-06
213
+ hf_transformer_model_dir:
214
+ desc: null
215
+ value: null
216
+ instruction_train_data_path:
217
+ desc: null
218
+ value: null
219
+ instruction_valid_data_path:
220
+ desc: null
221
+ value: null
222
+ epoch:
223
+ desc: null
224
+ value: null
225
+ instruction_dataset_size:
226
+ desc: null
227
+ value: null
228
+ save_sampler_state:
229
+ desc: null
230
+ value: false
231
+ label_smoothing:
232
+ desc: null
233
+ value: 0.0
234
+ save_n_checkpoints:
235
+ desc: null
236
+ value: 10
237
+ hf_repo_id:
238
+ desc: null
239
+ value: koichi12/tiny-mistral-sample5
240
+ create_public_hf_repo:
241
+ desc: null
242
+ value: false
243
+ upload_all_checkpoints_to_hf:
244
+ desc: null
245
+ value: false
246
+ hf_upload_retry_limit:
247
+ desc: null
248
+ value: 2
249
+ exit_duration_in_mins:
250
+ desc: null
251
+ value: null
252
+ source_key:
253
+ desc: null
254
+ value: null
255
+ target_key:
256
+ desc: null
257
+ value: null
258
+ attn_implementation:
259
+ desc: null
260
+ value: flash_attention_2
261
+ efficient_instruction_tuning:
262
+ desc: null
263
+ value: false
264
+ remove_padding_masking:
265
+ desc: null
266
+ value: false
267
+ save_start_iter:
268
+ desc: null
269
+ value: null
270
+ rank:
271
+ desc: null
272
+ value: 0
273
+ world_size:
274
+ desc: null
275
+ value: 1
276
+ padded_vocab_size:
277
+ desc: null
278
+ value: 32768
279
+ gradient_accumulation_steps:
280
+ desc: null
281
+ value: 40
282
+ _wandb:
283
+ desc: null
284
+ value:
285
+ python_version: 3.10.12
286
+ cli_version: 0.16.3
287
+ framework: huggingface
288
+ huggingface_version: 4.43.3
289
+ is_jupyter_run: false
290
+ is_kaggle_kernel: false
291
+ start_time: 1722705368.213775
292
+ t:
293
+ 1:
294
+ - 1
295
+ - 11
296
+ - 49
297
+ - 55
298
+ - 71
299
+ 2:
300
+ - 1
301
+ - 11
302
+ - 49
303
+ - 55
304
+ - 71
305
+ 3:
306
+ - 13
307
+ - 16
308
+ - 23
309
+ 4: 3.10.12
310
+ 5: 0.16.3
311
+ 6: 4.43.3
312
+ 8:
313
+ - 5
314
+ 13: linux-x86_64
315
+ activation_function:
316
+ desc: null
317
+ value: silu
318
+ hidden_size:
319
+ desc: null
320
+ value: 256
321
+ model_type:
322
+ desc: null
323
+ value: mistral
324
+ max_position_embeddings:
325
+ desc: null
326
+ value: 1024
327
+ num_attention_heads:
328
+ desc: null
329
+ value: 4
330
+ num_hidden_layers:
331
+ desc: null
332
+ value: 4
333
+ model_architecture:
334
+ desc: null
335
+ value: MistralForCausalLM
wandb/run-20240804_021608-l90yeme3/files/output.log ADDED
The diff for this file is too large to render. See raw diff
 
wandb/run-20240804_021608-l90yeme3/files/requirements.txt ADDED
@@ -0,0 +1,271 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ absl-py==2.1.0
2
+ accelerate==0.33.0
3
+ aiohttp==3.9.1
4
+ aiosignal==1.3.1
5
+ annotated-types==0.6.0
6
+ apex==0.1
7
+ appdirs==1.4.4
8
+ argon2-cffi-bindings==21.2.0
9
+ argon2-cffi==23.1.0
10
+ asttokens==2.4.1
11
+ astunparse==1.6.3
12
+ async-timeout==4.0.3
13
+ attrs==23.2.0
14
+ audioread==3.0.1
15
+ beautifulsoup4==4.12.3
16
+ bleach==6.1.0
17
+ blis==0.7.11
18
+ cachetools==5.3.2
19
+ catalogue==2.0.10
20
+ certifi==2024.2.2
21
+ cffi==1.16.0
22
+ charset-normalizer==3.3.2
23
+ click==8.1.7
24
+ cloudpathlib==0.16.0
25
+ cloudpickle==3.0.0
26
+ cmake==3.28.1
27
+ colorama==0.4.6
28
+ comm==0.2.1
29
+ confection==0.1.4
30
+ contourpy==1.2.0
31
+ cubinlinker==0.3.0+2.g405ac64
32
+ cuda-python==12.3.0rc4+9.gdb8c48a.dirty
33
+ cudf==23.12.0
34
+ cugraph-dgl==23.12.0
35
+ cugraph-service-client==23.12.0
36
+ cugraph-service-server==23.12.0
37
+ cugraph==23.12.0
38
+ cuml==23.12.0
39
+ cupy-cuda12x==12.3.0
40
+ cycler==0.12.1
41
+ cymem==2.0.8
42
+ cython==3.0.8
43
+ dask-cuda==23.12.0
44
+ dask-cudf==23.12.0
45
+ dask==2023.11.0
46
+ debugpy==1.8.1
47
+ decorator==5.1.1
48
+ defusedxml==0.7.1
49
+ distributed==2023.11.0
50
+ dm-tree==0.1.8
51
+ docker-pycreds==0.4.0
52
+ einops==0.7.0
53
+ exceptiongroup==1.2.0
54
+ execnet==2.0.2
55
+ executing==2.0.1
56
+ expecttest==0.1.3
57
+ fastjsonschema==2.19.1
58
+ fastrlock==0.8.2
59
+ filelock==3.13.1
60
+ flash-attn==2.4.2
61
+ fonttools==4.48.1
62
+ frozenlist==1.4.1
63
+ fsspec==2023.12.2
64
+ gast==0.5.4
65
+ gitdb==4.0.11
66
+ gitpython==3.1.43
67
+ google-auth-oauthlib==0.4.6
68
+ google-auth==2.27.0
69
+ graphsurgeon==0.4.6
70
+ grpcio==1.60.1
71
+ huggingface-hub==0.24.5
72
+ hypothesis==5.35.1
73
+ idna==3.6
74
+ importlib-metadata==7.0.1
75
+ iniconfig==2.0.0
76
+ intel-openmp==2021.4.0
77
+ ipadic==1.0.0
78
+ ipykernel==6.29.2
79
+ ipython-genutils==0.2.0
80
+ ipython==8.21.0
81
+ jedi==0.19.1
82
+ jinja2==3.1.3
83
+ joblib==1.3.2
84
+ json5==0.9.14
85
+ jsonnet==0.19.1
86
+ jsonschema-specifications==2023.12.1
87
+ jsonschema==4.21.1
88
+ jupyter-client==8.6.0
89
+ jupyter-core==5.7.1
90
+ jupyter-tensorboard==0.2.0
91
+ jupyterlab-pygments==0.3.0
92
+ jupyterlab-server==1.2.0
93
+ jupyterlab==2.3.2
94
+ jupytext==1.16.1
95
+ kiwisolver==1.4.5
96
+ langcodes==3.3.0
97
+ lazy-loader==0.3
98
+ librosa==0.10.1
99
+ llvmlite==0.40.1
100
+ locket==1.0.0
101
+ logzero==1.7.0
102
+ lxml==5.2.2
103
+ markdown-it-py==3.0.0
104
+ markdown==3.5.2
105
+ markupsafe==2.1.4
106
+ matplotlib-inline==0.1.6
107
+ matplotlib==3.8.2
108
+ mdit-py-plugins==0.4.0
109
+ mdurl==0.1.2
110
+ mecab-python3==1.0.6
111
+ mistune==3.0.2
112
+ mkl-devel==2021.1.1
113
+ mkl-include==2021.1.1
114
+ mkl==2021.1.1
115
+ mock==5.1.0
116
+ more-itertools==9.1.0
117
+ mpmath==1.3.0
118
+ msgpack==1.0.7
119
+ multidict==6.0.4
120
+ murmurhash==1.0.10
121
+ nbclient==0.9.0
122
+ nbconvert==7.16.0
123
+ nbformat==5.9.2
124
+ nest-asyncio==1.6.0
125
+ networkx==2.6.3
126
+ ninja==1.11.1.1
127
+ nltk==3.8.1
128
+ notebook==6.4.10
129
+ numba==0.57.1+1.g1ff679645
130
+ numpy==1.24.4
131
+ nvfuser==0.1.4a0+d0bb811
132
+ nvidia-dali-cuda120==1.34.0
133
+ nvidia-pyindex==1.0.9
134
+ nvtx==0.2.5
135
+ oauthlib==3.2.2
136
+ onnx==1.15.0rc2
137
+ opencv==4.7.0
138
+ optree==0.10.0
139
+ packaging==23.2
140
+ pandas==1.5.3
141
+ pandocfilters==1.5.1
142
+ parso==0.8.3
143
+ partd==1.4.1
144
+ peft==0.11.1
145
+ pexpect==4.9.0
146
+ pillow==10.2.0
147
+ pip==24.0
148
+ platformdirs==4.2.0
149
+ pluggy==1.4.0
150
+ ply==3.11
151
+ polygraphy==0.49.4
152
+ pooch==1.8.0
153
+ portalocker==2.10.1
154
+ preshed==3.0.9
155
+ prettytable==3.9.0
156
+ prometheus-client==0.19.0
157
+ prompt-toolkit==3.0.43
158
+ protobuf==4.24.4
159
+ psutil==5.9.4
160
+ ptxcompiler==0.8.1+2.g0d406d6
161
+ ptyprocess==0.7.0
162
+ pure-eval==0.2.2
163
+ pyarrow==14.0.1.dev0+gba5374836.d20240125
164
+ pyasn1-modules==0.3.0
165
+ pyasn1==0.5.1
166
+ pybind11-global==2.11.1
167
+ pybind11==2.11.1
168
+ pycocotools==2.0+nv0.8.0
169
+ pycparser==2.21
170
+ pydantic-core==2.16.2
171
+ pydantic==2.6.1
172
+ pygments==2.17.2
173
+ pylibcugraph==23.12.0
174
+ pylibcugraphops==23.12.0
175
+ pylibraft==23.12.0
176
+ pynvml==11.4.1
177
+ pyparsing==3.1.1
178
+ pytest-flakefinder==1.1.0
179
+ pytest-rerunfailures==13.0
180
+ pytest-shard==0.1.2
181
+ pytest-xdist==3.5.0
182
+ pytest==8.0.0
183
+ python-dateutil==2.8.2
184
+ python-dotenv==1.0.0
185
+ python-hostlist==1.23.0
186
+ pytorch-quantization==2.1.2
187
+ pytz==2023.3.post1
188
+ pyyaml==6.0.1
189
+ pyzmq==25.1.2
190
+ raft-dask==23.12.0
191
+ rapids-dask-dependency==23.12.1
192
+ referencing==0.33.0
193
+ regex==2023.12.25
194
+ requests-oauthlib==1.3.1
195
+ requests==2.31.0
196
+ rich==13.7.0
197
+ rmm==23.12.0
198
+ rpds-py==0.17.1
199
+ rsa==4.9
200
+ sacrebleu==2.4.0
201
+ safetensors==0.4.3
202
+ scikit-learn==1.2.0
203
+ scipy==1.12.0
204
+ send2trash==1.8.2
205
+ sentencepiece==0.1.99
206
+ sentry-sdk==2.12.0
207
+ setproctitle==1.3.3
208
+ setuptools==68.2.2
209
+ six==1.16.0
210
+ smart-open==6.4.0
211
+ smmap==5.0.1
212
+ sortedcontainers==2.4.0
213
+ soundfile==0.12.1
214
+ soupsieve==2.5
215
+ soxr==0.3.7
216
+ spacy-legacy==3.0.12
217
+ spacy-loggers==1.0.5
218
+ spacy==3.7.2
219
+ sphinx-glpi-theme==0.6
220
+ srsly==2.4.8
221
+ stack-data==0.6.3
222
+ sympy==1.12
223
+ tabulate==0.9.0
224
+ tbb==2021.11.0
225
+ tblib==3.0.0
226
+ tensorboard-data-server==0.6.1
227
+ tensorboard-plugin-wit==1.8.1
228
+ tensorboard==2.9.0
229
+ tensorrt==8.6.3
230
+ terminado==0.18.0
231
+ termplotlib==0.3.9
232
+ thinc==8.2.3
233
+ threadpoolctl==3.2.0
234
+ thriftpy2==0.4.17
235
+ tinycss2==1.2.1
236
+ tokenizers==0.19.1
237
+ toml==0.10.2
238
+ tomli==2.0.1
239
+ toolz==0.12.1
240
+ torch-tensorrt==2.3.0a0
241
+ torch==2.3.0a0+ebedce2
242
+ torchdata==0.7.1a0
243
+ torchtext==0.17.0a0
244
+ torchvision==0.18.0a0
245
+ tornado==6.4
246
+ tqdm==4.66.1
247
+ traitlets==5.9.0
248
+ transformer-engine==1.3.0+5b90b7f
249
+ transformers==4.43.3
250
+ treelite-runtime==3.9.1
251
+ treelite==3.9.1
252
+ triton==2.2.0+e28a256
253
+ typer==0.9.0
254
+ types-dataclasses==0.6.6
255
+ typing-extensions==4.9.0
256
+ ucx-py==0.35.0
257
+ uff==0.6.9
258
+ ujson==5.8.0
259
+ urllib3==1.26.18
260
+ wandb==0.16.3
261
+ wasabi==1.1.2
262
+ wcwidth==0.2.13
263
+ weasel==0.3.4
264
+ webencodings==0.5.1
265
+ werkzeug==3.0.1
266
+ wheel==0.42.0
267
+ xdoctest==1.0.2
268
+ xgboost==1.7.6
269
+ yarl==1.9.4
270
+ zict==3.0.0
271
+ zipp==3.17.0
wandb/run-20240804_021608-l90yeme3/files/wandb-metadata.json ADDED
@@ -0,0 +1,215 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "os": "Linux-5.15.0-91-generic-x86_64-with-glibc2.35",
3
+ "python": "3.10.12",
4
+ "heartbeatAt": "2024-08-03T17:16:08.874656",
5
+ "startedAt": "2024-08-03T17:16:08.183460",
6
+ "docker": null,
7
+ "cuda": null,
8
+ "args": [
9
+ "--seq-length",
10
+ "1024",
11
+ "--sliding-window-size",
12
+ "4096",
13
+ "--micro-batch-size",
14
+ "8",
15
+ "--global-batch-size",
16
+ "320",
17
+ "--train-iters",
18
+ "20000",
19
+ "--tokenizer-type",
20
+ "Llama2Tokenizer",
21
+ "--tokenizer-model",
22
+ "/share/pretrained_lm/custom/tiny-mistral/tokenizer.model.v3",
23
+ "--train-data-path",
24
+ "4013541",
25
+ "/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document",
26
+ "--valid-data-path",
27
+ "4013541",
28
+ "/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document",
29
+ "--test-data-path",
30
+ "4013541",
31
+ "/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document",
32
+ "--lr",
33
+ "2e-5",
34
+ "--min-lr",
35
+ "1e-6",
36
+ "--lr-decay-style",
37
+ "cosine",
38
+ "--lr-warmup-iters",
39
+ "500",
40
+ "--lr-decay-iters",
41
+ "20000",
42
+ "--weight-decay",
43
+ "0.1",
44
+ "--grad-clip-norm",
45
+ "1.0",
46
+ "--optimizer",
47
+ "adam",
48
+ "--adam-beta1",
49
+ "0.9",
50
+ "--adam-beta2",
51
+ "0.95",
52
+ "--adam-eps",
53
+ "1e-6",
54
+ "--save-interval",
55
+ "200",
56
+ "--eval-interval",
57
+ "200",
58
+ "--eval-iters",
59
+ "10",
60
+ "--bf16",
61
+ "--mixed-precision",
62
+ "--base-model",
63
+ "/share/pretrained_lm/custom/tiny-mistral",
64
+ "--save",
65
+ "/work/llm_recipes/models/tiny-mistral-sample5",
66
+ "--load",
67
+ "/work/llm_recipes/models/tiny-mistral-sample5",
68
+ "--fsdp-activation-checkpointing",
69
+ "--sharding-strategy",
70
+ "FULL_SHARD",
71
+ "--checkpoint-type",
72
+ "LOCAL_STATE_DICT",
73
+ "--save-n-checkpoints",
74
+ "10",
75
+ "--hf-upload-retry-limit",
76
+ "2",
77
+ "--hf-repo-id",
78
+ "koichi12/tiny-mistral-sample5",
79
+ "--wandb-entity",
80
+ "iwakawa-koichi-q5-tohoku-nlp6723",
81
+ "--wandb-project",
82
+ "llm_tutorial",
83
+ "--wandb-name",
84
+ "tiny-mistral-sample5_train_2024-08-04-02:15:57"
85
+ ],
86
+ "state": "running",
87
+ "program": "/project/examples/finetuning.py",
88
+ "codePathLocal": "examples/finetuning.py",
89
+ "codePath": "examples/finetuning.py",
90
+ "git": {
91
+ "remote": "https://github.com/cl-tohoku/llm-recipes-failab-m1-yans.git",
92
+ "commit": "3be5353210a678dc7008f237fa16b99f2bdf36ea"
93
+ },
94
+ "email": null,
95
+ "root": "/project",
96
+ "host": "gpu-koiwa-00",
97
+ "username": "koiwa",
98
+ "executable": "/usr/bin/python",
99
+ "cpu_count": 18,
100
+ "cpu_count_logical": 18,
101
+ "cpu_freq": {
102
+ "current": 2400.034,
103
+ "min": 0.0,
104
+ "max": 0.0
105
+ },
106
+ "cpu_freq_per_core": [
107
+ {
108
+ "current": 2400.034,
109
+ "min": 0.0,
110
+ "max": 0.0
111
+ },
112
+ {
113
+ "current": 2400.034,
114
+ "min": 0.0,
115
+ "max": 0.0
116
+ },
117
+ {
118
+ "current": 2400.034,
119
+ "min": 0.0,
120
+ "max": 0.0
121
+ },
122
+ {
123
+ "current": 2400.034,
124
+ "min": 0.0,
125
+ "max": 0.0
126
+ },
127
+ {
128
+ "current": 2400.034,
129
+ "min": 0.0,
130
+ "max": 0.0
131
+ },
132
+ {
133
+ "current": 2400.034,
134
+ "min": 0.0,
135
+ "max": 0.0
136
+ },
137
+ {
138
+ "current": 2400.034,
139
+ "min": 0.0,
140
+ "max": 0.0
141
+ },
142
+ {
143
+ "current": 2400.034,
144
+ "min": 0.0,
145
+ "max": 0.0
146
+ },
147
+ {
148
+ "current": 2400.034,
149
+ "min": 0.0,
150
+ "max": 0.0
151
+ },
152
+ {
153
+ "current": 2400.034,
154
+ "min": 0.0,
155
+ "max": 0.0
156
+ },
157
+ {
158
+ "current": 2400.034,
159
+ "min": 0.0,
160
+ "max": 0.0
161
+ },
162
+ {
163
+ "current": 2400.034,
164
+ "min": 0.0,
165
+ "max": 0.0
166
+ },
167
+ {
168
+ "current": 2400.034,
169
+ "min": 0.0,
170
+ "max": 0.0
171
+ },
172
+ {
173
+ "current": 2400.034,
174
+ "min": 0.0,
175
+ "max": 0.0
176
+ },
177
+ {
178
+ "current": 2400.034,
179
+ "min": 0.0,
180
+ "max": 0.0
181
+ },
182
+ {
183
+ "current": 2400.034,
184
+ "min": 0.0,
185
+ "max": 0.0
186
+ },
187
+ {
188
+ "current": 2400.034,
189
+ "min": 0.0,
190
+ "max": 0.0
191
+ },
192
+ {
193
+ "current": 2400.034,
194
+ "min": 0.0,
195
+ "max": 0.0
196
+ }
197
+ ],
198
+ "disk": {
199
+ "/": {
200
+ "total": 0.0625,
201
+ "used": 1.1444091796875e-05
202
+ }
203
+ },
204
+ "gpu": "NVIDIA A100-SXM4-40GB",
205
+ "gpu_count": 1,
206
+ "gpu_devices": [
207
+ {
208
+ "name": "NVIDIA A100-SXM4-40GB",
209
+ "memory_total": 42949672960
210
+ }
211
+ ],
212
+ "memory": {
213
+ "total": 56.48782730102539
214
+ }
215
+ }
wandb/run-20240804_021608-l90yeme3/files/wandb-summary.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"training/loss": 8.691668510437012, "training/perplexity": 5953.106781607083, "utils/batch_size": 8, "utils/global_batch_size": 320, "utils/seq_len": 1025, "utils/gradient_accumulation_steps": 40, "utils/iteration": 3169, "optimizer/lr": 1.91351934671402e-05, "optimizer/variance_l2": 0.01361168373992461, "optimizer/variance_sqrt_l2": 0.9996963278311211, "optimizer/momentum_l2": 0.9700213365567774, "optimizer/weight_l2": 101.93656115447489, "optimizer/variance_l1": 0.9979248046875, "optimizer/variance_sqrt_l1": 598.25, "optimizer/momentum_l1": 420.75, "optimizer/weight_l1": 332992.0, "optimizer/variance_abs_max": 0.00130462646484375, "optimizer/variance_sqrt_abs_max": 0.0361328125, "optimizer/momentum_abs_max": 0.03515625, "optimizer/weight_abs_max": 1.0, "stats/1_iteration_time": 1.5153617600008147, "stats/tokens_per_sec": 216449.96505641245, "stats/tokens_per_sec_per_gpu": 216449.96505641245, "stats/tflops": 16.684531271256578, "_timestamp": 1722710244.673399, "_runtime": 4876.459624052048, "_step": 3169, "evaluation/val_loss": 8.686346054077148, "evaluation/val_ppl": 5921.50634765625, "_wandb": {"runtime": 4876}}
wandb/run-20240804_021608-l90yeme3/logs/debug-internal.log ADDED
The diff for this file is too large to render. See raw diff
 
wandb/run-20240804_021608-l90yeme3/logs/debug.log ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2024-08-04 02:16:08,207 INFO MainThread:11734 [wandb_setup.py:_flush():76] Current SDK version is 0.16.3
2
+ 2024-08-04 02:16:08,207 INFO MainThread:11734 [wandb_setup.py:_flush():76] Configure stats pid to 11734
3
+ 2024-08-04 02:16:08,207 INFO MainThread:11734 [wandb_setup.py:_flush():76] Loading settings from /singularity_home/.config/wandb/settings
4
+ 2024-08-04 02:16:08,207 INFO MainThread:11734 [wandb_setup.py:_flush():76] Loading settings from /project/wandb/settings
5
+ 2024-08-04 02:16:08,207 INFO MainThread:11734 [wandb_setup.py:_flush():76] Loading settings from environment variables: {'api_key': '***REDACTED***', 'run_notes': 'Train tuny llama sample'}
6
+ 2024-08-04 02:16:08,207 INFO MainThread:11734 [wandb_setup.py:_flush():76] Applying setup settings: {'_disable_service': False}
7
+ 2024-08-04 02:16:08,207 INFO MainThread:11734 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {'program_relpath': 'examples/finetuning.py', 'program_abspath': '/project/examples/finetuning.py', 'program': '/project/examples/finetuning.py'}
8
+ 2024-08-04 02:16:08,207 INFO MainThread:11734 [wandb_init.py:_log_setup():526] Logging user logs to /project/wandb/run-20240804_021608-l90yeme3/logs/debug.log
9
+ 2024-08-04 02:16:08,208 INFO MainThread:11734 [wandb_init.py:_log_setup():527] Logging internal logs to /project/wandb/run-20240804_021608-l90yeme3/logs/debug-internal.log
10
+ 2024-08-04 02:16:08,208 INFO MainThread:11734 [wandb_init.py:init():566] calling init triggers
11
+ 2024-08-04 02:16:08,208 INFO MainThread:11734 [wandb_init.py:init():573] wandb.init called with sweep_config: {}
12
+ config: {'sharding_strategy': 'FULL_SHARD', 'checkpoint_type': 'LOCAL_STATE_DICT', 'fsdp_activation_checkpointing': True, 'fsdp_cpu_offload': False, 'low_cpu_fsdp': False, 'no_meta_device': False, 'data_path': None, 'split': '969, 30, 1', 'train_data_path': ['4013541', '/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document'], 'valid_data_path': ['4013541', '/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document'], 'test_data_path': ['4013541', '/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document'], 'data_cache_path': None, 'vocab_size': None, 'vocab_file': None, 'merge_file': None, 'seq_length': 1024, 'num_workers': 2, 'tokenizer_type': 'Llama2Tokenizer', 'tokenizer_model': '/share/pretrained_lm/custom/tiny-mistral/tokenizer.model.v3', 'reset_position_ids': False, 'reset_attention_mask': False, 'eod_mask_loss': False, 'retro_return_doc_ids': False, 'short_seq_prob': 0.1, 'vocab_extra_ids': 0, 'seed': 1234, 'use_mpi': False, 'wandb_entity': 'iwakawa-koichi-q5-tohoku-nlp6723', 'wandb_name': 'tiny-mistral-sample5_train_2024-08-04-02:15:57', 'wandb_project': 'llm_tutorial', 'quantization': False, 'use_freeze_layers': False, 'freeze_layers': None, 'bf16': True, 'fp16': False, 'mixed_precision': True, 'param_dtype': None, 'load': '/work/llm_recipes/models/tiny-mistral-sample5', 'save': '/work/llm_recipes/models/tiny-mistral-sample5', 'base_model': '/share/pretrained_lm/custom/tiny-mistral', 'use_better_transformer': False, 'grad_clip_norm': 1.0, 'eval_interval': 200, 'save_interval': 200, 'eval_iters': 10, 'optimizer': 'adam', 'lr': 2e-05, 'lr_decay_style': 'cosine', 'lr_decay_iters': 20000, 'lr_warmup_iters': 500, 'min_lr': 1e-06, 'train_iters': 20000, 'train_samples': None, 'global_batch_size': 320, 'micro_batch_size': 8, 'make_vocab_size_divisible_by': 128, 'sliding_window_size': 4096, 'skip_batch': None, 'no_save_optimizer_state': False, 'continual_pretraining': False, 'instruction_tuning': False, 'direct_preference_optimization': False, 'attention_dropout': 0.1, 'hidden_dropout': 0.1, 'weight_decay': 0.1, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_eps': 1e-06, 'hf_transformer_model_dir': None, 'instruction_train_data_path': None, 'instruction_valid_data_path': None, 'epoch': None, 'instruction_dataset_size': None, 'save_sampler_state': False, 'label_smoothing': 0.0, 'save_n_checkpoints': 10, 'hf_repo_id': 'koichi12/tiny-mistral-sample5', 'create_public_hf_repo': False, 'upload_all_checkpoints_to_hf': False, 'hf_upload_retry_limit': 2, 'exit_duration_in_mins': None, 'source_key': None, 'target_key': None, 'attn_implementation': 'flash_attention_2', 'efficient_instruction_tuning': False, 'remove_padding_masking': False, 'save_start_iter': None, 'rank': 0, 'world_size': 1, 'padded_vocab_size': 32768, 'gradient_accumulation_steps': 40}
13
+ 2024-08-04 02:16:08,208 INFO MainThread:11734 [wandb_init.py:init():616] starting backend
14
+ 2024-08-04 02:16:08,208 INFO MainThread:11734 [wandb_init.py:init():620] setting up manager
15
+ 2024-08-04 02:16:08,212 INFO MainThread:11734 [backend.py:_multiprocessing_setup():105] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
16
+ 2024-08-04 02:16:08,213 INFO MainThread:11734 [wandb_init.py:init():628] backend started and connected
17
+ 2024-08-04 02:16:08,218 INFO MainThread:11734 [wandb_init.py:init():720] updated telemetry
18
+ 2024-08-04 02:16:08,228 INFO MainThread:11734 [wandb_init.py:init():753] communicating run to backend with 90.0 second timeout
19
+ 2024-08-04 02:16:08,766 INFO MainThread:11734 [wandb_run.py:_on_init():2262] communicating current version
20
+ 2024-08-04 02:16:08,847 INFO MainThread:11734 [wandb_run.py:_on_init():2271] got version response upgrade_message: "wandb version 0.17.5 is available! To upgrade, please run:\n $ pip install wandb --upgrade"
21
+
22
+ 2024-08-04 02:16:08,847 INFO MainThread:11734 [wandb_init.py:init():804] starting run threads in backend
23
+ 2024-08-04 02:16:08,953 INFO MainThread:11734 [wandb_run.py:_console_start():2241] atexit reg
24
+ 2024-08-04 02:16:08,953 INFO MainThread:11734 [wandb_run.py:_redirect():2096] redirect: wrap_raw
25
+ 2024-08-04 02:16:08,954 INFO MainThread:11734 [wandb_run.py:_redirect():2161] Wrapping output streams.
26
+ 2024-08-04 02:16:08,954 INFO MainThread:11734 [wandb_run.py:_redirect():2186] Redirects installed.
27
+ 2024-08-04 02:16:08,954 INFO MainThread:11734 [wandb_init.py:init():847] run started, returning control to user process
28
+ 2024-08-04 02:16:09,857 INFO MainThread:11734 [wandb_run.py:_config_callback():1343] config_cb None None {'activation_function': 'silu', 'hidden_size': 256, 'model_type': 'mistral', 'max_position_embeddings': 1024, 'num_attention_heads': 4, 'num_hidden_layers': 4, 'model_architecture': 'MistralForCausalLM'}
29
+ 2024-08-04 02:16:09,857 INFO MainThread:11734 [wandb_run.py:_config_callback():1343] config_cb None None {'world_size': 1}
wandb/run-20240804_035906-457c7q3q/files/config.yaml ADDED
@@ -0,0 +1,335 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ wandb_version: 1
2
+
3
+ sharding_strategy:
4
+ desc: null
5
+ value: FULL_SHARD
6
+ checkpoint_type:
7
+ desc: null
8
+ value: LOCAL_STATE_DICT
9
+ fsdp_activation_checkpointing:
10
+ desc: null
11
+ value: true
12
+ fsdp_cpu_offload:
13
+ desc: null
14
+ value: false
15
+ low_cpu_fsdp:
16
+ desc: null
17
+ value: false
18
+ no_meta_device:
19
+ desc: null
20
+ value: false
21
+ data_path:
22
+ desc: null
23
+ value: null
24
+ split:
25
+ desc: null
26
+ value: 969, 30, 1
27
+ train_data_path:
28
+ desc: null
29
+ value:
30
+ - '4013541'
31
+ - /work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document
32
+ valid_data_path:
33
+ desc: null
34
+ value:
35
+ - '4013541'
36
+ - /work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document
37
+ test_data_path:
38
+ desc: null
39
+ value:
40
+ - '4013541'
41
+ - /work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document
42
+ data_cache_path:
43
+ desc: null
44
+ value: null
45
+ vocab_size:
46
+ desc: null
47
+ value: null
48
+ vocab_file:
49
+ desc: null
50
+ value: null
51
+ merge_file:
52
+ desc: null
53
+ value: null
54
+ seq_length:
55
+ desc: null
56
+ value: 512
57
+ num_workers:
58
+ desc: null
59
+ value: 2
60
+ tokenizer_type:
61
+ desc: null
62
+ value: Llama2Tokenizer
63
+ tokenizer_model:
64
+ desc: null
65
+ value: /share/pretrained_lm/meta-llama/TinyLlama_v1.1/tokenizer.model
66
+ reset_position_ids:
67
+ desc: null
68
+ value: false
69
+ reset_attention_mask:
70
+ desc: null
71
+ value: false
72
+ eod_mask_loss:
73
+ desc: null
74
+ value: false
75
+ retro_return_doc_ids:
76
+ desc: null
77
+ value: false
78
+ short_seq_prob:
79
+ desc: null
80
+ value: 0.1
81
+ vocab_extra_ids:
82
+ desc: null
83
+ value: 0
84
+ seed:
85
+ desc: null
86
+ value: 1234
87
+ use_mpi:
88
+ desc: null
89
+ value: false
90
+ wandb_entity:
91
+ desc: null
92
+ value: iwakawa-koichi-q5-tohoku-nlp6723
93
+ wandb_name:
94
+ desc: null
95
+ value: tiny-llama-sample_train_2024-08-04-03:58:55
96
+ wandb_project:
97
+ desc: null
98
+ value: llm_tutorial
99
+ quantization:
100
+ desc: null
101
+ value: false
102
+ use_freeze_layers:
103
+ desc: null
104
+ value: false
105
+ freeze_layers:
106
+ desc: null
107
+ value: null
108
+ bf16:
109
+ desc: null
110
+ value: true
111
+ fp16:
112
+ desc: null
113
+ value: false
114
+ mixed_precision:
115
+ desc: null
116
+ value: true
117
+ param_dtype:
118
+ desc: null
119
+ value: null
120
+ load:
121
+ desc: null
122
+ value: /work/llm_recipes/models/tiny-llama-sample
123
+ save:
124
+ desc: null
125
+ value: /work/llm_recipes/models/tiny-llama-sample
126
+ base_model:
127
+ desc: null
128
+ value: /share/pretrained_lm/meta-llama/TinyLlama_v1.1
129
+ use_better_transformer:
130
+ desc: null
131
+ value: false
132
+ grad_clip_norm:
133
+ desc: null
134
+ value: 1.0
135
+ eval_interval:
136
+ desc: null
137
+ value: 200
138
+ save_interval:
139
+ desc: null
140
+ value: 200
141
+ eval_iters:
142
+ desc: null
143
+ value: 10
144
+ optimizer:
145
+ desc: null
146
+ value: adam
147
+ lr:
148
+ desc: null
149
+ value: 2.0e-05
150
+ lr_decay_style:
151
+ desc: null
152
+ value: cosine
153
+ lr_decay_iters:
154
+ desc: null
155
+ value: 2000
156
+ lr_warmup_iters:
157
+ desc: null
158
+ value: 500
159
+ min_lr:
160
+ desc: null
161
+ value: 1.0e-06
162
+ train_iters:
163
+ desc: null
164
+ value: 2000
165
+ train_samples:
166
+ desc: null
167
+ value: null
168
+ global_batch_size:
169
+ desc: null
170
+ value: 320
171
+ micro_batch_size:
172
+ desc: null
173
+ value: 8
174
+ make_vocab_size_divisible_by:
175
+ desc: null
176
+ value: 128
177
+ sliding_window_size:
178
+ desc: null
179
+ value: 4096
180
+ skip_batch:
181
+ desc: null
182
+ value: null
183
+ no_save_optimizer_state:
184
+ desc: null
185
+ value: false
186
+ continual_pretraining:
187
+ desc: null
188
+ value: false
189
+ instruction_tuning:
190
+ desc: null
191
+ value: false
192
+ direct_preference_optimization:
193
+ desc: null
194
+ value: false
195
+ attention_dropout:
196
+ desc: null
197
+ value: 0.1
198
+ hidden_dropout:
199
+ desc: null
200
+ value: 0.1
201
+ weight_decay:
202
+ desc: null
203
+ value: 0.1
204
+ adam_beta1:
205
+ desc: null
206
+ value: 0.9
207
+ adam_beta2:
208
+ desc: null
209
+ value: 0.95
210
+ adam_eps:
211
+ desc: null
212
+ value: 1.0e-06
213
+ hf_transformer_model_dir:
214
+ desc: null
215
+ value: null
216
+ instruction_train_data_path:
217
+ desc: null
218
+ value: null
219
+ instruction_valid_data_path:
220
+ desc: null
221
+ value: null
222
+ epoch:
223
+ desc: null
224
+ value: null
225
+ instruction_dataset_size:
226
+ desc: null
227
+ value: null
228
+ save_sampler_state:
229
+ desc: null
230
+ value: false
231
+ label_smoothing:
232
+ desc: null
233
+ value: 0.0
234
+ save_n_checkpoints:
235
+ desc: null
236
+ value: 10
237
+ hf_repo_id:
238
+ desc: null
239
+ value: koichi12/tiny-llama-sample
240
+ create_public_hf_repo:
241
+ desc: null
242
+ value: false
243
+ upload_all_checkpoints_to_hf:
244
+ desc: null
245
+ value: false
246
+ hf_upload_retry_limit:
247
+ desc: null
248
+ value: 2
249
+ exit_duration_in_mins:
250
+ desc: null
251
+ value: null
252
+ source_key:
253
+ desc: null
254
+ value: null
255
+ target_key:
256
+ desc: null
257
+ value: null
258
+ attn_implementation:
259
+ desc: null
260
+ value: flash_attention_2
261
+ efficient_instruction_tuning:
262
+ desc: null
263
+ value: false
264
+ remove_padding_masking:
265
+ desc: null
266
+ value: false
267
+ save_start_iter:
268
+ desc: null
269
+ value: null
270
+ rank:
271
+ desc: null
272
+ value: 0
273
+ world_size:
274
+ desc: null
275
+ value: 1
276
+ padded_vocab_size:
277
+ desc: null
278
+ value: 32000
279
+ gradient_accumulation_steps:
280
+ desc: null
281
+ value: 40
282
+ _wandb:
283
+ desc: null
284
+ value:
285
+ python_version: 3.10.12
286
+ cli_version: 0.16.3
287
+ framework: huggingface
288
+ huggingface_version: 4.43.3
289
+ is_jupyter_run: false
290
+ is_kaggle_kernel: false
291
+ start_time: 1722711546.225609
292
+ t:
293
+ 1:
294
+ - 1
295
+ - 11
296
+ - 49
297
+ - 55
298
+ - 71
299
+ 2:
300
+ - 1
301
+ - 11
302
+ - 49
303
+ - 55
304
+ - 71
305
+ 3:
306
+ - 13
307
+ - 16
308
+ - 23
309
+ 4: 3.10.12
310
+ 5: 0.16.3
311
+ 6: 4.43.3
312
+ 8:
313
+ - 5
314
+ 13: linux-x86_64
315
+ activation_function:
316
+ desc: null
317
+ value: silu
318
+ hidden_size:
319
+ desc: null
320
+ value: 2048
321
+ model_type:
322
+ desc: null
323
+ value: llama
324
+ max_position_embeddings:
325
+ desc: null
326
+ value: 2048
327
+ num_attention_heads:
328
+ desc: null
329
+ value: 32
330
+ num_hidden_layers:
331
+ desc: null
332
+ value: 22
333
+ model_architecture:
334
+ desc: null
335
+ value: LlamaForCausalLM
wandb/run-20240804_035906-457c7q3q/files/output.log ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Created Hugging Face repository with ID koichi12/tiny-llama-sample.
2
+ Clearing GPU cache for all ranks
3
+ --> Running with torch torch_distributed debug set to detail
4
+ File not found: /work/llm_recipes/models/tiny-llama-sample/latest_iteration.txt
5
+ Unable to read latest iteration from /work/llm_recipes/models/tiny-llama-sample/latest_iteration.txt
6
+ File not found: /work/llm_recipes/models/tiny-llama-sample/latest_iteration.txt
7
+ Unable to read latest iteration from /work/llm_recipes/models/tiny-llama-sample/latest_iteration.txt
8
+ File not found: /work/llm_recipes/models/tiny-llama-sample/latest_iteration.txt
9
+ Unable to read latest iteration from /work/llm_recipes/models/tiny-llama-sample/latest_iteration.txt
10
+ No checkpoint found in /work/llm_recipes/models/tiny-llama-sample, skipping model loading
11
+ --> Model /share/pretrained_lm/meta-llama/TinyLlama_v1.1
12
+ --> /share/pretrained_lm/meta-llama/TinyLlama_v1.1 has 1100.048384 Million params
13
+ You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
14
+ You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
15
+ Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
16
+ Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
17
+ /usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_init_utils.py:441: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.FULL_SHARD since the world size is 1.
18
+ warnings.warn(
19
+ BFloat16 enabled for mixed precision - using bfSixteen policy
20
+ --> applying fsdp activation checkpointing...
21
+ > datasets target sizes (minimum size):
22
+ train: 640000
23
+ validation: 35200
24
+ test: 3200
25
+ > building train, validation, and test datasets for GPT ...
26
+ > finished creating GPT datasets ...
27
+ File not found: /work/llm_recipes/models/tiny-llama-sample/latest_iteration.txt
28
+ Unable to read latest iteration from /work/llm_recipes/models/tiny-llama-sample/latest_iteration.txt
29
+ No checkpoint found in /work/llm_recipes/models/tiny-llama-sample, skipping optimizer loading
30
+ File not found: /work/llm_recipes/models/tiny-llama-sample/latest_iteration.txt
31
+ Unable to read latest iteration from /work/llm_recipes/models/tiny-llama-sample/latest_iteration.txt
32
+ model info: FullyShardedDataParallel(
33
+ (_fsdp_wrapped_module): LlamaForCausalLM(
34
+ (model): LlamaModel(
35
+ (embed_tokens): Embedding(32000, 2048)
36
+ (layers): ModuleList(
37
+ (0-21): 22 x FullyShardedDataParallel(
38
+ (_fsdp_wrapped_module): CheckpointWrapper(
39
+ (_checkpoint_wrapped_module): LlamaDecoderLayer(
40
+ (self_attn): LlamaFlashAttention2(
41
+ (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
42
+ (k_proj): Linear(in_features=2048, out_features=256, bias=False)
43
+ (v_proj): Linear(in_features=2048, out_features=256, bias=False)
44
+ (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
45
+ (rotary_emb): LlamaRotaryEmbedding()
46
+ )
47
+ (mlp): LlamaMLP(
48
+ (gate_proj): Linear(in_features=2048, out_features=5632, bias=False)
49
+ (up_proj): Linear(in_features=2048, out_features=5632, bias=False)
50
+ (down_proj): Linear(in_features=5632, out_features=2048, bias=False)
51
+ (act_fn): SiLU()
52
+ )
53
+ (input_layernorm): LlamaRMSNorm()
54
+ (post_attention_layernorm): LlamaRMSNorm()
55
+ )
56
+ )
57
+ )
58
+ )
59
+ (norm): LlamaRMSNorm()
60
+ (rotary_emb): LlamaRotaryEmbedding()
61
+ )
62
+ (lm_head): Linear(in_features=2048, out_features=32000, bias=False)
63
+ )
64
+ )
65
+ model config: LlamaConfig {
66
+ "_name_or_path": "/share/pretrained_lm/meta-llama/TinyLlama_v1.1",
67
+ "architectures": [
68
+ "LlamaForCausalLM"
69
+ ],
70
+ "attention_bias": false,
71
+ "attention_dropout": 0.0,
72
+ "bos_token_id": 1,
73
+ "eos_token_id": 2,
74
+ "hidden_act": "silu",
75
+ "hidden_size": 2048,
76
+ "initializer_range": 0.02,
77
+ "intermediate_size": 5632,
78
+ "label_smoothing": 0.0,
79
+ "max_position_embeddings": 2048,
80
+ "mlp_bias": false,
81
+ "model_type": "llama",
82
+ "num_attention_heads": 32,
83
+ "num_hidden_layers": 22,
84
+ "num_key_value_heads": 4,
85
+ "pretraining_tp": 1,
86
+ "rms_norm_eps": 1e-05,
87
+ "rope_scaling": null,
88
+ "rope_theta": 10000.0,
89
+ "tie_word_embeddings": false,
90
+ "torch_dtype": "float32",
91
+ "transformers_version": "4.43.3",
92
+ "use_cache": false,
93
+ "vocab_size": 32000
94
+ }
95
+ Let split = None
96
+ Building a BlendedDataset for a single MegatronDataset
97
+ Unable to save the indexes because path_to_cache is None
98
+ Building a BlendedDataset for a single MegatronDataset
99
+ Unable to save the indexes because path_to_cache is None
100
+ Building a BlendedDataset for a single MegatronDataset
101
+ Unable to save the indexes because path_to_cache is None
102
+ Traceback (most recent call last):
103
+ File "/project/examples/finetuning.py", line 13, in <module>
104
+ main()
105
+ File "/project/src/llama_recipes/finetuning.py", line 281, in main
106
+ train(
107
+ File "/project/src/llama_recipes/utils/train_utils.py", line 110, in train
108
+ loss: torch.Tensor = model(**batch).loss
109
+ File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
110
+ return self._call_impl(*args, **kwargs)
111
+ File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
112
+ return forward_call(*args, **kwargs)
113
+ File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 849, in forward
114
+ output = self._fsdp_wrapped_module(*args, **kwargs)
115
+ File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
116
+ return self._call_impl(*args, **kwargs)
117
+ File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
118
+ return forward_call(*args, **kwargs)
119
+ File "/project/lib/transformers/src/transformers/models/llama/modeling_llama.py", line 1141, in forward
120
+ outputs = self.model(
121
+ File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
122
+ return self._call_impl(*args, **kwargs)
123
+ File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
124
+ return forward_call(*args, **kwargs)
125
+ File "/project/lib/transformers/src/transformers/models/llama/modeling_llama.py", line 908, in forward
126
+ cache_position = torch.arange(
127
+ RuntimeError: CUDA error: device-side assert triggered
128
+ CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
129
+ For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
130
+ Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
wandb/run-20240804_035906-457c7q3q/files/requirements.txt ADDED
@@ -0,0 +1,271 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ absl-py==2.1.0
2
+ accelerate==0.33.0
3
+ aiohttp==3.9.1
4
+ aiosignal==1.3.1
5
+ annotated-types==0.6.0
6
+ apex==0.1
7
+ appdirs==1.4.4
8
+ argon2-cffi-bindings==21.2.0
9
+ argon2-cffi==23.1.0
10
+ asttokens==2.4.1
11
+ astunparse==1.6.3
12
+ async-timeout==4.0.3
13
+ attrs==23.2.0
14
+ audioread==3.0.1
15
+ beautifulsoup4==4.12.3
16
+ bleach==6.1.0
17
+ blis==0.7.11
18
+ cachetools==5.3.2
19
+ catalogue==2.0.10
20
+ certifi==2024.2.2
21
+ cffi==1.16.0
22
+ charset-normalizer==3.3.2
23
+ click==8.1.7
24
+ cloudpathlib==0.16.0
25
+ cloudpickle==3.0.0
26
+ cmake==3.28.1
27
+ colorama==0.4.6
28
+ comm==0.2.1
29
+ confection==0.1.4
30
+ contourpy==1.2.0
31
+ cubinlinker==0.3.0+2.g405ac64
32
+ cuda-python==12.3.0rc4+9.gdb8c48a.dirty
33
+ cudf==23.12.0
34
+ cugraph-dgl==23.12.0
35
+ cugraph-service-client==23.12.0
36
+ cugraph-service-server==23.12.0
37
+ cugraph==23.12.0
38
+ cuml==23.12.0
39
+ cupy-cuda12x==12.3.0
40
+ cycler==0.12.1
41
+ cymem==2.0.8
42
+ cython==3.0.8
43
+ dask-cuda==23.12.0
44
+ dask-cudf==23.12.0
45
+ dask==2023.11.0
46
+ debugpy==1.8.1
47
+ decorator==5.1.1
48
+ defusedxml==0.7.1
49
+ distributed==2023.11.0
50
+ dm-tree==0.1.8
51
+ docker-pycreds==0.4.0
52
+ einops==0.7.0
53
+ exceptiongroup==1.2.0
54
+ execnet==2.0.2
55
+ executing==2.0.1
56
+ expecttest==0.1.3
57
+ fastjsonschema==2.19.1
58
+ fastrlock==0.8.2
59
+ filelock==3.13.1
60
+ flash-attn==2.4.2
61
+ fonttools==4.48.1
62
+ frozenlist==1.4.1
63
+ fsspec==2023.12.2
64
+ gast==0.5.4
65
+ gitdb==4.0.11
66
+ gitpython==3.1.43
67
+ google-auth-oauthlib==0.4.6
68
+ google-auth==2.27.0
69
+ graphsurgeon==0.4.6
70
+ grpcio==1.60.1
71
+ huggingface-hub==0.24.5
72
+ hypothesis==5.35.1
73
+ idna==3.6
74
+ importlib-metadata==7.0.1
75
+ iniconfig==2.0.0
76
+ intel-openmp==2021.4.0
77
+ ipadic==1.0.0
78
+ ipykernel==6.29.2
79
+ ipython-genutils==0.2.0
80
+ ipython==8.21.0
81
+ jedi==0.19.1
82
+ jinja2==3.1.3
83
+ joblib==1.3.2
84
+ json5==0.9.14
85
+ jsonnet==0.19.1
86
+ jsonschema-specifications==2023.12.1
87
+ jsonschema==4.21.1
88
+ jupyter-client==8.6.0
89
+ jupyter-core==5.7.1
90
+ jupyter-tensorboard==0.2.0
91
+ jupyterlab-pygments==0.3.0
92
+ jupyterlab-server==1.2.0
93
+ jupyterlab==2.3.2
94
+ jupytext==1.16.1
95
+ kiwisolver==1.4.5
96
+ langcodes==3.3.0
97
+ lazy-loader==0.3
98
+ librosa==0.10.1
99
+ llvmlite==0.40.1
100
+ locket==1.0.0
101
+ logzero==1.7.0
102
+ lxml==5.2.2
103
+ markdown-it-py==3.0.0
104
+ markdown==3.5.2
105
+ markupsafe==2.1.4
106
+ matplotlib-inline==0.1.6
107
+ matplotlib==3.8.2
108
+ mdit-py-plugins==0.4.0
109
+ mdurl==0.1.2
110
+ mecab-python3==1.0.6
111
+ mistune==3.0.2
112
+ mkl-devel==2021.1.1
113
+ mkl-include==2021.1.1
114
+ mkl==2021.1.1
115
+ mock==5.1.0
116
+ more-itertools==9.1.0
117
+ mpmath==1.3.0
118
+ msgpack==1.0.7
119
+ multidict==6.0.4
120
+ murmurhash==1.0.10
121
+ nbclient==0.9.0
122
+ nbconvert==7.16.0
123
+ nbformat==5.9.2
124
+ nest-asyncio==1.6.0
125
+ networkx==2.6.3
126
+ ninja==1.11.1.1
127
+ nltk==3.8.1
128
+ notebook==6.4.10
129
+ numba==0.57.1+1.g1ff679645
130
+ numpy==1.24.4
131
+ nvfuser==0.1.4a0+d0bb811
132
+ nvidia-dali-cuda120==1.34.0
133
+ nvidia-pyindex==1.0.9
134
+ nvtx==0.2.5
135
+ oauthlib==3.2.2
136
+ onnx==1.15.0rc2
137
+ opencv==4.7.0
138
+ optree==0.10.0
139
+ packaging==23.2
140
+ pandas==1.5.3
141
+ pandocfilters==1.5.1
142
+ parso==0.8.3
143
+ partd==1.4.1
144
+ peft==0.11.1
145
+ pexpect==4.9.0
146
+ pillow==10.2.0
147
+ pip==24.0
148
+ platformdirs==4.2.0
149
+ pluggy==1.4.0
150
+ ply==3.11
151
+ polygraphy==0.49.4
152
+ pooch==1.8.0
153
+ portalocker==2.10.1
154
+ preshed==3.0.9
155
+ prettytable==3.9.0
156
+ prometheus-client==0.19.0
157
+ prompt-toolkit==3.0.43
158
+ protobuf==4.24.4
159
+ psutil==5.9.4
160
+ ptxcompiler==0.8.1+2.g0d406d6
161
+ ptyprocess==0.7.0
162
+ pure-eval==0.2.2
163
+ pyarrow==14.0.1.dev0+gba5374836.d20240125
164
+ pyasn1-modules==0.3.0
165
+ pyasn1==0.5.1
166
+ pybind11-global==2.11.1
167
+ pybind11==2.11.1
168
+ pycocotools==2.0+nv0.8.0
169
+ pycparser==2.21
170
+ pydantic-core==2.16.2
171
+ pydantic==2.6.1
172
+ pygments==2.17.2
173
+ pylibcugraph==23.12.0
174
+ pylibcugraphops==23.12.0
175
+ pylibraft==23.12.0
176
+ pynvml==11.4.1
177
+ pyparsing==3.1.1
178
+ pytest-flakefinder==1.1.0
179
+ pytest-rerunfailures==13.0
180
+ pytest-shard==0.1.2
181
+ pytest-xdist==3.5.0
182
+ pytest==8.0.0
183
+ python-dateutil==2.8.2
184
+ python-dotenv==1.0.0
185
+ python-hostlist==1.23.0
186
+ pytorch-quantization==2.1.2
187
+ pytz==2023.3.post1
188
+ pyyaml==6.0.1
189
+ pyzmq==25.1.2
190
+ raft-dask==23.12.0
191
+ rapids-dask-dependency==23.12.1
192
+ referencing==0.33.0
193
+ regex==2023.12.25
194
+ requests-oauthlib==1.3.1
195
+ requests==2.31.0
196
+ rich==13.7.0
197
+ rmm==23.12.0
198
+ rpds-py==0.17.1
199
+ rsa==4.9
200
+ sacrebleu==2.4.0
201
+ safetensors==0.4.3
202
+ scikit-learn==1.2.0
203
+ scipy==1.12.0
204
+ send2trash==1.8.2
205
+ sentencepiece==0.1.99
206
+ sentry-sdk==2.12.0
207
+ setproctitle==1.3.3
208
+ setuptools==68.2.2
209
+ six==1.16.0
210
+ smart-open==6.4.0
211
+ smmap==5.0.1
212
+ sortedcontainers==2.4.0
213
+ soundfile==0.12.1
214
+ soupsieve==2.5
215
+ soxr==0.3.7
216
+ spacy-legacy==3.0.12
217
+ spacy-loggers==1.0.5
218
+ spacy==3.7.2
219
+ sphinx-glpi-theme==0.6
220
+ srsly==2.4.8
221
+ stack-data==0.6.3
222
+ sympy==1.12
223
+ tabulate==0.9.0
224
+ tbb==2021.11.0
225
+ tblib==3.0.0
226
+ tensorboard-data-server==0.6.1
227
+ tensorboard-plugin-wit==1.8.1
228
+ tensorboard==2.9.0
229
+ tensorrt==8.6.3
230
+ terminado==0.18.0
231
+ termplotlib==0.3.9
232
+ thinc==8.2.3
233
+ threadpoolctl==3.2.0
234
+ thriftpy2==0.4.17
235
+ tinycss2==1.2.1
236
+ tokenizers==0.19.1
237
+ toml==0.10.2
238
+ tomli==2.0.1
239
+ toolz==0.12.1
240
+ torch-tensorrt==2.3.0a0
241
+ torch==2.3.0a0+ebedce2
242
+ torchdata==0.7.1a0
243
+ torchtext==0.17.0a0
244
+ torchvision==0.18.0a0
245
+ tornado==6.4
246
+ tqdm==4.66.1
247
+ traitlets==5.9.0
248
+ transformer-engine==1.3.0+5b90b7f
249
+ transformers==4.43.3
250
+ treelite-runtime==3.9.1
251
+ treelite==3.9.1
252
+ triton==2.2.0+e28a256
253
+ typer==0.9.0
254
+ types-dataclasses==0.6.6
255
+ typing-extensions==4.9.0
256
+ ucx-py==0.35.0
257
+ uff==0.6.9
258
+ ujson==5.8.0
259
+ urllib3==1.26.18
260
+ wandb==0.16.3
261
+ wasabi==1.1.2
262
+ wcwidth==0.2.13
263
+ weasel==0.3.4
264
+ webencodings==0.5.1
265
+ werkzeug==3.0.1
266
+ wheel==0.42.0
267
+ xdoctest==1.0.2
268
+ xgboost==1.7.6
269
+ yarl==1.9.4
270
+ zict==3.0.0
271
+ zipp==3.17.0
wandb/run-20240804_035906-457c7q3q/files/wandb-metadata.json ADDED
@@ -0,0 +1,215 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "os": "Linux-5.15.0-91-generic-x86_64-with-glibc2.35",
3
+ "python": "3.10.12",
4
+ "heartbeatAt": "2024-08-03T18:59:06.856800",
5
+ "startedAt": "2024-08-03T18:59:06.213352",
6
+ "docker": null,
7
+ "cuda": null,
8
+ "args": [
9
+ "--seq-length",
10
+ "512",
11
+ "--sliding-window-size",
12
+ "4096",
13
+ "--micro-batch-size",
14
+ "8",
15
+ "--global-batch-size",
16
+ "320",
17
+ "--train-iters",
18
+ "2000",
19
+ "--tokenizer-type",
20
+ "Llama2Tokenizer",
21
+ "--tokenizer-model",
22
+ "/share/pretrained_lm/meta-llama/TinyLlama_v1.1/tokenizer.model",
23
+ "--train-data-path",
24
+ "4013541",
25
+ "/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document",
26
+ "--valid-data-path",
27
+ "4013541",
28
+ "/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document",
29
+ "--test-data-path",
30
+ "4013541",
31
+ "/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document",
32
+ "--lr",
33
+ "2e-5",
34
+ "--min-lr",
35
+ "1e-6",
36
+ "--lr-decay-style",
37
+ "cosine",
38
+ "--lr-warmup-iters",
39
+ "500",
40
+ "--lr-decay-iters",
41
+ "2000",
42
+ "--weight-decay",
43
+ "0.1",
44
+ "--grad-clip-norm",
45
+ "1.0",
46
+ "--optimizer",
47
+ "adam",
48
+ "--adam-beta1",
49
+ "0.9",
50
+ "--adam-beta2",
51
+ "0.95",
52
+ "--adam-eps",
53
+ "1e-6",
54
+ "--save-interval",
55
+ "200",
56
+ "--eval-interval",
57
+ "200",
58
+ "--eval-iters",
59
+ "10",
60
+ "--bf16",
61
+ "--mixed-precision",
62
+ "--base-model",
63
+ "/share/pretrained_lm/meta-llama/TinyLlama_v1.1",
64
+ "--save",
65
+ "/work/llm_recipes/models/tiny-llama-sample",
66
+ "--load",
67
+ "/work/llm_recipes/models/tiny-llama-sample",
68
+ "--fsdp-activation-checkpointing",
69
+ "--sharding-strategy",
70
+ "FULL_SHARD",
71
+ "--checkpoint-type",
72
+ "LOCAL_STATE_DICT",
73
+ "--save-n-checkpoints",
74
+ "10",
75
+ "--hf-upload-retry-limit",
76
+ "2",
77
+ "--hf-repo-id",
78
+ "koichi12/tiny-llama-sample",
79
+ "--wandb-entity",
80
+ "iwakawa-koichi-q5-tohoku-nlp6723",
81
+ "--wandb-project",
82
+ "llm_tutorial",
83
+ "--wandb-name",
84
+ "tiny-llama-sample_train_2024-08-04-03:58:55"
85
+ ],
86
+ "state": "running",
87
+ "program": "/project/examples/finetuning.py",
88
+ "codePathLocal": "examples/finetuning.py",
89
+ "codePath": "examples/finetuning.py",
90
+ "git": {
91
+ "remote": "https://github.com/cl-tohoku/llm-recipes-failab-m1-yans.git",
92
+ "commit": "3be5353210a678dc7008f237fa16b99f2bdf36ea"
93
+ },
94
+ "email": null,
95
+ "root": "/project",
96
+ "host": "gpu-koiwa-00",
97
+ "username": "koiwa",
98
+ "executable": "/usr/bin/python",
99
+ "cpu_count": 18,
100
+ "cpu_count_logical": 18,
101
+ "cpu_freq": {
102
+ "current": 2400.034,
103
+ "min": 0.0,
104
+ "max": 0.0
105
+ },
106
+ "cpu_freq_per_core": [
107
+ {
108
+ "current": 2400.034,
109
+ "min": 0.0,
110
+ "max": 0.0
111
+ },
112
+ {
113
+ "current": 2400.034,
114
+ "min": 0.0,
115
+ "max": 0.0
116
+ },
117
+ {
118
+ "current": 2400.034,
119
+ "min": 0.0,
120
+ "max": 0.0
121
+ },
122
+ {
123
+ "current": 2400.034,
124
+ "min": 0.0,
125
+ "max": 0.0
126
+ },
127
+ {
128
+ "current": 2400.034,
129
+ "min": 0.0,
130
+ "max": 0.0
131
+ },
132
+ {
133
+ "current": 2400.034,
134
+ "min": 0.0,
135
+ "max": 0.0
136
+ },
137
+ {
138
+ "current": 2400.034,
139
+ "min": 0.0,
140
+ "max": 0.0
141
+ },
142
+ {
143
+ "current": 2400.034,
144
+ "min": 0.0,
145
+ "max": 0.0
146
+ },
147
+ {
148
+ "current": 2400.034,
149
+ "min": 0.0,
150
+ "max": 0.0
151
+ },
152
+ {
153
+ "current": 2400.034,
154
+ "min": 0.0,
155
+ "max": 0.0
156
+ },
157
+ {
158
+ "current": 2400.034,
159
+ "min": 0.0,
160
+ "max": 0.0
161
+ },
162
+ {
163
+ "current": 2400.034,
164
+ "min": 0.0,
165
+ "max": 0.0
166
+ },
167
+ {
168
+ "current": 2400.034,
169
+ "min": 0.0,
170
+ "max": 0.0
171
+ },
172
+ {
173
+ "current": 2400.034,
174
+ "min": 0.0,
175
+ "max": 0.0
176
+ },
177
+ {
178
+ "current": 2400.034,
179
+ "min": 0.0,
180
+ "max": 0.0
181
+ },
182
+ {
183
+ "current": 2400.034,
184
+ "min": 0.0,
185
+ "max": 0.0
186
+ },
187
+ {
188
+ "current": 2400.034,
189
+ "min": 0.0,
190
+ "max": 0.0
191
+ },
192
+ {
193
+ "current": 2400.034,
194
+ "min": 0.0,
195
+ "max": 0.0
196
+ }
197
+ ],
198
+ "disk": {
199
+ "/": {
200
+ "total": 0.0625,
201
+ "used": 1.1444091796875e-05
202
+ }
203
+ },
204
+ "gpu": "NVIDIA A100-SXM4-40GB",
205
+ "gpu_count": 1,
206
+ "gpu_devices": [
207
+ {
208
+ "name": "NVIDIA A100-SXM4-40GB",
209
+ "memory_total": 42949672960
210
+ }
211
+ ],
212
+ "memory": {
213
+ "total": 56.48782730102539
214
+ }
215
+ }
wandb/run-20240804_035906-457c7q3q/files/wandb-summary.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"_wandb": {"runtime": 3}}
wandb/run-20240804_035906-457c7q3q/logs/debug-internal.log ADDED
@@ -0,0 +1,186 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2024-08-04 03:59:06,227 INFO StreamThr :13051 [internal.py:wandb_internal():86] W&B internal server running at pid: 13051, started at: 2024-08-04 03:59:06.226186
2
+ 2024-08-04 03:59:06,228 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: status
3
+ 2024-08-04 03:59:06,230 INFO WriterThread:13051 [datastore.py:open_for_write():87] open: /project/wandb/run-20240804_035906-457c7q3q/run-457c7q3q.wandb
4
+ 2024-08-04 03:59:06,231 DEBUG SenderThread:13051 [sender.py:send():382] send: header
5
+ 2024-08-04 03:59:06,244 DEBUG SenderThread:13051 [sender.py:send():382] send: run
6
+ 2024-08-04 03:59:06,745 INFO SenderThread:13051 [dir_watcher.py:__init__():211] watching files in: /project/wandb/run-20240804_035906-457c7q3q/files
7
+ 2024-08-04 03:59:06,745 INFO SenderThread:13051 [sender.py:_start_run_threads():1136] run started: 457c7q3q with start time 1722711546.225609
8
+ 2024-08-04 03:59:06,750 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: check_version
9
+ 2024-08-04 03:59:06,751 DEBUG SenderThread:13051 [sender.py:send_request():409] send_request: check_version
10
+ 2024-08-04 03:59:06,837 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: run_start
11
+ 2024-08-04 03:59:06,843 DEBUG HandlerThread:13051 [system_info.py:__init__():27] System info init
12
+ 2024-08-04 03:59:06,843 DEBUG HandlerThread:13051 [system_info.py:__init__():42] System info init done
13
+ 2024-08-04 03:59:06,843 INFO HandlerThread:13051 [system_monitor.py:start():194] Starting system monitor
14
+ 2024-08-04 03:59:06,843 INFO SystemMonitor:13051 [system_monitor.py:_start():158] Starting system asset monitoring threads
15
+ 2024-08-04 03:59:06,843 INFO HandlerThread:13051 [system_monitor.py:probe():214] Collecting system info
16
+ 2024-08-04 03:59:06,844 INFO SystemMonitor:13051 [interfaces.py:start():190] Started cpu monitoring
17
+ 2024-08-04 03:59:06,844 INFO SystemMonitor:13051 [interfaces.py:start():190] Started disk monitoring
18
+ 2024-08-04 03:59:06,845 INFO SystemMonitor:13051 [interfaces.py:start():190] Started gpu monitoring
19
+ 2024-08-04 03:59:06,846 INFO SystemMonitor:13051 [interfaces.py:start():190] Started memory monitoring
20
+ 2024-08-04 03:59:06,847 INFO SystemMonitor:13051 [interfaces.py:start():190] Started network monitoring
21
+ 2024-08-04 03:59:06,856 DEBUG HandlerThread:13051 [system_info.py:probe():151] Probing system
22
+ 2024-08-04 03:59:06,858 DEBUG HandlerThread:13051 [system_info.py:_probe_git():136] Probing git
23
+ 2024-08-04 03:59:06,869 DEBUG HandlerThread:13051 [system_info.py:_probe_git():144] Probing git done
24
+ 2024-08-04 03:59:06,869 DEBUG HandlerThread:13051 [system_info.py:probe():199] Probing system done
25
+ 2024-08-04 03:59:06,869 DEBUG HandlerThread:13051 [system_monitor.py:probe():223] {'os': 'Linux-5.15.0-91-generic-x86_64-with-glibc2.35', 'python': '3.10.12', 'heartbeatAt': '2024-08-03T18:59:06.856800', 'startedAt': '2024-08-03T18:59:06.213352', 'docker': None, 'cuda': None, 'args': ('--seq-length', '512', '--sliding-window-size', '4096', '--micro-batch-size', '8', '--global-batch-size', '320', '--train-iters', '2000', '--tokenizer-type', 'Llama2Tokenizer', '--tokenizer-model', '/share/pretrained_lm/meta-llama/TinyLlama_v1.1/tokenizer.model', '--train-data-path', '4013541', '/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document', '--valid-data-path', '4013541', '/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document', '--test-data-path', '4013541', '/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document', '--lr', '2e-5', '--min-lr', '1e-6', '--lr-decay-style', 'cosine', '--lr-warmup-iters', '500', '--lr-decay-iters', '2000', '--weight-decay', '0.1', '--grad-clip-norm', '1.0', '--optimizer', 'adam', '--adam-beta1', '0.9', '--adam-beta2', '0.95', '--adam-eps', '1e-6', '--save-interval', '200', '--eval-interval', '200', '--eval-iters', '10', '--bf16', '--mixed-precision', '--base-model', '/share/pretrained_lm/meta-llama/TinyLlama_v1.1', '--save', '/work/llm_recipes/models/tiny-llama-sample', '--load', '/work/llm_recipes/models/tiny-llama-sample', '--fsdp-activation-checkpointing', '--sharding-strategy', 'FULL_SHARD', '--checkpoint-type', 'LOCAL_STATE_DICT', '--save-n-checkpoints', '10', '--hf-upload-retry-limit', '2', '--hf-repo-id', 'koichi12/tiny-llama-sample', '--wandb-entity', 'iwakawa-koichi-q5-tohoku-nlp6723', '--wandb-project', 'llm_tutorial', '--wandb-name', 'tiny-llama-sample_train_2024-08-04-03:58:55'), 'state': 'running', 'program': '/project/examples/finetuning.py', 'codePathLocal': 'examples/finetuning.py', 'codePath': 'examples/finetuning.py', 'git': {'remote': 'https://github.com/cl-tohoku/llm-recipes-failab-m1-yans.git', 'commit': '3be5353210a678dc7008f237fa16b99f2bdf36ea'}, 'email': None, 'root': '/project', 'host': 'gpu-koiwa-00', 'username': 'koiwa', 'executable': '/usr/bin/python', 'cpu_count': 18, 'cpu_count_logical': 18, 'cpu_freq': {'current': 2400.034, 'min': 0.0, 'max': 0.0}, 'cpu_freq_per_core': [{'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}], 'disk': {'/': {'total': 0.0625, 'used': 1.1444091796875e-05}}, 'gpu': 'NVIDIA A100-SXM4-40GB', 'gpu_count': 1, 'gpu_devices': [{'name': 'NVIDIA A100-SXM4-40GB', 'memory_total': 42949672960}], 'memory': {'total': 56.48782730102539}}
26
+ 2024-08-04 03:59:06,870 INFO HandlerThread:13051 [system_monitor.py:probe():224] Finished collecting system info
27
+ 2024-08-04 03:59:06,870 INFO HandlerThread:13051 [system_monitor.py:probe():227] Publishing system info
28
+ 2024-08-04 03:59:06,871 INFO HandlerThread:13051 [system_monitor.py:probe():229] Finished publishing system info
29
+ 2024-08-04 03:59:06,876 DEBUG SenderThread:13051 [sender.py:send():382] send: files
30
+ 2024-08-04 03:59:06,877 INFO SenderThread:13051 [sender.py:_save_file():1403] saving file wandb-metadata.json with policy now
31
+ 2024-08-04 03:59:06,886 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: python_packages
32
+ 2024-08-04 03:59:06,886 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: internal_messages
33
+ 2024-08-04 03:59:06,886 DEBUG SenderThread:13051 [sender.py:send_request():409] send_request: python_packages
34
+ 2024-08-04 03:59:06,887 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: stop_status
35
+ 2024-08-04 03:59:06,888 DEBUG SenderThread:13051 [sender.py:send_request():409] send_request: stop_status
36
+ 2024-08-04 03:59:07,128 DEBUG SenderThread:13051 [sender.py:send():382] send: telemetry
37
+ 2024-08-04 03:59:07,553 INFO wandb-upload_0:13051 [upload_job.py:push():131] Uploaded file /tmp/tmpuq1rfkhgwandb/205blebe-wandb-metadata.json
38
+ 2024-08-04 03:59:07,747 INFO Thread-12 :13051 [dir_watcher.py:_on_file_created():271] file/dir created: /project/wandb/run-20240804_035906-457c7q3q/files/output.log
39
+ 2024-08-04 03:59:07,747 INFO Thread-12 :13051 [dir_watcher.py:_on_file_created():271] file/dir created: /project/wandb/run-20240804_035906-457c7q3q/files/wandb-metadata.json
40
+ 2024-08-04 03:59:07,747 INFO Thread-12 :13051 [dir_watcher.py:_on_file_created():271] file/dir created: /project/wandb/run-20240804_035906-457c7q3q/files/requirements.txt
41
+ 2024-08-04 03:59:09,511 DEBUG SenderThread:13051 [sender.py:send():382] send: config
42
+ 2024-08-04 03:59:09,512 DEBUG SenderThread:13051 [sender.py:send():382] send: config
43
+ 2024-08-04 03:59:09,747 INFO Thread-12 :13051 [dir_watcher.py:_on_file_modified():288] file/dir modified: /project/wandb/run-20240804_035906-457c7q3q/files/output.log
44
+ 2024-08-04 03:59:10,109 DEBUG SenderThread:13051 [sender.py:send():382] send: exit
45
+ 2024-08-04 03:59:10,109 INFO SenderThread:13051 [sender.py:send_exit():589] handling exit code: 1
46
+ 2024-08-04 03:59:10,109 INFO SenderThread:13051 [sender.py:send_exit():591] handling runtime: 3
47
+ 2024-08-04 03:59:10,123 INFO SenderThread:13051 [sender.py:_save_file():1403] saving file wandb-summary.json with policy end
48
+ 2024-08-04 03:59:10,123 INFO SenderThread:13051 [sender.py:send_exit():597] send defer
49
+ 2024-08-04 03:59:10,123 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: defer
50
+ 2024-08-04 03:59:10,123 INFO HandlerThread:13051 [handler.py:handle_request_defer():172] handle defer: 0
51
+ 2024-08-04 03:59:10,124 DEBUG SenderThread:13051 [sender.py:send_request():409] send_request: defer
52
+ 2024-08-04 03:59:10,124 INFO SenderThread:13051 [sender.py:send_request_defer():613] handle sender defer: 0
53
+ 2024-08-04 03:59:10,124 INFO SenderThread:13051 [sender.py:transition_state():617] send defer: 1
54
+ 2024-08-04 03:59:10,124 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: defer
55
+ 2024-08-04 03:59:10,124 INFO HandlerThread:13051 [handler.py:handle_request_defer():172] handle defer: 1
56
+ 2024-08-04 03:59:10,124 DEBUG SenderThread:13051 [sender.py:send_request():409] send_request: defer
57
+ 2024-08-04 03:59:10,124 INFO SenderThread:13051 [sender.py:send_request_defer():613] handle sender defer: 1
58
+ 2024-08-04 03:59:10,124 INFO SenderThread:13051 [sender.py:transition_state():617] send defer: 2
59
+ 2024-08-04 03:59:10,124 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: defer
60
+ 2024-08-04 03:59:10,124 INFO HandlerThread:13051 [handler.py:handle_request_defer():172] handle defer: 2
61
+ 2024-08-04 03:59:10,124 INFO HandlerThread:13051 [system_monitor.py:finish():203] Stopping system monitor
62
+ 2024-08-04 03:59:10,124 DEBUG SystemMonitor:13051 [system_monitor.py:_start():172] Starting system metrics aggregation loop
63
+ 2024-08-04 03:59:10,125 INFO HandlerThread:13051 [interfaces.py:finish():202] Joined cpu monitor
64
+ 2024-08-04 03:59:10,125 DEBUG SystemMonitor:13051 [system_monitor.py:_start():179] Finished system metrics aggregation loop
65
+ 2024-08-04 03:59:10,125 INFO HandlerThread:13051 [interfaces.py:finish():202] Joined disk monitor
66
+ 2024-08-04 03:59:10,125 DEBUG SystemMonitor:13051 [system_monitor.py:_start():183] Publishing last batch of metrics
67
+ 2024-08-04 03:59:10,159 INFO HandlerThread:13051 [interfaces.py:finish():202] Joined gpu monitor
68
+ 2024-08-04 03:59:10,159 INFO HandlerThread:13051 [interfaces.py:finish():202] Joined memory monitor
69
+ 2024-08-04 03:59:10,159 INFO HandlerThread:13051 [interfaces.py:finish():202] Joined network monitor
70
+ 2024-08-04 03:59:10,160 DEBUG SenderThread:13051 [sender.py:send_request():409] send_request: defer
71
+ 2024-08-04 03:59:10,160 INFO SenderThread:13051 [sender.py:send_request_defer():613] handle sender defer: 2
72
+ 2024-08-04 03:59:10,160 INFO SenderThread:13051 [sender.py:transition_state():617] send defer: 3
73
+ 2024-08-04 03:59:10,160 DEBUG SenderThread:13051 [sender.py:send():382] send: stats
74
+ 2024-08-04 03:59:10,160 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: defer
75
+ 2024-08-04 03:59:10,160 INFO HandlerThread:13051 [handler.py:handle_request_defer():172] handle defer: 3
76
+ 2024-08-04 03:59:10,161 DEBUG SenderThread:13051 [sender.py:send_request():409] send_request: defer
77
+ 2024-08-04 03:59:10,161 INFO SenderThread:13051 [sender.py:send_request_defer():613] handle sender defer: 3
78
+ 2024-08-04 03:59:10,161 INFO SenderThread:13051 [sender.py:transition_state():617] send defer: 4
79
+ 2024-08-04 03:59:10,161 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: defer
80
+ 2024-08-04 03:59:10,161 INFO HandlerThread:13051 [handler.py:handle_request_defer():172] handle defer: 4
81
+ 2024-08-04 03:59:10,161 DEBUG SenderThread:13051 [sender.py:send_request():409] send_request: defer
82
+ 2024-08-04 03:59:10,161 INFO SenderThread:13051 [sender.py:send_request_defer():613] handle sender defer: 4
83
+ 2024-08-04 03:59:10,161 INFO SenderThread:13051 [sender.py:transition_state():617] send defer: 5
84
+ 2024-08-04 03:59:10,161 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: defer
85
+ 2024-08-04 03:59:10,161 INFO HandlerThread:13051 [handler.py:handle_request_defer():172] handle defer: 5
86
+ 2024-08-04 03:59:10,161 DEBUG SenderThread:13051 [sender.py:send():382] send: summary
87
+ 2024-08-04 03:59:10,165 INFO SenderThread:13051 [sender.py:_save_file():1403] saving file wandb-summary.json with policy end
88
+ 2024-08-04 03:59:10,165 DEBUG SenderThread:13051 [sender.py:send_request():409] send_request: defer
89
+ 2024-08-04 03:59:10,165 INFO SenderThread:13051 [sender.py:send_request_defer():613] handle sender defer: 5
90
+ 2024-08-04 03:59:10,165 INFO SenderThread:13051 [sender.py:transition_state():617] send defer: 6
91
+ 2024-08-04 03:59:10,165 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: defer
92
+ 2024-08-04 03:59:10,165 INFO HandlerThread:13051 [handler.py:handle_request_defer():172] handle defer: 6
93
+ 2024-08-04 03:59:10,165 DEBUG SenderThread:13051 [sender.py:send_request():409] send_request: defer
94
+ 2024-08-04 03:59:10,165 INFO SenderThread:13051 [sender.py:send_request_defer():613] handle sender defer: 6
95
+ 2024-08-04 03:59:10,168 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: status_report
96
+ 2024-08-04 03:59:10,367 INFO SenderThread:13051 [sender.py:transition_state():617] send defer: 7
97
+ 2024-08-04 03:59:10,367 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: defer
98
+ 2024-08-04 03:59:10,367 INFO HandlerThread:13051 [handler.py:handle_request_defer():172] handle defer: 7
99
+ 2024-08-04 03:59:10,367 DEBUG SenderThread:13051 [sender.py:send_request():409] send_request: defer
100
+ 2024-08-04 03:59:10,368 INFO SenderThread:13051 [sender.py:send_request_defer():613] handle sender defer: 7
101
+ 2024-08-04 03:59:10,749 INFO Thread-12 :13051 [dir_watcher.py:_on_file_modified():288] file/dir modified: /project/wandb/run-20240804_035906-457c7q3q/files/config.yaml
102
+ 2024-08-04 03:59:10,749 INFO Thread-12 :13051 [dir_watcher.py:_on_file_created():271] file/dir created: /project/wandb/run-20240804_035906-457c7q3q/files/wandb-summary.json
103
+ 2024-08-04 03:59:11,109 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: poll_exit
104
+ 2024-08-04 03:59:11,580 INFO SenderThread:13051 [sender.py:transition_state():617] send defer: 8
105
+ 2024-08-04 03:59:11,580 DEBUG SenderThread:13051 [sender.py:send_request():409] send_request: poll_exit
106
+ 2024-08-04 03:59:11,580 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: defer
107
+ 2024-08-04 03:59:11,581 INFO HandlerThread:13051 [handler.py:handle_request_defer():172] handle defer: 8
108
+ 2024-08-04 03:59:11,581 DEBUG SenderThread:13051 [sender.py:send_request():409] send_request: defer
109
+ 2024-08-04 03:59:11,581 INFO SenderThread:13051 [sender.py:send_request_defer():613] handle sender defer: 8
110
+ 2024-08-04 03:59:11,581 INFO SenderThread:13051 [job_builder.py:build():296] Attempting to build job artifact
111
+ 2024-08-04 03:59:11,582 INFO SenderThread:13051 [job_builder.py:_get_source_type():426] is repo sourced job
112
+ 2024-08-04 03:59:11,595 INFO SenderThread:13051 [job_builder.py:build():402] adding wandb-job metadata file
113
+ 2024-08-04 03:59:11,631 INFO SenderThread:13051 [sender.py:transition_state():617] send defer: 9
114
+ 2024-08-04 03:59:11,631 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: defer
115
+ 2024-08-04 03:59:11,631 DEBUG SenderThread:13051 [sender.py:send():382] send: artifact
116
+ 2024-08-04 03:59:11,631 INFO HandlerThread:13051 [handler.py:handle_request_defer():172] handle defer: 9
117
+ 2024-08-04 03:59:11,749 INFO Thread-12 :13051 [dir_watcher.py:_on_file_modified():288] file/dir modified: /project/wandb/run-20240804_035906-457c7q3q/files/output.log
118
+ 2024-08-04 03:59:12,109 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: poll_exit
119
+ 2024-08-04 03:59:12,520 INFO SenderThread:13051 [sender.py:send_artifact():1494] sent artifact job-https___github.com_cl-tohoku_llm-recipes-failab-m1-yans.git_examples_finetuning.py - {'id': 'QXJ0aWZhY3Q6MTA5MTk2NTkzOA==', 'state': 'COMMITTED', 'artifactSequence': {'id': 'QXJ0aWZhY3RDb2xsZWN0aW9uOjM2MjY3MjMzNA==', 'latestArtifact': {'id': 'QXJ0aWZhY3Q6MTA5MzUzODM4NQ==', 'versionIndex': 3}}}
120
+ 2024-08-04 03:59:12,520 DEBUG SenderThread:13051 [sender.py:send_request():409] send_request: defer
121
+ 2024-08-04 03:59:12,520 INFO SenderThread:13051 [sender.py:send_request_defer():613] handle sender defer: 9
122
+ 2024-08-04 03:59:12,520 INFO SenderThread:13051 [dir_watcher.py:finish():358] shutting down directory watcher
123
+ 2024-08-04 03:59:12,750 INFO SenderThread:13051 [dir_watcher.py:finish():388] scan: /project/wandb/run-20240804_035906-457c7q3q/files
124
+ 2024-08-04 03:59:12,750 INFO SenderThread:13051 [dir_watcher.py:finish():402] scan save: /project/wandb/run-20240804_035906-457c7q3q/files/requirements.txt requirements.txt
125
+ 2024-08-04 03:59:12,751 INFO SenderThread:13051 [dir_watcher.py:finish():402] scan save: /project/wandb/run-20240804_035906-457c7q3q/files/config.yaml config.yaml
126
+ 2024-08-04 03:59:12,752 INFO SenderThread:13051 [dir_watcher.py:finish():402] scan save: /project/wandb/run-20240804_035906-457c7q3q/files/wandb-metadata.json wandb-metadata.json
127
+ 2024-08-04 03:59:12,752 INFO SenderThread:13051 [dir_watcher.py:finish():402] scan save: /project/wandb/run-20240804_035906-457c7q3q/files/wandb-summary.json wandb-summary.json
128
+ 2024-08-04 03:59:12,754 INFO SenderThread:13051 [dir_watcher.py:finish():402] scan save: /project/wandb/run-20240804_035906-457c7q3q/files/output.log output.log
129
+ 2024-08-04 03:59:12,755 INFO SenderThread:13051 [sender.py:transition_state():617] send defer: 10
130
+ 2024-08-04 03:59:12,755 DEBUG SenderThread:13051 [sender.py:send_request():409] send_request: poll_exit
131
+ 2024-08-04 03:59:12,755 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: defer
132
+ 2024-08-04 03:59:12,757 INFO HandlerThread:13051 [handler.py:handle_request_defer():172] handle defer: 10
133
+ 2024-08-04 03:59:12,757 DEBUG SenderThread:13051 [sender.py:send_request():409] send_request: defer
134
+ 2024-08-04 03:59:12,757 INFO SenderThread:13051 [sender.py:send_request_defer():613] handle sender defer: 10
135
+ 2024-08-04 03:59:12,757 INFO SenderThread:13051 [file_pusher.py:finish():172] shutting down file pusher
136
+ 2024-08-04 03:59:13,109 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: poll_exit
137
+ 2024-08-04 03:59:13,110 DEBUG SenderThread:13051 [sender.py:send_request():409] send_request: poll_exit
138
+ 2024-08-04 03:59:13,154 INFO wandb-upload_0:13051 [upload_job.py:push():131] Uploaded file /project/wandb/run-20240804_035906-457c7q3q/files/requirements.txt
139
+ 2024-08-04 03:59:13,257 INFO wandb-upload_1:13051 [upload_job.py:push():131] Uploaded file /project/wandb/run-20240804_035906-457c7q3q/files/config.yaml
140
+ 2024-08-04 03:59:13,334 INFO wandb-upload_2:13051 [upload_job.py:push():131] Uploaded file /project/wandb/run-20240804_035906-457c7q3q/files/wandb-summary.json
141
+ 2024-08-04 03:59:13,368 INFO wandb-upload_3:13051 [upload_job.py:push():131] Uploaded file /project/wandb/run-20240804_035906-457c7q3q/files/output.log
142
+ 2024-08-04 03:59:13,568 INFO Thread-11 (_thread_body):13051 [sender.py:transition_state():617] send defer: 11
143
+ 2024-08-04 03:59:13,569 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: defer
144
+ 2024-08-04 03:59:13,569 INFO HandlerThread:13051 [handler.py:handle_request_defer():172] handle defer: 11
145
+ 2024-08-04 03:59:13,569 DEBUG SenderThread:13051 [sender.py:send_request():409] send_request: defer
146
+ 2024-08-04 03:59:13,569 INFO SenderThread:13051 [sender.py:send_request_defer():613] handle sender defer: 11
147
+ 2024-08-04 03:59:13,569 INFO SenderThread:13051 [file_pusher.py:join():178] waiting for file pusher
148
+ 2024-08-04 03:59:13,569 INFO SenderThread:13051 [sender.py:transition_state():617] send defer: 12
149
+ 2024-08-04 03:59:13,570 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: defer
150
+ 2024-08-04 03:59:13,570 INFO HandlerThread:13051 [handler.py:handle_request_defer():172] handle defer: 12
151
+ 2024-08-04 03:59:13,570 DEBUG SenderThread:13051 [sender.py:send_request():409] send_request: defer
152
+ 2024-08-04 03:59:13,570 INFO SenderThread:13051 [sender.py:send_request_defer():613] handle sender defer: 12
153
+ 2024-08-04 03:59:13,570 INFO SenderThread:13051 [file_stream.py:finish():595] file stream finish called
154
+ 2024-08-04 03:59:13,759 INFO SenderThread:13051 [file_stream.py:finish():599] file stream finish is done
155
+ 2024-08-04 03:59:13,759 INFO SenderThread:13051 [sender.py:transition_state():617] send defer: 13
156
+ 2024-08-04 03:59:13,759 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: defer
157
+ 2024-08-04 03:59:13,759 INFO HandlerThread:13051 [handler.py:handle_request_defer():172] handle defer: 13
158
+ 2024-08-04 03:59:13,759 DEBUG SenderThread:13051 [sender.py:send_request():409] send_request: defer
159
+ 2024-08-04 03:59:13,759 INFO SenderThread:13051 [sender.py:send_request_defer():613] handle sender defer: 13
160
+ 2024-08-04 03:59:13,759 INFO SenderThread:13051 [sender.py:transition_state():617] send defer: 14
161
+ 2024-08-04 03:59:13,759 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: defer
162
+ 2024-08-04 03:59:13,759 DEBUG SenderThread:13051 [sender.py:send():382] send: final
163
+ 2024-08-04 03:59:13,759 INFO HandlerThread:13051 [handler.py:handle_request_defer():172] handle defer: 14
164
+ 2024-08-04 03:59:13,760 DEBUG SenderThread:13051 [sender.py:send():382] send: footer
165
+ 2024-08-04 03:59:13,760 DEBUG SenderThread:13051 [sender.py:send_request():409] send_request: defer
166
+ 2024-08-04 03:59:13,760 INFO SenderThread:13051 [sender.py:send_request_defer():613] handle sender defer: 14
167
+ 2024-08-04 03:59:13,760 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: poll_exit
168
+ 2024-08-04 03:59:13,760 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: poll_exit
169
+ 2024-08-04 03:59:13,761 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: server_info
170
+ 2024-08-04 03:59:13,761 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: get_summary
171
+ 2024-08-04 03:59:13,761 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: sampled_history
172
+ 2024-08-04 03:59:13,761 DEBUG SenderThread:13051 [sender.py:send_request():409] send_request: poll_exit
173
+ 2024-08-04 03:59:13,761 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: internal_messages
174
+ 2024-08-04 03:59:13,761 DEBUG SenderThread:13051 [sender.py:send_request():409] send_request: poll_exit
175
+ 2024-08-04 03:59:13,762 DEBUG SenderThread:13051 [sender.py:send_request():409] send_request: server_info
176
+ 2024-08-04 03:59:13,763 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: job_info
177
+ 2024-08-04 03:59:13,927 DEBUG SenderThread:13051 [sender.py:send_request():409] send_request: job_info
178
+ 2024-08-04 03:59:13,927 INFO MainThread:13051 [wandb_run.py:_footer_history_summary_info():3866] rendering history
179
+ 2024-08-04 03:59:13,927 INFO MainThread:13051 [wandb_run.py:_footer_history_summary_info():3898] rendering summary
180
+ 2024-08-04 03:59:13,928 INFO MainThread:13051 [wandb_run.py:_footer_sync_info():3825] logging synced files
181
+ 2024-08-04 03:59:13,928 DEBUG HandlerThread:13051 [handler.py:handle_request():146] handle_request: shutdown
182
+ 2024-08-04 03:59:13,928 INFO HandlerThread:13051 [handler.py:finish():869] shutting down handler
183
+ 2024-08-04 03:59:14,763 INFO WriterThread:13051 [datastore.py:close():296] close: /project/wandb/run-20240804_035906-457c7q3q/run-457c7q3q.wandb
184
+ 2024-08-04 03:59:14,927 INFO SenderThread:13051 [sender.py:finish():1572] shutting down sender
185
+ 2024-08-04 03:59:14,928 INFO SenderThread:13051 [file_pusher.py:finish():172] shutting down file pusher
186
+ 2024-08-04 03:59:14,928 INFO SenderThread:13051 [file_pusher.py:join():178] waiting for file pusher
wandb/run-20240804_035906-457c7q3q/logs/debug.log ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2024-08-04 03:59:06,219 INFO MainThread:12980 [wandb_setup.py:_flush():76] Current SDK version is 0.16.3
2
+ 2024-08-04 03:59:06,219 INFO MainThread:12980 [wandb_setup.py:_flush():76] Configure stats pid to 12980
3
+ 2024-08-04 03:59:06,219 INFO MainThread:12980 [wandb_setup.py:_flush():76] Loading settings from /singularity_home/.config/wandb/settings
4
+ 2024-08-04 03:59:06,219 INFO MainThread:12980 [wandb_setup.py:_flush():76] Loading settings from /project/wandb/settings
5
+ 2024-08-04 03:59:06,219 INFO MainThread:12980 [wandb_setup.py:_flush():76] Loading settings from environment variables: {'api_key': '***REDACTED***', 'run_notes': 'Train tuny llama sample'}
6
+ 2024-08-04 03:59:06,219 INFO MainThread:12980 [wandb_setup.py:_flush():76] Applying setup settings: {'_disable_service': False}
7
+ 2024-08-04 03:59:06,219 INFO MainThread:12980 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {'program_relpath': 'examples/finetuning.py', 'program_abspath': '/project/examples/finetuning.py', 'program': '/project/examples/finetuning.py'}
8
+ 2024-08-04 03:59:06,219 INFO MainThread:12980 [wandb_init.py:_log_setup():526] Logging user logs to /project/wandb/run-20240804_035906-457c7q3q/logs/debug.log
9
+ 2024-08-04 03:59:06,219 INFO MainThread:12980 [wandb_init.py:_log_setup():527] Logging internal logs to /project/wandb/run-20240804_035906-457c7q3q/logs/debug-internal.log
10
+ 2024-08-04 03:59:06,219 INFO MainThread:12980 [wandb_init.py:init():566] calling init triggers
11
+ 2024-08-04 03:59:06,219 INFO MainThread:12980 [wandb_init.py:init():573] wandb.init called with sweep_config: {}
12
+ config: {'sharding_strategy': 'FULL_SHARD', 'checkpoint_type': 'LOCAL_STATE_DICT', 'fsdp_activation_checkpointing': True, 'fsdp_cpu_offload': False, 'low_cpu_fsdp': False, 'no_meta_device': False, 'data_path': None, 'split': '969, 30, 1', 'train_data_path': ['4013541', '/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document'], 'valid_data_path': ['4013541', '/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document'], 'test_data_path': ['4013541', '/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document'], 'data_cache_path': None, 'vocab_size': None, 'vocab_file': None, 'merge_file': None, 'seq_length': 512, 'num_workers': 2, 'tokenizer_type': 'Llama2Tokenizer', 'tokenizer_model': '/share/pretrained_lm/meta-llama/TinyLlama_v1.1/tokenizer.model', 'reset_position_ids': False, 'reset_attention_mask': False, 'eod_mask_loss': False, 'retro_return_doc_ids': False, 'short_seq_prob': 0.1, 'vocab_extra_ids': 0, 'seed': 1234, 'use_mpi': False, 'wandb_entity': 'iwakawa-koichi-q5-tohoku-nlp6723', 'wandb_name': 'tiny-llama-sample_train_2024-08-04-03:58:55', 'wandb_project': 'llm_tutorial', 'quantization': False, 'use_freeze_layers': False, 'freeze_layers': None, 'bf16': True, 'fp16': False, 'mixed_precision': True, 'param_dtype': None, 'load': '/work/llm_recipes/models/tiny-llama-sample', 'save': '/work/llm_recipes/models/tiny-llama-sample', 'base_model': '/share/pretrained_lm/meta-llama/TinyLlama_v1.1', 'use_better_transformer': False, 'grad_clip_norm': 1.0, 'eval_interval': 200, 'save_interval': 200, 'eval_iters': 10, 'optimizer': 'adam', 'lr': 2e-05, 'lr_decay_style': 'cosine', 'lr_decay_iters': 2000, 'lr_warmup_iters': 500, 'min_lr': 1e-06, 'train_iters': 2000, 'train_samples': None, 'global_batch_size': 320, 'micro_batch_size': 8, 'make_vocab_size_divisible_by': 128, 'sliding_window_size': 4096, 'skip_batch': None, 'no_save_optimizer_state': False, 'continual_pretraining': False, 'instruction_tuning': False, 'direct_preference_optimization': False, 'attention_dropout': 0.1, 'hidden_dropout': 0.1, 'weight_decay': 0.1, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_eps': 1e-06, 'hf_transformer_model_dir': None, 'instruction_train_data_path': None, 'instruction_valid_data_path': None, 'epoch': None, 'instruction_dataset_size': None, 'save_sampler_state': False, 'label_smoothing': 0.0, 'save_n_checkpoints': 10, 'hf_repo_id': 'koichi12/tiny-llama-sample', 'create_public_hf_repo': False, 'upload_all_checkpoints_to_hf': False, 'hf_upload_retry_limit': 2, 'exit_duration_in_mins': None, 'source_key': None, 'target_key': None, 'attn_implementation': 'flash_attention_2', 'efficient_instruction_tuning': False, 'remove_padding_masking': False, 'save_start_iter': None, 'rank': 0, 'world_size': 1, 'padded_vocab_size': 32000, 'gradient_accumulation_steps': 40}
13
+ 2024-08-04 03:59:06,219 INFO MainThread:12980 [wandb_init.py:init():616] starting backend
14
+ 2024-08-04 03:59:06,220 INFO MainThread:12980 [wandb_init.py:init():620] setting up manager
15
+ 2024-08-04 03:59:06,224 INFO MainThread:12980 [backend.py:_multiprocessing_setup():105] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
16
+ 2024-08-04 03:59:06,225 INFO MainThread:12980 [wandb_init.py:init():628] backend started and connected
17
+ 2024-08-04 03:59:06,230 INFO MainThread:12980 [wandb_init.py:init():720] updated telemetry
18
+ 2024-08-04 03:59:06,240 INFO MainThread:12980 [wandb_init.py:init():753] communicating run to backend with 90.0 second timeout
19
+ 2024-08-04 03:59:06,750 INFO MainThread:12980 [wandb_run.py:_on_init():2262] communicating current version
20
+ 2024-08-04 03:59:06,830 INFO MainThread:12980 [wandb_run.py:_on_init():2271] got version response upgrade_message: "wandb version 0.17.5 is available! To upgrade, please run:\n $ pip install wandb --upgrade"
21
+
22
+ 2024-08-04 03:59:06,830 INFO MainThread:12980 [wandb_init.py:init():804] starting run threads in backend
23
+ 2024-08-04 03:59:06,885 INFO MainThread:12980 [wandb_run.py:_console_start():2241] atexit reg
24
+ 2024-08-04 03:59:06,885 INFO MainThread:12980 [wandb_run.py:_redirect():2096] redirect: wrap_raw
25
+ 2024-08-04 03:59:06,885 INFO MainThread:12980 [wandb_run.py:_redirect():2161] Wrapping output streams.
26
+ 2024-08-04 03:59:06,886 INFO MainThread:12980 [wandb_run.py:_redirect():2186] Redirects installed.
27
+ 2024-08-04 03:59:06,887 INFO MainThread:12980 [wandb_init.py:init():847] run started, returning control to user process
28
+ 2024-08-04 03:59:09,511 INFO MainThread:12980 [wandb_run.py:_config_callback():1343] config_cb None None {'activation_function': 'silu', 'hidden_size': 2048, 'model_type': 'llama', 'max_position_embeddings': 2048, 'num_attention_heads': 32, 'num_hidden_layers': 22, 'model_architecture': 'LlamaForCausalLM'}
29
+ 2024-08-04 03:59:09,511 INFO MainThread:12980 [wandb_run.py:_config_callback():1343] config_cb None None {'world_size': 1}
wandb/run-20240804_035906-457c7q3q/run-457c7q3q.wandb ADDED
Binary file (20.8 kB). View file
 
wandb/run-20240804_143449-7tyiihss/files/config.yaml ADDED
@@ -0,0 +1,335 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ wandb_version: 1
2
+
3
+ sharding_strategy:
4
+ desc: null
5
+ value: FULL_SHARD
6
+ checkpoint_type:
7
+ desc: null
8
+ value: LOCAL_STATE_DICT
9
+ fsdp_activation_checkpointing:
10
+ desc: null
11
+ value: true
12
+ fsdp_cpu_offload:
13
+ desc: null
14
+ value: false
15
+ low_cpu_fsdp:
16
+ desc: null
17
+ value: false
18
+ no_meta_device:
19
+ desc: null
20
+ value: false
21
+ data_path:
22
+ desc: null
23
+ value: null
24
+ split:
25
+ desc: null
26
+ value: 969, 30, 1
27
+ train_data_path:
28
+ desc: null
29
+ value:
30
+ - '4013541'
31
+ - /work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document
32
+ valid_data_path:
33
+ desc: null
34
+ value:
35
+ - '4013541'
36
+ - /work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document
37
+ test_data_path:
38
+ desc: null
39
+ value:
40
+ - '4013541'
41
+ - /work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document
42
+ data_cache_path:
43
+ desc: null
44
+ value: null
45
+ vocab_size:
46
+ desc: null
47
+ value: null
48
+ vocab_file:
49
+ desc: null
50
+ value: null
51
+ merge_file:
52
+ desc: null
53
+ value: null
54
+ seq_length:
55
+ desc: null
56
+ value: 512
57
+ num_workers:
58
+ desc: null
59
+ value: 2
60
+ tokenizer_type:
61
+ desc: null
62
+ value: Llama2Tokenizer
63
+ tokenizer_model:
64
+ desc: null
65
+ value: /share/pretrained_lm/meta-llama/TinyLlama_v1.1/tokenizer.model
66
+ reset_position_ids:
67
+ desc: null
68
+ value: false
69
+ reset_attention_mask:
70
+ desc: null
71
+ value: false
72
+ eod_mask_loss:
73
+ desc: null
74
+ value: false
75
+ retro_return_doc_ids:
76
+ desc: null
77
+ value: false
78
+ short_seq_prob:
79
+ desc: null
80
+ value: 0.1
81
+ vocab_extra_ids:
82
+ desc: null
83
+ value: 0
84
+ seed:
85
+ desc: null
86
+ value: 1234
87
+ use_mpi:
88
+ desc: null
89
+ value: false
90
+ wandb_entity:
91
+ desc: null
92
+ value: iwakawa-koichi-q5-tohoku-nlp6723
93
+ wandb_name:
94
+ desc: null
95
+ value: tiny-llama_train_2024-08-04-14:34:38
96
+ wandb_project:
97
+ desc: null
98
+ value: llm_tutorial
99
+ quantization:
100
+ desc: null
101
+ value: false
102
+ use_freeze_layers:
103
+ desc: null
104
+ value: false
105
+ freeze_layers:
106
+ desc: null
107
+ value: null
108
+ bf16:
109
+ desc: null
110
+ value: true
111
+ fp16:
112
+ desc: null
113
+ value: false
114
+ mixed_precision:
115
+ desc: null
116
+ value: true
117
+ param_dtype:
118
+ desc: null
119
+ value: null
120
+ load:
121
+ desc: null
122
+ value: /work/llm_recipes/models/tiny-llama
123
+ save:
124
+ desc: null
125
+ value: /work/llm_recipes/models/tiny-llama
126
+ base_model:
127
+ desc: null
128
+ value: /share/pretrained_lm/meta-llama/TinyLlama_v1.1
129
+ use_better_transformer:
130
+ desc: null
131
+ value: false
132
+ grad_clip_norm:
133
+ desc: null
134
+ value: 1.0
135
+ eval_interval:
136
+ desc: null
137
+ value: 200
138
+ save_interval:
139
+ desc: null
140
+ value: 200
141
+ eval_iters:
142
+ desc: null
143
+ value: 10
144
+ optimizer:
145
+ desc: null
146
+ value: adam
147
+ lr:
148
+ desc: null
149
+ value: 2.0e-05
150
+ lr_decay_style:
151
+ desc: null
152
+ value: cosine
153
+ lr_decay_iters:
154
+ desc: null
155
+ value: 2000
156
+ lr_warmup_iters:
157
+ desc: null
158
+ value: 500
159
+ min_lr:
160
+ desc: null
161
+ value: 1.0e-06
162
+ train_iters:
163
+ desc: null
164
+ value: 2000
165
+ train_samples:
166
+ desc: null
167
+ value: null
168
+ global_batch_size:
169
+ desc: null
170
+ value: 320
171
+ micro_batch_size:
172
+ desc: null
173
+ value: 8
174
+ make_vocab_size_divisible_by:
175
+ desc: null
176
+ value: 128
177
+ sliding_window_size:
178
+ desc: null
179
+ value: 4096
180
+ skip_batch:
181
+ desc: null
182
+ value: null
183
+ no_save_optimizer_state:
184
+ desc: null
185
+ value: false
186
+ continual_pretraining:
187
+ desc: null
188
+ value: false
189
+ instruction_tuning:
190
+ desc: null
191
+ value: false
192
+ direct_preference_optimization:
193
+ desc: null
194
+ value: false
195
+ attention_dropout:
196
+ desc: null
197
+ value: 0.1
198
+ hidden_dropout:
199
+ desc: null
200
+ value: 0.1
201
+ weight_decay:
202
+ desc: null
203
+ value: 0.1
204
+ adam_beta1:
205
+ desc: null
206
+ value: 0.9
207
+ adam_beta2:
208
+ desc: null
209
+ value: 0.95
210
+ adam_eps:
211
+ desc: null
212
+ value: 1.0e-06
213
+ hf_transformer_model_dir:
214
+ desc: null
215
+ value: null
216
+ instruction_train_data_path:
217
+ desc: null
218
+ value: null
219
+ instruction_valid_data_path:
220
+ desc: null
221
+ value: null
222
+ epoch:
223
+ desc: null
224
+ value: null
225
+ instruction_dataset_size:
226
+ desc: null
227
+ value: null
228
+ save_sampler_state:
229
+ desc: null
230
+ value: false
231
+ label_smoothing:
232
+ desc: null
233
+ value: 0.0
234
+ save_n_checkpoints:
235
+ desc: null
236
+ value: 10
237
+ hf_repo_id:
238
+ desc: null
239
+ value: koichi12/tiny-llama
240
+ create_public_hf_repo:
241
+ desc: null
242
+ value: false
243
+ upload_all_checkpoints_to_hf:
244
+ desc: null
245
+ value: false
246
+ hf_upload_retry_limit:
247
+ desc: null
248
+ value: 2
249
+ exit_duration_in_mins:
250
+ desc: null
251
+ value: null
252
+ source_key:
253
+ desc: null
254
+ value: null
255
+ target_key:
256
+ desc: null
257
+ value: null
258
+ attn_implementation:
259
+ desc: null
260
+ value: flash_attention_2
261
+ efficient_instruction_tuning:
262
+ desc: null
263
+ value: false
264
+ remove_padding_masking:
265
+ desc: null
266
+ value: false
267
+ save_start_iter:
268
+ desc: null
269
+ value: null
270
+ rank:
271
+ desc: null
272
+ value: 0
273
+ world_size:
274
+ desc: null
275
+ value: 1
276
+ padded_vocab_size:
277
+ desc: null
278
+ value: 32000
279
+ gradient_accumulation_steps:
280
+ desc: null
281
+ value: 40
282
+ _wandb:
283
+ desc: null
284
+ value:
285
+ python_version: 3.10.12
286
+ cli_version: 0.16.3
287
+ framework: huggingface
288
+ huggingface_version: 4.43.3
289
+ is_jupyter_run: false
290
+ is_kaggle_kernel: false
291
+ start_time: 1722749689.905326
292
+ t:
293
+ 1:
294
+ - 1
295
+ - 11
296
+ - 49
297
+ - 55
298
+ - 71
299
+ 2:
300
+ - 1
301
+ - 11
302
+ - 49
303
+ - 55
304
+ - 71
305
+ 3:
306
+ - 13
307
+ - 16
308
+ - 23
309
+ 4: 3.10.12
310
+ 5: 0.16.3
311
+ 6: 4.43.3
312
+ 8:
313
+ - 5
314
+ 13: linux-x86_64
315
+ activation_function:
316
+ desc: null
317
+ value: silu
318
+ hidden_size:
319
+ desc: null
320
+ value: 2048
321
+ model_type:
322
+ desc: null
323
+ value: llama
324
+ max_position_embeddings:
325
+ desc: null
326
+ value: 2048
327
+ num_attention_heads:
328
+ desc: null
329
+ value: 32
330
+ num_hidden_layers:
331
+ desc: null
332
+ value: 22
333
+ model_architecture:
334
+ desc: null
335
+ value: LlamaForCausalLM
wandb/run-20240804_143449-7tyiihss/files/output.log ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Created Hugging Face repository with ID koichi12/tiny-llama.
2
+ Clearing GPU cache for all ranks
3
+ --> Running with torch torch_distributed debug set to detail
4
+ File not found: /work/llm_recipes/models/tiny-llama/latest_iteration.txt
5
+ Unable to read latest iteration from /work/llm_recipes/models/tiny-llama/latest_iteration.txt
6
+ File not found: /work/llm_recipes/models/tiny-llama/latest_iteration.txt
7
+ Unable to read latest iteration from /work/llm_recipes/models/tiny-llama/latest_iteration.txt
8
+ File not found: /work/llm_recipes/models/tiny-llama/latest_iteration.txt
9
+ Unable to read latest iteration from /work/llm_recipes/models/tiny-llama/latest_iteration.txt
10
+ No checkpoint found in /work/llm_recipes/models/tiny-llama, skipping model loading
11
+ --> Model /share/pretrained_lm/meta-llama/TinyLlama_v1.1
12
+ --> /share/pretrained_lm/meta-llama/TinyLlama_v1.1 has 1100.048384 Million params
13
+ You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
14
+ You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
15
+ Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
16
+ Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
17
+ /usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_init_utils.py:441: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.FULL_SHARD since the world size is 1.
18
+ warnings.warn(
19
+ BFloat16 enabled for mixed precision - using bfSixteen policy
20
+ --> applying fsdp activation checkpointing...
21
+ > datasets target sizes (minimum size):
22
+ train: 640000
23
+ validation: 35200
24
+ test: 3200
25
+ > building train, validation, and test datasets for GPT ...
26
+ > finished creating GPT datasets ...
27
+ File not found: /work/llm_recipes/models/tiny-llama/latest_iteration.txt
28
+ Unable to read latest iteration from /work/llm_recipes/models/tiny-llama/latest_iteration.txt
29
+ No checkpoint found in /work/llm_recipes/models/tiny-llama, skipping optimizer loading
30
+ File not found: /work/llm_recipes/models/tiny-llama/latest_iteration.txt
31
+ Unable to read latest iteration from /work/llm_recipes/models/tiny-llama/latest_iteration.txt
32
+ model info: FullyShardedDataParallel(
33
+ (_fsdp_wrapped_module): LlamaForCausalLM(
34
+ (model): LlamaModel(
35
+ (embed_tokens): Embedding(32000, 2048)
36
+ (layers): ModuleList(
37
+ (0-21): 22 x FullyShardedDataParallel(
38
+ (_fsdp_wrapped_module): CheckpointWrapper(
39
+ (_checkpoint_wrapped_module): LlamaDecoderLayer(
40
+ (self_attn): LlamaFlashAttention2(
41
+ (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
42
+ (k_proj): Linear(in_features=2048, out_features=256, bias=False)
43
+ (v_proj): Linear(in_features=2048, out_features=256, bias=False)
44
+ (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
45
+ (rotary_emb): LlamaRotaryEmbedding()
46
+ )
47
+ (mlp): LlamaMLP(
48
+ (gate_proj): Linear(in_features=2048, out_features=5632, bias=False)
49
+ (up_proj): Linear(in_features=2048, out_features=5632, bias=False)
50
+ (down_proj): Linear(in_features=5632, out_features=2048, bias=False)
51
+ (act_fn): SiLU()
52
+ )
53
+ (input_layernorm): LlamaRMSNorm()
54
+ (post_attention_layernorm): LlamaRMSNorm()
55
+ )
56
+ )
57
+ )
58
+ )
59
+ (norm): LlamaRMSNorm()
60
+ (rotary_emb): LlamaRotaryEmbedding()
61
+ )
62
+ (lm_head): Linear(in_features=2048, out_features=32000, bias=False)
63
+ )
64
+ )
65
+ model config: LlamaConfig {
66
+ "_name_or_path": "/share/pretrained_lm/meta-llama/TinyLlama_v1.1",
67
+ "architectures": [
68
+ "LlamaForCausalLM"
69
+ ],
70
+ "attention_bias": false,
71
+ "attention_dropout": 0.0,
72
+ "bos_token_id": 1,
73
+ "eos_token_id": 2,
74
+ "hidden_act": "silu",
75
+ "hidden_size": 2048,
76
+ "initializer_range": 0.02,
77
+ "intermediate_size": 5632,
78
+ "label_smoothing": 0.0,
79
+ "max_position_embeddings": 2048,
80
+ "mlp_bias": false,
81
+ "model_type": "llama",
82
+ "num_attention_heads": 32,
83
+ "num_hidden_layers": 22,
84
+ "num_key_value_heads": 4,
85
+ "pretraining_tp": 1,
86
+ "rms_norm_eps": 1e-05,
87
+ "rope_scaling": null,
88
+ "rope_theta": 10000.0,
89
+ "tie_word_embeddings": false,
90
+ "torch_dtype": "float32",
91
+ "transformers_version": "4.43.3",
92
+ "use_cache": false,
93
+ "vocab_size": 32000
94
+ }
95
+ Let split = None
96
+ Building a BlendedDataset for a single MegatronDataset
97
+ Unable to save the indexes because path_to_cache is None
98
+ Building a BlendedDataset for a single MegatronDataset
99
+ Unable to save the indexes because path_to_cache is None
100
+ Building a BlendedDataset for a single MegatronDataset
101
+ Unable to save the indexes because path_to_cache is None
102
+ Traceback (most recent call last):
103
+ File "/project/examples/finetuning.py", line 13, in <module>
104
+ main()
105
+ File "/project/src/llama_recipes/finetuning.py", line 281, in main
106
+ train(
107
+ File "/project/src/llama_recipes/utils/train_utils.py", line 104, in train
108
+ batch = next(train_dataloader)
109
+ File "/project/src/llama_recipes/utils/train_utils.py", line 24, in cyclic_iter
110
+ for x in iter:
111
+ File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 631, in __next__
112
+ data = self._next_data()
113
+ File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
114
+ return self._process_data(data)
115
+ File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
116
+ data.reraise()
117
+ File "/usr/local/lib/python3.10/dist-packages/torch/_utils.py", line 705, in reraise
118
+ raise exception
119
+ RuntimeError: Caught RuntimeError in DataLoader worker process 0.
120
+ Original Traceback (most recent call last):
121
+ File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
122
+ data = fetcher.fetch(index)
123
+ File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
124
+ return self.collate_fn(data)
125
+ File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py", line 277, in default_collate
126
+ return collate(batch, collate_fn_map=default_collate_fn_map)
127
+ File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py", line 129, in collate
128
+ return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
129
+ File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py", line 129, in <dictcomp>
130
+ return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
131
+ File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py", line 121, in collate
132
+ return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map)
133
+ File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py", line 174, in collate_tensor_fn
134
+ return torch.stack(batch, 0, out=out)
135
+ RuntimeError: stack expects each tensor to be equal size, but got [513] at entry 0 and [543] at entry 1
wandb/run-20240804_143449-7tyiihss/files/requirements.txt ADDED
@@ -0,0 +1,271 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ absl-py==2.1.0
2
+ accelerate==0.33.0
3
+ aiohttp==3.9.1
4
+ aiosignal==1.3.1
5
+ annotated-types==0.6.0
6
+ apex==0.1
7
+ appdirs==1.4.4
8
+ argon2-cffi-bindings==21.2.0
9
+ argon2-cffi==23.1.0
10
+ asttokens==2.4.1
11
+ astunparse==1.6.3
12
+ async-timeout==4.0.3
13
+ attrs==23.2.0
14
+ audioread==3.0.1
15
+ beautifulsoup4==4.12.3
16
+ bleach==6.1.0
17
+ blis==0.7.11
18
+ cachetools==5.3.2
19
+ catalogue==2.0.10
20
+ certifi==2024.2.2
21
+ cffi==1.16.0
22
+ charset-normalizer==3.3.2
23
+ click==8.1.7
24
+ cloudpathlib==0.16.0
25
+ cloudpickle==3.0.0
26
+ cmake==3.28.1
27
+ colorama==0.4.6
28
+ comm==0.2.1
29
+ confection==0.1.4
30
+ contourpy==1.2.0
31
+ cubinlinker==0.3.0+2.g405ac64
32
+ cuda-python==12.3.0rc4+9.gdb8c48a.dirty
33
+ cudf==23.12.0
34
+ cugraph-dgl==23.12.0
35
+ cugraph-service-client==23.12.0
36
+ cugraph-service-server==23.12.0
37
+ cugraph==23.12.0
38
+ cuml==23.12.0
39
+ cupy-cuda12x==12.3.0
40
+ cycler==0.12.1
41
+ cymem==2.0.8
42
+ cython==3.0.8
43
+ dask-cuda==23.12.0
44
+ dask-cudf==23.12.0
45
+ dask==2023.11.0
46
+ debugpy==1.8.1
47
+ decorator==5.1.1
48
+ defusedxml==0.7.1
49
+ distributed==2023.11.0
50
+ dm-tree==0.1.8
51
+ docker-pycreds==0.4.0
52
+ einops==0.7.0
53
+ exceptiongroup==1.2.0
54
+ execnet==2.0.2
55
+ executing==2.0.1
56
+ expecttest==0.1.3
57
+ fastjsonschema==2.19.1
58
+ fastrlock==0.8.2
59
+ filelock==3.13.1
60
+ flash-attn==2.4.2
61
+ fonttools==4.48.1
62
+ frozenlist==1.4.1
63
+ fsspec==2023.12.2
64
+ gast==0.5.4
65
+ gitdb==4.0.11
66
+ gitpython==3.1.43
67
+ google-auth-oauthlib==0.4.6
68
+ google-auth==2.27.0
69
+ graphsurgeon==0.4.6
70
+ grpcio==1.60.1
71
+ huggingface-hub==0.24.5
72
+ hypothesis==5.35.1
73
+ idna==3.6
74
+ importlib-metadata==7.0.1
75
+ iniconfig==2.0.0
76
+ intel-openmp==2021.4.0
77
+ ipadic==1.0.0
78
+ ipykernel==6.29.2
79
+ ipython-genutils==0.2.0
80
+ ipython==8.21.0
81
+ jedi==0.19.1
82
+ jinja2==3.1.3
83
+ joblib==1.3.2
84
+ json5==0.9.14
85
+ jsonnet==0.19.1
86
+ jsonschema-specifications==2023.12.1
87
+ jsonschema==4.21.1
88
+ jupyter-client==8.6.0
89
+ jupyter-core==5.7.1
90
+ jupyter-tensorboard==0.2.0
91
+ jupyterlab-pygments==0.3.0
92
+ jupyterlab-server==1.2.0
93
+ jupyterlab==2.3.2
94
+ jupytext==1.16.1
95
+ kiwisolver==1.4.5
96
+ langcodes==3.3.0
97
+ lazy-loader==0.3
98
+ librosa==0.10.1
99
+ llvmlite==0.40.1
100
+ locket==1.0.0
101
+ logzero==1.7.0
102
+ lxml==5.2.2
103
+ markdown-it-py==3.0.0
104
+ markdown==3.5.2
105
+ markupsafe==2.1.4
106
+ matplotlib-inline==0.1.6
107
+ matplotlib==3.8.2
108
+ mdit-py-plugins==0.4.0
109
+ mdurl==0.1.2
110
+ mecab-python3==1.0.6
111
+ mistune==3.0.2
112
+ mkl-devel==2021.1.1
113
+ mkl-include==2021.1.1
114
+ mkl==2021.1.1
115
+ mock==5.1.0
116
+ more-itertools==9.1.0
117
+ mpmath==1.3.0
118
+ msgpack==1.0.7
119
+ multidict==6.0.4
120
+ murmurhash==1.0.10
121
+ nbclient==0.9.0
122
+ nbconvert==7.16.0
123
+ nbformat==5.9.2
124
+ nest-asyncio==1.6.0
125
+ networkx==2.6.3
126
+ ninja==1.11.1.1
127
+ nltk==3.8.1
128
+ notebook==6.4.10
129
+ numba==0.57.1+1.g1ff679645
130
+ numpy==1.24.4
131
+ nvfuser==0.1.4a0+d0bb811
132
+ nvidia-dali-cuda120==1.34.0
133
+ nvidia-pyindex==1.0.9
134
+ nvtx==0.2.5
135
+ oauthlib==3.2.2
136
+ onnx==1.15.0rc2
137
+ opencv==4.7.0
138
+ optree==0.10.0
139
+ packaging==23.2
140
+ pandas==1.5.3
141
+ pandocfilters==1.5.1
142
+ parso==0.8.3
143
+ partd==1.4.1
144
+ peft==0.11.1
145
+ pexpect==4.9.0
146
+ pillow==10.2.0
147
+ pip==24.0
148
+ platformdirs==4.2.0
149
+ pluggy==1.4.0
150
+ ply==3.11
151
+ polygraphy==0.49.4
152
+ pooch==1.8.0
153
+ portalocker==2.10.1
154
+ preshed==3.0.9
155
+ prettytable==3.9.0
156
+ prometheus-client==0.19.0
157
+ prompt-toolkit==3.0.43
158
+ protobuf==4.24.4
159
+ psutil==5.9.4
160
+ ptxcompiler==0.8.1+2.g0d406d6
161
+ ptyprocess==0.7.0
162
+ pure-eval==0.2.2
163
+ pyarrow==14.0.1.dev0+gba5374836.d20240125
164
+ pyasn1-modules==0.3.0
165
+ pyasn1==0.5.1
166
+ pybind11-global==2.11.1
167
+ pybind11==2.11.1
168
+ pycocotools==2.0+nv0.8.0
169
+ pycparser==2.21
170
+ pydantic-core==2.16.2
171
+ pydantic==2.6.1
172
+ pygments==2.17.2
173
+ pylibcugraph==23.12.0
174
+ pylibcugraphops==23.12.0
175
+ pylibraft==23.12.0
176
+ pynvml==11.4.1
177
+ pyparsing==3.1.1
178
+ pytest-flakefinder==1.1.0
179
+ pytest-rerunfailures==13.0
180
+ pytest-shard==0.1.2
181
+ pytest-xdist==3.5.0
182
+ pytest==8.0.0
183
+ python-dateutil==2.8.2
184
+ python-dotenv==1.0.0
185
+ python-hostlist==1.23.0
186
+ pytorch-quantization==2.1.2
187
+ pytz==2023.3.post1
188
+ pyyaml==6.0.1
189
+ pyzmq==25.1.2
190
+ raft-dask==23.12.0
191
+ rapids-dask-dependency==23.12.1
192
+ referencing==0.33.0
193
+ regex==2023.12.25
194
+ requests-oauthlib==1.3.1
195
+ requests==2.31.0
196
+ rich==13.7.0
197
+ rmm==23.12.0
198
+ rpds-py==0.17.1
199
+ rsa==4.9
200
+ sacrebleu==2.4.0
201
+ safetensors==0.4.3
202
+ scikit-learn==1.2.0
203
+ scipy==1.12.0
204
+ send2trash==1.8.2
205
+ sentencepiece==0.1.99
206
+ sentry-sdk==2.12.0
207
+ setproctitle==1.3.3
208
+ setuptools==68.2.2
209
+ six==1.16.0
210
+ smart-open==6.4.0
211
+ smmap==5.0.1
212
+ sortedcontainers==2.4.0
213
+ soundfile==0.12.1
214
+ soupsieve==2.5
215
+ soxr==0.3.7
216
+ spacy-legacy==3.0.12
217
+ spacy-loggers==1.0.5
218
+ spacy==3.7.2
219
+ sphinx-glpi-theme==0.6
220
+ srsly==2.4.8
221
+ stack-data==0.6.3
222
+ sympy==1.12
223
+ tabulate==0.9.0
224
+ tbb==2021.11.0
225
+ tblib==3.0.0
226
+ tensorboard-data-server==0.6.1
227
+ tensorboard-plugin-wit==1.8.1
228
+ tensorboard==2.9.0
229
+ tensorrt==8.6.3
230
+ terminado==0.18.0
231
+ termplotlib==0.3.9
232
+ thinc==8.2.3
233
+ threadpoolctl==3.2.0
234
+ thriftpy2==0.4.17
235
+ tinycss2==1.2.1
236
+ tokenizers==0.19.1
237
+ toml==0.10.2
238
+ tomli==2.0.1
239
+ toolz==0.12.1
240
+ torch-tensorrt==2.3.0a0
241
+ torch==2.3.0a0+ebedce2
242
+ torchdata==0.7.1a0
243
+ torchtext==0.17.0a0
244
+ torchvision==0.18.0a0
245
+ tornado==6.4
246
+ tqdm==4.66.1
247
+ traitlets==5.9.0
248
+ transformer-engine==1.3.0+5b90b7f
249
+ transformers==4.43.3
250
+ treelite-runtime==3.9.1
251
+ treelite==3.9.1
252
+ triton==2.2.0+e28a256
253
+ typer==0.9.0
254
+ types-dataclasses==0.6.6
255
+ typing-extensions==4.9.0
256
+ ucx-py==0.35.0
257
+ uff==0.6.9
258
+ ujson==5.8.0
259
+ urllib3==1.26.18
260
+ wandb==0.16.3
261
+ wasabi==1.1.2
262
+ wcwidth==0.2.13
263
+ weasel==0.3.4
264
+ webencodings==0.5.1
265
+ werkzeug==3.0.1
266
+ wheel==0.42.0
267
+ xdoctest==1.0.2
268
+ xgboost==1.7.6
269
+ yarl==1.9.4
270
+ zict==3.0.0
271
+ zipp==3.17.0
wandb/run-20240804_143449-7tyiihss/files/wandb-metadata.json ADDED
@@ -0,0 +1,215 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "os": "Linux-5.15.0-91-generic-x86_64-with-glibc2.35",
3
+ "python": "3.10.12",
4
+ "heartbeatAt": "2024-08-04T05:34:50.487822",
5
+ "startedAt": "2024-08-04T05:34:49.889154",
6
+ "docker": null,
7
+ "cuda": null,
8
+ "args": [
9
+ "--seq-length",
10
+ "512",
11
+ "--sliding-window-size",
12
+ "4096",
13
+ "--micro-batch-size",
14
+ "8",
15
+ "--global-batch-size",
16
+ "320",
17
+ "--train-iters",
18
+ "2000",
19
+ "--tokenizer-type",
20
+ "Llama2Tokenizer",
21
+ "--tokenizer-model",
22
+ "/share/pretrained_lm/meta-llama/TinyLlama_v1.1/tokenizer.model",
23
+ "--train-data-path",
24
+ "4013541",
25
+ "/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document",
26
+ "--valid-data-path",
27
+ "4013541",
28
+ "/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document",
29
+ "--test-data-path",
30
+ "4013541",
31
+ "/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document",
32
+ "--lr",
33
+ "2e-5",
34
+ "--min-lr",
35
+ "1e-6",
36
+ "--lr-decay-style",
37
+ "cosine",
38
+ "--lr-warmup-iters",
39
+ "500",
40
+ "--lr-decay-iters",
41
+ "2000",
42
+ "--weight-decay",
43
+ "0.1",
44
+ "--grad-clip-norm",
45
+ "1.0",
46
+ "--optimizer",
47
+ "adam",
48
+ "--adam-beta1",
49
+ "0.9",
50
+ "--adam-beta2",
51
+ "0.95",
52
+ "--adam-eps",
53
+ "1e-6",
54
+ "--save-interval",
55
+ "200",
56
+ "--eval-interval",
57
+ "200",
58
+ "--eval-iters",
59
+ "10",
60
+ "--bf16",
61
+ "--mixed-precision",
62
+ "--base-model",
63
+ "/share/pretrained_lm/meta-llama/TinyLlama_v1.1",
64
+ "--save",
65
+ "/work/llm_recipes/models/tiny-llama",
66
+ "--load",
67
+ "/work/llm_recipes/models/tiny-llama",
68
+ "--fsdp-activation-checkpointing",
69
+ "--sharding-strategy",
70
+ "FULL_SHARD",
71
+ "--checkpoint-type",
72
+ "LOCAL_STATE_DICT",
73
+ "--save-n-checkpoints",
74
+ "10",
75
+ "--hf-upload-retry-limit",
76
+ "2",
77
+ "--hf-repo-id",
78
+ "koichi12/tiny-llama",
79
+ "--wandb-entity",
80
+ "iwakawa-koichi-q5-tohoku-nlp6723",
81
+ "--wandb-project",
82
+ "llm_tutorial",
83
+ "--wandb-name",
84
+ "tiny-llama_train_2024-08-04-14:34:38"
85
+ ],
86
+ "state": "running",
87
+ "program": "/project/examples/finetuning.py",
88
+ "codePathLocal": "examples/finetuning.py",
89
+ "codePath": "examples/finetuning.py",
90
+ "git": {
91
+ "remote": "https://github.com/cl-tohoku/llm-recipes-failab-m1-yans.git",
92
+ "commit": "3be5353210a678dc7008f237fa16b99f2bdf36ea"
93
+ },
94
+ "email": null,
95
+ "root": "/project",
96
+ "host": "gpu-koiwa-00",
97
+ "username": "koiwa",
98
+ "executable": "/usr/bin/python",
99
+ "cpu_count": 18,
100
+ "cpu_count_logical": 18,
101
+ "cpu_freq": {
102
+ "current": 2400.0389999999993,
103
+ "min": 0.0,
104
+ "max": 0.0
105
+ },
106
+ "cpu_freq_per_core": [
107
+ {
108
+ "current": 2400.039,
109
+ "min": 0.0,
110
+ "max": 0.0
111
+ },
112
+ {
113
+ "current": 2400.039,
114
+ "min": 0.0,
115
+ "max": 0.0
116
+ },
117
+ {
118
+ "current": 2400.039,
119
+ "min": 0.0,
120
+ "max": 0.0
121
+ },
122
+ {
123
+ "current": 2400.039,
124
+ "min": 0.0,
125
+ "max": 0.0
126
+ },
127
+ {
128
+ "current": 2400.039,
129
+ "min": 0.0,
130
+ "max": 0.0
131
+ },
132
+ {
133
+ "current": 2400.039,
134
+ "min": 0.0,
135
+ "max": 0.0
136
+ },
137
+ {
138
+ "current": 2400.039,
139
+ "min": 0.0,
140
+ "max": 0.0
141
+ },
142
+ {
143
+ "current": 2400.039,
144
+ "min": 0.0,
145
+ "max": 0.0
146
+ },
147
+ {
148
+ "current": 2400.039,
149
+ "min": 0.0,
150
+ "max": 0.0
151
+ },
152
+ {
153
+ "current": 2400.039,
154
+ "min": 0.0,
155
+ "max": 0.0
156
+ },
157
+ {
158
+ "current": 2400.039,
159
+ "min": 0.0,
160
+ "max": 0.0
161
+ },
162
+ {
163
+ "current": 2400.039,
164
+ "min": 0.0,
165
+ "max": 0.0
166
+ },
167
+ {
168
+ "current": 2400.039,
169
+ "min": 0.0,
170
+ "max": 0.0
171
+ },
172
+ {
173
+ "current": 2400.039,
174
+ "min": 0.0,
175
+ "max": 0.0
176
+ },
177
+ {
178
+ "current": 2400.039,
179
+ "min": 0.0,
180
+ "max": 0.0
181
+ },
182
+ {
183
+ "current": 2400.039,
184
+ "min": 0.0,
185
+ "max": 0.0
186
+ },
187
+ {
188
+ "current": 2400.039,
189
+ "min": 0.0,
190
+ "max": 0.0
191
+ },
192
+ {
193
+ "current": 2400.039,
194
+ "min": 0.0,
195
+ "max": 0.0
196
+ }
197
+ ],
198
+ "disk": {
199
+ "/": {
200
+ "total": 0.0625,
201
+ "used": 1.1444091796875e-05
202
+ }
203
+ },
204
+ "gpu": "NVIDIA A100-SXM4-40GB",
205
+ "gpu_count": 1,
206
+ "gpu_devices": [
207
+ {
208
+ "name": "NVIDIA A100-SXM4-40GB",
209
+ "memory_total": 42949672960
210
+ }
211
+ ],
212
+ "memory": {
213
+ "total": 56.48781967163086
214
+ }
215
+ }
wandb/run-20240804_143449-7tyiihss/files/wandb-summary.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"_wandb": {"runtime": 3}}
wandb/run-20240804_143449-7tyiihss/logs/debug-internal.log ADDED
@@ -0,0 +1,186 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2024-08-04 14:34:49,906 INFO StreamThr :11193 [internal.py:wandb_internal():86] W&B internal server running at pid: 11193, started at: 2024-08-04 14:34:49.905947
2
+ 2024-08-04 14:34:49,908 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: status
3
+ 2024-08-04 14:34:49,910 INFO WriterThread:11193 [datastore.py:open_for_write():87] open: /project/wandb/run-20240804_143449-7tyiihss/run-7tyiihss.wandb
4
+ 2024-08-04 14:34:49,911 DEBUG SenderThread:11193 [sender.py:send():382] send: header
5
+ 2024-08-04 14:34:49,924 DEBUG SenderThread:11193 [sender.py:send():382] send: run
6
+ 2024-08-04 14:34:50,371 INFO SenderThread:11193 [dir_watcher.py:__init__():211] watching files in: /project/wandb/run-20240804_143449-7tyiihss/files
7
+ 2024-08-04 14:34:50,372 INFO SenderThread:11193 [sender.py:_start_run_threads():1136] run started: 7tyiihss with start time 1722749689.905326
8
+ 2024-08-04 14:34:50,377 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: check_version
9
+ 2024-08-04 14:34:50,377 DEBUG SenderThread:11193 [sender.py:send_request():409] send_request: check_version
10
+ 2024-08-04 14:34:50,468 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: run_start
11
+ 2024-08-04 14:34:50,474 DEBUG HandlerThread:11193 [system_info.py:__init__():27] System info init
12
+ 2024-08-04 14:34:50,474 DEBUG HandlerThread:11193 [system_info.py:__init__():42] System info init done
13
+ 2024-08-04 14:34:50,474 INFO HandlerThread:11193 [system_monitor.py:start():194] Starting system monitor
14
+ 2024-08-04 14:34:50,474 INFO SystemMonitor:11193 [system_monitor.py:_start():158] Starting system asset monitoring threads
15
+ 2024-08-04 14:34:50,475 INFO HandlerThread:11193 [system_monitor.py:probe():214] Collecting system info
16
+ 2024-08-04 14:34:50,475 INFO SystemMonitor:11193 [interfaces.py:start():190] Started cpu monitoring
17
+ 2024-08-04 14:34:50,475 INFO SystemMonitor:11193 [interfaces.py:start():190] Started disk monitoring
18
+ 2024-08-04 14:34:50,477 INFO SystemMonitor:11193 [interfaces.py:start():190] Started gpu monitoring
19
+ 2024-08-04 14:34:50,477 INFO SystemMonitor:11193 [interfaces.py:start():190] Started memory monitoring
20
+ 2024-08-04 14:34:50,478 INFO SystemMonitor:11193 [interfaces.py:start():190] Started network monitoring
21
+ 2024-08-04 14:34:50,487 DEBUG HandlerThread:11193 [system_info.py:probe():151] Probing system
22
+ 2024-08-04 14:34:50,490 DEBUG HandlerThread:11193 [system_info.py:_probe_git():136] Probing git
23
+ 2024-08-04 14:34:50,504 DEBUG HandlerThread:11193 [system_info.py:_probe_git():144] Probing git done
24
+ 2024-08-04 14:34:50,504 DEBUG HandlerThread:11193 [system_info.py:probe():199] Probing system done
25
+ 2024-08-04 14:34:50,504 DEBUG HandlerThread:11193 [system_monitor.py:probe():223] {'os': 'Linux-5.15.0-91-generic-x86_64-with-glibc2.35', 'python': '3.10.12', 'heartbeatAt': '2024-08-04T05:34:50.487822', 'startedAt': '2024-08-04T05:34:49.889154', 'docker': None, 'cuda': None, 'args': ('--seq-length', '512', '--sliding-window-size', '4096', '--micro-batch-size', '8', '--global-batch-size', '320', '--train-iters', '2000', '--tokenizer-type', 'Llama2Tokenizer', '--tokenizer-model', '/share/pretrained_lm/meta-llama/TinyLlama_v1.1/tokenizer.model', '--train-data-path', '4013541', '/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document', '--valid-data-path', '4013541', '/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document', '--test-data-path', '4013541', '/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document', '--lr', '2e-5', '--min-lr', '1e-6', '--lr-decay-style', 'cosine', '--lr-warmup-iters', '500', '--lr-decay-iters', '2000', '--weight-decay', '0.1', '--grad-clip-norm', '1.0', '--optimizer', 'adam', '--adam-beta1', '0.9', '--adam-beta2', '0.95', '--adam-eps', '1e-6', '--save-interval', '200', '--eval-interval', '200', '--eval-iters', '10', '--bf16', '--mixed-precision', '--base-model', '/share/pretrained_lm/meta-llama/TinyLlama_v1.1', '--save', '/work/llm_recipes/models/tiny-llama', '--load', '/work/llm_recipes/models/tiny-llama', '--fsdp-activation-checkpointing', '--sharding-strategy', 'FULL_SHARD', '--checkpoint-type', 'LOCAL_STATE_DICT', '--save-n-checkpoints', '10', '--hf-upload-retry-limit', '2', '--hf-repo-id', 'koichi12/tiny-llama', '--wandb-entity', 'iwakawa-koichi-q5-tohoku-nlp6723', '--wandb-project', 'llm_tutorial', '--wandb-name', 'tiny-llama_train_2024-08-04-14:34:38'), 'state': 'running', 'program': '/project/examples/finetuning.py', 'codePathLocal': 'examples/finetuning.py', 'codePath': 'examples/finetuning.py', 'git': {'remote': 'https://github.com/cl-tohoku/llm-recipes-failab-m1-yans.git', 'commit': '3be5353210a678dc7008f237fa16b99f2bdf36ea'}, 'email': None, 'root': '/project', 'host': 'gpu-koiwa-00', 'username': 'koiwa', 'executable': '/usr/bin/python', 'cpu_count': 18, 'cpu_count_logical': 18, 'cpu_freq': {'current': 2400.0389999999993, 'min': 0.0, 'max': 0.0}, 'cpu_freq_per_core': [{'current': 2400.039, 'min': 0.0, 'max': 0.0}, {'current': 2400.039, 'min': 0.0, 'max': 0.0}, {'current': 2400.039, 'min': 0.0, 'max': 0.0}, {'current': 2400.039, 'min': 0.0, 'max': 0.0}, {'current': 2400.039, 'min': 0.0, 'max': 0.0}, {'current': 2400.039, 'min': 0.0, 'max': 0.0}, {'current': 2400.039, 'min': 0.0, 'max': 0.0}, {'current': 2400.039, 'min': 0.0, 'max': 0.0}, {'current': 2400.039, 'min': 0.0, 'max': 0.0}, {'current': 2400.039, 'min': 0.0, 'max': 0.0}, {'current': 2400.039, 'min': 0.0, 'max': 0.0}, {'current': 2400.039, 'min': 0.0, 'max': 0.0}, {'current': 2400.039, 'min': 0.0, 'max': 0.0}, {'current': 2400.039, 'min': 0.0, 'max': 0.0}, {'current': 2400.039, 'min': 0.0, 'max': 0.0}, {'current': 2400.039, 'min': 0.0, 'max': 0.0}, {'current': 2400.039, 'min': 0.0, 'max': 0.0}, {'current': 2400.039, 'min': 0.0, 'max': 0.0}], 'disk': {'/': {'total': 0.0625, 'used': 1.1444091796875e-05}}, 'gpu': 'NVIDIA A100-SXM4-40GB', 'gpu_count': 1, 'gpu_devices': [{'name': 'NVIDIA A100-SXM4-40GB', 'memory_total': 42949672960}], 'memory': {'total': 56.48781967163086}}
26
+ 2024-08-04 14:34:50,505 INFO HandlerThread:11193 [system_monitor.py:probe():224] Finished collecting system info
27
+ 2024-08-04 14:34:50,505 INFO HandlerThread:11193 [system_monitor.py:probe():227] Publishing system info
28
+ 2024-08-04 14:34:50,506 INFO HandlerThread:11193 [system_monitor.py:probe():229] Finished publishing system info
29
+ 2024-08-04 14:34:50,512 DEBUG SenderThread:11193 [sender.py:send():382] send: files
30
+ 2024-08-04 14:34:50,512 INFO SenderThread:11193 [sender.py:_save_file():1403] saving file wandb-metadata.json with policy now
31
+ 2024-08-04 14:34:50,521 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: python_packages
32
+ 2024-08-04 14:34:50,521 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: stop_status
33
+ 2024-08-04 14:34:50,521 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: internal_messages
34
+ 2024-08-04 14:34:50,521 DEBUG SenderThread:11193 [sender.py:send_request():409] send_request: python_packages
35
+ 2024-08-04 14:34:50,523 DEBUG SenderThread:11193 [sender.py:send_request():409] send_request: stop_status
36
+ 2024-08-04 14:34:50,781 DEBUG SenderThread:11193 [sender.py:send():382] send: telemetry
37
+ 2024-08-04 14:34:51,211 INFO wandb-upload_0:11193 [upload_job.py:push():131] Uploaded file /tmp/tmp2tpc65lqwandb/b71f3euv-wandb-metadata.json
38
+ 2024-08-04 14:34:51,373 INFO Thread-12 :11193 [dir_watcher.py:_on_file_created():271] file/dir created: /project/wandb/run-20240804_143449-7tyiihss/files/wandb-metadata.json
39
+ 2024-08-04 14:34:51,374 INFO Thread-12 :11193 [dir_watcher.py:_on_file_created():271] file/dir created: /project/wandb/run-20240804_143449-7tyiihss/files/requirements.txt
40
+ 2024-08-04 14:34:52,374 INFO Thread-12 :11193 [dir_watcher.py:_on_file_created():271] file/dir created: /project/wandb/run-20240804_143449-7tyiihss/files/output.log
41
+ 2024-08-04 14:34:53,774 DEBUG SenderThread:11193 [sender.py:send():382] send: config
42
+ 2024-08-04 14:34:53,774 DEBUG SenderThread:11193 [sender.py:send():382] send: config
43
+ 2024-08-04 14:34:53,858 DEBUG SenderThread:11193 [sender.py:send():382] send: exit
44
+ 2024-08-04 14:34:53,858 INFO SenderThread:11193 [sender.py:send_exit():589] handling exit code: 1
45
+ 2024-08-04 14:34:53,858 INFO SenderThread:11193 [sender.py:send_exit():591] handling runtime: 3
46
+ 2024-08-04 14:34:53,859 INFO SenderThread:11193 [sender.py:_save_file():1403] saving file wandb-summary.json with policy end
47
+ 2024-08-04 14:34:53,860 INFO SenderThread:11193 [sender.py:send_exit():597] send defer
48
+ 2024-08-04 14:34:53,860 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: defer
49
+ 2024-08-04 14:34:53,860 INFO HandlerThread:11193 [handler.py:handle_request_defer():172] handle defer: 0
50
+ 2024-08-04 14:34:53,860 DEBUG SenderThread:11193 [sender.py:send_request():409] send_request: defer
51
+ 2024-08-04 14:34:53,860 INFO SenderThread:11193 [sender.py:send_request_defer():613] handle sender defer: 0
52
+ 2024-08-04 14:34:53,860 INFO SenderThread:11193 [sender.py:transition_state():617] send defer: 1
53
+ 2024-08-04 14:34:53,860 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: defer
54
+ 2024-08-04 14:34:53,860 INFO HandlerThread:11193 [handler.py:handle_request_defer():172] handle defer: 1
55
+ 2024-08-04 14:34:53,861 DEBUG SenderThread:11193 [sender.py:send_request():409] send_request: defer
56
+ 2024-08-04 14:34:53,861 INFO SenderThread:11193 [sender.py:send_request_defer():613] handle sender defer: 1
57
+ 2024-08-04 14:34:53,861 INFO SenderThread:11193 [sender.py:transition_state():617] send defer: 2
58
+ 2024-08-04 14:34:53,861 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: defer
59
+ 2024-08-04 14:34:53,861 INFO HandlerThread:11193 [handler.py:handle_request_defer():172] handle defer: 2
60
+ 2024-08-04 14:34:53,861 INFO HandlerThread:11193 [system_monitor.py:finish():203] Stopping system monitor
61
+ 2024-08-04 14:34:53,861 DEBUG SystemMonitor:11193 [system_monitor.py:_start():172] Starting system metrics aggregation loop
62
+ 2024-08-04 14:34:53,861 INFO HandlerThread:11193 [interfaces.py:finish():202] Joined cpu monitor
63
+ 2024-08-04 14:34:53,861 DEBUG SystemMonitor:11193 [system_monitor.py:_start():179] Finished system metrics aggregation loop
64
+ 2024-08-04 14:34:53,861 INFO HandlerThread:11193 [interfaces.py:finish():202] Joined disk monitor
65
+ 2024-08-04 14:34:53,862 DEBUG SystemMonitor:11193 [system_monitor.py:_start():183] Publishing last batch of metrics
66
+ 2024-08-04 14:34:53,894 INFO HandlerThread:11193 [interfaces.py:finish():202] Joined gpu monitor
67
+ 2024-08-04 14:34:53,894 INFO HandlerThread:11193 [interfaces.py:finish():202] Joined memory monitor
68
+ 2024-08-04 14:34:53,894 INFO HandlerThread:11193 [interfaces.py:finish():202] Joined network monitor
69
+ 2024-08-04 14:34:53,894 DEBUG SenderThread:11193 [sender.py:send_request():409] send_request: defer
70
+ 2024-08-04 14:34:53,895 INFO SenderThread:11193 [sender.py:send_request_defer():613] handle sender defer: 2
71
+ 2024-08-04 14:34:53,895 INFO SenderThread:11193 [sender.py:transition_state():617] send defer: 3
72
+ 2024-08-04 14:34:53,895 DEBUG SenderThread:11193 [sender.py:send():382] send: stats
73
+ 2024-08-04 14:34:53,895 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: defer
74
+ 2024-08-04 14:34:53,895 INFO HandlerThread:11193 [handler.py:handle_request_defer():172] handle defer: 3
75
+ 2024-08-04 14:34:53,895 DEBUG SenderThread:11193 [sender.py:send_request():409] send_request: defer
76
+ 2024-08-04 14:34:53,895 INFO SenderThread:11193 [sender.py:send_request_defer():613] handle sender defer: 3
77
+ 2024-08-04 14:34:53,895 INFO SenderThread:11193 [sender.py:transition_state():617] send defer: 4
78
+ 2024-08-04 14:34:53,895 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: defer
79
+ 2024-08-04 14:34:53,895 INFO HandlerThread:11193 [handler.py:handle_request_defer():172] handle defer: 4
80
+ 2024-08-04 14:34:53,895 DEBUG SenderThread:11193 [sender.py:send_request():409] send_request: defer
81
+ 2024-08-04 14:34:53,896 INFO SenderThread:11193 [sender.py:send_request_defer():613] handle sender defer: 4
82
+ 2024-08-04 14:34:53,896 INFO SenderThread:11193 [sender.py:transition_state():617] send defer: 5
83
+ 2024-08-04 14:34:53,896 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: defer
84
+ 2024-08-04 14:34:53,896 INFO HandlerThread:11193 [handler.py:handle_request_defer():172] handle defer: 5
85
+ 2024-08-04 14:34:53,896 DEBUG SenderThread:11193 [sender.py:send():382] send: summary
86
+ 2024-08-04 14:34:53,897 INFO SenderThread:11193 [sender.py:_save_file():1403] saving file wandb-summary.json with policy end
87
+ 2024-08-04 14:34:53,897 DEBUG SenderThread:11193 [sender.py:send_request():409] send_request: defer
88
+ 2024-08-04 14:34:53,897 INFO SenderThread:11193 [sender.py:send_request_defer():613] handle sender defer: 5
89
+ 2024-08-04 14:34:53,897 INFO SenderThread:11193 [sender.py:transition_state():617] send defer: 6
90
+ 2024-08-04 14:34:53,897 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: defer
91
+ 2024-08-04 14:34:53,897 INFO HandlerThread:11193 [handler.py:handle_request_defer():172] handle defer: 6
92
+ 2024-08-04 14:34:53,897 DEBUG SenderThread:11193 [sender.py:send_request():409] send_request: defer
93
+ 2024-08-04 14:34:53,897 INFO SenderThread:11193 [sender.py:send_request_defer():613] handle sender defer: 6
94
+ 2024-08-04 14:34:53,900 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: status_report
95
+ 2024-08-04 14:34:54,104 INFO SenderThread:11193 [sender.py:transition_state():617] send defer: 7
96
+ 2024-08-04 14:34:54,104 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: defer
97
+ 2024-08-04 14:34:54,104 INFO HandlerThread:11193 [handler.py:handle_request_defer():172] handle defer: 7
98
+ 2024-08-04 14:34:54,104 DEBUG SenderThread:11193 [sender.py:send_request():409] send_request: defer
99
+ 2024-08-04 14:34:54,104 INFO SenderThread:11193 [sender.py:send_request_defer():613] handle sender defer: 7
100
+ 2024-08-04 14:34:54,376 INFO Thread-12 :11193 [dir_watcher.py:_on_file_modified():288] file/dir modified: /project/wandb/run-20240804_143449-7tyiihss/files/output.log
101
+ 2024-08-04 14:34:54,376 INFO Thread-12 :11193 [dir_watcher.py:_on_file_modified():288] file/dir modified: /project/wandb/run-20240804_143449-7tyiihss/files/config.yaml
102
+ 2024-08-04 14:34:54,376 INFO Thread-12 :11193 [dir_watcher.py:_on_file_created():271] file/dir created: /project/wandb/run-20240804_143449-7tyiihss/files/wandb-summary.json
103
+ 2024-08-04 14:34:54,858 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: poll_exit
104
+ 2024-08-04 14:34:56,041 INFO SenderThread:11193 [sender.py:transition_state():617] send defer: 8
105
+ 2024-08-04 14:34:56,041 DEBUG SenderThread:11193 [sender.py:send_request():409] send_request: poll_exit
106
+ 2024-08-04 14:34:56,041 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: defer
107
+ 2024-08-04 14:34:56,042 INFO HandlerThread:11193 [handler.py:handle_request_defer():172] handle defer: 8
108
+ 2024-08-04 14:34:56,042 DEBUG SenderThread:11193 [sender.py:send_request():409] send_request: defer
109
+ 2024-08-04 14:34:56,042 INFO SenderThread:11193 [sender.py:send_request_defer():613] handle sender defer: 8
110
+ 2024-08-04 14:34:56,042 INFO SenderThread:11193 [job_builder.py:build():296] Attempting to build job artifact
111
+ 2024-08-04 14:34:56,043 INFO SenderThread:11193 [job_builder.py:_get_source_type():426] is repo sourced job
112
+ 2024-08-04 14:34:56,056 INFO SenderThread:11193 [job_builder.py:build():402] adding wandb-job metadata file
113
+ 2024-08-04 14:34:56,064 INFO SenderThread:11193 [sender.py:transition_state():617] send defer: 9
114
+ 2024-08-04 14:34:56,065 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: defer
115
+ 2024-08-04 14:34:56,065 DEBUG SenderThread:11193 [sender.py:send():382] send: artifact
116
+ 2024-08-04 14:34:56,065 INFO HandlerThread:11193 [handler.py:handle_request_defer():172] handle defer: 9
117
+ 2024-08-04 14:34:56,380 INFO Thread-12 :11193 [dir_watcher.py:_on_file_modified():288] file/dir modified: /project/wandb/run-20240804_143449-7tyiihss/files/output.log
118
+ 2024-08-04 14:34:56,858 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: poll_exit
119
+ 2024-08-04 14:34:56,895 INFO SenderThread:11193 [sender.py:send_artifact():1494] sent artifact job-https___github.com_cl-tohoku_llm-recipes-failab-m1-yans.git_examples_finetuning.py - {'id': 'QXJ0aWZhY3Q6MTA5MTk2NTkzOA==', 'state': 'COMMITTED', 'artifactSequence': {'id': 'QXJ0aWZhY3RDb2xsZWN0aW9uOjM2MjY3MjMzNA==', 'latestArtifact': {'id': 'QXJ0aWZhY3Q6MTA5MzUzODM4NQ==', 'versionIndex': 3}}}
120
+ 2024-08-04 14:34:56,895 DEBUG SenderThread:11193 [sender.py:send_request():409] send_request: defer
121
+ 2024-08-04 14:34:56,895 INFO SenderThread:11193 [sender.py:send_request_defer():613] handle sender defer: 9
122
+ 2024-08-04 14:34:56,895 INFO SenderThread:11193 [dir_watcher.py:finish():358] shutting down directory watcher
123
+ 2024-08-04 14:34:57,381 INFO SenderThread:11193 [dir_watcher.py:finish():388] scan: /project/wandb/run-20240804_143449-7tyiihss/files
124
+ 2024-08-04 14:34:57,382 INFO SenderThread:11193 [dir_watcher.py:finish():402] scan save: /project/wandb/run-20240804_143449-7tyiihss/files/requirements.txt requirements.txt
125
+ 2024-08-04 14:34:57,382 INFO SenderThread:11193 [dir_watcher.py:finish():402] scan save: /project/wandb/run-20240804_143449-7tyiihss/files/config.yaml config.yaml
126
+ 2024-08-04 14:34:57,382 INFO SenderThread:11193 [dir_watcher.py:finish():402] scan save: /project/wandb/run-20240804_143449-7tyiihss/files/wandb-metadata.json wandb-metadata.json
127
+ 2024-08-04 14:34:57,384 INFO SenderThread:11193 [dir_watcher.py:finish():402] scan save: /project/wandb/run-20240804_143449-7tyiihss/files/wandb-summary.json wandb-summary.json
128
+ 2024-08-04 14:34:57,386 INFO SenderThread:11193 [dir_watcher.py:finish():402] scan save: /project/wandb/run-20240804_143449-7tyiihss/files/output.log output.log
129
+ 2024-08-04 14:34:57,387 INFO SenderThread:11193 [sender.py:transition_state():617] send defer: 10
130
+ 2024-08-04 14:34:57,388 DEBUG SenderThread:11193 [sender.py:send_request():409] send_request: poll_exit
131
+ 2024-08-04 14:34:57,388 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: defer
132
+ 2024-08-04 14:34:57,388 INFO HandlerThread:11193 [handler.py:handle_request_defer():172] handle defer: 10
133
+ 2024-08-04 14:34:57,389 DEBUG SenderThread:11193 [sender.py:send_request():409] send_request: defer
134
+ 2024-08-04 14:34:57,390 INFO SenderThread:11193 [sender.py:send_request_defer():613] handle sender defer: 10
135
+ 2024-08-04 14:34:57,390 INFO SenderThread:11193 [file_pusher.py:finish():172] shutting down file pusher
136
+ 2024-08-04 14:34:57,784 INFO wandb-upload_1:11193 [upload_job.py:push():131] Uploaded file /project/wandb/run-20240804_143449-7tyiihss/files/config.yaml
137
+ 2024-08-04 14:34:57,859 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: poll_exit
138
+ 2024-08-04 14:34:57,859 DEBUG SenderThread:11193 [sender.py:send_request():409] send_request: poll_exit
139
+ 2024-08-04 14:34:57,882 INFO wandb-upload_0:11193 [upload_job.py:push():131] Uploaded file /project/wandb/run-20240804_143449-7tyiihss/files/requirements.txt
140
+ 2024-08-04 14:34:57,946 INFO wandb-upload_3:11193 [upload_job.py:push():131] Uploaded file /project/wandb/run-20240804_143449-7tyiihss/files/output.log
141
+ 2024-08-04 14:34:57,948 INFO wandb-upload_2:11193 [upload_job.py:push():131] Uploaded file /project/wandb/run-20240804_143449-7tyiihss/files/wandb-summary.json
142
+ 2024-08-04 14:34:58,148 INFO Thread-11 (_thread_body):11193 [sender.py:transition_state():617] send defer: 11
143
+ 2024-08-04 14:34:58,149 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: defer
144
+ 2024-08-04 14:34:58,149 INFO HandlerThread:11193 [handler.py:handle_request_defer():172] handle defer: 11
145
+ 2024-08-04 14:34:58,149 DEBUG SenderThread:11193 [sender.py:send_request():409] send_request: defer
146
+ 2024-08-04 14:34:58,149 INFO SenderThread:11193 [sender.py:send_request_defer():613] handle sender defer: 11
147
+ 2024-08-04 14:34:58,149 INFO SenderThread:11193 [file_pusher.py:join():178] waiting for file pusher
148
+ 2024-08-04 14:34:58,149 INFO SenderThread:11193 [sender.py:transition_state():617] send defer: 12
149
+ 2024-08-04 14:34:58,150 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: defer
150
+ 2024-08-04 14:34:58,150 INFO HandlerThread:11193 [handler.py:handle_request_defer():172] handle defer: 12
151
+ 2024-08-04 14:34:58,150 DEBUG SenderThread:11193 [sender.py:send_request():409] send_request: defer
152
+ 2024-08-04 14:34:58,150 INFO SenderThread:11193 [sender.py:send_request_defer():613] handle sender defer: 12
153
+ 2024-08-04 14:34:58,150 INFO SenderThread:11193 [file_stream.py:finish():595] file stream finish called
154
+ 2024-08-04 14:34:58,337 INFO SenderThread:11193 [file_stream.py:finish():599] file stream finish is done
155
+ 2024-08-04 14:34:58,337 INFO SenderThread:11193 [sender.py:transition_state():617] send defer: 13
156
+ 2024-08-04 14:34:58,337 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: defer
157
+ 2024-08-04 14:34:58,337 INFO HandlerThread:11193 [handler.py:handle_request_defer():172] handle defer: 13
158
+ 2024-08-04 14:34:58,338 DEBUG SenderThread:11193 [sender.py:send_request():409] send_request: defer
159
+ 2024-08-04 14:34:58,338 INFO SenderThread:11193 [sender.py:send_request_defer():613] handle sender defer: 13
160
+ 2024-08-04 14:34:58,338 INFO SenderThread:11193 [sender.py:transition_state():617] send defer: 14
161
+ 2024-08-04 14:34:58,338 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: defer
162
+ 2024-08-04 14:34:58,338 DEBUG SenderThread:11193 [sender.py:send():382] send: final
163
+ 2024-08-04 14:34:58,338 DEBUG SenderThread:11193 [sender.py:send():382] send: footer
164
+ 2024-08-04 14:34:58,338 INFO HandlerThread:11193 [handler.py:handle_request_defer():172] handle defer: 14
165
+ 2024-08-04 14:34:58,339 DEBUG SenderThread:11193 [sender.py:send_request():409] send_request: defer
166
+ 2024-08-04 14:34:58,339 INFO SenderThread:11193 [sender.py:send_request_defer():613] handle sender defer: 14
167
+ 2024-08-04 14:34:58,339 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: poll_exit
168
+ 2024-08-04 14:34:58,339 DEBUG SenderThread:11193 [sender.py:send_request():409] send_request: poll_exit
169
+ 2024-08-04 14:34:58,340 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: poll_exit
170
+ 2024-08-04 14:34:58,340 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: server_info
171
+ 2024-08-04 14:34:58,340 DEBUG SenderThread:11193 [sender.py:send_request():409] send_request: poll_exit
172
+ 2024-08-04 14:34:58,340 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: get_summary
173
+ 2024-08-04 14:34:58,340 DEBUG SenderThread:11193 [sender.py:send_request():409] send_request: server_info
174
+ 2024-08-04 14:34:58,342 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: sampled_history
175
+ 2024-08-04 14:34:58,342 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: internal_messages
176
+ 2024-08-04 14:34:58,342 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: job_info
177
+ 2024-08-04 14:34:58,505 DEBUG SenderThread:11193 [sender.py:send_request():409] send_request: job_info
178
+ 2024-08-04 14:34:58,506 INFO MainThread:11193 [wandb_run.py:_footer_history_summary_info():3866] rendering history
179
+ 2024-08-04 14:34:58,506 INFO MainThread:11193 [wandb_run.py:_footer_history_summary_info():3898] rendering summary
180
+ 2024-08-04 14:34:58,506 INFO MainThread:11193 [wandb_run.py:_footer_sync_info():3825] logging synced files
181
+ 2024-08-04 14:34:58,506 DEBUG HandlerThread:11193 [handler.py:handle_request():146] handle_request: shutdown
182
+ 2024-08-04 14:34:58,506 INFO HandlerThread:11193 [handler.py:finish():869] shutting down handler
183
+ 2024-08-04 14:34:59,343 INFO WriterThread:11193 [datastore.py:close():296] close: /project/wandb/run-20240804_143449-7tyiihss/run-7tyiihss.wandb
184
+ 2024-08-04 14:34:59,506 INFO SenderThread:11193 [sender.py:finish():1572] shutting down sender
185
+ 2024-08-04 14:34:59,506 INFO SenderThread:11193 [file_pusher.py:finish():172] shutting down file pusher
186
+ 2024-08-04 14:34:59,506 INFO SenderThread:11193 [file_pusher.py:join():178] waiting for file pusher
wandb/run-20240804_143449-7tyiihss/logs/debug.log ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2024-08-04 14:34:49,898 INFO MainThread:11121 [wandb_setup.py:_flush():76] Current SDK version is 0.16.3
2
+ 2024-08-04 14:34:49,899 INFO MainThread:11121 [wandb_setup.py:_flush():76] Configure stats pid to 11121
3
+ 2024-08-04 14:34:49,899 INFO MainThread:11121 [wandb_setup.py:_flush():76] Loading settings from /singularity_home/.config/wandb/settings
4
+ 2024-08-04 14:34:49,899 INFO MainThread:11121 [wandb_setup.py:_flush():76] Loading settings from /project/wandb/settings
5
+ 2024-08-04 14:34:49,899 INFO MainThread:11121 [wandb_setup.py:_flush():76] Loading settings from environment variables: {'api_key': '***REDACTED***', 'run_notes': 'Train tiny llama sample'}
6
+ 2024-08-04 14:34:49,899 INFO MainThread:11121 [wandb_setup.py:_flush():76] Applying setup settings: {'_disable_service': False}
7
+ 2024-08-04 14:34:49,899 INFO MainThread:11121 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {'program_relpath': 'examples/finetuning.py', 'program_abspath': '/project/examples/finetuning.py', 'program': '/project/examples/finetuning.py'}
8
+ 2024-08-04 14:34:49,899 INFO MainThread:11121 [wandb_init.py:_log_setup():526] Logging user logs to /project/wandb/run-20240804_143449-7tyiihss/logs/debug.log
9
+ 2024-08-04 14:34:49,899 INFO MainThread:11121 [wandb_init.py:_log_setup():527] Logging internal logs to /project/wandb/run-20240804_143449-7tyiihss/logs/debug-internal.log
10
+ 2024-08-04 14:34:49,899 INFO MainThread:11121 [wandb_init.py:init():566] calling init triggers
11
+ 2024-08-04 14:34:49,899 INFO MainThread:11121 [wandb_init.py:init():573] wandb.init called with sweep_config: {}
12
+ config: {'sharding_strategy': 'FULL_SHARD', 'checkpoint_type': 'LOCAL_STATE_DICT', 'fsdp_activation_checkpointing': True, 'fsdp_cpu_offload': False, 'low_cpu_fsdp': False, 'no_meta_device': False, 'data_path': None, 'split': '969, 30, 1', 'train_data_path': ['4013541', '/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document'], 'valid_data_path': ['4013541', '/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document'], 'test_data_path': ['4013541', '/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document'], 'data_cache_path': None, 'vocab_size': None, 'vocab_file': None, 'merge_file': None, 'seq_length': 512, 'num_workers': 2, 'tokenizer_type': 'Llama2Tokenizer', 'tokenizer_model': '/share/pretrained_lm/meta-llama/TinyLlama_v1.1/tokenizer.model', 'reset_position_ids': False, 'reset_attention_mask': False, 'eod_mask_loss': False, 'retro_return_doc_ids': False, 'short_seq_prob': 0.1, 'vocab_extra_ids': 0, 'seed': 1234, 'use_mpi': False, 'wandb_entity': 'iwakawa-koichi-q5-tohoku-nlp6723', 'wandb_name': 'tiny-llama_train_2024-08-04-14:34:38', 'wandb_project': 'llm_tutorial', 'quantization': False, 'use_freeze_layers': False, 'freeze_layers': None, 'bf16': True, 'fp16': False, 'mixed_precision': True, 'param_dtype': None, 'load': '/work/llm_recipes/models/tiny-llama', 'save': '/work/llm_recipes/models/tiny-llama', 'base_model': '/share/pretrained_lm/meta-llama/TinyLlama_v1.1', 'use_better_transformer': False, 'grad_clip_norm': 1.0, 'eval_interval': 200, 'save_interval': 200, 'eval_iters': 10, 'optimizer': 'adam', 'lr': 2e-05, 'lr_decay_style': 'cosine', 'lr_decay_iters': 2000, 'lr_warmup_iters': 500, 'min_lr': 1e-06, 'train_iters': 2000, 'train_samples': None, 'global_batch_size': 320, 'micro_batch_size': 8, 'make_vocab_size_divisible_by': 128, 'sliding_window_size': 4096, 'skip_batch': None, 'no_save_optimizer_state': False, 'continual_pretraining': False, 'instruction_tuning': False, 'direct_preference_optimization': False, 'attention_dropout': 0.1, 'hidden_dropout': 0.1, 'weight_decay': 0.1, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_eps': 1e-06, 'hf_transformer_model_dir': None, 'instruction_train_data_path': None, 'instruction_valid_data_path': None, 'epoch': None, 'instruction_dataset_size': None, 'save_sampler_state': False, 'label_smoothing': 0.0, 'save_n_checkpoints': 10, 'hf_repo_id': 'koichi12/tiny-llama', 'create_public_hf_repo': False, 'upload_all_checkpoints_to_hf': False, 'hf_upload_retry_limit': 2, 'exit_duration_in_mins': None, 'source_key': None, 'target_key': None, 'attn_implementation': 'flash_attention_2', 'efficient_instruction_tuning': False, 'remove_padding_masking': False, 'save_start_iter': None, 'rank': 0, 'world_size': 1, 'padded_vocab_size': 32000, 'gradient_accumulation_steps': 40}
13
+ 2024-08-04 14:34:49,899 INFO MainThread:11121 [wandb_init.py:init():616] starting backend
14
+ 2024-08-04 14:34:49,899 INFO MainThread:11121 [wandb_init.py:init():620] setting up manager
15
+ 2024-08-04 14:34:49,904 INFO MainThread:11121 [backend.py:_multiprocessing_setup():105] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
16
+ 2024-08-04 14:34:49,905 INFO MainThread:11121 [wandb_init.py:init():628] backend started and connected
17
+ 2024-08-04 14:34:49,910 INFO MainThread:11121 [wandb_init.py:init():720] updated telemetry
18
+ 2024-08-04 14:34:49,920 INFO MainThread:11121 [wandb_init.py:init():753] communicating run to backend with 90.0 second timeout
19
+ 2024-08-04 14:34:50,376 INFO MainThread:11121 [wandb_run.py:_on_init():2262] communicating current version
20
+ 2024-08-04 14:34:50,461 INFO MainThread:11121 [wandb_run.py:_on_init():2271] got version response upgrade_message: "wandb version 0.17.5 is available! To upgrade, please run:\n $ pip install wandb --upgrade"
21
+
22
+ 2024-08-04 14:34:50,461 INFO MainThread:11121 [wandb_init.py:init():804] starting run threads in backend
23
+ 2024-08-04 14:34:50,520 INFO MainThread:11121 [wandb_run.py:_console_start():2241] atexit reg
24
+ 2024-08-04 14:34:50,520 INFO MainThread:11121 [wandb_run.py:_redirect():2096] redirect: wrap_raw
25
+ 2024-08-04 14:34:50,521 INFO MainThread:11121 [wandb_run.py:_redirect():2161] Wrapping output streams.
26
+ 2024-08-04 14:34:50,521 INFO MainThread:11121 [wandb_run.py:_redirect():2186] Redirects installed.
27
+ 2024-08-04 14:34:50,521 INFO MainThread:11121 [wandb_init.py:init():847] run started, returning control to user process
28
+ 2024-08-04 14:34:53,773 INFO MainThread:11121 [wandb_run.py:_config_callback():1343] config_cb None None {'activation_function': 'silu', 'hidden_size': 2048, 'model_type': 'llama', 'max_position_embeddings': 2048, 'num_attention_heads': 32, 'num_hidden_layers': 22, 'model_architecture': 'LlamaForCausalLM'}
29
+ 2024-08-04 14:34:53,774 INFO MainThread:11121 [wandb_run.py:_config_callback():1343] config_cb None None {'world_size': 1}
30
+ 2024-08-04 14:34:59,507 WARNING MsgRouterThr:11121 [router.py:message_loop():77] message_loop has been closed
wandb/run-20240804_143449-7tyiihss/run-7tyiihss.wandb ADDED
Binary file (20.4 kB). View file
 
wandb/run-20240804_153511-5ba5jbt6/files/config.yaml ADDED
@@ -0,0 +1,335 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ wandb_version: 1
2
+
3
+ sharding_strategy:
4
+ desc: null
5
+ value: FULL_SHARD
6
+ checkpoint_type:
7
+ desc: null
8
+ value: LOCAL_STATE_DICT
9
+ fsdp_activation_checkpointing:
10
+ desc: null
11
+ value: true
12
+ fsdp_cpu_offload:
13
+ desc: null
14
+ value: false
15
+ low_cpu_fsdp:
16
+ desc: null
17
+ value: false
18
+ no_meta_device:
19
+ desc: null
20
+ value: false
21
+ data_path:
22
+ desc: null
23
+ value: null
24
+ split:
25
+ desc: null
26
+ value: 969, 30, 1
27
+ train_data_path:
28
+ desc: null
29
+ value:
30
+ - '4013541'
31
+ - /work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document
32
+ valid_data_path:
33
+ desc: null
34
+ value:
35
+ - '4013541'
36
+ - /work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document
37
+ test_data_path:
38
+ desc: null
39
+ value:
40
+ - '4013541'
41
+ - /work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document
42
+ data_cache_path:
43
+ desc: null
44
+ value: null
45
+ vocab_size:
46
+ desc: null
47
+ value: null
48
+ vocab_file:
49
+ desc: null
50
+ value: null
51
+ merge_file:
52
+ desc: null
53
+ value: null
54
+ seq_length:
55
+ desc: null
56
+ value: 512
57
+ num_workers:
58
+ desc: null
59
+ value: 2
60
+ tokenizer_type:
61
+ desc: null
62
+ value: Llama2Tokenizer
63
+ tokenizer_model:
64
+ desc: null
65
+ value: /share/pretrained_lm/meta-llama/TinyLlama_v1.1/tokenizer.model
66
+ reset_position_ids:
67
+ desc: null
68
+ value: false
69
+ reset_attention_mask:
70
+ desc: null
71
+ value: false
72
+ eod_mask_loss:
73
+ desc: null
74
+ value: false
75
+ retro_return_doc_ids:
76
+ desc: null
77
+ value: false
78
+ short_seq_prob:
79
+ desc: null
80
+ value: 0.1
81
+ vocab_extra_ids:
82
+ desc: null
83
+ value: 0
84
+ seed:
85
+ desc: null
86
+ value: 1234
87
+ use_mpi:
88
+ desc: null
89
+ value: false
90
+ wandb_entity:
91
+ desc: null
92
+ value: iwakawa-koichi-q5-tohoku-nlp6723
93
+ wandb_name:
94
+ desc: null
95
+ value: tiny-llama_train_2024-08-04-15:34:59
96
+ wandb_project:
97
+ desc: null
98
+ value: llm_tutorial
99
+ quantization:
100
+ desc: null
101
+ value: false
102
+ use_freeze_layers:
103
+ desc: null
104
+ value: false
105
+ freeze_layers:
106
+ desc: null
107
+ value: null
108
+ bf16:
109
+ desc: null
110
+ value: true
111
+ fp16:
112
+ desc: null
113
+ value: false
114
+ mixed_precision:
115
+ desc: null
116
+ value: true
117
+ param_dtype:
118
+ desc: null
119
+ value: null
120
+ load:
121
+ desc: null
122
+ value: /work/llm_recipes/models/tiny-llama
123
+ save:
124
+ desc: null
125
+ value: /work/llm_recipes/models/tiny-llama
126
+ base_model:
127
+ desc: null
128
+ value: /share/pretrained_lm/meta-llama/TinyLlama_v1.1
129
+ use_better_transformer:
130
+ desc: null
131
+ value: false
132
+ grad_clip_norm:
133
+ desc: null
134
+ value: 1.0
135
+ eval_interval:
136
+ desc: null
137
+ value: 200
138
+ save_interval:
139
+ desc: null
140
+ value: 200
141
+ eval_iters:
142
+ desc: null
143
+ value: 10
144
+ optimizer:
145
+ desc: null
146
+ value: adam
147
+ lr:
148
+ desc: null
149
+ value: 2.0e-05
150
+ lr_decay_style:
151
+ desc: null
152
+ value: cosine
153
+ lr_decay_iters:
154
+ desc: null
155
+ value: 2000
156
+ lr_warmup_iters:
157
+ desc: null
158
+ value: 500
159
+ min_lr:
160
+ desc: null
161
+ value: 1.0e-06
162
+ train_iters:
163
+ desc: null
164
+ value: 2000
165
+ train_samples:
166
+ desc: null
167
+ value: null
168
+ global_batch_size:
169
+ desc: null
170
+ value: 320
171
+ micro_batch_size:
172
+ desc: null
173
+ value: 8
174
+ make_vocab_size_divisible_by:
175
+ desc: null
176
+ value: 128
177
+ sliding_window_size:
178
+ desc: null
179
+ value: 4096
180
+ skip_batch:
181
+ desc: null
182
+ value: null
183
+ no_save_optimizer_state:
184
+ desc: null
185
+ value: false
186
+ continual_pretraining:
187
+ desc: null
188
+ value: false
189
+ instruction_tuning:
190
+ desc: null
191
+ value: false
192
+ direct_preference_optimization:
193
+ desc: null
194
+ value: false
195
+ attention_dropout:
196
+ desc: null
197
+ value: 0.1
198
+ hidden_dropout:
199
+ desc: null
200
+ value: 0.1
201
+ weight_decay:
202
+ desc: null
203
+ value: 0.1
204
+ adam_beta1:
205
+ desc: null
206
+ value: 0.9
207
+ adam_beta2:
208
+ desc: null
209
+ value: 0.95
210
+ adam_eps:
211
+ desc: null
212
+ value: 1.0e-06
213
+ hf_transformer_model_dir:
214
+ desc: null
215
+ value: null
216
+ instruction_train_data_path:
217
+ desc: null
218
+ value: null
219
+ instruction_valid_data_path:
220
+ desc: null
221
+ value: null
222
+ epoch:
223
+ desc: null
224
+ value: null
225
+ instruction_dataset_size:
226
+ desc: null
227
+ value: null
228
+ save_sampler_state:
229
+ desc: null
230
+ value: false
231
+ label_smoothing:
232
+ desc: null
233
+ value: 0.0
234
+ save_n_checkpoints:
235
+ desc: null
236
+ value: 10
237
+ hf_repo_id:
238
+ desc: null
239
+ value: koichi12/tiny-llama
240
+ create_public_hf_repo:
241
+ desc: null
242
+ value: false
243
+ upload_all_checkpoints_to_hf:
244
+ desc: null
245
+ value: false
246
+ hf_upload_retry_limit:
247
+ desc: null
248
+ value: 2
249
+ exit_duration_in_mins:
250
+ desc: null
251
+ value: null
252
+ source_key:
253
+ desc: null
254
+ value: null
255
+ target_key:
256
+ desc: null
257
+ value: null
258
+ attn_implementation:
259
+ desc: null
260
+ value: flash_attention_2
261
+ efficient_instruction_tuning:
262
+ desc: null
263
+ value: false
264
+ remove_padding_masking:
265
+ desc: null
266
+ value: false
267
+ save_start_iter:
268
+ desc: null
269
+ value: null
270
+ rank:
271
+ desc: null
272
+ value: 0
273
+ world_size:
274
+ desc: null
275
+ value: 1
276
+ padded_vocab_size:
277
+ desc: null
278
+ value: 32000
279
+ gradient_accumulation_steps:
280
+ desc: null
281
+ value: 40
282
+ _wandb:
283
+ desc: null
284
+ value:
285
+ python_version: 3.10.12
286
+ cli_version: 0.16.3
287
+ framework: huggingface
288
+ huggingface_version: 4.43.3
289
+ is_jupyter_run: false
290
+ is_kaggle_kernel: false
291
+ start_time: 1722753311.766293
292
+ t:
293
+ 1:
294
+ - 1
295
+ - 11
296
+ - 49
297
+ - 55
298
+ - 71
299
+ 2:
300
+ - 1
301
+ - 11
302
+ - 49
303
+ - 55
304
+ - 71
305
+ 3:
306
+ - 13
307
+ - 16
308
+ - 23
309
+ 4: 3.10.12
310
+ 5: 0.16.3
311
+ 6: 4.43.3
312
+ 8:
313
+ - 5
314
+ 13: linux-x86_64
315
+ activation_function:
316
+ desc: null
317
+ value: silu
318
+ hidden_size:
319
+ desc: null
320
+ value: 2048
321
+ model_type:
322
+ desc: null
323
+ value: llama
324
+ max_position_embeddings:
325
+ desc: null
326
+ value: 2048
327
+ num_attention_heads:
328
+ desc: null
329
+ value: 32
330
+ num_hidden_layers:
331
+ desc: null
332
+ value: 22
333
+ model_architecture:
334
+ desc: null
335
+ value: LlamaForCausalLM
wandb/run-20240804_153511-5ba5jbt6/files/output.log ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Created Hugging Face repository with ID koichi12/tiny-llama.
2
+ Clearing GPU cache for all ranks
3
+ --> Running with torch torch_distributed debug set to detail
4
+ File not found: /work/llm_recipes/models/tiny-llama/latest_iteration.txt
5
+ Unable to read latest iteration from /work/llm_recipes/models/tiny-llama/latest_iteration.txt
6
+ File not found: /work/llm_recipes/models/tiny-llama/latest_iteration.txt
7
+ Unable to read latest iteration from /work/llm_recipes/models/tiny-llama/latest_iteration.txt
8
+ File not found: /work/llm_recipes/models/tiny-llama/latest_iteration.txt
9
+ Unable to read latest iteration from /work/llm_recipes/models/tiny-llama/latest_iteration.txt
10
+ No checkpoint found in /work/llm_recipes/models/tiny-llama, skipping model loading
11
+ --> Model /share/pretrained_lm/meta-llama/TinyLlama_v1.1
12
+ --> /share/pretrained_lm/meta-llama/TinyLlama_v1.1 has 1100.048384 Million params
13
+ You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
14
+ You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
15
+ Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
16
+ Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in LlamaModel is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
17
+ BFloat16 enabled for mixed precision - using bfSixteen policy
18
+ --> applying fsdp activation checkpointing...
19
+ > datasets target sizes (minimum size):
20
+ train: 640000
21
+ validation: 35200
22
+ test: 3200
23
+ > building train, validation, and test datasets for GPT ...
24
+ > finished creating GPT datasets ...
25
+ File not found: /work/llm_recipes/models/tiny-llama/latest_iteration.txt
26
+ Unable to read latest iteration from /work/llm_recipes/models/tiny-llama/latest_iteration.txt
27
+ No checkpoint found in /work/llm_recipes/models/tiny-llama, skipping optimizer loading
28
+ File not found: /work/llm_recipes/models/tiny-llama/latest_iteration.txt
29
+ Unable to read latest iteration from /work/llm_recipes/models/tiny-llama/latest_iteration.txt
30
+ model info: FullyShardedDataParallel(
31
+ (_fsdp_wrapped_module): LlamaForCausalLM(
32
+ (model): LlamaModel(
33
+ (embed_tokens): Embedding(32000, 2048)
34
+ (layers): ModuleList(
35
+ (0-21): 22 x FullyShardedDataParallel(
36
+ (_fsdp_wrapped_module): CheckpointWrapper(
37
+ (_checkpoint_wrapped_module): LlamaDecoderLayer(
38
+ (self_attn): LlamaFlashAttention2(
39
+ (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
40
+ (k_proj): Linear(in_features=2048, out_features=256, bias=False)
41
+ (v_proj): Linear(in_features=2048, out_features=256, bias=False)
42
+ (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
43
+ (rotary_emb): LlamaRotaryEmbedding()
44
+ )
45
+ (mlp): LlamaMLP(
46
+ (gate_proj): Linear(in_features=2048, out_features=5632, bias=False)
47
+ (up_proj): Linear(in_features=2048, out_features=5632, bias=False)
48
+ (down_proj): Linear(in_features=5632, out_features=2048, bias=False)
49
+ (act_fn): SiLU()
50
+ )
51
+ (input_layernorm): LlamaRMSNorm()
52
+ (post_attention_layernorm): LlamaRMSNorm()
53
+ )
54
+ )
55
+ )
56
+ )
57
+ (norm): LlamaRMSNorm()
58
+ (rotary_emb): LlamaRotaryEmbedding()
59
+ )
60
+ (lm_head): Linear(in_features=2048, out_features=32000, bias=False)
61
+ )
62
+ )
63
+ model config: LlamaConfig {
64
+ "_name_or_path": "/share/pretrained_lm/meta-llama/TinyLlama_v1.1",
65
+ "architectures": [
66
+ "LlamaForCausalLM"
67
+ ],
68
+ "attention_bias": false,
69
+ "attention_dropout": 0.0,
70
+ "bos_token_id": 1,
71
+ "eos_token_id": 2,
72
+ "hidden_act": "silu",
73
+ "hidden_size": 2048,
74
+ "initializer_range": 0.02,
75
+ "intermediate_size": 5632,
76
+ "label_smoothing": 0.0,
77
+ "max_position_embeddings": 2048,
78
+ "mlp_bias": false,
79
+ "model_type": "llama",
80
+ "num_attention_heads": 32,
81
+ "num_hidden_layers": 22,
82
+ "num_key_value_heads": 4,
83
+ "pretraining_tp": 1,
84
+ "rms_norm_eps": 1e-05,
85
+ "rope_scaling": null,
86
+ "rope_theta": 10000.0,
87
+ "tie_word_embeddings": false,
88
+ "torch_dtype": "float32",
89
+ "transformers_version": "4.43.3",
90
+ "use_cache": false,
91
+ "vocab_size": 32000
92
+ }
93
+ /usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_init_utils.py:441: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.FULL_SHARD since the world size is 1.
94
+ warnings.warn(
95
+ Let split = None
96
+ Building a BlendedDataset for a single MegatronDataset
97
+ Unable to save the indexes because path_to_cache is None
98
+ Building a BlendedDataset for a single MegatronDataset
99
+ Unable to save the indexes because path_to_cache is None
100
+ Building a BlendedDataset for a single MegatronDataset
101
+ Unable to save the indexes because path_to_cache is None
102
+ Traceback (most recent call last):
103
+ File "/project/examples/finetuning.py", line 13, in <module>
104
+ main()
105
+ File "/project/src/llama_recipes/finetuning.py", line 281, in main
106
+ train(
107
+ File "/project/src/llama_recipes/utils/train_utils.py", line 104, in train
108
+ batch = next(train_dataloader)
109
+ File "/project/src/llama_recipes/utils/train_utils.py", line 24, in cyclic_iter
110
+ for x in iter:
111
+ File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 631, in __next__
112
+ data = self._next_data()
113
+ File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
114
+ return self._process_data(data)
115
+ File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
116
+ data.reraise()
117
+ File "/usr/local/lib/python3.10/dist-packages/torch/_utils.py", line 705, in reraise
118
+ raise exception
119
+ RuntimeError: Caught RuntimeError in DataLoader worker process 0.
120
+ Original Traceback (most recent call last):
121
+ File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
122
+ data = fetcher.fetch(index)
123
+ File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
124
+ return self.collate_fn(data)
125
+ File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py", line 277, in default_collate
126
+ return collate(batch, collate_fn_map=default_collate_fn_map)
127
+ File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py", line 129, in collate
128
+ return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
129
+ File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py", line 129, in <dictcomp>
130
+ return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
131
+ File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py", line 121, in collate
132
+ return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map)
133
+ File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py", line 174, in collate_tensor_fn
134
+ return torch.stack(batch, 0, out=out)
135
+ RuntimeError: stack expects each tensor to be equal size, but got [513] at entry 0 and [543] at entry 1
wandb/run-20240804_153511-5ba5jbt6/files/requirements.txt ADDED
@@ -0,0 +1,271 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ absl-py==2.1.0
2
+ accelerate==0.33.0
3
+ aiohttp==3.9.1
4
+ aiosignal==1.3.1
5
+ annotated-types==0.6.0
6
+ apex==0.1
7
+ appdirs==1.4.4
8
+ argon2-cffi-bindings==21.2.0
9
+ argon2-cffi==23.1.0
10
+ asttokens==2.4.1
11
+ astunparse==1.6.3
12
+ async-timeout==4.0.3
13
+ attrs==23.2.0
14
+ audioread==3.0.1
15
+ beautifulsoup4==4.12.3
16
+ bleach==6.1.0
17
+ blis==0.7.11
18
+ cachetools==5.3.2
19
+ catalogue==2.0.10
20
+ certifi==2024.2.2
21
+ cffi==1.16.0
22
+ charset-normalizer==3.3.2
23
+ click==8.1.7
24
+ cloudpathlib==0.16.0
25
+ cloudpickle==3.0.0
26
+ cmake==3.28.1
27
+ colorama==0.4.6
28
+ comm==0.2.1
29
+ confection==0.1.4
30
+ contourpy==1.2.0
31
+ cubinlinker==0.3.0+2.g405ac64
32
+ cuda-python==12.3.0rc4+9.gdb8c48a.dirty
33
+ cudf==23.12.0
34
+ cugraph-dgl==23.12.0
35
+ cugraph-service-client==23.12.0
36
+ cugraph-service-server==23.12.0
37
+ cugraph==23.12.0
38
+ cuml==23.12.0
39
+ cupy-cuda12x==12.3.0
40
+ cycler==0.12.1
41
+ cymem==2.0.8
42
+ cython==3.0.8
43
+ dask-cuda==23.12.0
44
+ dask-cudf==23.12.0
45
+ dask==2023.11.0
46
+ debugpy==1.8.1
47
+ decorator==5.1.1
48
+ defusedxml==0.7.1
49
+ distributed==2023.11.0
50
+ dm-tree==0.1.8
51
+ docker-pycreds==0.4.0
52
+ einops==0.7.0
53
+ exceptiongroup==1.2.0
54
+ execnet==2.0.2
55
+ executing==2.0.1
56
+ expecttest==0.1.3
57
+ fastjsonschema==2.19.1
58
+ fastrlock==0.8.2
59
+ filelock==3.13.1
60
+ flash-attn==2.4.2
61
+ fonttools==4.48.1
62
+ frozenlist==1.4.1
63
+ fsspec==2023.12.2
64
+ gast==0.5.4
65
+ gitdb==4.0.11
66
+ gitpython==3.1.43
67
+ google-auth-oauthlib==0.4.6
68
+ google-auth==2.27.0
69
+ graphsurgeon==0.4.6
70
+ grpcio==1.60.1
71
+ huggingface-hub==0.24.5
72
+ hypothesis==5.35.1
73
+ idna==3.6
74
+ importlib-metadata==7.0.1
75
+ iniconfig==2.0.0
76
+ intel-openmp==2021.4.0
77
+ ipadic==1.0.0
78
+ ipykernel==6.29.2
79
+ ipython-genutils==0.2.0
80
+ ipython==8.21.0
81
+ jedi==0.19.1
82
+ jinja2==3.1.3
83
+ joblib==1.3.2
84
+ json5==0.9.14
85
+ jsonnet==0.19.1
86
+ jsonschema-specifications==2023.12.1
87
+ jsonschema==4.21.1
88
+ jupyter-client==8.6.0
89
+ jupyter-core==5.7.1
90
+ jupyter-tensorboard==0.2.0
91
+ jupyterlab-pygments==0.3.0
92
+ jupyterlab-server==1.2.0
93
+ jupyterlab==2.3.2
94
+ jupytext==1.16.1
95
+ kiwisolver==1.4.5
96
+ langcodes==3.3.0
97
+ lazy-loader==0.3
98
+ librosa==0.10.1
99
+ llvmlite==0.40.1
100
+ locket==1.0.0
101
+ logzero==1.7.0
102
+ lxml==5.2.2
103
+ markdown-it-py==3.0.0
104
+ markdown==3.5.2
105
+ markupsafe==2.1.4
106
+ matplotlib-inline==0.1.6
107
+ matplotlib==3.8.2
108
+ mdit-py-plugins==0.4.0
109
+ mdurl==0.1.2
110
+ mecab-python3==1.0.6
111
+ mistune==3.0.2
112
+ mkl-devel==2021.1.1
113
+ mkl-include==2021.1.1
114
+ mkl==2021.1.1
115
+ mock==5.1.0
116
+ more-itertools==9.1.0
117
+ mpmath==1.3.0
118
+ msgpack==1.0.7
119
+ multidict==6.0.4
120
+ murmurhash==1.0.10
121
+ nbclient==0.9.0
122
+ nbconvert==7.16.0
123
+ nbformat==5.9.2
124
+ nest-asyncio==1.6.0
125
+ networkx==2.6.3
126
+ ninja==1.11.1.1
127
+ nltk==3.8.1
128
+ notebook==6.4.10
129
+ numba==0.57.1+1.g1ff679645
130
+ numpy==1.24.4
131
+ nvfuser==0.1.4a0+d0bb811
132
+ nvidia-dali-cuda120==1.34.0
133
+ nvidia-pyindex==1.0.9
134
+ nvtx==0.2.5
135
+ oauthlib==3.2.2
136
+ onnx==1.15.0rc2
137
+ opencv==4.7.0
138
+ optree==0.10.0
139
+ packaging==23.2
140
+ pandas==1.5.3
141
+ pandocfilters==1.5.1
142
+ parso==0.8.3
143
+ partd==1.4.1
144
+ peft==0.11.1
145
+ pexpect==4.9.0
146
+ pillow==10.2.0
147
+ pip==24.0
148
+ platformdirs==4.2.0
149
+ pluggy==1.4.0
150
+ ply==3.11
151
+ polygraphy==0.49.4
152
+ pooch==1.8.0
153
+ portalocker==2.10.1
154
+ preshed==3.0.9
155
+ prettytable==3.9.0
156
+ prometheus-client==0.19.0
157
+ prompt-toolkit==3.0.43
158
+ protobuf==4.24.4
159
+ psutil==5.9.4
160
+ ptxcompiler==0.8.1+2.g0d406d6
161
+ ptyprocess==0.7.0
162
+ pure-eval==0.2.2
163
+ pyarrow==14.0.1.dev0+gba5374836.d20240125
164
+ pyasn1-modules==0.3.0
165
+ pyasn1==0.5.1
166
+ pybind11-global==2.11.1
167
+ pybind11==2.11.1
168
+ pycocotools==2.0+nv0.8.0
169
+ pycparser==2.21
170
+ pydantic-core==2.16.2
171
+ pydantic==2.6.1
172
+ pygments==2.17.2
173
+ pylibcugraph==23.12.0
174
+ pylibcugraphops==23.12.0
175
+ pylibraft==23.12.0
176
+ pynvml==11.4.1
177
+ pyparsing==3.1.1
178
+ pytest-flakefinder==1.1.0
179
+ pytest-rerunfailures==13.0
180
+ pytest-shard==0.1.2
181
+ pytest-xdist==3.5.0
182
+ pytest==8.0.0
183
+ python-dateutil==2.8.2
184
+ python-dotenv==1.0.0
185
+ python-hostlist==1.23.0
186
+ pytorch-quantization==2.1.2
187
+ pytz==2023.3.post1
188
+ pyyaml==6.0.1
189
+ pyzmq==25.1.2
190
+ raft-dask==23.12.0
191
+ rapids-dask-dependency==23.12.1
192
+ referencing==0.33.0
193
+ regex==2023.12.25
194
+ requests-oauthlib==1.3.1
195
+ requests==2.31.0
196
+ rich==13.7.0
197
+ rmm==23.12.0
198
+ rpds-py==0.17.1
199
+ rsa==4.9
200
+ sacrebleu==2.4.0
201
+ safetensors==0.4.3
202
+ scikit-learn==1.2.0
203
+ scipy==1.12.0
204
+ send2trash==1.8.2
205
+ sentencepiece==0.1.99
206
+ sentry-sdk==2.12.0
207
+ setproctitle==1.3.3
208
+ setuptools==68.2.2
209
+ six==1.16.0
210
+ smart-open==6.4.0
211
+ smmap==5.0.1
212
+ sortedcontainers==2.4.0
213
+ soundfile==0.12.1
214
+ soupsieve==2.5
215
+ soxr==0.3.7
216
+ spacy-legacy==3.0.12
217
+ spacy-loggers==1.0.5
218
+ spacy==3.7.2
219
+ sphinx-glpi-theme==0.6
220
+ srsly==2.4.8
221
+ stack-data==0.6.3
222
+ sympy==1.12
223
+ tabulate==0.9.0
224
+ tbb==2021.11.0
225
+ tblib==3.0.0
226
+ tensorboard-data-server==0.6.1
227
+ tensorboard-plugin-wit==1.8.1
228
+ tensorboard==2.9.0
229
+ tensorrt==8.6.3
230
+ terminado==0.18.0
231
+ termplotlib==0.3.9
232
+ thinc==8.2.3
233
+ threadpoolctl==3.2.0
234
+ thriftpy2==0.4.17
235
+ tinycss2==1.2.1
236
+ tokenizers==0.19.1
237
+ toml==0.10.2
238
+ tomli==2.0.1
239
+ toolz==0.12.1
240
+ torch-tensorrt==2.3.0a0
241
+ torch==2.3.0a0+ebedce2
242
+ torchdata==0.7.1a0
243
+ torchtext==0.17.0a0
244
+ torchvision==0.18.0a0
245
+ tornado==6.4
246
+ tqdm==4.66.1
247
+ traitlets==5.9.0
248
+ transformer-engine==1.3.0+5b90b7f
249
+ transformers==4.43.3
250
+ treelite-runtime==3.9.1
251
+ treelite==3.9.1
252
+ triton==2.2.0+e28a256
253
+ typer==0.9.0
254
+ types-dataclasses==0.6.6
255
+ typing-extensions==4.9.0
256
+ ucx-py==0.35.0
257
+ uff==0.6.9
258
+ ujson==5.8.0
259
+ urllib3==1.26.18
260
+ wandb==0.16.3
261
+ wasabi==1.1.2
262
+ wcwidth==0.2.13
263
+ weasel==0.3.4
264
+ webencodings==0.5.1
265
+ werkzeug==3.0.1
266
+ wheel==0.42.0
267
+ xdoctest==1.0.2
268
+ xgboost==1.7.6
269
+ yarl==1.9.4
270
+ zict==3.0.0
271
+ zipp==3.17.0
wandb/run-20240804_153511-5ba5jbt6/files/wandb-metadata.json ADDED
@@ -0,0 +1,215 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "os": "Linux-5.15.0-91-generic-x86_64-with-glibc2.35",
3
+ "python": "3.10.12",
4
+ "heartbeatAt": "2024-08-04T06:35:12.365765",
5
+ "startedAt": "2024-08-04T06:35:11.753150",
6
+ "docker": null,
7
+ "cuda": null,
8
+ "args": [
9
+ "--seq-length",
10
+ "512",
11
+ "--sliding-window-size",
12
+ "4096",
13
+ "--micro-batch-size",
14
+ "8",
15
+ "--global-batch-size",
16
+ "320",
17
+ "--train-iters",
18
+ "2000",
19
+ "--tokenizer-type",
20
+ "Llama2Tokenizer",
21
+ "--tokenizer-model",
22
+ "/share/pretrained_lm/meta-llama/TinyLlama_v1.1/tokenizer.model",
23
+ "--train-data-path",
24
+ "4013541",
25
+ "/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document",
26
+ "--valid-data-path",
27
+ "4013541",
28
+ "/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document",
29
+ "--test-data-path",
30
+ "4013541",
31
+ "/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document",
32
+ "--lr",
33
+ "2e-5",
34
+ "--min-lr",
35
+ "1e-6",
36
+ "--lr-decay-style",
37
+ "cosine",
38
+ "--lr-warmup-iters",
39
+ "500",
40
+ "--lr-decay-iters",
41
+ "2000",
42
+ "--weight-decay",
43
+ "0.1",
44
+ "--grad-clip-norm",
45
+ "1.0",
46
+ "--optimizer",
47
+ "adam",
48
+ "--adam-beta1",
49
+ "0.9",
50
+ "--adam-beta2",
51
+ "0.95",
52
+ "--adam-eps",
53
+ "1e-6",
54
+ "--save-interval",
55
+ "200",
56
+ "--eval-interval",
57
+ "200",
58
+ "--eval-iters",
59
+ "10",
60
+ "--bf16",
61
+ "--mixed-precision",
62
+ "--base-model",
63
+ "/share/pretrained_lm/meta-llama/TinyLlama_v1.1",
64
+ "--save",
65
+ "/work/llm_recipes/models/tiny-llama",
66
+ "--load",
67
+ "/work/llm_recipes/models/tiny-llama",
68
+ "--fsdp-activation-checkpointing",
69
+ "--sharding-strategy",
70
+ "FULL_SHARD",
71
+ "--checkpoint-type",
72
+ "LOCAL_STATE_DICT",
73
+ "--save-n-checkpoints",
74
+ "10",
75
+ "--hf-upload-retry-limit",
76
+ "2",
77
+ "--hf-repo-id",
78
+ "koichi12/tiny-llama",
79
+ "--wandb-entity",
80
+ "iwakawa-koichi-q5-tohoku-nlp6723",
81
+ "--wandb-project",
82
+ "llm_tutorial",
83
+ "--wandb-name",
84
+ "tiny-llama_train_2024-08-04-15:34:59"
85
+ ],
86
+ "state": "running",
87
+ "program": "/project/examples/finetuning.py",
88
+ "codePathLocal": "examples/finetuning.py",
89
+ "codePath": "examples/finetuning.py",
90
+ "git": {
91
+ "remote": "https://github.com/cl-tohoku/llm-recipes-failab-m1-yans.git",
92
+ "commit": "3be5353210a678dc7008f237fa16b99f2bdf36ea"
93
+ },
94
+ "email": null,
95
+ "root": "/project",
96
+ "host": "gpu-koiwa-00",
97
+ "username": "koiwa",
98
+ "executable": "/usr/bin/python",
99
+ "cpu_count": 18,
100
+ "cpu_count_logical": 18,
101
+ "cpu_freq": {
102
+ "current": 2400.034,
103
+ "min": 0.0,
104
+ "max": 0.0
105
+ },
106
+ "cpu_freq_per_core": [
107
+ {
108
+ "current": 2400.034,
109
+ "min": 0.0,
110
+ "max": 0.0
111
+ },
112
+ {
113
+ "current": 2400.034,
114
+ "min": 0.0,
115
+ "max": 0.0
116
+ },
117
+ {
118
+ "current": 2400.034,
119
+ "min": 0.0,
120
+ "max": 0.0
121
+ },
122
+ {
123
+ "current": 2400.034,
124
+ "min": 0.0,
125
+ "max": 0.0
126
+ },
127
+ {
128
+ "current": 2400.034,
129
+ "min": 0.0,
130
+ "max": 0.0
131
+ },
132
+ {
133
+ "current": 2400.034,
134
+ "min": 0.0,
135
+ "max": 0.0
136
+ },
137
+ {
138
+ "current": 2400.034,
139
+ "min": 0.0,
140
+ "max": 0.0
141
+ },
142
+ {
143
+ "current": 2400.034,
144
+ "min": 0.0,
145
+ "max": 0.0
146
+ },
147
+ {
148
+ "current": 2400.034,
149
+ "min": 0.0,
150
+ "max": 0.0
151
+ },
152
+ {
153
+ "current": 2400.034,
154
+ "min": 0.0,
155
+ "max": 0.0
156
+ },
157
+ {
158
+ "current": 2400.034,
159
+ "min": 0.0,
160
+ "max": 0.0
161
+ },
162
+ {
163
+ "current": 2400.034,
164
+ "min": 0.0,
165
+ "max": 0.0
166
+ },
167
+ {
168
+ "current": 2400.034,
169
+ "min": 0.0,
170
+ "max": 0.0
171
+ },
172
+ {
173
+ "current": 2400.034,
174
+ "min": 0.0,
175
+ "max": 0.0
176
+ },
177
+ {
178
+ "current": 2400.034,
179
+ "min": 0.0,
180
+ "max": 0.0
181
+ },
182
+ {
183
+ "current": 2400.034,
184
+ "min": 0.0,
185
+ "max": 0.0
186
+ },
187
+ {
188
+ "current": 2400.034,
189
+ "min": 0.0,
190
+ "max": 0.0
191
+ },
192
+ {
193
+ "current": 2400.034,
194
+ "min": 0.0,
195
+ "max": 0.0
196
+ }
197
+ ],
198
+ "disk": {
199
+ "/": {
200
+ "total": 0.0625,
201
+ "used": 1.1444091796875e-05
202
+ }
203
+ },
204
+ "gpu": "NVIDIA A100-SXM4-40GB",
205
+ "gpu_count": 1,
206
+ "gpu_devices": [
207
+ {
208
+ "name": "NVIDIA A100-SXM4-40GB",
209
+ "memory_total": 42949672960
210
+ }
211
+ ],
212
+ "memory": {
213
+ "total": 56.48781967163086
214
+ }
215
+ }
wandb/run-20240804_153511-5ba5jbt6/files/wandb-summary.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"_wandb": {"runtime": 3}}
wandb/run-20240804_153511-5ba5jbt6/logs/debug-internal.log ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2024-08-04 15:35:11,766 INFO StreamThr :10035 [internal.py:wandb_internal():86] W&B internal server running at pid: 10035, started at: 2024-08-04 15:35:11.765926
2
+ 2024-08-04 15:35:11,768 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: status
3
+ 2024-08-04 15:35:11,771 INFO WriterThread:10035 [datastore.py:open_for_write():87] open: /project/wandb/run-20240804_153511-5ba5jbt6/run-5ba5jbt6.wandb
4
+ 2024-08-04 15:35:11,772 DEBUG SenderThread:10035 [sender.py:send():382] send: header
5
+ 2024-08-04 15:35:11,786 DEBUG SenderThread:10035 [sender.py:send():382] send: run
6
+ 2024-08-04 15:35:12,256 INFO SenderThread:10035 [dir_watcher.py:__init__():211] watching files in: /project/wandb/run-20240804_153511-5ba5jbt6/files
7
+ 2024-08-04 15:35:12,256 INFO SenderThread:10035 [sender.py:_start_run_threads():1136] run started: 5ba5jbt6 with start time 1722753311.766293
8
+ 2024-08-04 15:35:12,259 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: check_version
9
+ 2024-08-04 15:35:12,260 DEBUG SenderThread:10035 [sender.py:send_request():409] send_request: check_version
10
+ 2024-08-04 15:35:12,346 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: run_start
11
+ 2024-08-04 15:35:12,352 DEBUG HandlerThread:10035 [system_info.py:__init__():27] System info init
12
+ 2024-08-04 15:35:12,352 DEBUG HandlerThread:10035 [system_info.py:__init__():42] System info init done
13
+ 2024-08-04 15:35:12,352 INFO HandlerThread:10035 [system_monitor.py:start():194] Starting system monitor
14
+ 2024-08-04 15:35:12,352 INFO SystemMonitor:10035 [system_monitor.py:_start():158] Starting system asset monitoring threads
15
+ 2024-08-04 15:35:12,352 INFO HandlerThread:10035 [system_monitor.py:probe():214] Collecting system info
16
+ 2024-08-04 15:35:12,353 INFO SystemMonitor:10035 [interfaces.py:start():190] Started cpu monitoring
17
+ 2024-08-04 15:35:12,353 INFO SystemMonitor:10035 [interfaces.py:start():190] Started disk monitoring
18
+ 2024-08-04 15:35:12,354 INFO SystemMonitor:10035 [interfaces.py:start():190] Started gpu monitoring
19
+ 2024-08-04 15:35:12,354 INFO SystemMonitor:10035 [interfaces.py:start():190] Started memory monitoring
20
+ 2024-08-04 15:35:12,354 INFO SystemMonitor:10035 [interfaces.py:start():190] Started network monitoring
21
+ 2024-08-04 15:35:12,365 DEBUG HandlerThread:10035 [system_info.py:probe():151] Probing system
22
+ 2024-08-04 15:35:12,367 DEBUG HandlerThread:10035 [system_info.py:_probe_git():136] Probing git
23
+ 2024-08-04 15:35:12,379 DEBUG HandlerThread:10035 [system_info.py:_probe_git():144] Probing git done
24
+ 2024-08-04 15:35:12,379 DEBUG HandlerThread:10035 [system_info.py:probe():199] Probing system done
25
+ 2024-08-04 15:35:12,379 DEBUG HandlerThread:10035 [system_monitor.py:probe():223] {'os': 'Linux-5.15.0-91-generic-x86_64-with-glibc2.35', 'python': '3.10.12', 'heartbeatAt': '2024-08-04T06:35:12.365765', 'startedAt': '2024-08-04T06:35:11.753150', 'docker': None, 'cuda': None, 'args': ('--seq-length', '512', '--sliding-window-size', '4096', '--micro-batch-size', '8', '--global-batch-size', '320', '--train-iters', '2000', '--tokenizer-type', 'Llama2Tokenizer', '--tokenizer-model', '/share/pretrained_lm/meta-llama/TinyLlama_v1.1/tokenizer.model', '--train-data-path', '4013541', '/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document', '--valid-data-path', '4013541', '/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document', '--test-data-path', '4013541', '/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document', '--lr', '2e-5', '--min-lr', '1e-6', '--lr-decay-style', 'cosine', '--lr-warmup-iters', '500', '--lr-decay-iters', '2000', '--weight-decay', '0.1', '--grad-clip-norm', '1.0', '--optimizer', 'adam', '--adam-beta1', '0.9', '--adam-beta2', '0.95', '--adam-eps', '1e-6', '--save-interval', '200', '--eval-interval', '200', '--eval-iters', '10', '--bf16', '--mixed-precision', '--base-model', '/share/pretrained_lm/meta-llama/TinyLlama_v1.1', '--save', '/work/llm_recipes/models/tiny-llama', '--load', '/work/llm_recipes/models/tiny-llama', '--fsdp-activation-checkpointing', '--sharding-strategy', 'FULL_SHARD', '--checkpoint-type', 'LOCAL_STATE_DICT', '--save-n-checkpoints', '10', '--hf-upload-retry-limit', '2', '--hf-repo-id', 'koichi12/tiny-llama', '--wandb-entity', 'iwakawa-koichi-q5-tohoku-nlp6723', '--wandb-project', 'llm_tutorial', '--wandb-name', 'tiny-llama_train_2024-08-04-15:34:59'), 'state': 'running', 'program': '/project/examples/finetuning.py', 'codePathLocal': 'examples/finetuning.py', 'codePath': 'examples/finetuning.py', 'git': {'remote': 'https://github.com/cl-tohoku/llm-recipes-failab-m1-yans.git', 'commit': '3be5353210a678dc7008f237fa16b99f2bdf36ea'}, 'email': None, 'root': '/project', 'host': 'gpu-koiwa-00', 'username': 'koiwa', 'executable': '/usr/bin/python', 'cpu_count': 18, 'cpu_count_logical': 18, 'cpu_freq': {'current': 2400.034, 'min': 0.0, 'max': 0.0}, 'cpu_freq_per_core': [{'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}, {'current': 2400.034, 'min': 0.0, 'max': 0.0}], 'disk': {'/': {'total': 0.0625, 'used': 1.1444091796875e-05}}, 'gpu': 'NVIDIA A100-SXM4-40GB', 'gpu_count': 1, 'gpu_devices': [{'name': 'NVIDIA A100-SXM4-40GB', 'memory_total': 42949672960}], 'memory': {'total': 56.48781967163086}}
26
+ 2024-08-04 15:35:12,379 INFO HandlerThread:10035 [system_monitor.py:probe():224] Finished collecting system info
27
+ 2024-08-04 15:35:12,379 INFO HandlerThread:10035 [system_monitor.py:probe():227] Publishing system info
28
+ 2024-08-04 15:35:12,380 INFO HandlerThread:10035 [system_monitor.py:probe():229] Finished publishing system info
29
+ 2024-08-04 15:35:12,392 DEBUG SenderThread:10035 [sender.py:send():382] send: files
30
+ 2024-08-04 15:35:12,392 INFO SenderThread:10035 [sender.py:_save_file():1403] saving file wandb-metadata.json with policy now
31
+ 2024-08-04 15:35:12,401 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: python_packages
32
+ 2024-08-04 15:35:12,401 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: stop_status
33
+ 2024-08-04 15:35:12,401 DEBUG SenderThread:10035 [sender.py:send_request():409] send_request: python_packages
34
+ 2024-08-04 15:35:12,402 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: internal_messages
35
+ 2024-08-04 15:35:12,403 DEBUG SenderThread:10035 [sender.py:send_request():409] send_request: stop_status
36
+ 2024-08-04 15:35:12,635 DEBUG SenderThread:10035 [sender.py:send():382] send: telemetry
37
+ 2024-08-04 15:35:13,069 INFO wandb-upload_0:10035 [upload_job.py:push():131] Uploaded file /tmp/tmpekww83l_wandb/2um60osn-wandb-metadata.json
38
+ 2024-08-04 15:35:13,258 INFO Thread-12 :10035 [dir_watcher.py:_on_file_created():271] file/dir created: /project/wandb/run-20240804_153511-5ba5jbt6/files/requirements.txt
39
+ 2024-08-04 15:35:13,258 INFO Thread-12 :10035 [dir_watcher.py:_on_file_created():271] file/dir created: /project/wandb/run-20240804_153511-5ba5jbt6/files/output.log
40
+ 2024-08-04 15:35:13,259 INFO Thread-12 :10035 [dir_watcher.py:_on_file_created():271] file/dir created: /project/wandb/run-20240804_153511-5ba5jbt6/files/wandb-metadata.json
41
+ 2024-08-04 15:35:15,255 DEBUG SenderThread:10035 [sender.py:send():382] send: config
42
+ 2024-08-04 15:35:15,255 DEBUG SenderThread:10035 [sender.py:send():382] send: config
43
+ 2024-08-04 15:35:15,259 INFO Thread-12 :10035 [dir_watcher.py:_on_file_modified():288] file/dir modified: /project/wandb/run-20240804_153511-5ba5jbt6/files/output.log
44
+ 2024-08-04 15:35:15,561 DEBUG SenderThread:10035 [sender.py:send():382] send: exit
45
+ 2024-08-04 15:35:15,561 INFO SenderThread:10035 [sender.py:send_exit():589] handling exit code: 1
46
+ 2024-08-04 15:35:15,561 INFO SenderThread:10035 [sender.py:send_exit():591] handling runtime: 3
47
+ 2024-08-04 15:35:15,562 INFO SenderThread:10035 [sender.py:_save_file():1403] saving file wandb-summary.json with policy end
48
+ 2024-08-04 15:35:15,563 INFO SenderThread:10035 [sender.py:send_exit():597] send defer
49
+ 2024-08-04 15:35:15,563 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: defer
50
+ 2024-08-04 15:35:15,563 INFO HandlerThread:10035 [handler.py:handle_request_defer():172] handle defer: 0
51
+ 2024-08-04 15:35:15,563 DEBUG SenderThread:10035 [sender.py:send_request():409] send_request: defer
52
+ 2024-08-04 15:35:15,563 INFO SenderThread:10035 [sender.py:send_request_defer():613] handle sender defer: 0
53
+ 2024-08-04 15:35:15,563 INFO SenderThread:10035 [sender.py:transition_state():617] send defer: 1
54
+ 2024-08-04 15:35:15,563 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: defer
55
+ 2024-08-04 15:35:15,563 INFO HandlerThread:10035 [handler.py:handle_request_defer():172] handle defer: 1
56
+ 2024-08-04 15:35:15,563 DEBUG SenderThread:10035 [sender.py:send_request():409] send_request: defer
57
+ 2024-08-04 15:35:15,563 INFO SenderThread:10035 [sender.py:send_request_defer():613] handle sender defer: 1
58
+ 2024-08-04 15:35:15,564 INFO SenderThread:10035 [sender.py:transition_state():617] send defer: 2
59
+ 2024-08-04 15:35:15,564 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: defer
60
+ 2024-08-04 15:35:15,564 INFO HandlerThread:10035 [handler.py:handle_request_defer():172] handle defer: 2
61
+ 2024-08-04 15:35:15,564 INFO HandlerThread:10035 [system_monitor.py:finish():203] Stopping system monitor
62
+ 2024-08-04 15:35:15,564 INFO HandlerThread:10035 [interfaces.py:finish():202] Joined cpu monitor
63
+ 2024-08-04 15:35:15,564 DEBUG SystemMonitor:10035 [system_monitor.py:_start():172] Starting system metrics aggregation loop
64
+ 2024-08-04 15:35:15,564 INFO HandlerThread:10035 [interfaces.py:finish():202] Joined disk monitor
65
+ 2024-08-04 15:35:15,564 DEBUG SystemMonitor:10035 [system_monitor.py:_start():179] Finished system metrics aggregation loop
66
+ 2024-08-04 15:35:15,565 DEBUG SystemMonitor:10035 [system_monitor.py:_start():183] Publishing last batch of metrics
67
+ 2024-08-04 15:35:15,597 INFO HandlerThread:10035 [interfaces.py:finish():202] Joined gpu monitor
68
+ 2024-08-04 15:35:15,597 INFO HandlerThread:10035 [interfaces.py:finish():202] Joined memory monitor
69
+ 2024-08-04 15:35:15,597 INFO HandlerThread:10035 [interfaces.py:finish():202] Joined network monitor
70
+ 2024-08-04 15:35:15,598 DEBUG SenderThread:10035 [sender.py:send_request():409] send_request: defer
71
+ 2024-08-04 15:35:15,598 INFO SenderThread:10035 [sender.py:send_request_defer():613] handle sender defer: 2
72
+ 2024-08-04 15:35:15,598 INFO SenderThread:10035 [sender.py:transition_state():617] send defer: 3
73
+ 2024-08-04 15:35:15,598 DEBUG SenderThread:10035 [sender.py:send():382] send: stats
74
+ 2024-08-04 15:35:15,598 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: defer
75
+ 2024-08-04 15:35:15,598 INFO HandlerThread:10035 [handler.py:handle_request_defer():172] handle defer: 3
76
+ 2024-08-04 15:35:15,598 DEBUG SenderThread:10035 [sender.py:send_request():409] send_request: defer
77
+ 2024-08-04 15:35:15,598 INFO SenderThread:10035 [sender.py:send_request_defer():613] handle sender defer: 3
78
+ 2024-08-04 15:35:15,598 INFO SenderThread:10035 [sender.py:transition_state():617] send defer: 4
79
+ 2024-08-04 15:35:15,598 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: defer
80
+ 2024-08-04 15:35:15,599 INFO HandlerThread:10035 [handler.py:handle_request_defer():172] handle defer: 4
81
+ 2024-08-04 15:35:15,599 DEBUG SenderThread:10035 [sender.py:send_request():409] send_request: defer
82
+ 2024-08-04 15:35:15,599 INFO SenderThread:10035 [sender.py:send_request_defer():613] handle sender defer: 4
83
+ 2024-08-04 15:35:15,599 INFO SenderThread:10035 [sender.py:transition_state():617] send defer: 5
84
+ 2024-08-04 15:35:15,599 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: defer
85
+ 2024-08-04 15:35:15,599 INFO HandlerThread:10035 [handler.py:handle_request_defer():172] handle defer: 5
86
+ 2024-08-04 15:35:15,599 DEBUG SenderThread:10035 [sender.py:send():382] send: summary
87
+ 2024-08-04 15:35:15,600 INFO SenderThread:10035 [sender.py:_save_file():1403] saving file wandb-summary.json with policy end
88
+ 2024-08-04 15:35:15,600 DEBUG SenderThread:10035 [sender.py:send_request():409] send_request: defer
89
+ 2024-08-04 15:35:15,600 INFO SenderThread:10035 [sender.py:send_request_defer():613] handle sender defer: 5
90
+ 2024-08-04 15:35:15,600 INFO SenderThread:10035 [sender.py:transition_state():617] send defer: 6
91
+ 2024-08-04 15:35:15,600 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: defer
92
+ 2024-08-04 15:35:15,600 INFO HandlerThread:10035 [handler.py:handle_request_defer():172] handle defer: 6
93
+ 2024-08-04 15:35:15,601 DEBUG SenderThread:10035 [sender.py:send_request():409] send_request: defer
94
+ 2024-08-04 15:35:15,601 INFO SenderThread:10035 [sender.py:send_request_defer():613] handle sender defer: 6
95
+ 2024-08-04 15:35:15,603 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: status_report
96
+ 2024-08-04 15:35:15,791 INFO SenderThread:10035 [sender.py:transition_state():617] send defer: 7
97
+ 2024-08-04 15:35:15,791 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: defer
98
+ 2024-08-04 15:35:15,791 INFO HandlerThread:10035 [handler.py:handle_request_defer():172] handle defer: 7
99
+ 2024-08-04 15:35:15,791 DEBUG SenderThread:10035 [sender.py:send_request():409] send_request: defer
100
+ 2024-08-04 15:35:15,791 INFO SenderThread:10035 [sender.py:send_request_defer():613] handle sender defer: 7
101
+ 2024-08-04 15:35:16,260 INFO Thread-12 :10035 [dir_watcher.py:_on_file_modified():288] file/dir modified: /project/wandb/run-20240804_153511-5ba5jbt6/files/output.log
102
+ 2024-08-04 15:35:16,260 INFO Thread-12 :10035 [dir_watcher.py:_on_file_modified():288] file/dir modified: /project/wandb/run-20240804_153511-5ba5jbt6/files/config.yaml
103
+ 2024-08-04 15:35:16,260 INFO Thread-12 :10035 [dir_watcher.py:_on_file_created():271] file/dir created: /project/wandb/run-20240804_153511-5ba5jbt6/files/wandb-summary.json
104
+ 2024-08-04 15:35:16,561 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: poll_exit
105
+ 2024-08-04 15:35:17,260 INFO Thread-12 :10035 [dir_watcher.py:_on_file_modified():288] file/dir modified: /project/wandb/run-20240804_153511-5ba5jbt6/files/output.log
106
+ 2024-08-04 15:35:17,299 INFO SenderThread:10035 [sender.py:transition_state():617] send defer: 8
107
+ 2024-08-04 15:35:17,299 DEBUG SenderThread:10035 [sender.py:send_request():409] send_request: poll_exit
108
+ 2024-08-04 15:35:17,299 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: defer
109
+ 2024-08-04 15:35:17,299 INFO HandlerThread:10035 [handler.py:handle_request_defer():172] handle defer: 8
110
+ 2024-08-04 15:35:17,299 DEBUG SenderThread:10035 [sender.py:send_request():409] send_request: defer
111
+ 2024-08-04 15:35:17,299 INFO SenderThread:10035 [sender.py:send_request_defer():613] handle sender defer: 8
112
+ 2024-08-04 15:35:17,299 INFO SenderThread:10035 [job_builder.py:build():296] Attempting to build job artifact
113
+ 2024-08-04 15:35:17,300 INFO SenderThread:10035 [job_builder.py:_get_source_type():426] is repo sourced job
114
+ 2024-08-04 15:35:17,314 INFO SenderThread:10035 [job_builder.py:build():402] adding wandb-job metadata file
115
+ 2024-08-04 15:35:17,322 INFO SenderThread:10035 [sender.py:transition_state():617] send defer: 9
116
+ 2024-08-04 15:35:17,322 DEBUG SenderThread:10035 [sender.py:send():382] send: artifact
117
+ 2024-08-04 15:35:17,322 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: defer
118
+ 2024-08-04 15:35:17,323 INFO HandlerThread:10035 [handler.py:handle_request_defer():172] handle defer: 9
119
+ 2024-08-04 15:35:17,561 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: poll_exit
120
+ 2024-08-04 15:35:18,177 INFO SenderThread:10035 [sender.py:send_artifact():1494] sent artifact job-https___github.com_cl-tohoku_llm-recipes-failab-m1-yans.git_examples_finetuning.py - {'id': 'QXJ0aWZhY3Q6MTA5MTk2NTkzOA==', 'state': 'COMMITTED', 'artifactSequence': {'id': 'QXJ0aWZhY3RDb2xsZWN0aW9uOjM2MjY3MjMzNA==', 'latestArtifact': {'id': 'QXJ0aWZhY3Q6MTA5MzUzODM4NQ==', 'versionIndex': 3}}}
121
+ 2024-08-04 15:35:18,177 DEBUG SenderThread:10035 [sender.py:send_request():409] send_request: defer
122
+ 2024-08-04 15:35:18,177 INFO SenderThread:10035 [sender.py:send_request_defer():613] handle sender defer: 9
123
+ 2024-08-04 15:35:18,177 INFO SenderThread:10035 [dir_watcher.py:finish():358] shutting down directory watcher
124
+ 2024-08-04 15:35:18,261 INFO SenderThread:10035 [dir_watcher.py:_on_file_modified():288] file/dir modified: /project/wandb/run-20240804_153511-5ba5jbt6/files/output.log
125
+ 2024-08-04 15:35:18,261 INFO SenderThread:10035 [dir_watcher.py:finish():388] scan: /project/wandb/run-20240804_153511-5ba5jbt6/files
126
+ 2024-08-04 15:35:18,262 INFO SenderThread:10035 [dir_watcher.py:finish():402] scan save: /project/wandb/run-20240804_153511-5ba5jbt6/files/requirements.txt requirements.txt
127
+ 2024-08-04 15:35:18,262 INFO SenderThread:10035 [dir_watcher.py:finish():402] scan save: /project/wandb/run-20240804_153511-5ba5jbt6/files/config.yaml config.yaml
128
+ 2024-08-04 15:35:18,263 INFO SenderThread:10035 [dir_watcher.py:finish():402] scan save: /project/wandb/run-20240804_153511-5ba5jbt6/files/wandb-metadata.json wandb-metadata.json
129
+ 2024-08-04 15:35:18,263 INFO SenderThread:10035 [dir_watcher.py:finish():402] scan save: /project/wandb/run-20240804_153511-5ba5jbt6/files/wandb-summary.json wandb-summary.json
130
+ 2024-08-04 15:35:18,265 INFO SenderThread:10035 [dir_watcher.py:finish():402] scan save: /project/wandb/run-20240804_153511-5ba5jbt6/files/output.log output.log
131
+ 2024-08-04 15:35:18,266 INFO SenderThread:10035 [sender.py:transition_state():617] send defer: 10
132
+ 2024-08-04 15:35:18,267 DEBUG SenderThread:10035 [sender.py:send_request():409] send_request: poll_exit
133
+ 2024-08-04 15:35:18,267 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: defer
134
+ 2024-08-04 15:35:18,268 INFO HandlerThread:10035 [handler.py:handle_request_defer():172] handle defer: 10
135
+ 2024-08-04 15:35:18,268 DEBUG SenderThread:10035 [sender.py:send_request():409] send_request: defer
136
+ 2024-08-04 15:35:18,268 INFO SenderThread:10035 [sender.py:send_request_defer():613] handle sender defer: 10
137
+ 2024-08-04 15:35:18,268 INFO SenderThread:10035 [file_pusher.py:finish():172] shutting down file pusher
138
+ 2024-08-04 15:35:18,562 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: poll_exit
139
+ 2024-08-04 15:35:18,562 DEBUG SenderThread:10035 [sender.py:send_request():409] send_request: poll_exit
140
+ 2024-08-04 15:35:18,679 INFO wandb-upload_0:10035 [upload_job.py:push():131] Uploaded file /project/wandb/run-20240804_153511-5ba5jbt6/files/requirements.txt
141
+ 2024-08-04 15:35:18,797 INFO wandb-upload_1:10035 [upload_job.py:push():131] Uploaded file /project/wandb/run-20240804_153511-5ba5jbt6/files/config.yaml
142
+ 2024-08-04 15:35:18,860 INFO wandb-upload_2:10035 [upload_job.py:push():131] Uploaded file /project/wandb/run-20240804_153511-5ba5jbt6/files/wandb-summary.json
143
+ 2024-08-04 15:35:18,877 INFO wandb-upload_3:10035 [upload_job.py:push():131] Uploaded file /project/wandb/run-20240804_153511-5ba5jbt6/files/output.log
144
+ 2024-08-04 15:35:19,077 INFO Thread-11 (_thread_body):10035 [sender.py:transition_state():617] send defer: 11
145
+ 2024-08-04 15:35:19,077 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: defer
146
+ 2024-08-04 15:35:19,078 INFO HandlerThread:10035 [handler.py:handle_request_defer():172] handle defer: 11
147
+ 2024-08-04 15:35:19,078 DEBUG SenderThread:10035 [sender.py:send_request():409] send_request: defer
148
+ 2024-08-04 15:35:19,078 INFO SenderThread:10035 [sender.py:send_request_defer():613] handle sender defer: 11
149
+ 2024-08-04 15:35:19,078 INFO SenderThread:10035 [file_pusher.py:join():178] waiting for file pusher
150
+ 2024-08-04 15:35:19,078 INFO SenderThread:10035 [sender.py:transition_state():617] send defer: 12
151
+ 2024-08-04 15:35:19,078 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: defer
152
+ 2024-08-04 15:35:19,078 INFO HandlerThread:10035 [handler.py:handle_request_defer():172] handle defer: 12
153
+ 2024-08-04 15:35:19,078 DEBUG SenderThread:10035 [sender.py:send_request():409] send_request: defer
154
+ 2024-08-04 15:35:19,078 INFO SenderThread:10035 [sender.py:send_request_defer():613] handle sender defer: 12
155
+ 2024-08-04 15:35:19,078 INFO SenderThread:10035 [file_stream.py:finish():595] file stream finish called
156
+ 2024-08-04 15:35:19,260 INFO SenderThread:10035 [file_stream.py:finish():599] file stream finish is done
157
+ 2024-08-04 15:35:19,260 INFO SenderThread:10035 [sender.py:transition_state():617] send defer: 13
158
+ 2024-08-04 15:35:19,261 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: defer
159
+ 2024-08-04 15:35:19,261 INFO HandlerThread:10035 [handler.py:handle_request_defer():172] handle defer: 13
160
+ 2024-08-04 15:35:19,261 DEBUG SenderThread:10035 [sender.py:send_request():409] send_request: defer
161
+ 2024-08-04 15:35:19,261 INFO SenderThread:10035 [sender.py:send_request_defer():613] handle sender defer: 13
162
+ 2024-08-04 15:35:19,261 INFO SenderThread:10035 [sender.py:transition_state():617] send defer: 14
163
+ 2024-08-04 15:35:19,261 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: defer
164
+ 2024-08-04 15:35:19,261 DEBUG SenderThread:10035 [sender.py:send():382] send: final
165
+ 2024-08-04 15:35:19,261 INFO HandlerThread:10035 [handler.py:handle_request_defer():172] handle defer: 14
166
+ 2024-08-04 15:35:19,261 DEBUG SenderThread:10035 [sender.py:send():382] send: footer
167
+ 2024-08-04 15:35:19,262 DEBUG SenderThread:10035 [sender.py:send_request():409] send_request: defer
168
+ 2024-08-04 15:35:19,262 INFO SenderThread:10035 [sender.py:send_request_defer():613] handle sender defer: 14
169
+ 2024-08-04 15:35:19,262 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: poll_exit
170
+ 2024-08-04 15:35:19,262 DEBUG SenderThread:10035 [sender.py:send_request():409] send_request: poll_exit
171
+ 2024-08-04 15:35:19,262 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: poll_exit
172
+ 2024-08-04 15:35:19,263 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: server_info
173
+ 2024-08-04 15:35:19,263 DEBUG SenderThread:10035 [sender.py:send_request():409] send_request: poll_exit
174
+ 2024-08-04 15:35:19,263 DEBUG SenderThread:10035 [sender.py:send_request():409] send_request: server_info
175
+ 2024-08-04 15:35:19,263 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: get_summary
176
+ 2024-08-04 15:35:19,264 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: sampled_history
177
+ 2024-08-04 15:35:19,265 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: internal_messages
178
+ 2024-08-04 15:35:19,265 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: job_info
179
+ 2024-08-04 15:35:19,429 DEBUG SenderThread:10035 [sender.py:send_request():409] send_request: job_info
180
+ 2024-08-04 15:35:19,429 INFO MainThread:10035 [wandb_run.py:_footer_history_summary_info():3866] rendering history
181
+ 2024-08-04 15:35:19,429 INFO MainThread:10035 [wandb_run.py:_footer_history_summary_info():3898] rendering summary
182
+ 2024-08-04 15:35:19,429 INFO MainThread:10035 [wandb_run.py:_footer_sync_info():3825] logging synced files
183
+ 2024-08-04 15:35:19,429 DEBUG HandlerThread:10035 [handler.py:handle_request():146] handle_request: shutdown
184
+ 2024-08-04 15:35:19,429 INFO HandlerThread:10035 [handler.py:finish():869] shutting down handler
185
+ 2024-08-04 15:35:20,265 INFO WriterThread:10035 [datastore.py:close():296] close: /project/wandb/run-20240804_153511-5ba5jbt6/run-5ba5jbt6.wandb
186
+ 2024-08-04 15:35:20,429 INFO SenderThread:10035 [sender.py:finish():1572] shutting down sender
187
+ 2024-08-04 15:35:20,429 INFO SenderThread:10035 [file_pusher.py:finish():172] shutting down file pusher
188
+ 2024-08-04 15:35:20,429 INFO SenderThread:10035 [file_pusher.py:join():178] waiting for file pusher
wandb/run-20240804_153511-5ba5jbt6/logs/debug.log ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2024-08-04 15:35:11,758 INFO MainThread:9964 [wandb_setup.py:_flush():76] Current SDK version is 0.16.3
2
+ 2024-08-04 15:35:11,759 INFO MainThread:9964 [wandb_setup.py:_flush():76] Configure stats pid to 9964
3
+ 2024-08-04 15:35:11,759 INFO MainThread:9964 [wandb_setup.py:_flush():76] Loading settings from /singularity_home/.config/wandb/settings
4
+ 2024-08-04 15:35:11,759 INFO MainThread:9964 [wandb_setup.py:_flush():76] Loading settings from /project/wandb/settings
5
+ 2024-08-04 15:35:11,759 INFO MainThread:9964 [wandb_setup.py:_flush():76] Loading settings from environment variables: {'api_key': '***REDACTED***', 'run_notes': 'Train tiny llama sample'}
6
+ 2024-08-04 15:35:11,759 INFO MainThread:9964 [wandb_setup.py:_flush():76] Applying setup settings: {'_disable_service': False}
7
+ 2024-08-04 15:35:11,759 INFO MainThread:9964 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {'program_relpath': 'examples/finetuning.py', 'program_abspath': '/project/examples/finetuning.py', 'program': '/project/examples/finetuning.py'}
8
+ 2024-08-04 15:35:11,759 INFO MainThread:9964 [wandb_init.py:_log_setup():526] Logging user logs to /project/wandb/run-20240804_153511-5ba5jbt6/logs/debug.log
9
+ 2024-08-04 15:35:11,759 INFO MainThread:9964 [wandb_init.py:_log_setup():527] Logging internal logs to /project/wandb/run-20240804_153511-5ba5jbt6/logs/debug-internal.log
10
+ 2024-08-04 15:35:11,759 INFO MainThread:9964 [wandb_init.py:init():566] calling init triggers
11
+ 2024-08-04 15:35:11,759 INFO MainThread:9964 [wandb_init.py:init():573] wandb.init called with sweep_config: {}
12
+ config: {'sharding_strategy': 'FULL_SHARD', 'checkpoint_type': 'LOCAL_STATE_DICT', 'fsdp_activation_checkpointing': True, 'fsdp_cpu_offload': False, 'low_cpu_fsdp': False, 'no_meta_device': False, 'data_path': None, 'split': '969, 30, 1', 'train_data_path': ['4013541', '/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document'], 'valid_data_path': ['4013541', '/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document'], 'test_data_path': ['4013541', '/work/llm_recipes/datasets/bin/common_crawl_and_extended_common_crawl.doc_extracted.200.sorted.uniq.filtered.shuf.head/data_text_document'], 'data_cache_path': None, 'vocab_size': None, 'vocab_file': None, 'merge_file': None, 'seq_length': 512, 'num_workers': 2, 'tokenizer_type': 'Llama2Tokenizer', 'tokenizer_model': '/share/pretrained_lm/meta-llama/TinyLlama_v1.1/tokenizer.model', 'reset_position_ids': False, 'reset_attention_mask': False, 'eod_mask_loss': False, 'retro_return_doc_ids': False, 'short_seq_prob': 0.1, 'vocab_extra_ids': 0, 'seed': 1234, 'use_mpi': False, 'wandb_entity': 'iwakawa-koichi-q5-tohoku-nlp6723', 'wandb_name': 'tiny-llama_train_2024-08-04-15:34:59', 'wandb_project': 'llm_tutorial', 'quantization': False, 'use_freeze_layers': False, 'freeze_layers': None, 'bf16': True, 'fp16': False, 'mixed_precision': True, 'param_dtype': None, 'load': '/work/llm_recipes/models/tiny-llama', 'save': '/work/llm_recipes/models/tiny-llama', 'base_model': '/share/pretrained_lm/meta-llama/TinyLlama_v1.1', 'use_better_transformer': False, 'grad_clip_norm': 1.0, 'eval_interval': 200, 'save_interval': 200, 'eval_iters': 10, 'optimizer': 'adam', 'lr': 2e-05, 'lr_decay_style': 'cosine', 'lr_decay_iters': 2000, 'lr_warmup_iters': 500, 'min_lr': 1e-06, 'train_iters': 2000, 'train_samples': None, 'global_batch_size': 320, 'micro_batch_size': 8, 'make_vocab_size_divisible_by': 128, 'sliding_window_size': 4096, 'skip_batch': None, 'no_save_optimizer_state': False, 'continual_pretraining': False, 'instruction_tuning': False, 'direct_preference_optimization': False, 'attention_dropout': 0.1, 'hidden_dropout': 0.1, 'weight_decay': 0.1, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_eps': 1e-06, 'hf_transformer_model_dir': None, 'instruction_train_data_path': None, 'instruction_valid_data_path': None, 'epoch': None, 'instruction_dataset_size': None, 'save_sampler_state': False, 'label_smoothing': 0.0, 'save_n_checkpoints': 10, 'hf_repo_id': 'koichi12/tiny-llama', 'create_public_hf_repo': False, 'upload_all_checkpoints_to_hf': False, 'hf_upload_retry_limit': 2, 'exit_duration_in_mins': None, 'source_key': None, 'target_key': None, 'attn_implementation': 'flash_attention_2', 'efficient_instruction_tuning': False, 'remove_padding_masking': False, 'save_start_iter': None, 'rank': 0, 'world_size': 1, 'padded_vocab_size': 32000, 'gradient_accumulation_steps': 40}
13
+ 2024-08-04 15:35:11,759 INFO MainThread:9964 [wandb_init.py:init():616] starting backend
14
+ 2024-08-04 15:35:11,759 INFO MainThread:9964 [wandb_init.py:init():620] setting up manager
15
+ 2024-08-04 15:35:11,764 INFO MainThread:9964 [backend.py:_multiprocessing_setup():105] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
16
+ 2024-08-04 15:35:11,766 INFO MainThread:9964 [wandb_init.py:init():628] backend started and connected
17
+ 2024-08-04 15:35:11,770 INFO MainThread:9964 [wandb_init.py:init():720] updated telemetry
18
+ 2024-08-04 15:35:11,782 INFO MainThread:9964 [wandb_init.py:init():753] communicating run to backend with 90.0 second timeout
19
+ 2024-08-04 15:35:12,259 INFO MainThread:9964 [wandb_run.py:_on_init():2262] communicating current version
20
+ 2024-08-04 15:35:12,339 INFO MainThread:9964 [wandb_run.py:_on_init():2271] got version response upgrade_message: "wandb version 0.17.5 is available! To upgrade, please run:\n $ pip install wandb --upgrade"
21
+
22
+ 2024-08-04 15:35:12,339 INFO MainThread:9964 [wandb_init.py:init():804] starting run threads in backend
23
+ 2024-08-04 15:35:12,400 INFO MainThread:9964 [wandb_run.py:_console_start():2241] atexit reg
24
+ 2024-08-04 15:35:12,400 INFO MainThread:9964 [wandb_run.py:_redirect():2096] redirect: wrap_raw
25
+ 2024-08-04 15:35:12,400 INFO MainThread:9964 [wandb_run.py:_redirect():2161] Wrapping output streams.
26
+ 2024-08-04 15:35:12,400 INFO MainThread:9964 [wandb_run.py:_redirect():2186] Redirects installed.
27
+ 2024-08-04 15:35:12,401 INFO MainThread:9964 [wandb_init.py:init():847] run started, returning control to user process
28
+ 2024-08-04 15:35:15,253 INFO MainThread:9964 [wandb_run.py:_config_callback():1343] config_cb None None {'activation_function': 'silu', 'hidden_size': 2048, 'model_type': 'llama', 'max_position_embeddings': 2048, 'num_attention_heads': 32, 'num_hidden_layers': 22, 'model_architecture': 'LlamaForCausalLM'}
29
+ 2024-08-04 15:35:15,253 INFO MainThread:9964 [wandb_run.py:_config_callback():1343] config_cb None None {'world_size': 1}
30
+ 2024-08-04 15:35:20,430 WARNING MsgRouterThr:9964 [router.py:message_loop():77] message_loop has been closed
wandb/run-20240804_153511-5ba5jbt6/run-5ba5jbt6.wandb ADDED
Binary file (20.4 kB). View file
 
wandb/run-20240812_052446-qrv0d6sp/files/config.yaml ADDED
@@ -0,0 +1,314 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ wandb_version: 1
2
+
3
+ sharding_strategy:
4
+ desc: null
5
+ value: FULL_SHARD
6
+ checkpoint_type:
7
+ desc: null
8
+ value: LOCAL_STATE_DICT
9
+ fsdp_activation_checkpointing:
10
+ desc: null
11
+ value: true
12
+ fsdp_cpu_offload:
13
+ desc: null
14
+ value: false
15
+ low_cpu_fsdp:
16
+ desc: null
17
+ value: false
18
+ no_meta_device:
19
+ desc: null
20
+ value: false
21
+ data_path:
22
+ desc: null
23
+ value: null
24
+ split:
25
+ desc: null
26
+ value: 969, 30, 1
27
+ train_data_path:
28
+ desc: null
29
+ value:
30
+ - '304771887'
31
+ - /work/llm_recipes/datasets/bin/sample/llm_jp_corpus_v2_ja_wiki_train_0/data_text_document
32
+ valid_data_path:
33
+ desc: null
34
+ value:
35
+ - '304771887'
36
+ - /work/llm_recipes/datasets/bin/sample/llm_jp_corpus_v2_ja_wiki_train_0/data_text_document
37
+ test_data_path:
38
+ desc: null
39
+ value:
40
+ - '304771887'
41
+ - /work/llm_recipes/datasets/bin/sample/llm_jp_corpus_v2_ja_wiki_train_0/data_text_document
42
+ data_cache_path:
43
+ desc: null
44
+ value: null
45
+ vocab_size:
46
+ desc: null
47
+ value: null
48
+ vocab_file:
49
+ desc: null
50
+ value: null
51
+ merge_file:
52
+ desc: null
53
+ value: null
54
+ seq_length:
55
+ desc: null
56
+ value: 4096
57
+ num_workers:
58
+ desc: null
59
+ value: 2
60
+ tokenizer_type:
61
+ desc: null
62
+ value: HFPreTrainedTokenizer
63
+ tokenizer_model:
64
+ desc: null
65
+ value: /share/pretrained_lm/Qwen/Qwen2-0.5B
66
+ reset_position_ids:
67
+ desc: null
68
+ value: false
69
+ reset_attention_mask:
70
+ desc: null
71
+ value: false
72
+ eod_mask_loss:
73
+ desc: null
74
+ value: false
75
+ retro_return_doc_ids:
76
+ desc: null
77
+ value: false
78
+ short_seq_prob:
79
+ desc: null
80
+ value: 0.1
81
+ vocab_extra_ids:
82
+ desc: null
83
+ value: 0
84
+ seed:
85
+ desc: null
86
+ value: 1234
87
+ use_mpi:
88
+ desc: null
89
+ value: false
90
+ wandb_entity:
91
+ desc: null
92
+ value: iwakawa-koichi-q5-tohoku-nlp6723
93
+ wandb_name:
94
+ desc: null
95
+ value: yans-qwen2-0.5B_train_2024-08-12-05:24:35
96
+ wandb_project:
97
+ desc: null
98
+ value: llm_tutorial
99
+ quantization:
100
+ desc: null
101
+ value: false
102
+ use_freeze_layers:
103
+ desc: null
104
+ value: false
105
+ freeze_layers:
106
+ desc: null
107
+ value: null
108
+ bf16:
109
+ desc: null
110
+ value: true
111
+ fp16:
112
+ desc: null
113
+ value: false
114
+ mixed_precision:
115
+ desc: null
116
+ value: true
117
+ param_dtype:
118
+ desc: null
119
+ value: null
120
+ load:
121
+ desc: null
122
+ value: /work/llm_recipes/models/yans-qwen2-0.5B
123
+ save:
124
+ desc: null
125
+ value: /work/llm_recipes/models/yans-qwen2-0.5B
126
+ base_model:
127
+ desc: null
128
+ value: /share/pretrained_lm/Qwen/Qwen2-0.5B
129
+ use_better_transformer:
130
+ desc: null
131
+ value: false
132
+ grad_clip_norm:
133
+ desc: null
134
+ value: 1.0
135
+ eval_interval:
136
+ desc: null
137
+ value: 200
138
+ save_interval:
139
+ desc: null
140
+ value: 5
141
+ eval_iters:
142
+ desc: null
143
+ value: 10
144
+ optimizer:
145
+ desc: null
146
+ value: adam
147
+ lr:
148
+ desc: null
149
+ value: 2.0e-05
150
+ lr_decay_style:
151
+ desc: null
152
+ value: cosine
153
+ lr_decay_iters:
154
+ desc: null
155
+ value: 20000
156
+ lr_warmup_iters:
157
+ desc: null
158
+ value: 500
159
+ min_lr:
160
+ desc: null
161
+ value: 1.0e-06
162
+ train_iters:
163
+ desc: null
164
+ value: 20000
165
+ train_samples:
166
+ desc: null
167
+ value: null
168
+ global_batch_size:
169
+ desc: null
170
+ value: 320
171
+ micro_batch_size:
172
+ desc: null
173
+ value: 1
174
+ make_vocab_size_divisible_by:
175
+ desc: null
176
+ value: 128
177
+ sliding_window_size:
178
+ desc: null
179
+ value: 4096
180
+ skip_batch:
181
+ desc: null
182
+ value: null
183
+ no_save_optimizer_state:
184
+ desc: null
185
+ value: false
186
+ continual_pretraining:
187
+ desc: null
188
+ value: false
189
+ instruction_tuning:
190
+ desc: null
191
+ value: false
192
+ direct_preference_optimization:
193
+ desc: null
194
+ value: false
195
+ attention_dropout:
196
+ desc: null
197
+ value: 0.1
198
+ hidden_dropout:
199
+ desc: null
200
+ value: 0.1
201
+ weight_decay:
202
+ desc: null
203
+ value: 0.1
204
+ adam_beta1:
205
+ desc: null
206
+ value: 0.9
207
+ adam_beta2:
208
+ desc: null
209
+ value: 0.95
210
+ adam_eps:
211
+ desc: null
212
+ value: 1.0e-06
213
+ hf_transformer_model_dir:
214
+ desc: null
215
+ value: null
216
+ instruction_train_data_path:
217
+ desc: null
218
+ value: null
219
+ instruction_valid_data_path:
220
+ desc: null
221
+ value: null
222
+ epoch:
223
+ desc: null
224
+ value: null
225
+ instruction_dataset_size:
226
+ desc: null
227
+ value: null
228
+ save_sampler_state:
229
+ desc: null
230
+ value: false
231
+ label_smoothing:
232
+ desc: null
233
+ value: 0.0
234
+ save_n_checkpoints:
235
+ desc: null
236
+ value: 10
237
+ hf_repo_id:
238
+ desc: null
239
+ value: koichi12//yans-qwen2-0.5B
240
+ create_public_hf_repo:
241
+ desc: null
242
+ value: false
243
+ upload_all_checkpoints_to_hf:
244
+ desc: null
245
+ value: false
246
+ hf_upload_retry_limit:
247
+ desc: null
248
+ value: 2
249
+ exit_duration_in_mins:
250
+ desc: null
251
+ value: null
252
+ source_key:
253
+ desc: null
254
+ value: null
255
+ target_key:
256
+ desc: null
257
+ value: null
258
+ attn_implementation:
259
+ desc: null
260
+ value: flash_attention_2
261
+ efficient_instruction_tuning:
262
+ desc: null
263
+ value: false
264
+ remove_padding_masking:
265
+ desc: null
266
+ value: false
267
+ save_start_iter:
268
+ desc: null
269
+ value: null
270
+ rank:
271
+ desc: null
272
+ value: 0
273
+ world_size:
274
+ desc: null
275
+ value: 1
276
+ padded_vocab_size:
277
+ desc: null
278
+ value: 151680
279
+ gradient_accumulation_steps:
280
+ desc: null
281
+ value: 320
282
+ _wandb:
283
+ desc: null
284
+ value:
285
+ python_version: 3.10.12
286
+ cli_version: 0.16.3
287
+ framework: huggingface
288
+ huggingface_version: 4.43.3
289
+ is_jupyter_run: false
290
+ is_kaggle_kernel: false
291
+ start_time: 1723407886.294165
292
+ t:
293
+ 1:
294
+ - 1
295
+ - 11
296
+ - 49
297
+ - 55
298
+ - 71
299
+ 2:
300
+ - 1
301
+ - 11
302
+ - 49
303
+ - 55
304
+ - 71
305
+ 3:
306
+ - 13
307
+ - 16
308
+ - 23
309
+ 4: 3.10.12
310
+ 5: 0.16.3
311
+ 6: 4.43.3
312
+ 8:
313
+ - 5
314
+ 13: linux-x86_64
wandb/run-20240812_052446-qrv0d6sp/files/output.log ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Traceback (most recent call last):
2
+ File "/project/examples/finetuning.py", line 13, in <module>
3
+ main()
4
+ File "/project/src/llama_recipes/finetuning.py", line 85, in main
5
+ setup_huggingface_repository(args)
6
+ File "/project/src/llama_recipes/utils/hf_hub_utils.py", line 10, in setup_huggingface_repository
7
+ create_repo(
8
+ File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
9
+ validate_repo_id(arg_value)
10
+ File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id
11
+ raise HFValidationError(
12
+ huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': 'koichi12//yans-qwen2-0.5B'. Use `repo_type` argument if needed.
wandb/run-20240812_052446-qrv0d6sp/files/requirements.txt ADDED
@@ -0,0 +1,271 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ absl-py==2.1.0
2
+ accelerate==0.33.0
3
+ aiohttp==3.9.1
4
+ aiosignal==1.3.1
5
+ annotated-types==0.6.0
6
+ apex==0.1
7
+ appdirs==1.4.4
8
+ argon2-cffi-bindings==21.2.0
9
+ argon2-cffi==23.1.0
10
+ asttokens==2.4.1
11
+ astunparse==1.6.3
12
+ async-timeout==4.0.3
13
+ attrs==23.2.0
14
+ audioread==3.0.1
15
+ beautifulsoup4==4.12.3
16
+ bleach==6.1.0
17
+ blis==0.7.11
18
+ cachetools==5.3.2
19
+ catalogue==2.0.10
20
+ certifi==2024.2.2
21
+ cffi==1.16.0
22
+ charset-normalizer==3.3.2
23
+ click==8.1.7
24
+ cloudpathlib==0.16.0
25
+ cloudpickle==3.0.0
26
+ cmake==3.28.1
27
+ colorama==0.4.6
28
+ comm==0.2.1
29
+ confection==0.1.4
30
+ contourpy==1.2.0
31
+ cubinlinker==0.3.0+2.g405ac64
32
+ cuda-python==12.3.0rc4+9.gdb8c48a.dirty
33
+ cudf==23.12.0
34
+ cugraph-dgl==23.12.0
35
+ cugraph-service-client==23.12.0
36
+ cugraph-service-server==23.12.0
37
+ cugraph==23.12.0
38
+ cuml==23.12.0
39
+ cupy-cuda12x==12.3.0
40
+ cycler==0.12.1
41
+ cymem==2.0.8
42
+ cython==3.0.8
43
+ dask-cuda==23.12.0
44
+ dask-cudf==23.12.0
45
+ dask==2023.11.0
46
+ debugpy==1.8.1
47
+ decorator==5.1.1
48
+ defusedxml==0.7.1
49
+ distributed==2023.11.0
50
+ dm-tree==0.1.8
51
+ docker-pycreds==0.4.0
52
+ einops==0.7.0
53
+ exceptiongroup==1.2.0
54
+ execnet==2.0.2
55
+ executing==2.0.1
56
+ expecttest==0.1.3
57
+ fastjsonschema==2.19.1
58
+ fastrlock==0.8.2
59
+ filelock==3.13.1
60
+ flash-attn==2.4.2
61
+ fonttools==4.48.1
62
+ frozenlist==1.4.1
63
+ fsspec==2023.12.2
64
+ gast==0.5.4
65
+ gitdb==4.0.11
66
+ gitpython==3.1.43
67
+ google-auth-oauthlib==0.4.6
68
+ google-auth==2.27.0
69
+ graphsurgeon==0.4.6
70
+ grpcio==1.60.1
71
+ huggingface-hub==0.24.5
72
+ hypothesis==5.35.1
73
+ idna==3.6
74
+ importlib-metadata==7.0.1
75
+ iniconfig==2.0.0
76
+ intel-openmp==2021.4.0
77
+ ipadic==1.0.0
78
+ ipykernel==6.29.2
79
+ ipython-genutils==0.2.0
80
+ ipython==8.21.0
81
+ jedi==0.19.1
82
+ jinja2==3.1.3
83
+ joblib==1.3.2
84
+ json5==0.9.14
85
+ jsonnet==0.19.1
86
+ jsonschema-specifications==2023.12.1
87
+ jsonschema==4.21.1
88
+ jupyter-client==8.6.0
89
+ jupyter-core==5.7.1
90
+ jupyter-tensorboard==0.2.0
91
+ jupyterlab-pygments==0.3.0
92
+ jupyterlab-server==1.2.0
93
+ jupyterlab==2.3.2
94
+ jupytext==1.16.1
95
+ kiwisolver==1.4.5
96
+ langcodes==3.3.0
97
+ lazy-loader==0.3
98
+ librosa==0.10.1
99
+ llvmlite==0.40.1
100
+ locket==1.0.0
101
+ logzero==1.7.0
102
+ lxml==5.2.2
103
+ markdown-it-py==3.0.0
104
+ markdown==3.5.2
105
+ markupsafe==2.1.4
106
+ matplotlib-inline==0.1.6
107
+ matplotlib==3.8.2
108
+ mdit-py-plugins==0.4.0
109
+ mdurl==0.1.2
110
+ mecab-python3==1.0.6
111
+ mistune==3.0.2
112
+ mkl-devel==2021.1.1
113
+ mkl-include==2021.1.1
114
+ mkl==2021.1.1
115
+ mock==5.1.0
116
+ more-itertools==9.1.0
117
+ mpmath==1.3.0
118
+ msgpack==1.0.7
119
+ multidict==6.0.4
120
+ murmurhash==1.0.10
121
+ nbclient==0.9.0
122
+ nbconvert==7.16.0
123
+ nbformat==5.9.2
124
+ nest-asyncio==1.6.0
125
+ networkx==2.6.3
126
+ ninja==1.11.1.1
127
+ nltk==3.8.1
128
+ notebook==6.4.10
129
+ numba==0.57.1+1.g1ff679645
130
+ numpy==1.24.4
131
+ nvfuser==0.1.4a0+d0bb811
132
+ nvidia-dali-cuda120==1.34.0
133
+ nvidia-pyindex==1.0.9
134
+ nvtx==0.2.5
135
+ oauthlib==3.2.2
136
+ onnx==1.15.0rc2
137
+ opencv==4.7.0
138
+ optree==0.10.0
139
+ packaging==23.2
140
+ pandas==1.5.3
141
+ pandocfilters==1.5.1
142
+ parso==0.8.3
143
+ partd==1.4.1
144
+ peft==0.11.1
145
+ pexpect==4.9.0
146
+ pillow==10.2.0
147
+ pip==24.0
148
+ platformdirs==4.2.0
149
+ pluggy==1.4.0
150
+ ply==3.11
151
+ polygraphy==0.49.4
152
+ pooch==1.8.0
153
+ portalocker==2.10.1
154
+ preshed==3.0.9
155
+ prettytable==3.9.0
156
+ prometheus-client==0.19.0
157
+ prompt-toolkit==3.0.43
158
+ protobuf==4.24.4
159
+ psutil==5.9.4
160
+ ptxcompiler==0.8.1+2.g0d406d6
161
+ ptyprocess==0.7.0
162
+ pure-eval==0.2.2
163
+ pyarrow==14.0.1.dev0+gba5374836.d20240125
164
+ pyasn1-modules==0.3.0
165
+ pyasn1==0.5.1
166
+ pybind11-global==2.11.1
167
+ pybind11==2.11.1
168
+ pycocotools==2.0+nv0.8.0
169
+ pycparser==2.21
170
+ pydantic-core==2.16.2
171
+ pydantic==2.6.1
172
+ pygments==2.17.2
173
+ pylibcugraph==23.12.0
174
+ pylibcugraphops==23.12.0
175
+ pylibraft==23.12.0
176
+ pynvml==11.4.1
177
+ pyparsing==3.1.1
178
+ pytest-flakefinder==1.1.0
179
+ pytest-rerunfailures==13.0
180
+ pytest-shard==0.1.2
181
+ pytest-xdist==3.5.0
182
+ pytest==8.0.0
183
+ python-dateutil==2.8.2
184
+ python-dotenv==1.0.0
185
+ python-hostlist==1.23.0
186
+ pytorch-quantization==2.1.2
187
+ pytz==2023.3.post1
188
+ pyyaml==6.0.1
189
+ pyzmq==25.1.2
190
+ raft-dask==23.12.0
191
+ rapids-dask-dependency==23.12.1
192
+ referencing==0.33.0
193
+ regex==2023.12.25
194
+ requests-oauthlib==1.3.1
195
+ requests==2.31.0
196
+ rich==13.7.0
197
+ rmm==23.12.0
198
+ rpds-py==0.17.1
199
+ rsa==4.9
200
+ sacrebleu==2.4.0
201
+ safetensors==0.4.3
202
+ scikit-learn==1.2.0
203
+ scipy==1.12.0
204
+ send2trash==1.8.2
205
+ sentencepiece==0.1.99
206
+ sentry-sdk==2.12.0
207
+ setproctitle==1.3.3
208
+ setuptools==68.2.2
209
+ six==1.16.0
210
+ smart-open==6.4.0
211
+ smmap==5.0.1
212
+ sortedcontainers==2.4.0
213
+ soundfile==0.12.1
214
+ soupsieve==2.5
215
+ soxr==0.3.7
216
+ spacy-legacy==3.0.12
217
+ spacy-loggers==1.0.5
218
+ spacy==3.7.2
219
+ sphinx-glpi-theme==0.6
220
+ srsly==2.4.8
221
+ stack-data==0.6.3
222
+ sympy==1.12
223
+ tabulate==0.9.0
224
+ tbb==2021.11.0
225
+ tblib==3.0.0
226
+ tensorboard-data-server==0.6.1
227
+ tensorboard-plugin-wit==1.8.1
228
+ tensorboard==2.9.0
229
+ tensorrt==8.6.3
230
+ terminado==0.18.0
231
+ termplotlib==0.3.9
232
+ thinc==8.2.3
233
+ threadpoolctl==3.2.0
234
+ thriftpy2==0.4.17
235
+ tinycss2==1.2.1
236
+ tokenizers==0.19.1
237
+ toml==0.10.2
238
+ tomli==2.0.1
239
+ toolz==0.12.1
240
+ torch-tensorrt==2.3.0a0
241
+ torch==2.3.0a0+ebedce2
242
+ torchdata==0.7.1a0
243
+ torchtext==0.17.0a0
244
+ torchvision==0.18.0a0
245
+ tornado==6.4
246
+ tqdm==4.66.1
247
+ traitlets==5.9.0
248
+ transformer-engine==1.3.0+5b90b7f
249
+ transformers==4.43.3
250
+ treelite-runtime==3.9.1
251
+ treelite==3.9.1
252
+ triton==2.2.0+e28a256
253
+ typer==0.9.0
254
+ types-dataclasses==0.6.6
255
+ typing-extensions==4.9.0
256
+ ucx-py==0.35.0
257
+ uff==0.6.9
258
+ ujson==5.8.0
259
+ urllib3==1.26.18
260
+ wandb==0.16.3
261
+ wasabi==1.1.2
262
+ wcwidth==0.2.13
263
+ weasel==0.3.4
264
+ webencodings==0.5.1
265
+ werkzeug==3.0.1
266
+ wheel==0.42.0
267
+ xdoctest==1.0.2
268
+ xgboost==1.7.6
269
+ yarl==1.9.4
270
+ zict==3.0.0
271
+ zipp==3.17.0
wandb/run-20240812_052446-qrv0d6sp/files/wandb-metadata.json ADDED
@@ -0,0 +1,215 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "os": "Linux-5.15.0-91-generic-x86_64-with-glibc2.35",
3
+ "python": "3.10.12",
4
+ "heartbeatAt": "2024-08-11T20:24:46.917714",
5
+ "startedAt": "2024-08-11T20:24:46.281353",
6
+ "docker": null,
7
+ "cuda": null,
8
+ "args": [
9
+ "--seq-length",
10
+ "4096",
11
+ "--sliding-window-size",
12
+ "4096",
13
+ "--micro-batch-size",
14
+ "1",
15
+ "--global-batch-size",
16
+ "320",
17
+ "--train-iters",
18
+ "20000",
19
+ "--tokenizer-type",
20
+ "HFPreTrainedTokenizer",
21
+ "--tokenizer-model",
22
+ "/share/pretrained_lm/Qwen/Qwen2-0.5B",
23
+ "--train-data-path",
24
+ "304771887",
25
+ "/work/llm_recipes/datasets/bin/sample/llm_jp_corpus_v2_ja_wiki_train_0/data_text_document",
26
+ "--valid-data-path",
27
+ "304771887",
28
+ "/work/llm_recipes/datasets/bin/sample/llm_jp_corpus_v2_ja_wiki_train_0/data_text_document",
29
+ "--test-data-path",
30
+ "304771887",
31
+ "/work/llm_recipes/datasets/bin/sample/llm_jp_corpus_v2_ja_wiki_train_0/data_text_document",
32
+ "--lr",
33
+ "2e-5",
34
+ "--min-lr",
35
+ "1e-6",
36
+ "--lr-decay-style",
37
+ "cosine",
38
+ "--lr-warmup-iters",
39
+ "500",
40
+ "--lr-decay-iters",
41
+ "20000",
42
+ "--weight-decay",
43
+ "0.1",
44
+ "--grad-clip-norm",
45
+ "1.0",
46
+ "--optimizer",
47
+ "adam",
48
+ "--adam-beta1",
49
+ "0.9",
50
+ "--adam-beta2",
51
+ "0.95",
52
+ "--adam-eps",
53
+ "1e-6",
54
+ "--save-interval",
55
+ "5",
56
+ "--eval-interval",
57
+ "200",
58
+ "--eval-iters",
59
+ "10",
60
+ "--bf16",
61
+ "--mixed-precision",
62
+ "--base-model",
63
+ "/share/pretrained_lm/Qwen/Qwen2-0.5B",
64
+ "--save",
65
+ "/work/llm_recipes/models/yans-qwen2-0.5B",
66
+ "--load",
67
+ "/work/llm_recipes/models/yans-qwen2-0.5B",
68
+ "--fsdp-activation-checkpointing",
69
+ "--sharding-strategy",
70
+ "FULL_SHARD",
71
+ "--checkpoint-type",
72
+ "LOCAL_STATE_DICT",
73
+ "--save-n-checkpoints",
74
+ "10",
75
+ "--hf-upload-retry-limit",
76
+ "2",
77
+ "--hf-repo-id",
78
+ "koichi12//yans-qwen2-0.5B",
79
+ "--wandb-entity",
80
+ "iwakawa-koichi-q5-tohoku-nlp6723",
81
+ "--wandb-project",
82
+ "llm_tutorial",
83
+ "--wandb-name",
84
+ "yans-qwen2-0.5B_train_2024-08-12-05:24:35"
85
+ ],
86
+ "state": "running",
87
+ "program": "/project/examples/finetuning.py",
88
+ "codePathLocal": "examples/finetuning.py",
89
+ "codePath": "examples/finetuning.py",
90
+ "git": {
91
+ "remote": "https://github.com/cl-tohoku/llm-recipes-failab-m1-yans.git",
92
+ "commit": "6da01327e78c302bc0cfdb335f3ca297e2a19c8c"
93
+ },
94
+ "email": null,
95
+ "root": "/project",
96
+ "host": "gpu-koiwa-00",
97
+ "username": "koiwa",
98
+ "executable": "/usr/bin/python",
99
+ "cpu_count": 18,
100
+ "cpu_count_logical": 18,
101
+ "cpu_freq": {
102
+ "current": 2400.0429999999997,
103
+ "min": 0.0,
104
+ "max": 0.0
105
+ },
106
+ "cpu_freq_per_core": [
107
+ {
108
+ "current": 2400.043,
109
+ "min": 0.0,
110
+ "max": 0.0
111
+ },
112
+ {
113
+ "current": 2400.043,
114
+ "min": 0.0,
115
+ "max": 0.0
116
+ },
117
+ {
118
+ "current": 2400.043,
119
+ "min": 0.0,
120
+ "max": 0.0
121
+ },
122
+ {
123
+ "current": 2400.043,
124
+ "min": 0.0,
125
+ "max": 0.0
126
+ },
127
+ {
128
+ "current": 2400.043,
129
+ "min": 0.0,
130
+ "max": 0.0
131
+ },
132
+ {
133
+ "current": 2400.043,
134
+ "min": 0.0,
135
+ "max": 0.0
136
+ },
137
+ {
138
+ "current": 2400.043,
139
+ "min": 0.0,
140
+ "max": 0.0
141
+ },
142
+ {
143
+ "current": 2400.043,
144
+ "min": 0.0,
145
+ "max": 0.0
146
+ },
147
+ {
148
+ "current": 2400.043,
149
+ "min": 0.0,
150
+ "max": 0.0
151
+ },
152
+ {
153
+ "current": 2400.043,
154
+ "min": 0.0,
155
+ "max": 0.0
156
+ },
157
+ {
158
+ "current": 2400.043,
159
+ "min": 0.0,
160
+ "max": 0.0
161
+ },
162
+ {
163
+ "current": 2400.043,
164
+ "min": 0.0,
165
+ "max": 0.0
166
+ },
167
+ {
168
+ "current": 2400.043,
169
+ "min": 0.0,
170
+ "max": 0.0
171
+ },
172
+ {
173
+ "current": 2400.043,
174
+ "min": 0.0,
175
+ "max": 0.0
176
+ },
177
+ {
178
+ "current": 2400.043,
179
+ "min": 0.0,
180
+ "max": 0.0
181
+ },
182
+ {
183
+ "current": 2400.043,
184
+ "min": 0.0,
185
+ "max": 0.0
186
+ },
187
+ {
188
+ "current": 2400.043,
189
+ "min": 0.0,
190
+ "max": 0.0
191
+ },
192
+ {
193
+ "current": 2400.043,
194
+ "min": 0.0,
195
+ "max": 0.0
196
+ }
197
+ ],
198
+ "disk": {
199
+ "/": {
200
+ "total": 0.0625,
201
+ "used": 1.1444091796875e-05
202
+ }
203
+ },
204
+ "gpu": "NVIDIA A100-SXM4-40GB",
205
+ "gpu_count": 1,
206
+ "gpu_devices": [
207
+ {
208
+ "name": "NVIDIA A100-SXM4-40GB",
209
+ "memory_total": 42949672960
210
+ }
211
+ ],
212
+ "memory": {
213
+ "total": 56.487823486328125
214
+ }
215
+ }
wandb/run-20240812_052446-qrv0d6sp/files/wandb-summary.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"_wandb": {"runtime": 0}}
wandb/run-20240812_052446-qrv0d6sp/logs/debug-internal.log ADDED
@@ -0,0 +1,185 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2024-08-12 05:24:46,295 INFO StreamThr :10279 [internal.py:wandb_internal():86] W&B internal server running at pid: 10279, started at: 2024-08-12 05:24:46.294899
2
+ 2024-08-12 05:24:46,297 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: status
3
+ 2024-08-12 05:24:46,299 INFO WriterThread:10279 [datastore.py:open_for_write():87] open: /project/wandb/run-20240812_052446-qrv0d6sp/run-qrv0d6sp.wandb
4
+ 2024-08-12 05:24:46,300 DEBUG SenderThread:10279 [sender.py:send():382] send: header
5
+ 2024-08-12 05:24:46,314 DEBUG SenderThread:10279 [sender.py:send():382] send: run
6
+ 2024-08-12 05:24:46,803 INFO SenderThread:10279 [dir_watcher.py:__init__():211] watching files in: /project/wandb/run-20240812_052446-qrv0d6sp/files
7
+ 2024-08-12 05:24:46,803 INFO SenderThread:10279 [sender.py:_start_run_threads():1136] run started: qrv0d6sp with start time 1723407886.294165
8
+ 2024-08-12 05:24:46,808 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: check_version
9
+ 2024-08-12 05:24:46,809 DEBUG SenderThread:10279 [sender.py:send_request():409] send_request: check_version
10
+ 2024-08-12 05:24:46,897 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: run_start
11
+ 2024-08-12 05:24:46,903 DEBUG HandlerThread:10279 [system_info.py:__init__():27] System info init
12
+ 2024-08-12 05:24:46,903 DEBUG HandlerThread:10279 [system_info.py:__init__():42] System info init done
13
+ 2024-08-12 05:24:46,903 INFO HandlerThread:10279 [system_monitor.py:start():194] Starting system monitor
14
+ 2024-08-12 05:24:46,903 INFO SystemMonitor:10279 [system_monitor.py:_start():158] Starting system asset monitoring threads
15
+ 2024-08-12 05:24:46,904 INFO HandlerThread:10279 [system_monitor.py:probe():214] Collecting system info
16
+ 2024-08-12 05:24:46,904 INFO SystemMonitor:10279 [interfaces.py:start():190] Started cpu monitoring
17
+ 2024-08-12 05:24:46,904 INFO SystemMonitor:10279 [interfaces.py:start():190] Started disk monitoring
18
+ 2024-08-12 05:24:46,905 INFO SystemMonitor:10279 [interfaces.py:start():190] Started gpu monitoring
19
+ 2024-08-12 05:24:46,906 INFO SystemMonitor:10279 [interfaces.py:start():190] Started memory monitoring
20
+ 2024-08-12 05:24:46,907 INFO SystemMonitor:10279 [interfaces.py:start():190] Started network monitoring
21
+ 2024-08-12 05:24:46,917 DEBUG HandlerThread:10279 [system_info.py:probe():151] Probing system
22
+ 2024-08-12 05:24:46,919 DEBUG HandlerThread:10279 [system_info.py:_probe_git():136] Probing git
23
+ 2024-08-12 05:24:46,932 DEBUG HandlerThread:10279 [system_info.py:_probe_git():144] Probing git done
24
+ 2024-08-12 05:24:46,932 DEBUG HandlerThread:10279 [system_info.py:probe():199] Probing system done
25
+ 2024-08-12 05:24:46,932 DEBUG HandlerThread:10279 [system_monitor.py:probe():223] {'os': 'Linux-5.15.0-91-generic-x86_64-with-glibc2.35', 'python': '3.10.12', 'heartbeatAt': '2024-08-11T20:24:46.917714', 'startedAt': '2024-08-11T20:24:46.281353', 'docker': None, 'cuda': None, 'args': ('--seq-length', '4096', '--sliding-window-size', '4096', '--micro-batch-size', '1', '--global-batch-size', '320', '--train-iters', '20000', '--tokenizer-type', 'HFPreTrainedTokenizer', '--tokenizer-model', '/share/pretrained_lm/Qwen/Qwen2-0.5B', '--train-data-path', '304771887', '/work/llm_recipes/datasets/bin/sample/llm_jp_corpus_v2_ja_wiki_train_0/data_text_document', '--valid-data-path', '304771887', '/work/llm_recipes/datasets/bin/sample/llm_jp_corpus_v2_ja_wiki_train_0/data_text_document', '--test-data-path', '304771887', '/work/llm_recipes/datasets/bin/sample/llm_jp_corpus_v2_ja_wiki_train_0/data_text_document', '--lr', '2e-5', '--min-lr', '1e-6', '--lr-decay-style', 'cosine', '--lr-warmup-iters', '500', '--lr-decay-iters', '20000', '--weight-decay', '0.1', '--grad-clip-norm', '1.0', '--optimizer', 'adam', '--adam-beta1', '0.9', '--adam-beta2', '0.95', '--adam-eps', '1e-6', '--save-interval', '5', '--eval-interval', '200', '--eval-iters', '10', '--bf16', '--mixed-precision', '--base-model', '/share/pretrained_lm/Qwen/Qwen2-0.5B', '--save', '/work/llm_recipes/models/yans-qwen2-0.5B', '--load', '/work/llm_recipes/models/yans-qwen2-0.5B', '--fsdp-activation-checkpointing', '--sharding-strategy', 'FULL_SHARD', '--checkpoint-type', 'LOCAL_STATE_DICT', '--save-n-checkpoints', '10', '--hf-upload-retry-limit', '2', '--hf-repo-id', 'koichi12//yans-qwen2-0.5B', '--wandb-entity', 'iwakawa-koichi-q5-tohoku-nlp6723', '--wandb-project', 'llm_tutorial', '--wandb-name', 'yans-qwen2-0.5B_train_2024-08-12-05:24:35'), 'state': 'running', 'program': '/project/examples/finetuning.py', 'codePathLocal': 'examples/finetuning.py', 'codePath': 'examples/finetuning.py', 'git': {'remote': 'https://github.com/cl-tohoku/llm-recipes-failab-m1-yans.git', 'commit': '6da01327e78c302bc0cfdb335f3ca297e2a19c8c'}, 'email': None, 'root': '/project', 'host': 'gpu-koiwa-00', 'username': 'koiwa', 'executable': '/usr/bin/python', 'cpu_count': 18, 'cpu_count_logical': 18, 'cpu_freq': {'current': 2400.0429999999997, 'min': 0.0, 'max': 0.0}, 'cpu_freq_per_core': [{'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}], 'disk': {'/': {'total': 0.0625, 'used': 1.1444091796875e-05}}, 'gpu': 'NVIDIA A100-SXM4-40GB', 'gpu_count': 1, 'gpu_devices': [{'name': 'NVIDIA A100-SXM4-40GB', 'memory_total': 42949672960}], 'memory': {'total': 56.487823486328125}}
26
+ 2024-08-12 05:24:46,932 INFO HandlerThread:10279 [system_monitor.py:probe():224] Finished collecting system info
27
+ 2024-08-12 05:24:46,932 INFO HandlerThread:10279 [system_monitor.py:probe():227] Publishing system info
28
+ 2024-08-12 05:24:46,934 INFO HandlerThread:10279 [system_monitor.py:probe():229] Finished publishing system info
29
+ 2024-08-12 05:24:46,940 DEBUG SenderThread:10279 [sender.py:send():382] send: files
30
+ 2024-08-12 05:24:46,940 INFO SenderThread:10279 [sender.py:_save_file():1403] saving file wandb-metadata.json with policy now
31
+ 2024-08-12 05:24:46,949 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: python_packages
32
+ 2024-08-12 05:24:46,949 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: internal_messages
33
+ 2024-08-12 05:24:46,950 DEBUG SenderThread:10279 [sender.py:send_request():409] send_request: python_packages
34
+ 2024-08-12 05:24:46,950 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: stop_status
35
+ 2024-08-12 05:24:46,951 DEBUG SenderThread:10279 [sender.py:send_request():409] send_request: stop_status
36
+ 2024-08-12 05:24:47,180 DEBUG SenderThread:10279 [sender.py:send():382] send: telemetry
37
+ 2024-08-12 05:24:47,182 DEBUG SenderThread:10279 [sender.py:send():382] send: exit
38
+ 2024-08-12 05:24:47,182 INFO SenderThread:10279 [sender.py:send_exit():589] handling exit code: 1
39
+ 2024-08-12 05:24:47,182 INFO SenderThread:10279 [sender.py:send_exit():591] handling runtime: 0
40
+ 2024-08-12 05:24:47,183 INFO SenderThread:10279 [sender.py:_save_file():1403] saving file wandb-summary.json with policy end
41
+ 2024-08-12 05:24:47,184 INFO SenderThread:10279 [sender.py:send_exit():597] send defer
42
+ 2024-08-12 05:24:47,184 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: defer
43
+ 2024-08-12 05:24:47,184 INFO HandlerThread:10279 [handler.py:handle_request_defer():172] handle defer: 0
44
+ 2024-08-12 05:24:47,184 DEBUG SenderThread:10279 [sender.py:send_request():409] send_request: defer
45
+ 2024-08-12 05:24:47,184 INFO SenderThread:10279 [sender.py:send_request_defer():613] handle sender defer: 0
46
+ 2024-08-12 05:24:47,184 INFO SenderThread:10279 [sender.py:transition_state():617] send defer: 1
47
+ 2024-08-12 05:24:47,184 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: defer
48
+ 2024-08-12 05:24:47,184 INFO HandlerThread:10279 [handler.py:handle_request_defer():172] handle defer: 1
49
+ 2024-08-12 05:24:47,184 DEBUG SenderThread:10279 [sender.py:send_request():409] send_request: defer
50
+ 2024-08-12 05:24:47,184 INFO SenderThread:10279 [sender.py:send_request_defer():613] handle sender defer: 1
51
+ 2024-08-12 05:24:47,184 INFO SenderThread:10279 [sender.py:transition_state():617] send defer: 2
52
+ 2024-08-12 05:24:47,184 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: defer
53
+ 2024-08-12 05:24:47,185 INFO HandlerThread:10279 [handler.py:handle_request_defer():172] handle defer: 2
54
+ 2024-08-12 05:24:47,185 INFO HandlerThread:10279 [system_monitor.py:finish():203] Stopping system monitor
55
+ 2024-08-12 05:24:47,185 DEBUG SystemMonitor:10279 [system_monitor.py:_start():172] Starting system metrics aggregation loop
56
+ 2024-08-12 05:24:47,185 INFO HandlerThread:10279 [interfaces.py:finish():202] Joined cpu monitor
57
+ 2024-08-12 05:24:47,185 DEBUG SystemMonitor:10279 [system_monitor.py:_start():179] Finished system metrics aggregation loop
58
+ 2024-08-12 05:24:47,185 INFO HandlerThread:10279 [interfaces.py:finish():202] Joined disk monitor
59
+ 2024-08-12 05:24:47,185 DEBUG SystemMonitor:10279 [system_monitor.py:_start():183] Publishing last batch of metrics
60
+ 2024-08-12 05:24:47,218 INFO HandlerThread:10279 [interfaces.py:finish():202] Joined gpu monitor
61
+ 2024-08-12 05:24:47,218 INFO HandlerThread:10279 [interfaces.py:finish():202] Joined memory monitor
62
+ 2024-08-12 05:24:47,218 INFO HandlerThread:10279 [interfaces.py:finish():202] Joined network monitor
63
+ 2024-08-12 05:24:47,219 DEBUG SenderThread:10279 [sender.py:send_request():409] send_request: defer
64
+ 2024-08-12 05:24:47,219 INFO SenderThread:10279 [sender.py:send_request_defer():613] handle sender defer: 2
65
+ 2024-08-12 05:24:47,219 INFO SenderThread:10279 [sender.py:transition_state():617] send defer: 3
66
+ 2024-08-12 05:24:47,219 DEBUG SenderThread:10279 [sender.py:send():382] send: stats
67
+ 2024-08-12 05:24:47,219 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: defer
68
+ 2024-08-12 05:24:47,219 INFO HandlerThread:10279 [handler.py:handle_request_defer():172] handle defer: 3
69
+ 2024-08-12 05:24:47,219 DEBUG SenderThread:10279 [sender.py:send_request():409] send_request: defer
70
+ 2024-08-12 05:24:47,219 INFO SenderThread:10279 [sender.py:send_request_defer():613] handle sender defer: 3
71
+ 2024-08-12 05:24:47,219 INFO SenderThread:10279 [sender.py:transition_state():617] send defer: 4
72
+ 2024-08-12 05:24:47,219 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: defer
73
+ 2024-08-12 05:24:47,219 INFO HandlerThread:10279 [handler.py:handle_request_defer():172] handle defer: 4
74
+ 2024-08-12 05:24:47,220 DEBUG SenderThread:10279 [sender.py:send_request():409] send_request: defer
75
+ 2024-08-12 05:24:47,220 INFO SenderThread:10279 [sender.py:send_request_defer():613] handle sender defer: 4
76
+ 2024-08-12 05:24:47,220 INFO SenderThread:10279 [sender.py:transition_state():617] send defer: 5
77
+ 2024-08-12 05:24:47,220 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: defer
78
+ 2024-08-12 05:24:47,220 INFO HandlerThread:10279 [handler.py:handle_request_defer():172] handle defer: 5
79
+ 2024-08-12 05:24:47,220 DEBUG SenderThread:10279 [sender.py:send():382] send: summary
80
+ 2024-08-12 05:24:47,221 INFO SenderThread:10279 [sender.py:_save_file():1403] saving file wandb-summary.json with policy end
81
+ 2024-08-12 05:24:47,221 DEBUG SenderThread:10279 [sender.py:send_request():409] send_request: defer
82
+ 2024-08-12 05:24:47,221 INFO SenderThread:10279 [sender.py:send_request_defer():613] handle sender defer: 5
83
+ 2024-08-12 05:24:47,221 INFO SenderThread:10279 [sender.py:transition_state():617] send defer: 6
84
+ 2024-08-12 05:24:47,221 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: defer
85
+ 2024-08-12 05:24:47,221 INFO HandlerThread:10279 [handler.py:handle_request_defer():172] handle defer: 6
86
+ 2024-08-12 05:24:47,221 DEBUG SenderThread:10279 [sender.py:send_request():409] send_request: defer
87
+ 2024-08-12 05:24:47,222 INFO SenderThread:10279 [sender.py:send_request_defer():613] handle sender defer: 6
88
+ 2024-08-12 05:24:47,224 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: status_report
89
+ 2024-08-12 05:24:47,422 INFO SenderThread:10279 [sender.py:transition_state():617] send defer: 7
90
+ 2024-08-12 05:24:47,422 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: defer
91
+ 2024-08-12 05:24:47,422 INFO HandlerThread:10279 [handler.py:handle_request_defer():172] handle defer: 7
92
+ 2024-08-12 05:24:47,422 DEBUG SenderThread:10279 [sender.py:send_request():409] send_request: defer
93
+ 2024-08-12 05:24:47,422 INFO SenderThread:10279 [sender.py:send_request_defer():613] handle sender defer: 7
94
+ 2024-08-12 05:24:47,581 INFO wandb-upload_0:10279 [upload_job.py:push():131] Uploaded file /tmp/tmppaigcwc7wandb/d7sbkpsh-wandb-metadata.json
95
+ 2024-08-12 05:24:47,805 INFO Thread-12 :10279 [dir_watcher.py:_on_file_modified():288] file/dir modified: /project/wandb/run-20240812_052446-qrv0d6sp/files/config.yaml
96
+ 2024-08-12 05:24:47,805 INFO Thread-12 :10279 [dir_watcher.py:_on_file_created():271] file/dir created: /project/wandb/run-20240812_052446-qrv0d6sp/files/requirements.txt
97
+ 2024-08-12 05:24:47,806 INFO Thread-12 :10279 [dir_watcher.py:_on_file_created():271] file/dir created: /project/wandb/run-20240812_052446-qrv0d6sp/files/output.log
98
+ 2024-08-12 05:24:47,806 INFO Thread-12 :10279 [dir_watcher.py:_on_file_created():271] file/dir created: /project/wandb/run-20240812_052446-qrv0d6sp/files/wandb-metadata.json
99
+ 2024-08-12 05:24:47,806 INFO Thread-12 :10279 [dir_watcher.py:_on_file_created():271] file/dir created: /project/wandb/run-20240812_052446-qrv0d6sp/files/wandb-summary.json
100
+ 2024-08-12 05:24:47,995 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: poll_exit
101
+ 2024-08-12 05:24:49,187 INFO SenderThread:10279 [sender.py:transition_state():617] send defer: 8
102
+ 2024-08-12 05:24:49,187 DEBUG SenderThread:10279 [sender.py:send_request():409] send_request: poll_exit
103
+ 2024-08-12 05:24:49,187 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: defer
104
+ 2024-08-12 05:24:49,187 INFO HandlerThread:10279 [handler.py:handle_request_defer():172] handle defer: 8
105
+ 2024-08-12 05:24:49,188 DEBUG SenderThread:10279 [sender.py:send_request():409] send_request: defer
106
+ 2024-08-12 05:24:49,188 INFO SenderThread:10279 [sender.py:send_request_defer():613] handle sender defer: 8
107
+ 2024-08-12 05:24:49,188 INFO SenderThread:10279 [job_builder.py:build():296] Attempting to build job artifact
108
+ 2024-08-12 05:24:49,189 INFO SenderThread:10279 [job_builder.py:_get_source_type():426] is repo sourced job
109
+ 2024-08-12 05:24:49,203 INFO SenderThread:10279 [job_builder.py:build():402] adding wandb-job metadata file
110
+ 2024-08-12 05:24:49,211 INFO SenderThread:10279 [sender.py:transition_state():617] send defer: 9
111
+ 2024-08-12 05:24:49,212 DEBUG SenderThread:10279 [sender.py:send():382] send: artifact
112
+ 2024-08-12 05:24:49,212 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: defer
113
+ 2024-08-12 05:24:49,213 INFO HandlerThread:10279 [handler.py:handle_request_defer():172] handle defer: 9
114
+ 2024-08-12 05:24:49,806 INFO Thread-12 :10279 [dir_watcher.py:_on_file_modified():288] file/dir modified: /project/wandb/run-20240812_052446-qrv0d6sp/files/output.log
115
+ 2024-08-12 05:24:49,996 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: poll_exit
116
+ 2024-08-12 05:24:50,071 INFO SenderThread:10279 [sender.py:send_artifact():1494] sent artifact job-https___github.com_cl-tohoku_llm-recipes-failab-m1-yans.git_examples_finetuning.py - {'id': 'QXJ0aWZhY3Q6MTEzOTgzMzc4Mw==', 'state': 'COMMITTED', 'artifactSequence': {'id': 'QXJ0aWZhY3RDb2xsZWN0aW9uOjM2MjY3MjMzNA==', 'latestArtifact': {'id': 'QXJ0aWZhY3Q6MTEzOTgzMzc4Mw==', 'versionIndex': 6}}}
117
+ 2024-08-12 05:24:50,071 DEBUG SenderThread:10279 [sender.py:send_request():409] send_request: defer
118
+ 2024-08-12 05:24:50,072 INFO SenderThread:10279 [sender.py:send_request_defer():613] handle sender defer: 9
119
+ 2024-08-12 05:24:50,072 INFO SenderThread:10279 [dir_watcher.py:finish():358] shutting down directory watcher
120
+ 2024-08-12 05:24:50,807 INFO SenderThread:10279 [dir_watcher.py:finish():388] scan: /project/wandb/run-20240812_052446-qrv0d6sp/files
121
+ 2024-08-12 05:24:50,808 INFO SenderThread:10279 [dir_watcher.py:finish():402] scan save: /project/wandb/run-20240812_052446-qrv0d6sp/files/requirements.txt requirements.txt
122
+ 2024-08-12 05:24:50,808 INFO SenderThread:10279 [dir_watcher.py:finish():402] scan save: /project/wandb/run-20240812_052446-qrv0d6sp/files/config.yaml config.yaml
123
+ 2024-08-12 05:24:50,808 INFO SenderThread:10279 [dir_watcher.py:finish():402] scan save: /project/wandb/run-20240812_052446-qrv0d6sp/files/wandb-metadata.json wandb-metadata.json
124
+ 2024-08-12 05:24:50,808 INFO SenderThread:10279 [dir_watcher.py:finish():402] scan save: /project/wandb/run-20240812_052446-qrv0d6sp/files/wandb-summary.json wandb-summary.json
125
+ 2024-08-12 05:24:50,808 INFO SenderThread:10279 [dir_watcher.py:finish():402] scan save: /project/wandb/run-20240812_052446-qrv0d6sp/files/output.log output.log
126
+ 2024-08-12 05:24:50,808 INFO SenderThread:10279 [sender.py:transition_state():617] send defer: 10
127
+ 2024-08-12 05:24:50,809 DEBUG SenderThread:10279 [sender.py:send_request():409] send_request: poll_exit
128
+ 2024-08-12 05:24:50,809 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: defer
129
+ 2024-08-12 05:24:50,812 INFO HandlerThread:10279 [handler.py:handle_request_defer():172] handle defer: 10
130
+ 2024-08-12 05:24:50,814 DEBUG SenderThread:10279 [sender.py:send_request():409] send_request: defer
131
+ 2024-08-12 05:24:50,815 INFO SenderThread:10279 [sender.py:send_request_defer():613] handle sender defer: 10
132
+ 2024-08-12 05:24:50,815 INFO SenderThread:10279 [file_pusher.py:finish():172] shutting down file pusher
133
+ 2024-08-12 05:24:50,997 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: poll_exit
134
+ 2024-08-12 05:24:50,997 DEBUG SenderThread:10279 [sender.py:send_request():409] send_request: poll_exit
135
+ 2024-08-12 05:24:51,206 INFO wandb-upload_1:10279 [upload_job.py:push():131] Uploaded file /project/wandb/run-20240812_052446-qrv0d6sp/files/config.yaml
136
+ 2024-08-12 05:24:51,307 INFO wandb-upload_0:10279 [upload_job.py:push():131] Uploaded file /project/wandb/run-20240812_052446-qrv0d6sp/files/requirements.txt
137
+ 2024-08-12 05:24:51,390 INFO wandb-upload_3:10279 [upload_job.py:push():131] Uploaded file /project/wandb/run-20240812_052446-qrv0d6sp/files/output.log
138
+ 2024-08-12 05:24:51,401 INFO wandb-upload_2:10279 [upload_job.py:push():131] Uploaded file /project/wandb/run-20240812_052446-qrv0d6sp/files/wandb-summary.json
139
+ 2024-08-12 05:24:51,602 INFO Thread-11 (_thread_body):10279 [sender.py:transition_state():617] send defer: 11
140
+ 2024-08-12 05:24:51,602 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: defer
141
+ 2024-08-12 05:24:51,602 INFO HandlerThread:10279 [handler.py:handle_request_defer():172] handle defer: 11
142
+ 2024-08-12 05:24:51,603 DEBUG SenderThread:10279 [sender.py:send_request():409] send_request: defer
143
+ 2024-08-12 05:24:51,603 INFO SenderThread:10279 [sender.py:send_request_defer():613] handle sender defer: 11
144
+ 2024-08-12 05:24:51,603 INFO SenderThread:10279 [file_pusher.py:join():178] waiting for file pusher
145
+ 2024-08-12 05:24:51,603 INFO SenderThread:10279 [sender.py:transition_state():617] send defer: 12
146
+ 2024-08-12 05:24:51,603 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: defer
147
+ 2024-08-12 05:24:51,603 INFO HandlerThread:10279 [handler.py:handle_request_defer():172] handle defer: 12
148
+ 2024-08-12 05:24:51,603 DEBUG SenderThread:10279 [sender.py:send_request():409] send_request: defer
149
+ 2024-08-12 05:24:51,603 INFO SenderThread:10279 [sender.py:send_request_defer():613] handle sender defer: 12
150
+ 2024-08-12 05:24:51,603 INFO SenderThread:10279 [file_stream.py:finish():595] file stream finish called
151
+ 2024-08-12 05:24:51,998 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: poll_exit
152
+ 2024-08-12 05:24:52,287 INFO SenderThread:10279 [file_stream.py:finish():599] file stream finish is done
153
+ 2024-08-12 05:24:52,287 INFO SenderThread:10279 [sender.py:transition_state():617] send defer: 13
154
+ 2024-08-12 05:24:52,287 DEBUG SenderThread:10279 [sender.py:send_request():409] send_request: poll_exit
155
+ 2024-08-12 05:24:52,287 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: defer
156
+ 2024-08-12 05:24:52,288 INFO HandlerThread:10279 [handler.py:handle_request_defer():172] handle defer: 13
157
+ 2024-08-12 05:24:52,288 DEBUG SenderThread:10279 [sender.py:send_request():409] send_request: defer
158
+ 2024-08-12 05:24:52,288 INFO SenderThread:10279 [sender.py:send_request_defer():613] handle sender defer: 13
159
+ 2024-08-12 05:24:52,288 INFO SenderThread:10279 [sender.py:transition_state():617] send defer: 14
160
+ 2024-08-12 05:24:52,288 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: defer
161
+ 2024-08-12 05:24:52,288 DEBUG SenderThread:10279 [sender.py:send():382] send: final
162
+ 2024-08-12 05:24:52,288 INFO HandlerThread:10279 [handler.py:handle_request_defer():172] handle defer: 14
163
+ 2024-08-12 05:24:52,289 DEBUG SenderThread:10279 [sender.py:send():382] send: footer
164
+ 2024-08-12 05:24:52,289 DEBUG SenderThread:10279 [sender.py:send_request():409] send_request: defer
165
+ 2024-08-12 05:24:52,289 INFO SenderThread:10279 [sender.py:send_request_defer():613] handle sender defer: 14
166
+ 2024-08-12 05:24:52,289 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: poll_exit
167
+ 2024-08-12 05:24:52,289 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: poll_exit
168
+ 2024-08-12 05:24:52,290 DEBUG SenderThread:10279 [sender.py:send_request():409] send_request: poll_exit
169
+ 2024-08-12 05:24:52,290 DEBUG SenderThread:10279 [sender.py:send_request():409] send_request: poll_exit
170
+ 2024-08-12 05:24:52,290 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: server_info
171
+ 2024-08-12 05:24:52,290 DEBUG SenderThread:10279 [sender.py:send_request():409] send_request: server_info
172
+ 2024-08-12 05:24:52,292 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: get_summary
173
+ 2024-08-12 05:24:52,292 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: sampled_history
174
+ 2024-08-12 05:24:52,292 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: internal_messages
175
+ 2024-08-12 05:24:52,293 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: job_info
176
+ 2024-08-12 05:24:52,456 DEBUG SenderThread:10279 [sender.py:send_request():409] send_request: job_info
177
+ 2024-08-12 05:24:52,457 INFO MainThread:10279 [wandb_run.py:_footer_history_summary_info():3866] rendering history
178
+ 2024-08-12 05:24:52,457 INFO MainThread:10279 [wandb_run.py:_footer_history_summary_info():3898] rendering summary
179
+ 2024-08-12 05:24:52,457 INFO MainThread:10279 [wandb_run.py:_footer_sync_info():3825] logging synced files
180
+ 2024-08-12 05:24:52,457 DEBUG HandlerThread:10279 [handler.py:handle_request():146] handle_request: shutdown
181
+ 2024-08-12 05:24:52,457 INFO HandlerThread:10279 [handler.py:finish():869] shutting down handler
182
+ 2024-08-12 05:24:53,293 INFO WriterThread:10279 [datastore.py:close():296] close: /project/wandb/run-20240812_052446-qrv0d6sp/run-qrv0d6sp.wandb
183
+ 2024-08-12 05:24:53,457 INFO SenderThread:10279 [sender.py:finish():1572] shutting down sender
184
+ 2024-08-12 05:24:53,457 INFO SenderThread:10279 [file_pusher.py:finish():172] shutting down file pusher
185
+ 2024-08-12 05:24:53,457 INFO SenderThread:10279 [file_pusher.py:join():178] waiting for file pusher
wandb/run-20240812_052446-qrv0d6sp/logs/debug.log ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2024-08-12 05:24:46,287 INFO MainThread:10208 [wandb_setup.py:_flush():76] Current SDK version is 0.16.3
2
+ 2024-08-12 05:24:46,287 INFO MainThread:10208 [wandb_setup.py:_flush():76] Configure stats pid to 10208
3
+ 2024-08-12 05:24:46,287 INFO MainThread:10208 [wandb_setup.py:_flush():76] Loading settings from /singularity_home/.config/wandb/settings
4
+ 2024-08-12 05:24:46,287 INFO MainThread:10208 [wandb_setup.py:_flush():76] Loading settings from /project/wandb/settings
5
+ 2024-08-12 05:24:46,287 INFO MainThread:10208 [wandb_setup.py:_flush():76] Loading settings from environment variables: {'api_key': '***REDACTED***', 'run_notes': 'Train Qwen2'}
6
+ 2024-08-12 05:24:46,287 INFO MainThread:10208 [wandb_setup.py:_flush():76] Applying setup settings: {'_disable_service': False}
7
+ 2024-08-12 05:24:46,287 INFO MainThread:10208 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {'program_relpath': 'examples/finetuning.py', 'program_abspath': '/project/examples/finetuning.py', 'program': '/project/examples/finetuning.py'}
8
+ 2024-08-12 05:24:46,288 INFO MainThread:10208 [wandb_init.py:_log_setup():526] Logging user logs to /project/wandb/run-20240812_052446-qrv0d6sp/logs/debug.log
9
+ 2024-08-12 05:24:46,288 INFO MainThread:10208 [wandb_init.py:_log_setup():527] Logging internal logs to /project/wandb/run-20240812_052446-qrv0d6sp/logs/debug-internal.log
10
+ 2024-08-12 05:24:46,288 INFO MainThread:10208 [wandb_init.py:init():566] calling init triggers
11
+ 2024-08-12 05:24:46,288 INFO MainThread:10208 [wandb_init.py:init():573] wandb.init called with sweep_config: {}
12
+ config: {'sharding_strategy': 'FULL_SHARD', 'checkpoint_type': 'LOCAL_STATE_DICT', 'fsdp_activation_checkpointing': True, 'fsdp_cpu_offload': False, 'low_cpu_fsdp': False, 'no_meta_device': False, 'data_path': None, 'split': '969, 30, 1', 'train_data_path': ['304771887', '/work/llm_recipes/datasets/bin/sample/llm_jp_corpus_v2_ja_wiki_train_0/data_text_document'], 'valid_data_path': ['304771887', '/work/llm_recipes/datasets/bin/sample/llm_jp_corpus_v2_ja_wiki_train_0/data_text_document'], 'test_data_path': ['304771887', '/work/llm_recipes/datasets/bin/sample/llm_jp_corpus_v2_ja_wiki_train_0/data_text_document'], 'data_cache_path': None, 'vocab_size': None, 'vocab_file': None, 'merge_file': None, 'seq_length': 4096, 'num_workers': 2, 'tokenizer_type': 'HFPreTrainedTokenizer', 'tokenizer_model': '/share/pretrained_lm/Qwen/Qwen2-0.5B', 'reset_position_ids': False, 'reset_attention_mask': False, 'eod_mask_loss': False, 'retro_return_doc_ids': False, 'short_seq_prob': 0.1, 'vocab_extra_ids': 0, 'seed': 1234, 'use_mpi': False, 'wandb_entity': 'iwakawa-koichi-q5-tohoku-nlp6723', 'wandb_name': 'yans-qwen2-0.5B_train_2024-08-12-05:24:35', 'wandb_project': 'llm_tutorial', 'quantization': False, 'use_freeze_layers': False, 'freeze_layers': None, 'bf16': True, 'fp16': False, 'mixed_precision': True, 'param_dtype': None, 'load': '/work/llm_recipes/models/yans-qwen2-0.5B', 'save': '/work/llm_recipes/models/yans-qwen2-0.5B', 'base_model': '/share/pretrained_lm/Qwen/Qwen2-0.5B', 'use_better_transformer': False, 'grad_clip_norm': 1.0, 'eval_interval': 200, 'save_interval': 5, 'eval_iters': 10, 'optimizer': 'adam', 'lr': 2e-05, 'lr_decay_style': 'cosine', 'lr_decay_iters': 20000, 'lr_warmup_iters': 500, 'min_lr': 1e-06, 'train_iters': 20000, 'train_samples': None, 'global_batch_size': 320, 'micro_batch_size': 1, 'make_vocab_size_divisible_by': 128, 'sliding_window_size': 4096, 'skip_batch': None, 'no_save_optimizer_state': False, 'continual_pretraining': False, 'instruction_tuning': False, 'direct_preference_optimization': False, 'attention_dropout': 0.1, 'hidden_dropout': 0.1, 'weight_decay': 0.1, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_eps': 1e-06, 'hf_transformer_model_dir': None, 'instruction_train_data_path': None, 'instruction_valid_data_path': None, 'epoch': None, 'instruction_dataset_size': None, 'save_sampler_state': False, 'label_smoothing': 0.0, 'save_n_checkpoints': 10, 'hf_repo_id': 'koichi12//yans-qwen2-0.5B', 'create_public_hf_repo': False, 'upload_all_checkpoints_to_hf': False, 'hf_upload_retry_limit': 2, 'exit_duration_in_mins': None, 'source_key': None, 'target_key': None, 'attn_implementation': 'flash_attention_2', 'efficient_instruction_tuning': False, 'remove_padding_masking': False, 'save_start_iter': None, 'rank': 0, 'world_size': 1, 'padded_vocab_size': 151680, 'gradient_accumulation_steps': 320}
13
+ 2024-08-12 05:24:46,288 INFO MainThread:10208 [wandb_init.py:init():616] starting backend
14
+ 2024-08-12 05:24:46,288 INFO MainThread:10208 [wandb_init.py:init():620] setting up manager
15
+ 2024-08-12 05:24:46,293 INFO MainThread:10208 [backend.py:_multiprocessing_setup():105] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
16
+ 2024-08-12 05:24:46,293 INFO MainThread:10208 [wandb_init.py:init():628] backend started and connected
17
+ 2024-08-12 05:24:46,298 INFO MainThread:10208 [wandb_init.py:init():720] updated telemetry
18
+ 2024-08-12 05:24:46,310 INFO MainThread:10208 [wandb_init.py:init():753] communicating run to backend with 90.0 second timeout
19
+ 2024-08-12 05:24:46,808 INFO MainThread:10208 [wandb_run.py:_on_init():2262] communicating current version
20
+ 2024-08-12 05:24:46,889 INFO MainThread:10208 [wandb_run.py:_on_init():2271] got version response upgrade_message: "wandb version 0.17.6 is available! To upgrade, please run:\n $ pip install wandb --upgrade"
21
+
22
+ 2024-08-12 05:24:46,889 INFO MainThread:10208 [wandb_init.py:init():804] starting run threads in backend
23
+ 2024-08-12 05:24:46,948 INFO MainThread:10208 [wandb_run.py:_console_start():2241] atexit reg
24
+ 2024-08-12 05:24:46,949 INFO MainThread:10208 [wandb_run.py:_redirect():2096] redirect: wrap_raw
25
+ 2024-08-12 05:24:46,949 INFO MainThread:10208 [wandb_run.py:_redirect():2161] Wrapping output streams.
26
+ 2024-08-12 05:24:46,949 INFO MainThread:10208 [wandb_run.py:_redirect():2186] Redirects installed.
27
+ 2024-08-12 05:24:46,950 INFO MainThread:10208 [wandb_init.py:init():847] run started, returning control to user process
28
+ 2024-08-12 05:24:53,458 WARNING MsgRouterThr:10208 [router.py:message_loop():77] message_loop has been closed
wandb/run-20240812_052446-qrv0d6sp/run-qrv0d6sp.wandb ADDED
Binary file (7.11 kB). View file
 
wandb/run-20240812_072401-esew3nhv/files/config.yaml ADDED
@@ -0,0 +1,335 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ wandb_version: 1
2
+
3
+ sharding_strategy:
4
+ desc: null
5
+ value: FULL_SHARD
6
+ checkpoint_type:
7
+ desc: null
8
+ value: LOCAL_STATE_DICT
9
+ fsdp_activation_checkpointing:
10
+ desc: null
11
+ value: true
12
+ fsdp_cpu_offload:
13
+ desc: null
14
+ value: false
15
+ low_cpu_fsdp:
16
+ desc: null
17
+ value: false
18
+ no_meta_device:
19
+ desc: null
20
+ value: false
21
+ data_path:
22
+ desc: null
23
+ value: null
24
+ split:
25
+ desc: null
26
+ value: 969, 30, 1
27
+ train_data_path:
28
+ desc: null
29
+ value:
30
+ - '304771887'
31
+ - /work/llm_recipes/datasets/bin/sample/llm_jp_corpus_v2_ja_wiki_train_0/data_text_document
32
+ valid_data_path:
33
+ desc: null
34
+ value:
35
+ - '304771887'
36
+ - /work/llm_recipes/datasets/bin/sample/llm_jp_corpus_v2_ja_wiki_train_0/data_text_document
37
+ test_data_path:
38
+ desc: null
39
+ value:
40
+ - '304771887'
41
+ - /work/llm_recipes/datasets/bin/sample/llm_jp_corpus_v2_ja_wiki_train_0/data_text_document
42
+ data_cache_path:
43
+ desc: null
44
+ value: null
45
+ vocab_size:
46
+ desc: null
47
+ value: null
48
+ vocab_file:
49
+ desc: null
50
+ value: null
51
+ merge_file:
52
+ desc: null
53
+ value: null
54
+ seq_length:
55
+ desc: null
56
+ value: 4096
57
+ num_workers:
58
+ desc: null
59
+ value: 2
60
+ tokenizer_type:
61
+ desc: null
62
+ value: HFPreTrainedTokenizer
63
+ tokenizer_model:
64
+ desc: null
65
+ value: /share/pretrained_lm/Qwen/Qwen2-0.5B
66
+ reset_position_ids:
67
+ desc: null
68
+ value: false
69
+ reset_attention_mask:
70
+ desc: null
71
+ value: false
72
+ eod_mask_loss:
73
+ desc: null
74
+ value: false
75
+ retro_return_doc_ids:
76
+ desc: null
77
+ value: false
78
+ short_seq_prob:
79
+ desc: null
80
+ value: 0.1
81
+ vocab_extra_ids:
82
+ desc: null
83
+ value: 0
84
+ seed:
85
+ desc: null
86
+ value: 1234
87
+ use_mpi:
88
+ desc: null
89
+ value: false
90
+ wandb_entity:
91
+ desc: null
92
+ value: iwakawa-koichi-q5-tohoku-nlp6723
93
+ wandb_name:
94
+ desc: null
95
+ value: yans-qwen2-0.5B_train_2024-08-12-07:23:49
96
+ wandb_project:
97
+ desc: null
98
+ value: llm_tutorial
99
+ quantization:
100
+ desc: null
101
+ value: false
102
+ use_freeze_layers:
103
+ desc: null
104
+ value: false
105
+ freeze_layers:
106
+ desc: null
107
+ value: null
108
+ bf16:
109
+ desc: null
110
+ value: true
111
+ fp16:
112
+ desc: null
113
+ value: false
114
+ mixed_precision:
115
+ desc: null
116
+ value: true
117
+ param_dtype:
118
+ desc: null
119
+ value: null
120
+ load:
121
+ desc: null
122
+ value: /work/llm_recipes/models/yans-qwen2-0.5B
123
+ save:
124
+ desc: null
125
+ value: /work/llm_recipes/models/yans-qwen2-0.5B
126
+ base_model:
127
+ desc: null
128
+ value: /share/pretrained_lm/Qwen/Qwen2-0.5B
129
+ use_better_transformer:
130
+ desc: null
131
+ value: false
132
+ grad_clip_norm:
133
+ desc: null
134
+ value: 1.0
135
+ eval_interval:
136
+ desc: null
137
+ value: 5
138
+ save_interval:
139
+ desc: null
140
+ value: 5
141
+ eval_iters:
142
+ desc: null
143
+ value: 10
144
+ optimizer:
145
+ desc: null
146
+ value: adam
147
+ lr:
148
+ desc: null
149
+ value: 2.0e-05
150
+ lr_decay_style:
151
+ desc: null
152
+ value: cosine
153
+ lr_decay_iters:
154
+ desc: null
155
+ value: 20000
156
+ lr_warmup_iters:
157
+ desc: null
158
+ value: 500
159
+ min_lr:
160
+ desc: null
161
+ value: 1.0e-06
162
+ train_iters:
163
+ desc: null
164
+ value: 20000
165
+ train_samples:
166
+ desc: null
167
+ value: null
168
+ global_batch_size:
169
+ desc: null
170
+ value: 320
171
+ micro_batch_size:
172
+ desc: null
173
+ value: 1
174
+ make_vocab_size_divisible_by:
175
+ desc: null
176
+ value: 128
177
+ sliding_window_size:
178
+ desc: null
179
+ value: 4096
180
+ skip_batch:
181
+ desc: null
182
+ value: null
183
+ no_save_optimizer_state:
184
+ desc: null
185
+ value: false
186
+ continual_pretraining:
187
+ desc: null
188
+ value: false
189
+ instruction_tuning:
190
+ desc: null
191
+ value: false
192
+ direct_preference_optimization:
193
+ desc: null
194
+ value: false
195
+ attention_dropout:
196
+ desc: null
197
+ value: 0.1
198
+ hidden_dropout:
199
+ desc: null
200
+ value: 0.1
201
+ weight_decay:
202
+ desc: null
203
+ value: 0.1
204
+ adam_beta1:
205
+ desc: null
206
+ value: 0.9
207
+ adam_beta2:
208
+ desc: null
209
+ value: 0.95
210
+ adam_eps:
211
+ desc: null
212
+ value: 1.0e-06
213
+ hf_transformer_model_dir:
214
+ desc: null
215
+ value: null
216
+ instruction_train_data_path:
217
+ desc: null
218
+ value: null
219
+ instruction_valid_data_path:
220
+ desc: null
221
+ value: null
222
+ epoch:
223
+ desc: null
224
+ value: null
225
+ instruction_dataset_size:
226
+ desc: null
227
+ value: null
228
+ save_sampler_state:
229
+ desc: null
230
+ value: false
231
+ label_smoothing:
232
+ desc: null
233
+ value: 0.0
234
+ save_n_checkpoints:
235
+ desc: null
236
+ value: 10
237
+ hf_repo_id:
238
+ desc: null
239
+ value: koichi12/yans-qwen2-0.5B
240
+ create_public_hf_repo:
241
+ desc: null
242
+ value: false
243
+ upload_all_checkpoints_to_hf:
244
+ desc: null
245
+ value: false
246
+ hf_upload_retry_limit:
247
+ desc: null
248
+ value: 2
249
+ exit_duration_in_mins:
250
+ desc: null
251
+ value: null
252
+ source_key:
253
+ desc: null
254
+ value: null
255
+ target_key:
256
+ desc: null
257
+ value: null
258
+ attn_implementation:
259
+ desc: null
260
+ value: flash_attention_2
261
+ efficient_instruction_tuning:
262
+ desc: null
263
+ value: false
264
+ remove_padding_masking:
265
+ desc: null
266
+ value: false
267
+ save_start_iter:
268
+ desc: null
269
+ value: null
270
+ rank:
271
+ desc: null
272
+ value: 0
273
+ world_size:
274
+ desc: null
275
+ value: 1
276
+ padded_vocab_size:
277
+ desc: null
278
+ value: 151680
279
+ gradient_accumulation_steps:
280
+ desc: null
281
+ value: 320
282
+ _wandb:
283
+ desc: null
284
+ value:
285
+ python_version: 3.10.12
286
+ cli_version: 0.16.3
287
+ framework: huggingface
288
+ huggingface_version: 4.43.3
289
+ is_jupyter_run: false
290
+ is_kaggle_kernel: false
291
+ start_time: 1723415041.503914
292
+ t:
293
+ 1:
294
+ - 1
295
+ - 11
296
+ - 49
297
+ - 55
298
+ - 71
299
+ 2:
300
+ - 1
301
+ - 11
302
+ - 49
303
+ - 55
304
+ - 71
305
+ 3:
306
+ - 13
307
+ - 16
308
+ - 23
309
+ 4: 3.10.12
310
+ 5: 0.16.3
311
+ 6: 4.43.3
312
+ 8:
313
+ - 5
314
+ 13: linux-x86_64
315
+ model_architecture:
316
+ desc: null
317
+ value: Qwen2ForCausalLM
318
+ activation_function:
319
+ desc: null
320
+ value: silu
321
+ hidden_size:
322
+ desc: null
323
+ value: 896
324
+ model_type:
325
+ desc: null
326
+ value: qwen2
327
+ max_position_embeddings:
328
+ desc: null
329
+ value: 4096
330
+ num_attention_heads:
331
+ desc: null
332
+ value: 14
333
+ num_hidden_layers:
334
+ desc: null
335
+ value: 24
wandb/run-20240812_072401-esew3nhv/files/requirements.txt ADDED
@@ -0,0 +1,271 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ absl-py==2.1.0
2
+ accelerate==0.33.0
3
+ aiohttp==3.9.1
4
+ aiosignal==1.3.1
5
+ annotated-types==0.6.0
6
+ apex==0.1
7
+ appdirs==1.4.4
8
+ argon2-cffi-bindings==21.2.0
9
+ argon2-cffi==23.1.0
10
+ asttokens==2.4.1
11
+ astunparse==1.6.3
12
+ async-timeout==4.0.3
13
+ attrs==23.2.0
14
+ audioread==3.0.1
15
+ beautifulsoup4==4.12.3
16
+ bleach==6.1.0
17
+ blis==0.7.11
18
+ cachetools==5.3.2
19
+ catalogue==2.0.10
20
+ certifi==2024.2.2
21
+ cffi==1.16.0
22
+ charset-normalizer==3.3.2
23
+ click==8.1.7
24
+ cloudpathlib==0.16.0
25
+ cloudpickle==3.0.0
26
+ cmake==3.28.1
27
+ colorama==0.4.6
28
+ comm==0.2.1
29
+ confection==0.1.4
30
+ contourpy==1.2.0
31
+ cubinlinker==0.3.0+2.g405ac64
32
+ cuda-python==12.3.0rc4+9.gdb8c48a.dirty
33
+ cudf==23.12.0
34
+ cugraph-dgl==23.12.0
35
+ cugraph-service-client==23.12.0
36
+ cugraph-service-server==23.12.0
37
+ cugraph==23.12.0
38
+ cuml==23.12.0
39
+ cupy-cuda12x==12.3.0
40
+ cycler==0.12.1
41
+ cymem==2.0.8
42
+ cython==3.0.8
43
+ dask-cuda==23.12.0
44
+ dask-cudf==23.12.0
45
+ dask==2023.11.0
46
+ debugpy==1.8.1
47
+ decorator==5.1.1
48
+ defusedxml==0.7.1
49
+ distributed==2023.11.0
50
+ dm-tree==0.1.8
51
+ docker-pycreds==0.4.0
52
+ einops==0.7.0
53
+ exceptiongroup==1.2.0
54
+ execnet==2.0.2
55
+ executing==2.0.1
56
+ expecttest==0.1.3
57
+ fastjsonschema==2.19.1
58
+ fastrlock==0.8.2
59
+ filelock==3.13.1
60
+ flash-attn==2.4.2
61
+ fonttools==4.48.1
62
+ frozenlist==1.4.1
63
+ fsspec==2023.12.2
64
+ gast==0.5.4
65
+ gitdb==4.0.11
66
+ gitpython==3.1.43
67
+ google-auth-oauthlib==0.4.6
68
+ google-auth==2.27.0
69
+ graphsurgeon==0.4.6
70
+ grpcio==1.60.1
71
+ huggingface-hub==0.24.5
72
+ hypothesis==5.35.1
73
+ idna==3.6
74
+ importlib-metadata==7.0.1
75
+ iniconfig==2.0.0
76
+ intel-openmp==2021.4.0
77
+ ipadic==1.0.0
78
+ ipykernel==6.29.2
79
+ ipython-genutils==0.2.0
80
+ ipython==8.21.0
81
+ jedi==0.19.1
82
+ jinja2==3.1.3
83
+ joblib==1.3.2
84
+ json5==0.9.14
85
+ jsonnet==0.19.1
86
+ jsonschema-specifications==2023.12.1
87
+ jsonschema==4.21.1
88
+ jupyter-client==8.6.0
89
+ jupyter-core==5.7.1
90
+ jupyter-tensorboard==0.2.0
91
+ jupyterlab-pygments==0.3.0
92
+ jupyterlab-server==1.2.0
93
+ jupyterlab==2.3.2
94
+ jupytext==1.16.1
95
+ kiwisolver==1.4.5
96
+ langcodes==3.3.0
97
+ lazy-loader==0.3
98
+ librosa==0.10.1
99
+ llvmlite==0.40.1
100
+ locket==1.0.0
101
+ logzero==1.7.0
102
+ lxml==5.2.2
103
+ markdown-it-py==3.0.0
104
+ markdown==3.5.2
105
+ markupsafe==2.1.4
106
+ matplotlib-inline==0.1.6
107
+ matplotlib==3.8.2
108
+ mdit-py-plugins==0.4.0
109
+ mdurl==0.1.2
110
+ mecab-python3==1.0.6
111
+ mistune==3.0.2
112
+ mkl-devel==2021.1.1
113
+ mkl-include==2021.1.1
114
+ mkl==2021.1.1
115
+ mock==5.1.0
116
+ more-itertools==9.1.0
117
+ mpmath==1.3.0
118
+ msgpack==1.0.7
119
+ multidict==6.0.4
120
+ murmurhash==1.0.10
121
+ nbclient==0.9.0
122
+ nbconvert==7.16.0
123
+ nbformat==5.9.2
124
+ nest-asyncio==1.6.0
125
+ networkx==2.6.3
126
+ ninja==1.11.1.1
127
+ nltk==3.8.1
128
+ notebook==6.4.10
129
+ numba==0.57.1+1.g1ff679645
130
+ numpy==1.24.4
131
+ nvfuser==0.1.4a0+d0bb811
132
+ nvidia-dali-cuda120==1.34.0
133
+ nvidia-pyindex==1.0.9
134
+ nvtx==0.2.5
135
+ oauthlib==3.2.2
136
+ onnx==1.15.0rc2
137
+ opencv==4.7.0
138
+ optree==0.10.0
139
+ packaging==23.2
140
+ pandas==1.5.3
141
+ pandocfilters==1.5.1
142
+ parso==0.8.3
143
+ partd==1.4.1
144
+ peft==0.11.1
145
+ pexpect==4.9.0
146
+ pillow==10.2.0
147
+ pip==24.0
148
+ platformdirs==4.2.0
149
+ pluggy==1.4.0
150
+ ply==3.11
151
+ polygraphy==0.49.4
152
+ pooch==1.8.0
153
+ portalocker==2.10.1
154
+ preshed==3.0.9
155
+ prettytable==3.9.0
156
+ prometheus-client==0.19.0
157
+ prompt-toolkit==3.0.43
158
+ protobuf==4.24.4
159
+ psutil==5.9.4
160
+ ptxcompiler==0.8.1+2.g0d406d6
161
+ ptyprocess==0.7.0
162
+ pure-eval==0.2.2
163
+ pyarrow==14.0.1.dev0+gba5374836.d20240125
164
+ pyasn1-modules==0.3.0
165
+ pyasn1==0.5.1
166
+ pybind11-global==2.11.1
167
+ pybind11==2.11.1
168
+ pycocotools==2.0+nv0.8.0
169
+ pycparser==2.21
170
+ pydantic-core==2.16.2
171
+ pydantic==2.6.1
172
+ pygments==2.17.2
173
+ pylibcugraph==23.12.0
174
+ pylibcugraphops==23.12.0
175
+ pylibraft==23.12.0
176
+ pynvml==11.4.1
177
+ pyparsing==3.1.1
178
+ pytest-flakefinder==1.1.0
179
+ pytest-rerunfailures==13.0
180
+ pytest-shard==0.1.2
181
+ pytest-xdist==3.5.0
182
+ pytest==8.0.0
183
+ python-dateutil==2.8.2
184
+ python-dotenv==1.0.0
185
+ python-hostlist==1.23.0
186
+ pytorch-quantization==2.1.2
187
+ pytz==2023.3.post1
188
+ pyyaml==6.0.1
189
+ pyzmq==25.1.2
190
+ raft-dask==23.12.0
191
+ rapids-dask-dependency==23.12.1
192
+ referencing==0.33.0
193
+ regex==2023.12.25
194
+ requests-oauthlib==1.3.1
195
+ requests==2.31.0
196
+ rich==13.7.0
197
+ rmm==23.12.0
198
+ rpds-py==0.17.1
199
+ rsa==4.9
200
+ sacrebleu==2.4.0
201
+ safetensors==0.4.3
202
+ scikit-learn==1.2.0
203
+ scipy==1.12.0
204
+ send2trash==1.8.2
205
+ sentencepiece==0.1.99
206
+ sentry-sdk==2.12.0
207
+ setproctitle==1.3.3
208
+ setuptools==68.2.2
209
+ six==1.16.0
210
+ smart-open==6.4.0
211
+ smmap==5.0.1
212
+ sortedcontainers==2.4.0
213
+ soundfile==0.12.1
214
+ soupsieve==2.5
215
+ soxr==0.3.7
216
+ spacy-legacy==3.0.12
217
+ spacy-loggers==1.0.5
218
+ spacy==3.7.2
219
+ sphinx-glpi-theme==0.6
220
+ srsly==2.4.8
221
+ stack-data==0.6.3
222
+ sympy==1.12
223
+ tabulate==0.9.0
224
+ tbb==2021.11.0
225
+ tblib==3.0.0
226
+ tensorboard-data-server==0.6.1
227
+ tensorboard-plugin-wit==1.8.1
228
+ tensorboard==2.9.0
229
+ tensorrt==8.6.3
230
+ terminado==0.18.0
231
+ termplotlib==0.3.9
232
+ thinc==8.2.3
233
+ threadpoolctl==3.2.0
234
+ thriftpy2==0.4.17
235
+ tinycss2==1.2.1
236
+ tokenizers==0.19.1
237
+ toml==0.10.2
238
+ tomli==2.0.1
239
+ toolz==0.12.1
240
+ torch-tensorrt==2.3.0a0
241
+ torch==2.3.0a0+ebedce2
242
+ torchdata==0.7.1a0
243
+ torchtext==0.17.0a0
244
+ torchvision==0.18.0a0
245
+ tornado==6.4
246
+ tqdm==4.66.1
247
+ traitlets==5.9.0
248
+ transformer-engine==1.3.0+5b90b7f
249
+ transformers==4.43.3
250
+ treelite-runtime==3.9.1
251
+ treelite==3.9.1
252
+ triton==2.2.0+e28a256
253
+ typer==0.9.0
254
+ types-dataclasses==0.6.6
255
+ typing-extensions==4.9.0
256
+ ucx-py==0.35.0
257
+ uff==0.6.9
258
+ ujson==5.8.0
259
+ urllib3==1.26.18
260
+ wandb==0.16.3
261
+ wasabi==1.1.2
262
+ wcwidth==0.2.13
263
+ weasel==0.3.4
264
+ webencodings==0.5.1
265
+ werkzeug==3.0.1
266
+ wheel==0.42.0
267
+ xdoctest==1.0.2
268
+ xgboost==1.7.6
269
+ yarl==1.9.4
270
+ zict==3.0.0
271
+ zipp==3.17.0
wandb/run-20240812_072401-esew3nhv/files/wandb-metadata.json ADDED
@@ -0,0 +1,215 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "os": "Linux-5.15.0-91-generic-x86_64-with-glibc2.35",
3
+ "python": "3.10.12",
4
+ "heartbeatAt": "2024-08-11T22:24:02.142128",
5
+ "startedAt": "2024-08-11T22:24:01.491031",
6
+ "docker": null,
7
+ "cuda": null,
8
+ "args": [
9
+ "--seq-length",
10
+ "4096",
11
+ "--sliding-window-size",
12
+ "4096",
13
+ "--micro-batch-size",
14
+ "1",
15
+ "--global-batch-size",
16
+ "320",
17
+ "--train-iters",
18
+ "20000",
19
+ "--tokenizer-type",
20
+ "HFPreTrainedTokenizer",
21
+ "--tokenizer-model",
22
+ "/share/pretrained_lm/Qwen/Qwen2-0.5B",
23
+ "--train-data-path",
24
+ "304771887",
25
+ "/work/llm_recipes/datasets/bin/sample/llm_jp_corpus_v2_ja_wiki_train_0/data_text_document",
26
+ "--valid-data-path",
27
+ "304771887",
28
+ "/work/llm_recipes/datasets/bin/sample/llm_jp_corpus_v2_ja_wiki_train_0/data_text_document",
29
+ "--test-data-path",
30
+ "304771887",
31
+ "/work/llm_recipes/datasets/bin/sample/llm_jp_corpus_v2_ja_wiki_train_0/data_text_document",
32
+ "--lr",
33
+ "2e-5",
34
+ "--min-lr",
35
+ "1e-6",
36
+ "--lr-decay-style",
37
+ "cosine",
38
+ "--lr-warmup-iters",
39
+ "500",
40
+ "--lr-decay-iters",
41
+ "20000",
42
+ "--weight-decay",
43
+ "0.1",
44
+ "--grad-clip-norm",
45
+ "1.0",
46
+ "--optimizer",
47
+ "adam",
48
+ "--adam-beta1",
49
+ "0.9",
50
+ "--adam-beta2",
51
+ "0.95",
52
+ "--adam-eps",
53
+ "1e-6",
54
+ "--save-interval",
55
+ "5",
56
+ "--eval-interval",
57
+ "5",
58
+ "--eval-iters",
59
+ "10",
60
+ "--bf16",
61
+ "--mixed-precision",
62
+ "--base-model",
63
+ "/share/pretrained_lm/Qwen/Qwen2-0.5B",
64
+ "--save",
65
+ "/work/llm_recipes/models/yans-qwen2-0.5B",
66
+ "--load",
67
+ "/work/llm_recipes/models/yans-qwen2-0.5B",
68
+ "--fsdp-activation-checkpointing",
69
+ "--sharding-strategy",
70
+ "FULL_SHARD",
71
+ "--checkpoint-type",
72
+ "LOCAL_STATE_DICT",
73
+ "--save-n-checkpoints",
74
+ "10",
75
+ "--hf-upload-retry-limit",
76
+ "2",
77
+ "--hf-repo-id",
78
+ "koichi12/yans-qwen2-0.5B",
79
+ "--wandb-entity",
80
+ "iwakawa-koichi-q5-tohoku-nlp6723",
81
+ "--wandb-project",
82
+ "llm_tutorial",
83
+ "--wandb-name",
84
+ "yans-qwen2-0.5B_train_2024-08-12-07:23:49"
85
+ ],
86
+ "state": "running",
87
+ "program": "/project/examples/finetuning.py",
88
+ "codePathLocal": "examples/finetuning.py",
89
+ "codePath": "examples/finetuning.py",
90
+ "git": {
91
+ "remote": "https://github.com/cl-tohoku/llm-recipes-failab-m1-yans.git",
92
+ "commit": "6da01327e78c302bc0cfdb335f3ca297e2a19c8c"
93
+ },
94
+ "email": null,
95
+ "root": "/project",
96
+ "host": "gpu-koiwa-00",
97
+ "username": "koiwa",
98
+ "executable": "/usr/bin/python",
99
+ "cpu_count": 18,
100
+ "cpu_count_logical": 18,
101
+ "cpu_freq": {
102
+ "current": 2400.0429999999997,
103
+ "min": 0.0,
104
+ "max": 0.0
105
+ },
106
+ "cpu_freq_per_core": [
107
+ {
108
+ "current": 2400.043,
109
+ "min": 0.0,
110
+ "max": 0.0
111
+ },
112
+ {
113
+ "current": 2400.043,
114
+ "min": 0.0,
115
+ "max": 0.0
116
+ },
117
+ {
118
+ "current": 2400.043,
119
+ "min": 0.0,
120
+ "max": 0.0
121
+ },
122
+ {
123
+ "current": 2400.043,
124
+ "min": 0.0,
125
+ "max": 0.0
126
+ },
127
+ {
128
+ "current": 2400.043,
129
+ "min": 0.0,
130
+ "max": 0.0
131
+ },
132
+ {
133
+ "current": 2400.043,
134
+ "min": 0.0,
135
+ "max": 0.0
136
+ },
137
+ {
138
+ "current": 2400.043,
139
+ "min": 0.0,
140
+ "max": 0.0
141
+ },
142
+ {
143
+ "current": 2400.043,
144
+ "min": 0.0,
145
+ "max": 0.0
146
+ },
147
+ {
148
+ "current": 2400.043,
149
+ "min": 0.0,
150
+ "max": 0.0
151
+ },
152
+ {
153
+ "current": 2400.043,
154
+ "min": 0.0,
155
+ "max": 0.0
156
+ },
157
+ {
158
+ "current": 2400.043,
159
+ "min": 0.0,
160
+ "max": 0.0
161
+ },
162
+ {
163
+ "current": 2400.043,
164
+ "min": 0.0,
165
+ "max": 0.0
166
+ },
167
+ {
168
+ "current": 2400.043,
169
+ "min": 0.0,
170
+ "max": 0.0
171
+ },
172
+ {
173
+ "current": 2400.043,
174
+ "min": 0.0,
175
+ "max": 0.0
176
+ },
177
+ {
178
+ "current": 2400.043,
179
+ "min": 0.0,
180
+ "max": 0.0
181
+ },
182
+ {
183
+ "current": 2400.043,
184
+ "min": 0.0,
185
+ "max": 0.0
186
+ },
187
+ {
188
+ "current": 2400.043,
189
+ "min": 0.0,
190
+ "max": 0.0
191
+ },
192
+ {
193
+ "current": 2400.043,
194
+ "min": 0.0,
195
+ "max": 0.0
196
+ }
197
+ ],
198
+ "disk": {
199
+ "/": {
200
+ "total": 0.0625,
201
+ "used": 1.1444091796875e-05
202
+ }
203
+ },
204
+ "gpu": "NVIDIA A100-SXM4-40GB",
205
+ "gpu_count": 1,
206
+ "gpu_devices": [
207
+ {
208
+ "name": "NVIDIA A100-SXM4-40GB",
209
+ "memory_total": 42949672960
210
+ }
211
+ ],
212
+ "memory": {
213
+ "total": 56.487823486328125
214
+ }
215
+ }
wandb/run-20240812_072401-esew3nhv/logs/debug-internal.log ADDED
@@ -0,0 +1,240 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2024-08-12 07:24:01,505 INFO StreamThr :14117 [internal.py:wandb_internal():86] W&B internal server running at pid: 14117, started at: 2024-08-12 07:24:01.504656
2
+ 2024-08-12 07:24:01,507 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status
3
+ 2024-08-12 07:24:01,508 INFO WriterThread:14117 [datastore.py:open_for_write():87] open: /project/wandb/run-20240812_072401-esew3nhv/run-esew3nhv.wandb
4
+ 2024-08-12 07:24:01,509 DEBUG SenderThread:14117 [sender.py:send():382] send: header
5
+ 2024-08-12 07:24:01,545 DEBUG SenderThread:14117 [sender.py:send():382] send: run
6
+ 2024-08-12 07:24:02,027 INFO SenderThread:14117 [dir_watcher.py:__init__():211] watching files in: /project/wandb/run-20240812_072401-esew3nhv/files
7
+ 2024-08-12 07:24:02,028 INFO SenderThread:14117 [sender.py:_start_run_threads():1136] run started: esew3nhv with start time 1723415041.503914
8
+ 2024-08-12 07:24:02,033 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: check_version
9
+ 2024-08-12 07:24:02,033 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: check_version
10
+ 2024-08-12 07:24:02,121 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: run_start
11
+ 2024-08-12 07:24:02,127 DEBUG HandlerThread:14117 [system_info.py:__init__():27] System info init
12
+ 2024-08-12 07:24:02,128 DEBUG HandlerThread:14117 [system_info.py:__init__():42] System info init done
13
+ 2024-08-12 07:24:02,128 INFO HandlerThread:14117 [system_monitor.py:start():194] Starting system monitor
14
+ 2024-08-12 07:24:02,128 INFO SystemMonitor:14117 [system_monitor.py:_start():158] Starting system asset monitoring threads
15
+ 2024-08-12 07:24:02,128 INFO HandlerThread:14117 [system_monitor.py:probe():214] Collecting system info
16
+ 2024-08-12 07:24:02,129 INFO SystemMonitor:14117 [interfaces.py:start():190] Started cpu monitoring
17
+ 2024-08-12 07:24:02,129 INFO SystemMonitor:14117 [interfaces.py:start():190] Started disk monitoring
18
+ 2024-08-12 07:24:02,130 INFO SystemMonitor:14117 [interfaces.py:start():190] Started gpu monitoring
19
+ 2024-08-12 07:24:02,131 INFO SystemMonitor:14117 [interfaces.py:start():190] Started memory monitoring
20
+ 2024-08-12 07:24:02,131 INFO SystemMonitor:14117 [interfaces.py:start():190] Started network monitoring
21
+ 2024-08-12 07:24:02,142 DEBUG HandlerThread:14117 [system_info.py:probe():151] Probing system
22
+ 2024-08-12 07:24:02,144 DEBUG HandlerThread:14117 [system_info.py:_probe_git():136] Probing git
23
+ 2024-08-12 07:24:02,156 DEBUG HandlerThread:14117 [system_info.py:_probe_git():144] Probing git done
24
+ 2024-08-12 07:24:02,156 DEBUG HandlerThread:14117 [system_info.py:probe():199] Probing system done
25
+ 2024-08-12 07:24:02,156 DEBUG HandlerThread:14117 [system_monitor.py:probe():223] {'os': 'Linux-5.15.0-91-generic-x86_64-with-glibc2.35', 'python': '3.10.12', 'heartbeatAt': '2024-08-11T22:24:02.142128', 'startedAt': '2024-08-11T22:24:01.491031', 'docker': None, 'cuda': None, 'args': ('--seq-length', '4096', '--sliding-window-size', '4096', '--micro-batch-size', '1', '--global-batch-size', '320', '--train-iters', '20000', '--tokenizer-type', 'HFPreTrainedTokenizer', '--tokenizer-model', '/share/pretrained_lm/Qwen/Qwen2-0.5B', '--train-data-path', '304771887', '/work/llm_recipes/datasets/bin/sample/llm_jp_corpus_v2_ja_wiki_train_0/data_text_document', '--valid-data-path', '304771887', '/work/llm_recipes/datasets/bin/sample/llm_jp_corpus_v2_ja_wiki_train_0/data_text_document', '--test-data-path', '304771887', '/work/llm_recipes/datasets/bin/sample/llm_jp_corpus_v2_ja_wiki_train_0/data_text_document', '--lr', '2e-5', '--min-lr', '1e-6', '--lr-decay-style', 'cosine', '--lr-warmup-iters', '500', '--lr-decay-iters', '20000', '--weight-decay', '0.1', '--grad-clip-norm', '1.0', '--optimizer', 'adam', '--adam-beta1', '0.9', '--adam-beta2', '0.95', '--adam-eps', '1e-6', '--save-interval', '5', '--eval-interval', '5', '--eval-iters', '10', '--bf16', '--mixed-precision', '--base-model', '/share/pretrained_lm/Qwen/Qwen2-0.5B', '--save', '/work/llm_recipes/models/yans-qwen2-0.5B', '--load', '/work/llm_recipes/models/yans-qwen2-0.5B', '--fsdp-activation-checkpointing', '--sharding-strategy', 'FULL_SHARD', '--checkpoint-type', 'LOCAL_STATE_DICT', '--save-n-checkpoints', '10', '--hf-upload-retry-limit', '2', '--hf-repo-id', 'koichi12/yans-qwen2-0.5B', '--wandb-entity', 'iwakawa-koichi-q5-tohoku-nlp6723', '--wandb-project', 'llm_tutorial', '--wandb-name', 'yans-qwen2-0.5B_train_2024-08-12-07:23:49'), 'state': 'running', 'program': '/project/examples/finetuning.py', 'codePathLocal': 'examples/finetuning.py', 'codePath': 'examples/finetuning.py', 'git': {'remote': 'https://github.com/cl-tohoku/llm-recipes-failab-m1-yans.git', 'commit': '6da01327e78c302bc0cfdb335f3ca297e2a19c8c'}, 'email': None, 'root': '/project', 'host': 'gpu-koiwa-00', 'username': 'koiwa', 'executable': '/usr/bin/python', 'cpu_count': 18, 'cpu_count_logical': 18, 'cpu_freq': {'current': 2400.0429999999997, 'min': 0.0, 'max': 0.0}, 'cpu_freq_per_core': [{'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}, {'current': 2400.043, 'min': 0.0, 'max': 0.0}], 'disk': {'/': {'total': 0.0625, 'used': 1.1444091796875e-05}}, 'gpu': 'NVIDIA A100-SXM4-40GB', 'gpu_count': 1, 'gpu_devices': [{'name': 'NVIDIA A100-SXM4-40GB', 'memory_total': 42949672960}], 'memory': {'total': 56.487823486328125}}
26
+ 2024-08-12 07:24:02,156 INFO HandlerThread:14117 [system_monitor.py:probe():224] Finished collecting system info
27
+ 2024-08-12 07:24:02,156 INFO HandlerThread:14117 [system_monitor.py:probe():227] Publishing system info
28
+ 2024-08-12 07:24:02,158 INFO HandlerThread:14117 [system_monitor.py:probe():229] Finished publishing system info
29
+ 2024-08-12 07:24:02,164 DEBUG SenderThread:14117 [sender.py:send():382] send: files
30
+ 2024-08-12 07:24:02,164 INFO SenderThread:14117 [sender.py:_save_file():1403] saving file wandb-metadata.json with policy now
31
+ 2024-08-12 07:24:02,174 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: python_packages
32
+ 2024-08-12 07:24:02,174 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: python_packages
33
+ 2024-08-12 07:24:02,216 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: stop_status
34
+ 2024-08-12 07:24:02,216 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: internal_messages
35
+ 2024-08-12 07:24:02,217 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: stop_status
36
+ 2024-08-12 07:24:02,505 DEBUG SenderThread:14117 [sender.py:send():382] send: telemetry
37
+ 2024-08-12 07:24:02,825 INFO wandb-upload_0:14117 [upload_job.py:push():131] Uploaded file /tmp/tmpynfca8juwandb/hnmvl8ac-wandb-metadata.json
38
+ 2024-08-12 07:24:03,029 INFO Thread-12 :14117 [dir_watcher.py:_on_file_created():271] file/dir created: /project/wandb/run-20240812_072401-esew3nhv/files/wandb-metadata.json
39
+ 2024-08-12 07:24:03,030 INFO Thread-12 :14117 [dir_watcher.py:_on_file_created():271] file/dir created: /project/wandb/run-20240812_072401-esew3nhv/files/output.log
40
+ 2024-08-12 07:24:03,030 INFO Thread-12 :14117 [dir_watcher.py:_on_file_created():271] file/dir created: /project/wandb/run-20240812_072401-esew3nhv/files/requirements.txt
41
+ 2024-08-12 07:24:05,030 INFO Thread-12 :14117 [dir_watcher.py:_on_file_modified():288] file/dir modified: /project/wandb/run-20240812_072401-esew3nhv/files/output.log
42
+ 2024-08-12 07:24:05,380 DEBUG SenderThread:14117 [sender.py:send():382] send: config
43
+ 2024-08-12 07:24:05,381 DEBUG SenderThread:14117 [sender.py:send():382] send: config
44
+ 2024-08-12 07:24:06,031 INFO Thread-12 :14117 [dir_watcher.py:_on_file_modified():288] file/dir modified: /project/wandb/run-20240812_072401-esew3nhv/files/output.log
45
+ 2024-08-12 07:24:07,031 INFO Thread-12 :14117 [dir_watcher.py:_on_file_modified():288] file/dir modified: /project/wandb/run-20240812_072401-esew3nhv/files/output.log
46
+ 2024-08-12 07:24:07,381 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
47
+ 2024-08-12 07:24:08,032 INFO Thread-12 :14117 [dir_watcher.py:_on_file_modified():288] file/dir modified: /project/wandb/run-20240812_072401-esew3nhv/files/output.log
48
+ 2024-08-12 07:24:12,382 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
49
+ 2024-08-12 07:24:17,174 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: stop_status
50
+ 2024-08-12 07:24:17,174 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: stop_status
51
+ 2024-08-12 07:24:17,174 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: internal_messages
52
+ 2024-08-12 07:24:17,384 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
53
+ 2024-08-12 07:24:22,385 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
54
+ 2024-08-12 07:24:27,385 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
55
+ 2024-08-12 07:24:32,173 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: stop_status
56
+ 2024-08-12 07:24:32,174 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: stop_status
57
+ 2024-08-12 07:24:32,216 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: internal_messages
58
+ 2024-08-12 07:24:33,387 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
59
+ 2024-08-12 07:24:34,049 INFO Thread-12 :14117 [dir_watcher.py:_on_file_modified():288] file/dir modified: /project/wandb/run-20240812_072401-esew3nhv/files/config.yaml
60
+ 2024-08-12 07:24:38,589 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
61
+ 2024-08-12 07:24:43,590 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
62
+ 2024-08-12 07:24:47,173 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: stop_status
63
+ 2024-08-12 07:24:47,174 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: stop_status
64
+ 2024-08-12 07:24:47,216 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: internal_messages
65
+ 2024-08-12 07:24:49,433 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
66
+ 2024-08-12 07:24:54,434 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
67
+ 2024-08-12 07:24:59,434 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
68
+ 2024-08-12 07:25:02,132 DEBUG SystemMonitor:14117 [system_monitor.py:_start():172] Starting system metrics aggregation loop
69
+ 2024-08-12 07:25:02,133 DEBUG SenderThread:14117 [sender.py:send():382] send: stats
70
+ 2024-08-12 07:25:02,173 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: stop_status
71
+ 2024-08-12 07:25:02,174 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: stop_status
72
+ 2024-08-12 07:25:02,216 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: internal_messages
73
+ 2024-08-12 07:25:05,393 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
74
+ 2024-08-12 07:25:10,394 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
75
+ 2024-08-12 07:25:15,394 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
76
+ 2024-08-12 07:25:17,174 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: stop_status
77
+ 2024-08-12 07:25:17,174 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: stop_status
78
+ 2024-08-12 07:25:17,216 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: internal_messages
79
+ 2024-08-12 07:25:21,046 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: partial_history
80
+ 2024-08-12 07:25:21,089 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
81
+ 2024-08-12 07:25:23,081 INFO Thread-12 :14117 [dir_watcher.py:_on_file_modified():288] file/dir modified: /project/wandb/run-20240812_072401-esew3nhv/files/output.log
82
+ 2024-08-12 07:25:26,090 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
83
+ 2024-08-12 07:25:31,091 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
84
+ 2024-08-12 07:25:32,134 DEBUG SenderThread:14117 [sender.py:send():382] send: stats
85
+ 2024-08-12 07:25:32,174 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: stop_status
86
+ 2024-08-12 07:25:32,174 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: stop_status
87
+ 2024-08-12 07:25:32,175 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: internal_messages
88
+ 2024-08-12 07:25:36,423 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
89
+ 2024-08-12 07:25:41,424 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
90
+ 2024-08-12 07:25:46,425 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
91
+ 2024-08-12 07:25:47,174 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: stop_status
92
+ 2024-08-12 07:25:47,174 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: stop_status
93
+ 2024-08-12 07:25:47,216 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: internal_messages
94
+ 2024-08-12 07:25:52,370 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
95
+ 2024-08-12 07:25:57,371 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
96
+ 2024-08-12 07:26:02,135 DEBUG SenderThread:14117 [sender.py:send():382] send: stats
97
+ 2024-08-12 07:26:02,174 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: stop_status
98
+ 2024-08-12 07:26:02,174 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: stop_status
99
+ 2024-08-12 07:26:02,216 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: internal_messages
100
+ 2024-08-12 07:26:02,441 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
101
+ 2024-08-12 07:26:07,441 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
102
+ 2024-08-12 07:26:12,442 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
103
+ 2024-08-12 07:26:17,174 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: stop_status
104
+ 2024-08-12 07:26:17,174 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: stop_status
105
+ 2024-08-12 07:26:17,216 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: internal_messages
106
+ 2024-08-12 07:26:18,440 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
107
+ 2024-08-12 07:26:23,440 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
108
+ 2024-08-12 07:26:28,441 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
109
+ 2024-08-12 07:26:32,136 DEBUG SenderThread:14117 [sender.py:send():382] send: stats
110
+ 2024-08-12 07:26:32,174 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: stop_status
111
+ 2024-08-12 07:26:32,175 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: stop_status
112
+ 2024-08-12 07:26:32,216 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: internal_messages
113
+ 2024-08-12 07:26:34,377 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
114
+ 2024-08-12 07:26:36,068 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: partial_history
115
+ 2024-08-12 07:26:36,070 DEBUG SenderThread:14117 [sender.py:send():382] send: history
116
+ 2024-08-12 07:26:36,071 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: summary_record
117
+ 2024-08-12 07:26:36,072 INFO SenderThread:14117 [sender.py:_save_file():1403] saving file wandb-summary.json with policy end
118
+ 2024-08-12 07:26:36,128 INFO Thread-12 :14117 [dir_watcher.py:_on_file_created():271] file/dir created: /project/wandb/run-20240812_072401-esew3nhv/files/wandb-summary.json
119
+ 2024-08-12 07:26:37,129 INFO Thread-12 :14117 [dir_watcher.py:_on_file_modified():288] file/dir modified: /project/wandb/run-20240812_072401-esew3nhv/files/output.log
120
+ 2024-08-12 07:26:40,110 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
121
+ 2024-08-12 07:26:45,111 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
122
+ 2024-08-12 07:26:47,174 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: stop_status
123
+ 2024-08-12 07:26:47,175 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: stop_status
124
+ 2024-08-12 07:26:47,176 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: internal_messages
125
+ 2024-08-12 07:26:50,379 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
126
+ 2024-08-12 07:26:55,380 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
127
+ 2024-08-12 07:27:00,381 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
128
+ 2024-08-12 07:27:02,137 DEBUG SenderThread:14117 [sender.py:send():382] send: stats
129
+ 2024-08-12 07:27:02,174 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: stop_status
130
+ 2024-08-12 07:27:02,175 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: stop_status
131
+ 2024-08-12 07:27:02,216 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: internal_messages
132
+ 2024-08-12 07:27:06,378 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
133
+ 2024-08-12 07:27:11,379 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
134
+ 2024-08-12 07:27:13,948 DEBUG SenderThread:14117 [sender.py:send():382] send: exit
135
+ 2024-08-12 07:27:13,948 INFO SenderThread:14117 [sender.py:send_exit():589] handling exit code: 255
136
+ 2024-08-12 07:27:13,948 INFO SenderThread:14117 [sender.py:send_exit():591] handling runtime: 191
137
+ 2024-08-12 07:27:13,950 INFO SenderThread:14117 [sender.py:_save_file():1403] saving file wandb-summary.json with policy end
138
+ 2024-08-12 07:27:13,950 INFO SenderThread:14117 [sender.py:send_exit():597] send defer
139
+ 2024-08-12 07:27:13,950 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: defer
140
+ 2024-08-12 07:27:13,950 INFO HandlerThread:14117 [handler.py:handle_request_defer():172] handle defer: 0
141
+ 2024-08-12 07:27:13,950 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: defer
142
+ 2024-08-12 07:27:13,950 INFO SenderThread:14117 [sender.py:send_request_defer():613] handle sender defer: 0
143
+ 2024-08-12 07:27:13,950 INFO SenderThread:14117 [sender.py:transition_state():617] send defer: 1
144
+ 2024-08-12 07:27:13,951 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: defer
145
+ 2024-08-12 07:27:13,951 INFO HandlerThread:14117 [handler.py:handle_request_defer():172] handle defer: 1
146
+ 2024-08-12 07:27:13,951 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: defer
147
+ 2024-08-12 07:27:13,951 INFO SenderThread:14117 [sender.py:send_request_defer():613] handle sender defer: 1
148
+ 2024-08-12 07:27:13,951 INFO SenderThread:14117 [sender.py:transition_state():617] send defer: 2
149
+ 2024-08-12 07:27:13,951 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: defer
150
+ 2024-08-12 07:27:13,951 INFO HandlerThread:14117 [handler.py:handle_request_defer():172] handle defer: 2
151
+ 2024-08-12 07:27:13,951 INFO HandlerThread:14117 [system_monitor.py:finish():203] Stopping system monitor
152
+ 2024-08-12 07:27:13,951 DEBUG SystemMonitor:14117 [system_monitor.py:_start():179] Finished system metrics aggregation loop
153
+ 2024-08-12 07:27:13,951 INFO HandlerThread:14117 [interfaces.py:finish():202] Joined cpu monitor
154
+ 2024-08-12 07:27:13,951 DEBUG SystemMonitor:14117 [system_monitor.py:_start():183] Publishing last batch of metrics
155
+ 2024-08-12 07:27:13,952 INFO HandlerThread:14117 [interfaces.py:finish():202] Joined disk monitor
156
+ 2024-08-12 07:27:13,986 INFO HandlerThread:14117 [interfaces.py:finish():202] Joined gpu monitor
157
+ 2024-08-12 07:27:13,986 INFO HandlerThread:14117 [interfaces.py:finish():202] Joined memory monitor
158
+ 2024-08-12 07:27:13,986 INFO HandlerThread:14117 [interfaces.py:finish():202] Joined network monitor
159
+ 2024-08-12 07:27:13,987 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: defer
160
+ 2024-08-12 07:27:13,987 INFO SenderThread:14117 [sender.py:send_request_defer():613] handle sender defer: 2
161
+ 2024-08-12 07:27:13,987 INFO SenderThread:14117 [sender.py:transition_state():617] send defer: 3
162
+ 2024-08-12 07:27:13,987 DEBUG SenderThread:14117 [sender.py:send():382] send: stats
163
+ 2024-08-12 07:27:13,987 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: defer
164
+ 2024-08-12 07:27:13,987 INFO HandlerThread:14117 [handler.py:handle_request_defer():172] handle defer: 3
165
+ 2024-08-12 07:27:13,989 DEBUG SenderThread:14117 [sender.py:send():382] send: history
166
+ 2024-08-12 07:27:13,989 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: summary_record
167
+ 2024-08-12 07:27:13,990 INFO SenderThread:14117 [sender.py:_save_file():1403] saving file wandb-summary.json with policy end
168
+ 2024-08-12 07:27:13,990 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: defer
169
+ 2024-08-12 07:27:13,990 INFO SenderThread:14117 [sender.py:send_request_defer():613] handle sender defer: 3
170
+ 2024-08-12 07:27:13,990 INFO SenderThread:14117 [sender.py:transition_state():617] send defer: 4
171
+ 2024-08-12 07:27:13,990 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: defer
172
+ 2024-08-12 07:27:13,990 INFO HandlerThread:14117 [handler.py:handle_request_defer():172] handle defer: 4
173
+ 2024-08-12 07:27:13,990 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: defer
174
+ 2024-08-12 07:27:13,990 INFO SenderThread:14117 [sender.py:send_request_defer():613] handle sender defer: 4
175
+ 2024-08-12 07:27:13,991 INFO SenderThread:14117 [sender.py:transition_state():617] send defer: 5
176
+ 2024-08-12 07:27:13,991 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: defer
177
+ 2024-08-12 07:27:13,991 INFO HandlerThread:14117 [handler.py:handle_request_defer():172] handle defer: 5
178
+ 2024-08-12 07:27:13,991 DEBUG SenderThread:14117 [sender.py:send():382] send: summary
179
+ 2024-08-12 07:27:13,992 INFO SenderThread:14117 [sender.py:_save_file():1403] saving file wandb-summary.json with policy end
180
+ 2024-08-12 07:27:13,992 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: defer
181
+ 2024-08-12 07:27:13,992 INFO SenderThread:14117 [sender.py:send_request_defer():613] handle sender defer: 5
182
+ 2024-08-12 07:27:13,993 INFO SenderThread:14117 [sender.py:transition_state():617] send defer: 6
183
+ 2024-08-12 07:27:13,993 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: defer
184
+ 2024-08-12 07:27:13,993 INFO HandlerThread:14117 [handler.py:handle_request_defer():172] handle defer: 6
185
+ 2024-08-12 07:27:13,993 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: defer
186
+ 2024-08-12 07:27:13,993 INFO SenderThread:14117 [sender.py:send_request_defer():613] handle sender defer: 6
187
+ 2024-08-12 07:27:13,993 INFO SenderThread:14117 [sender.py:transition_state():617] send defer: 7
188
+ 2024-08-12 07:27:13,993 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
189
+ 2024-08-12 07:27:13,993 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: defer
190
+ 2024-08-12 07:27:13,993 INFO HandlerThread:14117 [handler.py:handle_request_defer():172] handle defer: 7
191
+ 2024-08-12 07:27:13,993 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: defer
192
+ 2024-08-12 07:27:13,993 INFO SenderThread:14117 [sender.py:send_request_defer():613] handle sender defer: 7
193
+ 2024-08-12 07:27:14,154 INFO Thread-12 :14117 [dir_watcher.py:_on_file_modified():288] file/dir modified: /project/wandb/run-20240812_072401-esew3nhv/files/wandb-summary.json
194
+ 2024-08-12 07:27:14,948 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: poll_exit
195
+ 2024-08-12 07:27:15,265 INFO SenderThread:14117 [sender.py:transition_state():617] send defer: 8
196
+ 2024-08-12 07:27:15,265 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: poll_exit
197
+ 2024-08-12 07:27:15,265 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: defer
198
+ 2024-08-12 07:27:15,265 INFO HandlerThread:14117 [handler.py:handle_request_defer():172] handle defer: 8
199
+ 2024-08-12 07:27:15,265 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: defer
200
+ 2024-08-12 07:27:15,265 INFO SenderThread:14117 [sender.py:send_request_defer():613] handle sender defer: 8
201
+ 2024-08-12 07:27:15,266 INFO SenderThread:14117 [job_builder.py:build():296] Attempting to build job artifact
202
+ 2024-08-12 07:27:15,266 INFO SenderThread:14117 [job_builder.py:_get_source_type():426] is repo sourced job
203
+ 2024-08-12 07:27:15,281 INFO SenderThread:14117 [job_builder.py:build():402] adding wandb-job metadata file
204
+ 2024-08-12 07:27:15,289 INFO SenderThread:14117 [sender.py:transition_state():617] send defer: 9
205
+ 2024-08-12 07:27:15,290 DEBUG SenderThread:14117 [sender.py:send():382] send: artifact
206
+ 2024-08-12 07:27:15,290 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: defer
207
+ 2024-08-12 07:27:15,291 INFO HandlerThread:14117 [handler.py:handle_request_defer():172] handle defer: 9
208
+ 2024-08-12 07:27:15,948 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: poll_exit
209
+ 2024-08-12 07:27:16,156 INFO Thread-12 :14117 [dir_watcher.py:_on_file_modified():288] file/dir modified: /project/wandb/run-20240812_072401-esew3nhv/files/output.log
210
+ 2024-08-12 07:27:16,288 INFO SenderThread:14117 [sender.py:send_artifact():1494] sent artifact job-https___github.com_cl-tohoku_llm-recipes-failab-m1-yans.git_examples_finetuning.py - {'id': 'QXJ0aWZhY3Q6MTEzOTg5OTc5MQ==', 'state': 'COMMITTED', 'artifactSequence': {'id': 'QXJ0aWZhY3RDb2xsZWN0aW9uOjM2MjY3MjMzNA==', 'latestArtifact': {'id': 'QXJ0aWZhY3Q6MTE0MDA5NDY1MQ==', 'versionIndex': 9}}}
211
+ 2024-08-12 07:27:16,288 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: defer
212
+ 2024-08-12 07:27:16,288 INFO SenderThread:14117 [sender.py:send_request_defer():613] handle sender defer: 9
213
+ 2024-08-12 07:27:16,288 INFO SenderThread:14117 [dir_watcher.py:finish():358] shutting down directory watcher
214
+ 2024-08-12 07:27:17,157 INFO SenderThread:14117 [dir_watcher.py:finish():388] scan: /project/wandb/run-20240812_072401-esew3nhv/files
215
+ 2024-08-12 07:27:17,157 INFO SenderThread:14117 [dir_watcher.py:finish():402] scan save: /project/wandb/run-20240812_072401-esew3nhv/files/requirements.txt requirements.txt
216
+ 2024-08-12 07:27:17,157 INFO SenderThread:14117 [dir_watcher.py:finish():402] scan save: /project/wandb/run-20240812_072401-esew3nhv/files/config.yaml config.yaml
217
+ 2024-08-12 07:27:17,158 INFO SenderThread:14117 [dir_watcher.py:finish():402] scan save: /project/wandb/run-20240812_072401-esew3nhv/files/wandb-metadata.json wandb-metadata.json
218
+ 2024-08-12 07:27:17,158 INFO SenderThread:14117 [dir_watcher.py:finish():402] scan save: /project/wandb/run-20240812_072401-esew3nhv/files/wandb-summary.json wandb-summary.json
219
+ 2024-08-12 07:27:17,158 INFO SenderThread:14117 [dir_watcher.py:finish():402] scan save: /project/wandb/run-20240812_072401-esew3nhv/files/output.log output.log
220
+ 2024-08-12 07:27:17,158 INFO SenderThread:14117 [sender.py:transition_state():617] send defer: 10
221
+ 2024-08-12 07:27:17,158 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: poll_exit
222
+ 2024-08-12 07:27:17,158 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: defer
223
+ 2024-08-12 07:27:17,159 INFO HandlerThread:14117 [handler.py:handle_request_defer():172] handle defer: 10
224
+ 2024-08-12 07:27:17,159 DEBUG SenderThread:14117 [sender.py:send_request():409] send_request: defer
225
+ 2024-08-12 07:27:17,159 INFO SenderThread:14117 [sender.py:send_request_defer():613] handle sender defer: 10
226
+ 2024-08-12 07:27:17,159 INFO SenderThread:14117 [file_pusher.py:finish():172] shutting down file pusher
227
+ 2024-08-12 07:27:22,160 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
228
+ 2024-08-12 07:27:27,160 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
229
+ 2024-08-12 07:27:32,161 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
230
+ 2024-08-12 07:27:37,162 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
231
+ 2024-08-12 07:27:42,162 DEBUG HandlerThread:14117 [handler.py:handle_request():146] handle_request: status_report
232
+ 2024-08-12 07:27:46,742 WARNING StreamThr :14117 [internal.py:is_dead():414] Internal process exiting, parent pid 14046 disappeared
233
+ 2024-08-12 07:27:46,742 ERROR StreamThr :14117 [internal.py:wandb_internal():152] Internal process shutdown.
234
+ 2024-08-12 07:27:47,163 INFO SenderThread:14117 [sender.py:finish():1572] shutting down sender
235
+ 2024-08-12 07:27:47,163 INFO HandlerThread:14117 [handler.py:finish():869] shutting down handler
236
+ 2024-08-12 07:27:47,163 INFO SenderThread:14117 [file_pusher.py:finish():172] shutting down file pusher
237
+ 2024-08-12 07:27:47,163 INFO SenderThread:14117 [file_pusher.py:join():178] waiting for file pusher
238
+ 2024-08-12 07:27:47,163 INFO SenderThread:14117 [file_stream.py:finish():595] file stream finish called
239
+ 2024-08-12 07:27:47,163 INFO WriterThread:14117 [datastore.py:close():296] close: /project/wandb/run-20240812_072401-esew3nhv/run-esew3nhv.wandb
240
+ 2024-08-12 07:27:47,333 INFO SenderThread:14117 [file_stream.py:finish():599] file stream finish is done
wandb/run-20240812_072401-esew3nhv/logs/debug.log ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2024-08-12 07:24:01,497 INFO MainThread:14046 [wandb_setup.py:_flush():76] Current SDK version is 0.16.3
2
+ 2024-08-12 07:24:01,497 INFO MainThread:14046 [wandb_setup.py:_flush():76] Configure stats pid to 14046
3
+ 2024-08-12 07:24:01,497 INFO MainThread:14046 [wandb_setup.py:_flush():76] Loading settings from /singularity_home/.config/wandb/settings
4
+ 2024-08-12 07:24:01,497 INFO MainThread:14046 [wandb_setup.py:_flush():76] Loading settings from /project/wandb/settings
5
+ 2024-08-12 07:24:01,497 INFO MainThread:14046 [wandb_setup.py:_flush():76] Loading settings from environment variables: {'api_key': '***REDACTED***', 'run_notes': 'Train Qwen2'}
6
+ 2024-08-12 07:24:01,497 INFO MainThread:14046 [wandb_setup.py:_flush():76] Applying setup settings: {'_disable_service': False}
7
+ 2024-08-12 07:24:01,497 INFO MainThread:14046 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {'program_relpath': 'examples/finetuning.py', 'program_abspath': '/project/examples/finetuning.py', 'program': '/project/examples/finetuning.py'}
8
+ 2024-08-12 07:24:01,497 INFO MainThread:14046 [wandb_init.py:_log_setup():526] Logging user logs to /project/wandb/run-20240812_072401-esew3nhv/logs/debug.log
9
+ 2024-08-12 07:24:01,497 INFO MainThread:14046 [wandb_init.py:_log_setup():527] Logging internal logs to /project/wandb/run-20240812_072401-esew3nhv/logs/debug-internal.log
10
+ 2024-08-12 07:24:01,497 INFO MainThread:14046 [wandb_init.py:init():566] calling init triggers
11
+ 2024-08-12 07:24:01,498 INFO MainThread:14046 [wandb_init.py:init():573] wandb.init called with sweep_config: {}
12
+ config: {'sharding_strategy': 'FULL_SHARD', 'checkpoint_type': 'LOCAL_STATE_DICT', 'fsdp_activation_checkpointing': True, 'fsdp_cpu_offload': False, 'low_cpu_fsdp': False, 'no_meta_device': False, 'data_path': None, 'split': '969, 30, 1', 'train_data_path': ['304771887', '/work/llm_recipes/datasets/bin/sample/llm_jp_corpus_v2_ja_wiki_train_0/data_text_document'], 'valid_data_path': ['304771887', '/work/llm_recipes/datasets/bin/sample/llm_jp_corpus_v2_ja_wiki_train_0/data_text_document'], 'test_data_path': ['304771887', '/work/llm_recipes/datasets/bin/sample/llm_jp_corpus_v2_ja_wiki_train_0/data_text_document'], 'data_cache_path': None, 'vocab_size': None, 'vocab_file': None, 'merge_file': None, 'seq_length': 4096, 'num_workers': 2, 'tokenizer_type': 'HFPreTrainedTokenizer', 'tokenizer_model': '/share/pretrained_lm/Qwen/Qwen2-0.5B', 'reset_position_ids': False, 'reset_attention_mask': False, 'eod_mask_loss': False, 'retro_return_doc_ids': False, 'short_seq_prob': 0.1, 'vocab_extra_ids': 0, 'seed': 1234, 'use_mpi': False, 'wandb_entity': 'iwakawa-koichi-q5-tohoku-nlp6723', 'wandb_name': 'yans-qwen2-0.5B_train_2024-08-12-07:23:49', 'wandb_project': 'llm_tutorial', 'quantization': False, 'use_freeze_layers': False, 'freeze_layers': None, 'bf16': True, 'fp16': False, 'mixed_precision': True, 'param_dtype': None, 'load': '/work/llm_recipes/models/yans-qwen2-0.5B', 'save': '/work/llm_recipes/models/yans-qwen2-0.5B', 'base_model': '/share/pretrained_lm/Qwen/Qwen2-0.5B', 'use_better_transformer': False, 'grad_clip_norm': 1.0, 'eval_interval': 5, 'save_interval': 5, 'eval_iters': 10, 'optimizer': 'adam', 'lr': 2e-05, 'lr_decay_style': 'cosine', 'lr_decay_iters': 20000, 'lr_warmup_iters': 500, 'min_lr': 1e-06, 'train_iters': 20000, 'train_samples': None, 'global_batch_size': 320, 'micro_batch_size': 1, 'make_vocab_size_divisible_by': 128, 'sliding_window_size': 4096, 'skip_batch': None, 'no_save_optimizer_state': False, 'continual_pretraining': False, 'instruction_tuning': False, 'direct_preference_optimization': False, 'attention_dropout': 0.1, 'hidden_dropout': 0.1, 'weight_decay': 0.1, 'adam_beta1': 0.9, 'adam_beta2': 0.95, 'adam_eps': 1e-06, 'hf_transformer_model_dir': None, 'instruction_train_data_path': None, 'instruction_valid_data_path': None, 'epoch': None, 'instruction_dataset_size': None, 'save_sampler_state': False, 'label_smoothing': 0.0, 'save_n_checkpoints': 10, 'hf_repo_id': 'koichi12/yans-qwen2-0.5B', 'create_public_hf_repo': False, 'upload_all_checkpoints_to_hf': False, 'hf_upload_retry_limit': 2, 'exit_duration_in_mins': None, 'source_key': None, 'target_key': None, 'attn_implementation': 'flash_attention_2', 'efficient_instruction_tuning': False, 'remove_padding_masking': False, 'save_start_iter': None, 'rank': 0, 'world_size': 1, 'padded_vocab_size': 151680, 'gradient_accumulation_steps': 320}
13
+ 2024-08-12 07:24:01,498 INFO MainThread:14046 [wandb_init.py:init():616] starting backend
14
+ 2024-08-12 07:24:01,498 INFO MainThread:14046 [wandb_init.py:init():620] setting up manager
15
+ 2024-08-12 07:24:01,503 INFO MainThread:14046 [backend.py:_multiprocessing_setup():105] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
16
+ 2024-08-12 07:24:01,503 INFO MainThread:14046 [wandb_init.py:init():628] backend started and connected
17
+ 2024-08-12 07:24:01,508 INFO MainThread:14046 [wandb_init.py:init():720] updated telemetry
18
+ 2024-08-12 07:24:01,540 INFO MainThread:14046 [wandb_init.py:init():753] communicating run to backend with 90.0 second timeout
19
+ 2024-08-12 07:24:02,032 INFO MainThread:14046 [wandb_run.py:_on_init():2262] communicating current version
20
+ 2024-08-12 07:24:02,113 INFO MainThread:14046 [wandb_run.py:_on_init():2271] got version response upgrade_message: "wandb version 0.17.6 is available! To upgrade, please run:\n $ pip install wandb --upgrade"
21
+
22
+ 2024-08-12 07:24:02,114 INFO MainThread:14046 [wandb_init.py:init():804] starting run threads in backend
23
+ 2024-08-12 07:24:02,173 INFO MainThread:14046 [wandb_run.py:_console_start():2241] atexit reg
24
+ 2024-08-12 07:24:02,173 INFO MainThread:14046 [wandb_run.py:_redirect():2096] redirect: wrap_raw
25
+ 2024-08-12 07:24:02,174 INFO MainThread:14046 [wandb_run.py:_redirect():2161] Wrapping output streams.
26
+ 2024-08-12 07:24:02,174 INFO MainThread:14046 [wandb_run.py:_redirect():2186] Redirects installed.
27
+ 2024-08-12 07:24:02,174 INFO MainThread:14046 [wandb_init.py:init():847] run started, returning control to user process
28
+ 2024-08-12 07:24:05,379 INFO MainThread:14046 [wandb_run.py:_config_callback():1343] config_cb None None {'model_architecture': 'Qwen2ForCausalLM', 'activation_function': 'silu', 'hidden_size': 896, 'model_type': 'qwen2', 'max_position_embeddings': 4096, 'num_attention_heads': 14, 'num_hidden_layers': 24}
29
+ 2024-08-12 07:24:05,380 INFO MainThread:14046 [wandb_run.py:_config_callback():1343] config_cb None None {'world_size': 1}