|
1: 2023-04-27 15:54:46.865633: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
1: 2023-04-27 15:54:46.865658: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
1: 2023-04-27 15:54:46.865639: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
1: 2023-04-27 15:54:46.865670: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
1: 2023-04-27 15:54:46.865698: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
1: 2023-04-27 15:54:46.865712: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
1: 2023-04-27 15:54:46.865722: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
1: 2023-04-27 15:54:46.865732: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
0: 2023-04-27 15:54:46.866334: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
0: 2023-04-27 15:54:46.866381: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
0: 2023-04-27 15:54:46.866405: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
0: 2023-04-27 15:54:46.866422: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
0: 2023-04-27 15:54:46.866423: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
0: 2023-04-27 15:54:46.866376: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
0: 2023-04-27 15:54:46.866429: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
0: 2023-04-27 15:54:46.866443: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
0: 2023-04-27 15:54:54.275706: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-27 15:54:54.275726: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-27 15:54:54.275736: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-27 15:54:54.275754: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-27 15:54:54.275770: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-27 15:54:54.275778: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-27 15:54:54.275774: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-27 15:54:54.275780: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-27 15:54:54.280268: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
0: 2023-04-27 15:54:54.280299: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
0: 2023-04-27 15:54:54.280321: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
0: 2023-04-27 15:54:54.280348: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
0: 2023-04-27 15:54:54.280358: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
0: 2023-04-27 15:54:54.280366: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
0: 2023-04-27 15:54:54.280378: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
0: 2023-04-27 15:54:54.280589: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
1: 2023-04-27 15:54:54.295518: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-27 15:54:54.295543: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-27 15:54:54.295595: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-27 15:54:54.295601: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-27 15:54:54.295558: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-27 15:54:54.295598: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-27 15:54:54.295613: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-27 15:54:54.295641: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-27 15:54:54.296314: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
1: 2023-04-27 15:54:54.296327: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
1: 2023-04-27 15:54:54.296339: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
1: 2023-04-27 15:54:54.296344: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
1: 2023-04-27 15:54:54.296367: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
1: 2023-04-27 15:54:54.296370: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
1: 2023-04-27 15:54:54.296376: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
1: 2023-04-27 15:54:54.296384: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
0: 2023-04-27 15:55:19.995941: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-27 15:55:19.995971: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-27 15:55:19.995987: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-27 15:55:19.996017: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-27 15:55:19.996024: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-27 15:55:19.996036: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-27 15:55:19.996037: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-27 15:55:19.996243: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-27 15:55:19.997343: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-27 15:55:19.997350: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-27 15:55:19.997349: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-27 15:55:19.997356: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-27 15:55:19.997367: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-27 15:55:19.997395: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
0: 2023-04-27 15:55:19.997394: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
0: 2023-04-27 15:55:19.997395: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
0: 2023-04-27 15:55:19.997396: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
0: 2023-04-27 15:55:19.997384: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-27 15:55:19.997405: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
0: 2023-04-27 15:55:19.997429: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
0: 2023-04-27 15:55:19.997453: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-27 15:55:19.997475: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
0: 2023-04-27 15:55:19.997484: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-27 15:55:19.997506: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
1: 2023-04-27 15:55:20.004632: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-27 15:55:20.004664: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-27 15:55:20.004687: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-27 15:55:20.004697: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-27 15:55:20.004716: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-27 15:55:20.004722: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-27 15:55:20.004736: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-27 15:55:20.004964: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-27 15:55:20.006347: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-27 15:55:20.006347: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-27 15:55:20.006347: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-27 15:55:20.006350: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-27 15:55:20.006351: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-27 15:55:20.006349: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-27 15:55:20.006354: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-27 15:55:20.006352: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-27 15:55:20.006367: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
1: 2023-04-27 15:55:20.006368: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
1: 2023-04-27 15:55:20.006370: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
1: 2023-04-27 15:55:20.006369: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
1: 2023-04-27 15:55:20.006369: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
1: 2023-04-27 15:55:20.006371: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
1: 2023-04-27 15:55:20.006372: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
1: 2023-04-27 15:55:20.006372: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
0: Loading extension module scaled_upper_triang_masked_softmax_cuda... |
|
0: [92mSuccessfully preprocessed all matching files.[0m |
|
0: Detected CUDA files, patching ldflags |
|
0: Emitting ninja build file /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja... |
|
0: Building extension module scaled_masked_softmax_cuda... |
|
0: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) |
|
0: Loading extension module scaled_masked_softmax_cuda... |
|
0: [92mSuccessfully preprocessed all matching files.[0m |
|
0: Detected CUDA files, patching ldflags |
|
0: Emitting ninja build file /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja... |
|
0: Building extension module fused_mix_prec_layer_norm_cuda... |
|
0: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) |
|
0: Loading extension module fused_mix_prec_layer_norm_cuda... |
|
0: [92mSuccessfully preprocessed all matching files.[0m |
|
0: [92mSuccessfully preprocessed all matching files.[0m |
|
1: [92mSuccessfully preprocessed all matching files.[0m |
|
1: [92mSuccessfully preprocessed all matching files.[0m |
|
1: [92mSuccessfully preprocessed all matching files.[0m |
|
1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
1: warnings.warn( |
|
1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
1: warnings.warn( |
|
1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
1: warnings.warn( |
|
1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
1: warnings.warn( |
|
1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
1: warnings.warn( |
|
1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
1: warnings.warn( |
|
1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
1: warnings.warn( |
|
1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
1: warnings.warn( |
|
0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
0: warnings.warn( |
|
0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
0: warnings.warn( |
|
0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
0: warnings.warn( |
|
0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
0: warnings.warn( |
|
0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
0: warnings.warn( |
|
0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
0: warnings.warn( |
|
0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
0: warnings.warn( |
|
0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
0: warnings.warn( |
|
0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root...Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
0: |
|
0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root...Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root...Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
1: |
|
1: |
|
1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root...Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
1: |
|
1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root...Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
1: |
|
1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
0: Emitting ninja build file /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu/utils/build.ninja... |
|
0: Building extension module utils... |
|
0: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) |
|
0: Loading extension module utils... |
|
0: Loading extension module utils... |
|
0: Loading extension module utils... |
|
0: Loading extension module utils... |
|
0: Loading extension module utils... |
|
0: Loading extension module utils... |
|
0: Loading extension module utils... |
|
0: Loading extension module utils... |
|
1: Loading extension module utils... |
|
1: Loading extension module utils... |
|
1: Loading extension module utils... |
|
1: Loading extension module utils... |
|
1: Loading extension module utils... |
|
1: Loading extension module utils... |
|
1: Loading extension module utils... |
|
1: Loading extension module utils... |
|
0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
0: No modifications detected for re-loaded extension module utils, skipping build step... |
|
0: Loading extension module utils... |
|
0: No modifications detected for re-loaded extension module utils, skipping build step... |
|
0: Loading extension module utils... |
|
0: No modifications detected for re-loaded extension module utils, skipping build step... |
|
0: Loading extension module utils... |
|
0: No modifications detected for re-loaded extension module utils, skipping build step... |
|
0: Loading extension module utils... |
|
0: No modifications detected for re-loaded extension module utils, skipping build step... |
|
0: Loading extension module utils... |
|
0: No modifications detected for re-loaded extension module utils, skipping build step... |
|
0: Loading extension module utils... |
|
0: No modifications detected for re-loaded extension module utils, skipping build step... |
|
0: Loading extension module utils... |
|
1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
1: No modifications detected for re-loaded extension module utils, skipping build step... |
|
1: Loading extension module utils... |
|
1: No modifications detected for re-loaded extension module utils, skipping build step... |
|
1: Loading extension module utils... |
|
1: No modifications detected for re-loaded extension module utils, skipping build step... |
|
1: Loading extension module utils... |
|
1: No modifications detected for re-loaded extension module utils, skipping build step... |
|
1: Loading extension module utils...No modifications detected for re-loaded extension module utils, skipping build step... |
|
1: |
|
1: Loading extension module utils... |
|
1: No modifications detected for re-loaded extension module utils, skipping build step... |
|
1: Loading extension module utils... |
|
1: No modifications detected for re-loaded extension module utils, skipping build step... |
|
1: Loading extension module utils... |
|
1: No modifications detected for re-loaded extension module utils, skipping build step... |
|
1: Loading extension module utils... |
|
0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
0: No modifications detected for re-loaded extension module utils, skipping build step... |
|
0: Loading extension module utils... |
|
1: Traceback (most recent call last): |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
1: Traceback (most recent call last): |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
1: Traceback (most recent call last): |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
1: Traceback (most recent call last): |
|
1: main() |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
1: Traceback (most recent call last): |
|
1: Traceback (most recent call last): |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
1: Traceback (most recent call last): |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
1: main() |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: return f(*args, **kwargs) |
|
1: main() |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: main() |
|
1: main()Traceback (most recent call last): |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: |
|
1: main() |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: main() |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: return f(*args, **kwargs) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: return f(*args, **kwargs) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: return f(*args, **kwargs) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: return f(*args, **kwargs) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: return f(*args, **kwargs) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: main()pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
1: |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: return f(*args, **kwargs) |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: return f(*args, **kwargs) |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler)model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
1: |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: success = self._load_zero_checkpoint( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: success = self._load_zero_checkpoint( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: success = self._load_zero_checkpoint( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: success = self._load_zero_checkpoint( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: success = self._load_zero_checkpoint( |
|
1: self.optimizer.load_state_dict( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: success = self._load_zero_checkpoint( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: success = self._load_zero_checkpoint( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: self.optimizer.load_state_dict( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: success = self._load_zero_checkpoint( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
1: self._load_legacy_checkpoint(state_dict_list, File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: current_rank_sd = state_dict_list[dp_rank]self.optimizer.load_state_dict( |
|
1: |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: IndexError: list index out of range |
|
1: current_rank_sd = state_dict_list[dp_rank] |
|
1: IndexError: list index out of range |
|
1: self.optimizer.load_state_dict( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: current_rank_sd = state_dict_list[dp_rank] |
|
1: IndexError: list index out of range |
|
1: current_rank_sd = state_dict_list[dp_rank] |
|
1: IndexError: list index out of range |
|
1: self.optimizer.load_state_dict( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: self.optimizer.load_state_dict( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: self.optimizer.load_state_dict( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
1: self.optimizer.load_state_dict( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
1: current_rank_sd = state_dict_list[dp_rank] File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: |
|
1: IndexError: list index out of range |
|
1: current_rank_sd = state_dict_list[dp_rank] |
|
1: IndexError: list index out of range |
|
1: current_rank_sd = state_dict_list[dp_rank] |
|
1: IndexError: list index out of range |
|
1: current_rank_sd = state_dict_list[dp_rank] |
|
1: IndexError: list index out of range |
|
0: Traceback (most recent call last): |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
0: Traceback (most recent call last): |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
0: Traceback (most recent call last): |
|
0: Traceback (most recent call last): |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
0: Traceback (most recent call last): |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
0: Traceback (most recent call last): |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
0: main() |
|
0: Traceback (most recent call last): |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
0: main() |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: main() |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: main() |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: main() |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: main() |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: return f(*args, **kwargs) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: main() |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: return f(*args, **kwargs) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: return f(*args, **kwargs) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: return f(*args, **kwargs) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: return f(*args, **kwargs)return f(*args, **kwargs) |
|
0: |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: return f(*args, **kwargs) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step,model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: Traceback (most recent call last): |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
0: main() |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: return f(*args, **kwargs) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: success = self._load_zero_checkpoint( |
|
0: success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: self.optimizer.load_state_dict( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self.optimizer.load_state_dict( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self.optimizer.load_state_dict( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self.optimizer.load_state_dict( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self.optimizer.load_state_dict( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self.optimizer.load_state_dict( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self.optimizer.load_state_dict( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self.optimizer.load_state_dict( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self._load_legacy_checkpoint(state_dict_list, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: self._load_legacy_checkpoint(state_dict_list, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: self._load_legacy_checkpoint(state_dict_list, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: self._load_legacy_checkpoint(state_dict_list, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: self._load_legacy_checkpoint(state_dict_list, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: self._load_legacy_checkpoint(state_dict_list, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: self._load_legacy_checkpoint(state_dict_list, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: current.data.copy_(src_tensor.data) |
|
0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
|
0: current.data.copy_(src_tensor.data) |
|
0: RuntimeError : current.data.copy_(src_tensor.data)The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
|
0: |
|
0: current.data.copy_(src_tensor.data) |
|
0: RuntimeError: RuntimeErrorThe size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0: |
|
0: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
|
0: current.data.copy_(src_tensor.data) |
|
0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
|
0: current.data.copy_(src_tensor.data) |
|
0: RuntimeErrorcurrent.data.copy_(src_tensor.data): |
|
0: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
|
0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
|
0: self._load_legacy_checkpoint(state_dict_list, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: current.data.copy_(src_tensor.data) |
|
0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
|
0: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 42284) of binary: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/bin/python |
|
1: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 71550) of binary: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/bin/python |
|
1: ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/tmp/torchelastic_8ia3se9x/none_chz6nne8/attempt_0/0/error.json) |
|
1: Traceback (most recent call last): |
|
1: File "/opt/cray/pe/python/3.9.12.1/lib/python3.9/runpy.py", line 197, in _run_module_as_main |
|
1: return _run_code(code, main_globals, None, |
|
1: File "/opt/cray/pe/python/3.9.12.1/lib/python3.9/runpy.py", line 87, in _run_code |
|
1: exec(code, run_globals) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 766, in <module> |
|
1: main() |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: return f(*args, **kwargs) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main |
|
1: run(args) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run |
|
0: ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/tmp/torchelastic_w2odsqyk/none_2mmj7eag/attempt_0/0/error.json) |
|
1: elastic_launch( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ |
|
0: Traceback (most recent call last): |
|
0: File "/opt/cray/pe/python/3.9.12.1/lib/python3.9/runpy.py", line 197, in _run_module_as_main |
|
0: return _run_code(code, main_globals, None, |
|
0: File "/opt/cray/pe/python/3.9.12.1/lib/python3.9/runpy.py", line 87, in _run_code |
|
0: exec(code, run_globals) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 766, in <module> |
|
1: return launch_agent(self._config, self._entrypoint, list(args)) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent |
|
1: raise ChildFailedError( |
|
1: torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
|
1: ============================================================ |
|
1: Megatron-DeepSpeed/pretrain_gpt.py FAILED |
|
1: ------------------------------------------------------------ |
|
1: Failures: |
|
1: [1]: |
|
1: time : 2023-04-27_15:57:30 |
|
1: host : nid007281 |
|
1: rank : 9 (local_rank: 1) |
|
1: exitcode : 1 (pid: 71551) |
|
1: error_file: /tmp/torchelastic_8ia3se9x/none_chz6nne8/attempt_0/1/error.json |
|
1: traceback : Traceback (most recent call last): |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: return f(*args, **kwargs) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: success = self._load_zero_checkpoint( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: self.optimizer.load_state_dict( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: current_rank_sd = state_dict_list[dp_rank] |
|
1: IndexError: list index out of range |
|
1: |
|
1: [2]: |
|
1: time : 2023-04-27_15:57:30 |
|
1: host : nid007281 |
|
1: rank : 10 (local_rank: 2) |
|
1: exitcode : 1 (pid: 71552) |
|
1: error_file: /tmp/torchelastic_8ia3se9x/none_chz6nne8/attempt_0/2/error.json |
|
1: traceback : Traceback (most recent call last): |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: return f(*args, **kwargs) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: main() |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: success = self._load_zero_checkpoint( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: self.optimizer.load_state_dict( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: current_rank_sd = state_dict_list[dp_rank] |
|
1: IndexError: list index out of range |
|
1: |
|
1: [3]: |
|
1: time : 2023-04-27_15:57:30 |
|
1: host : nid007281 |
|
1: rank : 11 (local_rank: 3) |
|
1: exitcode : 1 (pid: 71553) |
|
1: error_file: /tmp/torchelastic_8ia3se9x/none_chz6nne8/attempt_0/3/error.json |
|
1: traceback : Traceback (most recent call last): |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: return f(*args, **kwargs) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: success = self._load_zero_checkpoint( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: self.optimizer.load_state_dict( |
|
0: return f(*args, **kwargs) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: current_rank_sd = state_dict_list[dp_rank] |
|
1: IndexError: list index out of range |
|
1: |
|
1: [4]: |
|
1: time : 2023-04-27_15:57:30 |
|
1: host : nid007281 |
|
1: rank : 12 (local_rank: 4) |
|
1: exitcode : 1 (pid: 71554) |
|
1: error_file: /tmp/torchelastic_8ia3se9x/none_chz6nne8/attempt_0/4/error.json |
|
1: traceback : Traceback (most recent call last): |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: return f(*args, **kwargs) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: success = self._load_zero_checkpoint( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: self.optimizer.load_state_dict( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: current_rank_sd = state_dict_list[dp_rank] |
|
1: IndexError: list index out of range |
|
1: |
|
1: [5]: |
|
1: time : 2023-04-27_15:57:30 |
|
1: host : nid007281 |
|
1: rank : 13 (local_rank: 5) |
|
1: exitcode : 1 (pid: 71555) |
|
1: error_file: /tmp/torchelastic_8ia3se9x/none_chz6nne8/attempt_0/5/error.json |
|
1: traceback : Traceback (most recent call last): |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: return f(*args, **kwargs) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: success = self._load_zero_checkpoint( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: self.optimizer.load_state_dict( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: current_rank_sd = state_dict_list[dp_rank] |
|
1: IndexError: list index out of range |
|
1: |
|
1: [6]: |
|
1: time : 2023-04-27_15:57:30 |
|
1: host : nid007281 |
|
1: rank : 14 (local_rank: 6) |
|
1: exitcode : 1 (pid: 71556) |
|
1: error_file: /tmp/torchelastic_8ia3se9x/none_chz6nne8/attempt_0/6/error.json |
|
1: traceback : Traceback (most recent call last): |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: return f(*args, **kwargs) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: success = self._load_zero_checkpoint( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: self.optimizer.load_state_dict( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: current_rank_sd = state_dict_list[dp_rank] |
|
1: IndexError: list index out of range |
|
1: |
|
1: [7]: |
|
1: time : 2023-04-27_15:57:30 |
|
1: host : nid007281 |
|
1: rank : 15 (local_rank: 7) |
|
1: exitcode : 1 (pid: 71557) |
|
1: error_file: /tmp/torchelastic_8ia3se9x/none_chz6nne8/attempt_0/7/error.json |
|
1: traceback : Traceback (most recent call last): |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: return f(*args, **kwargs) |
|
0: run(args) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: success = self._load_zero_checkpoint( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: self.optimizer.load_state_dict( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: current_rank_sd = state_dict_list[dp_rank] |
|
1: IndexError: list index out of range |
|
1: |
|
1: ------------------------------------------------------------ |
|
1: Root Cause (first observed failure): |
|
1: [0]: |
|
1: time : 2023-04-27_15:57:30 |
|
1: host : nid007281 |
|
1: rank : 8 (local_rank: 0) |
|
1: exitcode : 1 (pid: 71550) |
|
1: error_file: /tmp/torchelastic_8ia3se9x/none_chz6nne8/attempt_0/0/error.json |
|
1: traceback : Traceback (most recent call last): |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: return f(*args, **kwargs) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: success = self._load_zero_checkpoint( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: self.optimizer.load_state_dict( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
0: elastic_launch( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: current_rank_sd = state_dict_list[dp_rank] |
|
1: IndexError: list index out of range |
|
1: |
|
1: ============================================================ |
|
0: return launch_agent(self._config, self._entrypoint, list(args)) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent |
|
0: raise ChildFailedError( |
|
0: torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
|
0: ============================================================ |
|
0: Megatron-DeepSpeed/pretrain_gpt.py FAILED |
|
0: ------------------------------------------------------------ |
|
0: Failures: |
|
0: [1]: |
|
0: time : 2023-04-27_15:57:31 |
|
0: host : nid007280 |
|
0: rank : 1 (local_rank: 1) |
|
0: exitcode : 1 (pid: 42285) |
|
0: error_file: /tmp/torchelastic_w2odsqyk/none_2mmj7eag/attempt_0/1/error.json |
|
0: traceback : Traceback (most recent call last): |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: return f(*args, **kwargs) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: self.optimizer.load_state_dict( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self._load_legacy_checkpoint(state_dict_list, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: current.data.copy_(src_tensor.data) |
|
0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
|
0: |
|
0: [2]: |
|
0: time : 2023-04-27_15:57:31 |
|
0: host : nid007280 |
|
0: rank : 2 (local_rank: 2) |
|
0: exitcode : 1 (pid: 42286) |
|
0: error_file: /tmp/torchelastic_w2odsqyk/none_2mmj7eag/attempt_0/2/error.json |
|
0: traceback : Traceback (most recent call last): |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: return f(*args, **kwargs) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: self.optimizer.load_state_dict( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self._load_legacy_checkpoint(state_dict_list, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: current.data.copy_(src_tensor.data) |
|
0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
|
0: |
|
0: [3]: |
|
0: time : 2023-04-27_15:57:31 |
|
0: host : nid007280 |
|
0: rank : 3 (local_rank: 3) |
|
0: exitcode : 1 (pid: 42287) |
|
0: error_file: /tmp/torchelastic_w2odsqyk/none_2mmj7eag/attempt_0/3/error.json |
|
0: traceback : Traceback (most recent call last): |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: return f(*args, **kwargs) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: self.optimizer.load_state_dict( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self._load_legacy_checkpoint(state_dict_list, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: current.data.copy_(src_tensor.data) |
|
0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
|
0: |
|
0: [4]: |
|
0: time : 2023-04-27_15:57:31 |
|
0: host : nid007280 |
|
0: rank : 4 (local_rank: 4) |
|
0: exitcode : 1 (pid: 42288) |
|
0: error_file: /tmp/torchelastic_w2odsqyk/none_2mmj7eag/attempt_0/4/error.json |
|
0: traceback : Traceback (most recent call last): |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: return f(*args, **kwargs) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: self.optimizer.load_state_dict( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self._load_legacy_checkpoint(state_dict_list, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: current.data.copy_(src_tensor.data) |
|
0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
|
0: |
|
0: [5]: |
|
0: time : 2023-04-27_15:57:31 |
|
0: host : nid007280 |
|
0: rank : 5 (local_rank: 5) |
|
0: exitcode : 1 (pid: 42289) |
|
0: error_file: /tmp/torchelastic_w2odsqyk/none_2mmj7eag/attempt_0/5/error.json |
|
0: traceback : Traceback (most recent call last): |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: return f(*args, **kwargs) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: self.optimizer.load_state_dict( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self._load_legacy_checkpoint(state_dict_list, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: current.data.copy_(src_tensor.data) |
|
0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
|
0: |
|
0: [6]: |
|
0: time : 2023-04-27_15:57:31 |
|
0: host : nid007280 |
|
0: rank : 6 (local_rank: 6) |
|
0: exitcode : 1 (pid: 42290) |
|
0: error_file: /tmp/torchelastic_w2odsqyk/none_2mmj7eag/attempt_0/6/error.json |
|
0: traceback : Traceback (most recent call last): |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: return f(*args, **kwargs) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: self.optimizer.load_state_dict( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self._load_legacy_checkpoint(state_dict_list, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: current.data.copy_(src_tensor.data) |
|
0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
|
0: |
|
0: [7]: |
|
0: time : 2023-04-27_15:57:31 |
|
0: host : nid007280 |
|
0: rank : 7 (local_rank: 7) |
|
0: exitcode : 1 (pid: 42291) |
|
0: error_file: /tmp/torchelastic_w2odsqyk/none_2mmj7eag/attempt_0/7/error.json |
|
0: traceback : Traceback (most recent call last): |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: return f(*args, **kwargs) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: self.optimizer.load_state_dict( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self._load_legacy_checkpoint(state_dict_list, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: current.data.copy_(src_tensor.data) |
|
0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
|
0: |
|
0: ------------------------------------------------------------ |
|
0: Root Cause (first observed failure): |
|
0: [0]: |
|
0: time : 2023-04-27_15:57:31 |
|
0: host : nid007280 |
|
0: rank : 0 (local_rank: 0) |
|
0: exitcode : 1 (pid: 42284) |
|
0: error_file: /tmp/torchelastic_w2odsqyk/none_2mmj7eag/attempt_0/0/error.json |
|
0: traceback : Traceback (most recent call last): |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: return f(*args, **kwargs) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: self.optimizer.load_state_dict( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self._load_legacy_checkpoint(state_dict_list, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: current.data.copy_(src_tensor.data) |
|
0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
|
0: |
|
0: ============================================================ |
|
srun: error: nid007281: task 1: Exited with exit code 1 |
|
srun: launch/slurm: _step_signal: Terminating StepId=3423781.0 |
|
srun: error: nid007280: task 0: Exited with exit code 1 |
|
|