You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I add the mounted NFS shared dataset path [model/sharedata] on the sub-node, the sub-node locates a data file that doesn't exist on the main node.
FileNotFoundError: [Errno 2] No such file or directory: '/workspace/megatron-lm/models/gpt3/dataset/BookCorpusDataset_text_document/cache/GPTDataset_indices/2bcc0d01e685e944ad9c0c8ea43d126f-GPTDataset-train-document_index.npy
However, when I add a non-shared dataset [model/data] on the sub-node, the sub-node locates a data file that does exist on the main node.
FileNotFoundError: [Errno 2] No such file or directory: '/workspace/megatron-lm/models/gpt3/dataset/BookCorpusDataset_text_document/cache/GPTDataset_indices/fa7397310cb8333b787979cb7c45c55f-GPTDataset-train-document_index.npy
Could it be due to different seeds? Why is this happening?
I encountered the same issue.
Here is the data file generated by the main node:
data/BookCorpusDataset_text_document/cache/GPTDataset_indices:
6d84f3595d97dfece2364c3950a26906-GPTDataset-valid-description.txt
6d84f3595d97dfece2364c3950a26906-GPTDataset-valid-document_index.npy
6d84f3595d97dfece2364c3950a26906-GPTDataset-valid-sample_index.npy
6d84f3595d97dfece2364c3950a26906-GPTDataset-valid-shuffle_index.npy
d296b3899150edfd9092c34b30fa03c1-GPTDataset-test-description.txt
d296b3899150edfd9092c34b30fa03c1-GPTDataset-test-document_index.npy
d296b3899150edfd9092c34b30fa03c1-GPTDataset-test-sample_index.npy
d296b3899150edfd9092c34b30fa03c1-GPTDataset-test-shuffle_index.npy
fa7397310cb8333b787979cb7c45c55f-GPTDataset-train-description.txt
fa7397310cb8333b787979cb7c45c55f-GPTDataset-train-document_index.npy
fa7397310cb8333b787979cb7c45c55f-GPTDataset-train-sample_index.npy
fa7397310cb8333b787979cb7c45c55f-GPTDataset-train-shuffle_index.npy
When I add the mounted NFS shared dataset path [model/sharedata] on the sub-node, the sub-node locates a data file that doesn't exist on the main node.
FileNotFoundError: [Errno 2] No such file or directory: '/workspace/megatron-lm/models/gpt3/dataset/BookCorpusDataset_text_document/cache/GPTDataset_indices/2bcc0d01e685e944ad9c0c8ea43d126f-GPTDataset-train-document_index.npy
However, when I add a non-shared dataset [model/data] on the sub-node, the sub-node locates a data file that does exist on the main node.
FileNotFoundError: [Errno 2] No such file or directory: '/workspace/megatron-lm/models/gpt3/dataset/BookCorpusDataset_text_document/cache/GPTDataset_indices/fa7397310cb8333b787979cb7c45c55f-GPTDataset-train-document_index.npy
Could it be due to different seeds? Why is this happening?
I copied the segmented dataset generated by the master node [master/data/BookCorpusDataset_text_document/cache/GPTDataset_indices] to the node [master/data/BookCorpusDataset_text_document/cache/GPTDataset_indices], and the model was able to run. Could you please explain the reason for this?
stay88
changed the title
The dataset cannot be found in multi-node multi-GPU training.
[QUESTION] The dataset cannot be found in multi-node multi-GPU training.
Jan 15, 2025
Originally posted by @stay88 in #907
The text was updated successfully, but these errors were encountered: