You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I encountered a RuntimeError with an internal assertion failure when trying to resume training of a custom model from a checkpoint:
RuntimeError: t == DeviceType::CUDAINTERNAL ASSERT FAILED at "../c10/cuda/impl/CUDAGuardImpl.h":24, please report a bug to PyTorch.
This error occurred during the execution of an estimate_loss() function which is supposed to run before the actual model training resumes on CUDA. It seems to be triggered when the iteration number coincidentally matches a modulus of 2000.
I am willing to assist in resolving this issue if I can be of any help.
The text was updated successfully, but these errors were encountered:
After further investigation, I've identified the source of the problem that leads to the assertion failure:
The comparison operation if losses["val"] < best_val_loss or always_save_checkpoint: fails because losses["val"] is located on the CPU, while best_val_loss is loaded from the checkpoint directly onto the CUDA device due to checkpoint = torch.load(ckpt_path, map_location=device).
I encountered a RuntimeError with an internal assertion failure when trying to resume training of a custom model from a checkpoint:
This error occurred during the execution of an estimate_loss() function which is supposed to run before the actual model training resumes on CUDA. It seems to be triggered when the iteration number coincidentally matches a modulus of 2000.
I am willing to assist in resolving this issue if I can be of any help.
The text was updated successfully, but these errors were encountered: