RuntimeError with CUDA assertion failure when resuming model training from checkpoint #499

fancling · 2024-03-22T06:42:17Z

I encountered a RuntimeError with an internal assertion failure when trying to resume training of a custom model from a checkpoint:

RuntimeError: t == DeviceType::CUDAINTERNAL ASSERT FAILED at "../c10/cuda/impl/CUDAGuardImpl.h":24, please report a bug to PyTorch.

This error occurred during the execution of an estimate_loss() function which is supposed to run before the actual model training resumes on CUDA. It seems to be triggered when the iteration number coincidentally matches a modulus of 2000.

I am willing to assist in resolving this issue if I can be of any help.

The text was updated successfully, but these errors were encountered:

fancling · 2024-03-22T07:50:10Z

Update on the issue

After further investigation, I've identified the source of the problem that leads to the assertion failure:
The comparison operation if losses["val"] < best_val_loss or always_save_checkpoint: fails because losses["val"] is located on the CPU, while best_val_loss is loaded from the checkpoint directly onto the CUDA device due to checkpoint = torch.load(ckpt_path, map_location=device).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError with CUDA assertion failure when resuming model training from checkpoint #499

RuntimeError with CUDA assertion failure when resuming model training from checkpoint #499

fancling commented Mar 22, 2024

fancling commented Mar 22, 2024

RuntimeError with CUDA assertion failure when resuming model training from checkpoint #499

RuntimeError with CUDA assertion failure when resuming model training from checkpoint #499

Comments

fancling commented Mar 22, 2024

fancling commented Mar 22, 2024

Update on the issue