Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exemplar MAE with DDP does not work #1775

Open
mcleod-matthew-gene opened this issue Jan 7, 2025 · 2 comments
Open

Exemplar MAE with DDP does not work #1775

mcleod-matthew-gene opened this issue Jan 7, 2025 · 2 comments

Comments

@mcleod-matthew-gene
Copy link

Hello all,

Thanks for the great open source package. I noticed the example for MAE with pytorch lightning training DDP simply does not work? There is an to be an issue with unused parameters. i.e.

RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable the detection of unused parameters in DDP, either by setting the string value `strategy='ddp_find_unused_parameters_true'` or by setting the flag in the strategy with `strategy=DDPStrategy(find_unused_parameters=True)`.
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets()

If you print which parameters are do not have a gradient, you'll see they are vit.pos_embed, vit.head.weight, vit.head.bias. The unused head parameter makes sense, but I don't see why the vit.pos_embed would be unused.

I'd really appreciate if could confirm this is an issue with the example on main and if fixing this will be on the roadmap.

Thanks!

@guarin
Copy link
Contributor

guarin commented Jan 8, 2025

Hi, thanks for raising the issue! This is indeed wrong in the example. You have to set strategy="ddp_find_unused_parameters_true" for it to work.

If for some reason you cannot use ddp_find_unused_parameters_true and have to use ddp you can also drop the unused classifier weights from the backbone. I believe this is possible with:

class MAE(pl.LightningModule):
    def __init__(self):
        super().__init__()

        decoder_dim = 512
        vit = vit_base_patch32_224()
        vit.reset_classifier(0, '')
        ...

Finally, if you want to reproduce results from the paper I suggest you follow the more complete implementation here:

Regarding the positional embedding, MAE uses a fixed 2D sin-cos positional embedding and the corresponding parameter has set requires_grad=False. See:

If I remember correctly DDP expects that all parameters receive an update even if requires_grad=False (might be wrong there though). We have an issue regarding this here #1434

@liopeer
Copy link
Contributor

liopeer commented Jan 8, 2025

@mcleod-matthew-gene In the paper you will see that they also use sinusoidal positonal embeddings Masked Autoencoders Are Scalable Vision Learners – Appendix A.1, ViT architecture

Our MAE adds positional embeddings [57] (the sine-cosine version) to both the encoder and decoder inputs.

Therefore I would also suggest to proceed in suggested way above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants