Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Using fp16 uses more memory than using fp32 #1349

Open
eliird opened this issue Jan 8, 2025 · 6 comments
Open

[BUG] Using fp16 uses more memory than using fp32 #1349

eliird opened this issue Jan 8, 2025 · 6 comments

Comments

@eliird
Copy link

eliird commented Jan 8, 2025

Describe the bug
Using fp16 or bf16 uses more memory than using fp32

To Reproduce
Here are the training parameters I am using to train the model. When I comment out the --fp16, the memory usage increases.
My setup 8xH100.

GPT_MODEL_ARGS=(
    --num-layers 32
    --hidden-size 4096
    --num-attention-heads 32
    --seq-length 4096
    --no-position-embedding
    --no-masked-softmax-fusion
    --use-rotary-position-embeddings
    --max-position-embeddings 8192
    --attention-dropout 0
    --hidden-dropout 0
    --normalization RMSNorm
    --ffn-hidden-size 14336
    --num-query-groups 8
    --swiglu
    --group-query-attention
    --tokenizer-type HuggingFaceTokenizer
    # --untie-embeddings-and-output-weights
    --position-embedding-type rope
    --disable-bias-linear
    --tokenizer-model $TOKENIZER_SAVE_PATH
)

TRAINING_ARGS=(
    --micro-batch-size $MICRO_BATCH_SIZE
    --global-batch-size $GLOBAL_BATCH_SIZE
    --train-iters 500000
    --weight-decay 0.1
    --adam-beta1 0.9
    --adam-beta2 0.95
    --init-method-std 0.006
    --clip-grad 1.0
    --fp16 # disabling this parameter should use fp32, and it reduces memory usage.
    --lr 6.0e-5
    --lr-decay-style cosine
    --min-lr 6.0e-6
    --lr-warmup-fraction .001
    --lr-decay-iters 430000
    --optimizer sgd
    --empty-unused-memory-level 2
    --recompute-granularity "full"
    --recompute-method uniform
    --recompute-num-layers 1
    --transformer-impl "transformer_engine"

)

MODEL_PARALLEL_ARGS=(
     --tensor-model-parallel-size 8
    --pipeline-model-parallel-size 1
    --sequence-parallel
)

DATA_ARGS=(
    --data-path $DATA_PATH
    --split 949,50,1
)

EVAL_AND_LOGGING_ARGS=(
    --log-interval 10
    --save-interval 10000
    --eval-interval 1000
    --save $CHECKPOINT_SAVE_PATH
    # --load $CHECKPOINT_LOAD_PATH
    --eval-iters 10
    --tensorboard-dir $TENSORBOARD_LOGS_PATH
    --log-throughput
)

python pretrain_gpt.py \
    ${GPT_MODEL_ARGS[@]} \
    ${TRAINING_ARGS[@]} \
    ${MODEL_PARALLEL_ARGS[@]} \
    ${DATA_ARGS[@]} \
    ${EVAL_AND_LOGGING_ARGS[@]}

Expected behavior
FP16 should use less memory than that of FP32

Stack trace/logs
FP16 MEMORY USAGE

Image

FP32 MEMORY USAGE
Image

Environment (please complete the following information):

  • Megatron-LM commit ID 1ce944c
  • PyTorch version 2.4
  • CUDA version 12.1
  • NCCL version
@eliird eliird changed the title [BUG] [BUG] Using fp16 uses more memory than using fp32 Jan 8, 2025
@eliird
Copy link
Author

eliird commented Jan 8, 2025

I tried looking at the internal code of loading the model and it seems that model is moved to GPU and then converted to fp16, would that not consume more memory when the model is being loaded. Probably has nothing to do with the used memory but still...

Megatron-LM/megatron/training/training.py line 535

#  GPU allocation.
 for model_module in model:
      model_module.cuda(torch.cuda.current_device())

  # Fp16 conversion.
  if args.fp16 or args.bf16:
      model = [Float16Module(model_module, args) for model_module in model]

@eliird
Copy link
Author

eliird commented Jan 8, 2025

I am still trying to look throught he code but the main difference is the fp16 optimizer has groups with both fp32 and fp16 parameters, probably somewhere duplicate memory is being used or something, will try to investigate a bit more but some feedback on this would be appreciated, especially if someone can confirm their memory usage also increases for fp16

@eliird
Copy link
Author

eliird commented Jan 8, 2025

Maybe the cause for the increased memory is the parameter being detached and cloned iin the initialization of the FP16Optimizer class. I am adding the snippet of the code below probably better to refer to the code. I will do some profiling later.

main_param = param.detach().clone().float()

@JieSor
Copy link

JieSor commented Jan 22, 2025

The Distributed Optimizer section in the README explains that when not using zero-1, the model state for fp16 is 4 bytes (20-16=4) larger than that for fp32.
This is because it additionally saves the parameters and gradients in fp16 format.

It appears that your parameter script does not enable zero-1, and dp_size is set to 1 (with tp_size set to 8).

@eliird
Copy link
Author

eliird commented Jan 22, 2025

I am using TP=8 because I was trying to reduce the memory usage for a model that barely fits in a single node so I could increase batch size.

I am sorry I dont understand what is 20 and what is 16 that are being subtracted. I would be grateful if you can explain it.

@JieSor
Copy link

JieSor commented Jan 23, 2025

I am using TP=8 because I was trying to reduce the memory usage for a model that barely fits in a single node so I could increase batch size.

I am sorry I dont understand what is 20 and what is 16 that are being subtracted. I would be grateful if you can explain it.

The example mentioned above calculates the memory usage of the optimizer state using the Adam optimizer. Your script uses the SGD optimizer, so the calculation method may be different.

SGD optimizer:
When using fp32 format, it stores parameters and gradients in fp32 format.
When using fp16 format, I guess it additionally stores parameters and gradients in fp16 format, which leads to increased memory usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants