[BUG] Using fp16 uses more memory than using fp32 #1349

eliird · 2025-01-08T03:06:08Z

Describe the bug
Using fp16 or bf16 uses more memory than using fp32

To Reproduce
Here are the training parameters I am using to train the model. When I comment out the --fp16, the memory usage increases.
My setup 8xH100.

GPT_MODEL_ARGS=(
    --num-layers 32
    --hidden-size 4096
    --num-attention-heads 32
    --seq-length 4096
    --no-position-embedding
    --no-masked-softmax-fusion
    --use-rotary-position-embeddings
    --max-position-embeddings 8192
    --attention-dropout 0
    --hidden-dropout 0
    --normalization RMSNorm
    --ffn-hidden-size 14336
    --num-query-groups 8
    --swiglu
    --group-query-attention
    --tokenizer-type HuggingFaceTokenizer
    # --untie-embeddings-and-output-weights
    --position-embedding-type rope
    --disable-bias-linear
    --tokenizer-model $TOKENIZER_SAVE_PATH
)

TRAINING_ARGS=(
    --micro-batch-size $MICRO_BATCH_SIZE
    --global-batch-size $GLOBAL_BATCH_SIZE
    --train-iters 500000
    --weight-decay 0.1
    --adam-beta1 0.9
    --adam-beta2 0.95
    --init-method-std 0.006
    --clip-grad 1.0
    --fp16 # disabling this parameter should use fp32, and it reduces memory usage.
    --lr 6.0e-5
    --lr-decay-style cosine
    --min-lr 6.0e-6
    --lr-warmup-fraction .001
    --lr-decay-iters 430000
    --optimizer sgd
    --empty-unused-memory-level 2
    --recompute-granularity "full"
    --recompute-method uniform
    --recompute-num-layers 1
    --transformer-impl "transformer_engine"

)

MODEL_PARALLEL_ARGS=(
     --tensor-model-parallel-size 8
    --pipeline-model-parallel-size 1
    --sequence-parallel
)

DATA_ARGS=(
    --data-path $DATA_PATH
    --split 949,50,1
)

EVAL_AND_LOGGING_ARGS=(
    --log-interval 10
    --save-interval 10000
    --eval-interval 1000
    --save $CHECKPOINT_SAVE_PATH
    # --load $CHECKPOINT_LOAD_PATH
    --eval-iters 10
    --tensorboard-dir $TENSORBOARD_LOGS_PATH
    --log-throughput
)

python pretrain_gpt.py \
    ${GPT_MODEL_ARGS[@]} \
    ${TRAINING_ARGS[@]} \
    ${MODEL_PARALLEL_ARGS[@]} \
    ${DATA_ARGS[@]} \
    ${EVAL_AND_LOGGING_ARGS[@]}

Expected behavior
FP16 should use less memory than that of FP32

Stack trace/logs
FP16 MEMORY USAGE

FP32 MEMORY USAGE

Environment (please complete the following information):

Megatron-LM commit ID 1ce944c
PyTorch version 2.4
CUDA version 12.1
NCCL version

The text was updated successfully, but these errors were encountered:

eliird · 2025-01-08T09:10:36Z

I tried looking at the internal code of loading the model and it seems that model is moved to GPU and then converted to fp16, would that not consume more memory when the model is being loaded. Probably has nothing to do with the used memory but still...

Megatron-LM/megatron/training/training.py line 535

#  GPU allocation.
 for model_module in model:
      model_module.cuda(torch.cuda.current_device())

  # Fp16 conversion.
  if args.fp16 or args.bf16:
      model = [Float16Module(model_module, args) for model_module in model]

eliird · 2025-01-08T09:21:37Z

I am still trying to look throught he code but the main difference is the fp16 optimizer has groups with both fp32 and fp16 parameters, probably somewhere duplicate memory is being used or something, will try to investigate a bit more but some feedback on this would be appreciated, especially if someone can confirm their memory usage also increases for fp16

eliird · 2025-01-08T09:35:44Z

Maybe the cause for the increased memory is the parameter being detached and cloned iin the initialization of the FP16Optimizer class. ~~I am adding the snippet of the code below~~ probably better to refer to the code. I will do some profiling later.

Megatron-LM/megatron/core/optimizer/optimizer.py

Line 550 in 1ce944c

main_param = param.detach().clone().float()

JieSor · 2025-01-22T03:45:49Z

The Distributed Optimizer section in the README explains that when not using zero-1, the model state for fp16 is 4 bytes (20-16=4) larger than that for fp32.
This is because it additionally saves the parameters and gradients in fp16 format.

It appears that your parameter script does not enable zero-1, and dp_size is set to 1 (with tp_size set to 8).

eliird · 2025-01-22T12:05:25Z

I am using TP=8 because I was trying to reduce the memory usage for a model that barely fits in a single node so I could increase batch size.

I am sorry I dont understand what is 20 and what is 16 that are being subtracted. I would be grateful if you can explain it.

JieSor · 2025-01-23T07:03:50Z

I am using TP=8 because I was trying to reduce the memory usage for a model that barely fits in a single node so I could increase batch size.

I am sorry I dont understand what is 20 and what is 16 that are being subtracted. I would be grateful if you can explain it.

The example mentioned above calculates the memory usage of the optimizer state using the Adam optimizer. Your script uses the SGD optimizer, so the calculation method may be different.

SGD optimizer:
When using fp32 format, it stores parameters and gradients in fp32 format.
When using fp16 format, I guess it additionally stores parameters and gradients in fp16 format, which leads to increased memory usage.

eliird changed the title ~~[BUG]~~ [BUG] Using fp16 uses more memory than using fp32 Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Using fp16 uses more memory than using fp32 #1349

[BUG] Using fp16 uses more memory than using fp32 #1349

eliird commented Jan 8, 2025 •

edited

Loading

eliird commented Jan 8, 2025 •

edited

Loading

eliird commented Jan 8, 2025

eliird commented Jan 8, 2025 •

edited

Loading

JieSor commented Jan 22, 2025

eliird commented Jan 22, 2025

JieSor commented Jan 23, 2025

[BUG] Using fp16 uses more memory than using fp32 #1349

[BUG] Using fp16 uses more memory than using fp32 #1349

Comments

eliird commented Jan 8, 2025 • edited Loading

eliird commented Jan 8, 2025 • edited Loading

eliird commented Jan 8, 2025

eliird commented Jan 8, 2025 • edited Loading

JieSor commented Jan 22, 2025

eliird commented Jan 22, 2025

JieSor commented Jan 23, 2025

eliird commented Jan 8, 2025 •

edited

Loading

eliird commented Jan 8, 2025 •

edited

Loading

eliird commented Jan 8, 2025 •

edited

Loading