`auto_find_batch_size` isn't yet supported with DeepSpeed/FSDP. Raise error accrodingly. #29058

pacman100 · 2024-02-16T10:16:13Z

What does this PR do?

When examining if auto_find_batch_size issue with DeepSpeed is solved via Zach's previous PR as someone commented on the PR that issue is still there: Support DeepSpeed when using auto find batch size #28088 (comment)

When I try https://github.com/pacman100/DHS-LLM-Workshop/tree/main/chat_assistant/sft/training with following command:

accelerate launch --config_file "configs/deepspeed_config.yaml" train.py \
--seed 100 \
--model_name_or_path "mistralai/Mistral-7B-v0.1" \
--dataset_name "smangrul/code-chat-assistant-v1" \
--chat_template_format "none" \
--add_special_tokens False \
--append_concat_token False \
--splits "train,test" \
--max_seq_len 2048 \
--num_train_epochs 1 \
--logging_steps 5 \
--log_level "info" \
--logging_strategy "steps" \
--evaluation_strategy "epoch" \
--save_strategy "epoch" \
--push_to_hub \
--hub_private_repo True \
--hub_strategy "every_save" \
--bf16 True \
--packing True \
--learning_rate 2e-5 \
--lr_scheduler_type "cosine" \
--weight_decay 0.0 \
--warmup_ratio 0.1 \
--max_grad_norm 1.0 \
--output_dir "mistral-sft-ds" \
--per_device_train_batch_size 64 \
--per_device_eval_batch_size 16 \
--gradient_accumulation_steps 1 \
--dataset_text_field "content" \
--use_flash_attn True \
--auto_find_batch_size True

I get a different error:

File "/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1328, in partition
    self._partition(param_list, has_been_updated=has_been_updated)
  File "/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1477, in _partition
    self._partition_param(param, has_been_updated=has_been_updated)
  File "/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1510, in _partition_param
    free_param(param)
  File "/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/fsx/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 285, in free_param
    assert not param.ds_active_sub_modules, param.ds_summary()
AssertionError: {'id': 26, 'status': 'AVAILABLE', 'numel': 4096, 'ds_numel': 4096, 'shape': (4096,), 'ds_shape': (4096,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {44}, 'ds_tensor.shape': torch.Size([512])}
[2024-02-09 14:50:57,113] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 230181 closing signal SIGTERM
[2024-02-09 14:50:57,646] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 230182) of binary: /fsx/sourab/miniconda3/envs/hf/bin/python

As auto_find_batch_size is good to have feature and not a necessity, coupled with the obscure errors noticed with DeepSpeeed/FSDP, we don't want to spend more time around this at present. Hence, this PR to raise error when trying to use auto_find_batch_size with DeepSpeed/FSDP.

HuggingFaceDocBuilderDev · 2024-02-16T10:39:22Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

muellerzr

What's annoying with this is at the time it was working and that test was passing, but if it's niche enough (and we def don't have the bandwidth right now) raising this err makes sense. If enough users vocalize concern we can look at reverting it.

amyeroberts

Thanks for adding this!

Agreed with @muellerzr - raising an error makes sense and we can work on adding support if it's something which is requested a lot

… error accrodingly. (huggingface#29058) Update trainer.py

… error accrodingly. (#29058) Update trainer.py

Update trainer.py

a825b74

pacman100 requested review from muellerzr and amyeroberts February 16, 2024 10:16

muellerzr approved these changes Feb 16, 2024

View reviewed changes

amyeroberts approved these changes Feb 16, 2024

View reviewed changes

pacman100 merged commit 4c18ddb into main Feb 16, 2024
21 checks passed

pacman100 deleted the smangrul/throw-error-auto-batch-size-for-ds-fsdp branch February 16, 2024 12:41

zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Feb 19, 2024

auto_find_batch_size isn't yet supported with DeepSpeed/FSDP. Raise…

c04c469

… error accrodingly. (huggingface#29058) Update trainer.py

itazap pushed a commit that referenced this pull request May 14, 2024

auto_find_batch_size isn't yet supported with DeepSpeed/FSDP. Raise…

e564674

… error accrodingly. (#29058) Update trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`auto_find_batch_size` isn't yet supported with DeepSpeed/FSDP. Raise error accrodingly. #29058

`auto_find_batch_size` isn't yet supported with DeepSpeed/FSDP. Raise error accrodingly. #29058

pacman100 commented Feb 16, 2024

HuggingFaceDocBuilderDev commented Feb 16, 2024

muellerzr left a comment

amyeroberts left a comment

auto_find_batch_size isn't yet supported with DeepSpeed/FSDP. Raise error accrodingly. #29058

auto_find_batch_size isn't yet supported with DeepSpeed/FSDP. Raise error accrodingly. #29058

Conversation

pacman100 commented Feb 16, 2024

What does this PR do?

HuggingFaceDocBuilderDev commented Feb 16, 2024

muellerzr left a comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

`auto_find_batch_size` isn't yet supported with DeepSpeed/FSDP. Raise error accrodingly. #29058

`auto_find_batch_size` isn't yet supported with DeepSpeed/FSDP. Raise error accrodingly. #29058