Mert Toslali

toslali-ibm

AI & ML interests

None yet

Recent Activity

upvoted an article 3 days ago

No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL

commented on their article about 1 month ago

No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL

commented on their article 3 months ago

No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL

View all activity

Organizations

upvoted an article 3 days ago

Article

No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL

and 5 others •

Jun 3

• 91

commented on No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL about 1 month ago

Here are all the config used. But note that this was a while ago, so parameter names may have changed.

deepspeed.yaml

compute_environment: LOCAL_MACHINE
deepspeed_config:
 deepspeed_multinode_launcher: standard
 deepspeed_config_file: ds_config.json
 zero3_init_flag: true
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

ds_config.json

{
    "bf16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "gradient_accumulation_steps": "auto",
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": 1e6,
        "stage3_prefetch_bucket_size": 0.94e6,
        "stage3_param_persistence_threshold": 1e4,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_fp16_weights_on_model_save": true
    },
    "train_batch_size": "auto",
    "steps_per_print": 2000,
    "wall_clock_breakdown": false
}

experiment_config.yaml

# Model arguments
model_name_or_path: Qwen/Qwen2.5-Math-72B
model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2

# Data training arguments
dataset_name: DigitalLearningGmbH/MATH-lighteval
dataset_config: default
dataset_prompt_column: problem
system_prompt: "You are a helpful AI Assistant, designed to provided well-reasoned and detailed responses. You FIRST think about the reasoning process as an internal monologue and then provide the user with the answer. The reasoning process MUST BE enclosed within <think> and </think> tags."


# GRPO trainer config
bf16: true
use_vllm: true
vllm_mode: colocate
vllm_tensor_parallel_size: 8
vllm_gpu_memory_utilization: 0.5
vllm_enable_prefix_caching: false
vllm_max_model_len: 4096

do_eval: false
gradient_accumulation_steps: 1
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
learning_rate: 3.0e-06
log_completions: false
log_level: info
logging_first_step: true
logging_steps: 1
logging_strategy: steps

lr_scheduler_type: cosine

max_prompt_length: 512
max_completion_length: 3584
max_steps: -1
num_generations: 4
num_train_epochs: 1
overwrite_output_dir: true
per_device_train_batch_size: 4 
push_to_hub: false
reward_funcs:
- accuracy
- format
reward_weights:
- 1.0
- 1.0

eval_strategy: "no"
save_strategy: "steps"
save_steps: 30
save_total_limit: 3

report_to: 
- wandb

seed: 42
temperature: 0.7
warmup_ratio: 0.1

commented on No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL 3 months ago

Yes, we set TP=8, meaning each node has a copy of the shard of the 72B model.

commented on No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL 3 months ago

The reason is noted in the "segmentation fault" discussion under https://huggingface.co/blog/vllm-colocate#challenges. Basically, we are waiting for a fix of the bug (https://github.com/vllm-project/vllm/issues/16993) before integrating sleep() fully into TRL upstream.

commented on No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL 4 months ago

Are you running the script w/ accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml train.py ?

commented on No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL 4 months ago

Hello. Yes, sleep is not part of upstream yet, because vllm sleep has a bug, which we are waiting for the fix (https://github.com/vllm-project/vllm/issues/16993).
Sleep code is in our experimental branch at https://github.com/toslali-ibm/trl/pull/8/files

commented on No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL 4 months ago

DP is supported.
For example, if # of GPUS = 8 and vllm_tensor_parallel_size = 2 → groups: [0,1], [2,3], [4,5], [6,7] -> making DP=4

commented on No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL 4 months ago

Yes!

published an article 4 months ago

Article

No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL

and 5 others •

Jun 3

• 91

Mert Toslali

AI & ML interests

Recent Activity

Organizations

toslali-ibm's activity

No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL

No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL