Skip to content

Commit e2f4d85

Browse files
authored
[megatron] Support dpo adapters (#5451)
1 parent c4c2705 commit e2f4d85

File tree

23 files changed

+146
-80
lines changed

23 files changed

+146
-80
lines changed

docs/source/GetStarted/SWIFT安装.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@ pip install ms-swift==2.*
3838

3939
## 镜像
4040

41+
docker可以查看[这里](https://github.com/modelscope/modelscope/blob/master/docker/build_image.py#L345)
4142
```
4243
# swift3.7.1
4344
modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.6.3-py311-torch2.7.1-vllm0.10.0-modelscope1.28.2-swift3.7.1

docs/source/Instruction/Megatron-SWIFT训练.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -453,6 +453,7 @@ lora训练:
453453

454454
**DPO参数**:
455455
- ref_load: ref_model的加载路径。默认为None,即设置为`load`
456+
- ref_adapter_load: 加载ref_adapter的权重路径,默认为None。若你要使用SFT产生的LoRA权重进行DPO,请使用"ms-swift>=3.8",并在训练时设置`--adapter_load sft_ckpt --ref_adapter_load sft_ckpt --finetune true`。若是此场景的断点续训,则设置`--adapter_load rlhf_ckpt --ref_adapter_load sft_ckpt --finetune false`
456457
- beta: 含义与[TRL](https://huggingface.co/docs/trl/main/en/dpo_trainer#trl.DPOConfig)相同。控制与参考模型偏差程度的参数。beta值越高,表示与参考模型的偏差越小。对于 IPO 损失函数 (loss_type="ipo"),beta是[论文](https://huggingface.co/papers/2310.12036)中所指的正则化参数。默认为0.1。
457458
- rpo_alpha: 来自[RPO 论文](https://huggingface.co/papers/2404.19733)中的参数,用于控制损失函数中NLL项的权重(即SFT损失),`loss = dpo_loss + rpo_alpha * sft_loss`,论文中推荐设置为`1.`。默认为`None`,即默认不引入sft_loss。
458459
- 注意:在"ms-swift<3.8",其默认值为`1.`。在"ms-swift>=3.8"该默认值修改为`None`

docs/source/Instruction/命令行参数.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -433,8 +433,7 @@ RLHF参数继承于[训练参数](#训练参数)。
433433

434434
- 🔥rlhf_type: 人类对齐算法类型,支持'dpo'、'orpo'、'simpo'、'kto'、'cpo'、'rm'、'ppo'、'grpo'和'gkd'。默认为'dpo'。
435435
- ref_model: 采用dpo、kto、ppo、grpo算法且使用全参数训练时需要传入。默认为None。
436-
- ref_adapters: 默认为`[]`
437-
- 注意:在"ms-swift>=3.8",可以在训练时设置`--adapters sft_ckpt --ref_adapters sft_ckpt`,在该LoRA后进行继续训练,方便在LoRA SFT之后衔接DPO/KTO/GRPO的场景。若是此场景的断点续训,则设置`--resume_from_checkpoint rlhf_ckpt --ref_adapters sft_ckpt`
436+
- ref_adapters: 默认为`[]`。若你要使用SFT产生的LoRA权重进行DPO/KTO/GRPO,请使用"ms-swift>=3.8",并在训练时设置`--adapters sft_ckpt --ref_adapters sft_ckpt`。若是此场景的断点续训,则设置`--resume_from_checkpoint rlhf_ckpt --ref_adapters sft_ckpt`
438437
- ref_model_type: 同model_type。默认为None。
439438
- ref_model_revision: 同model_revision。默认为None。
440439
- 🔥beta: KL正则项系数,默认为`None`,即`simpo`算法默认为`2.`,GRPO默认为`0.04`,GKD默认为0.5,其他算法默认为`0.1`。具体参考[文档](./人类对齐.md)

docs/source_en/GetStarted/SWIFT-installation.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ pip install ms-swift==2.*
3939

4040
## Mirror
4141

42+
You can check Docker [here](https://github.com/modelscope/modelscope/blob/master/docker/build_image.py#L345).
4243
```
4344
# swift3.7.1
4445
modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.6.3-py311-torch2.7.1-vllm0.10.0-modelscope1.28.2-swift3.7.1

docs/source_en/Instruction/Command-line-parameters.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -443,8 +443,7 @@ RLHF arguments inherit from the [training arguments](#training-arguments).
443443

444444
- 🔥rlhf_type: Type of human alignment algorithm, supporting 'dpo', 'orpo', 'simpo', 'kto', 'cpo', 'rm', 'ppo', 'grpo' and 'gkd'. Default is 'dpo'.
445445
- ref_model: Required for full parameter training when using the dpo, kto, ppo or grpo algorithms. Default is None.
446-
- ref_adapters: Default is `[]`.
447-
- Note: In "ms-swift>=3.8", you can set `--adapters sft_ckpt --ref_adapters sft_ckpt` during training to continue training from that LoRA, which is convenient for scenarios where DPO/KTO/GRPO follows LoRA SFT. For resuming training from a checkpoint in such scenarios, use `--resume_from_checkpoint rlhf_ckpt --ref_adapters sft_ckpt`.
446+
- ref_adapters: Default is `[]`. If you want to use the LoRA weights generated from SFT for DPO/KTO/GRPO, please use "ms-swift>=3.8" and set `--adapters sft_ckpt --ref_adapters sft_ckpt`. For resuming training from a checkpoint in this scenario, set `--resume_from_checkpoint rlhf_ckpt --ref_adapters sft_ckpt`.
448447
- ref_model_type: Same as model_type. Default is None.
449448
- ref_model_revision: Same as model_revision. Default is None.
450449
- 🔥beta: Coefficient for the KL regularization term. Default is `None`, meaning `simpo` algorithm defaults to `2.`, `grpo` algorithm defaults to `0.04`, `gkd` algorithm defaults to `0.5`, and other algorithms default to `0.1`. For more details, refer to the [documentation](./RLHF.md).

docs/source_en/Instruction/Megatron-SWIFT-Training.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -475,6 +475,7 @@ LoRA Training:
475475

476476
**DPO Parameters**
477477
- ref_load: The path to load the reference model. Defaults to `None`, which means it will be set to `load`.
478+
- ref_adapter_load: The path to load the ref_adapter weights, default is `None`. If you want to use LoRA weights generated from SFT for DPO, please use "ms-swift>=3.8" and set `--adapter_load sft_ckpt --ref_adapter_load sft_ckpt --finetune true` during training. For resuming training from a checkpoint in this scenario, set `--adapter_load rlhf_ckpt --ref_adapter_load sft_ckpt --finetune false`.
478479
- beta: Has the same meaning as in [TRL](https://huggingface.co/docs/trl/main/en/dpo_trainer#trl.DPOConfig). It controls the degree of deviation from the reference model. A higher beta value indicates less deviation from the reference model. For the IPO loss function (`loss_type="ipo"`), beta is the regularization parameter as mentioned in the [paper](https://huggingface.co/papers/2310.12036). Default is 0.1.
479480
- rpo_alpha: A parameter from the [RPO paper](https://huggingface.co/papers/2404.19733) that controls the weight of the NLL term (i.e., the SFT loss) in the loss function, where `loss = dpo_loss + rpo_alpha * sft_loss`. The paper recommends setting it to `1.`. The default value is `None`, meaning the SFT loss is not included by default.
480481
- Note: In "ms-swift<3.8", the default value was `1.`. Starting from "ms-swift>=3.8", the default has been changed to `None`.

examples/infer/demo_mllm.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -104,9 +104,10 @@ def get_data(mm_type: Literal['text', 'image', 'video', 'audio']):
104104
infer_backend = 'pt'
105105

106106
if infer_backend == 'pt':
107-
model = 'Qwen/Qwen2-Audio-7B-Instruct'
107+
# test env: transformers==4.55.2
108+
model = 'Qwen/Qwen2.5-Omni-7B'
108109
mm_type = 'audio'
109-
engine = PtEngine(model, max_batch_size=64)
110+
engine = PtEngine(model, max_batch_size=64, attn_impl='flash_attention_2')
110111
elif infer_backend == 'vllm':
111112
# test env: vllm==0.8.5.post1, transformers==4.51.3
112113
# The meaning of environment variables can be found at:

examples/infer/vllm/mllm_ddp.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
NPROC_PER_NODE=2 \
33
CUDA_VISIBLE_DEVICES=0,1 \
44
swift infer \
5-
--model Qwen/Qwen2-Audio-7B-Instruct \
5+
--model Qwen/Qwen2.5-Omni-7B \
66
--infer_backend vllm \
77
--val_dataset speech_asr/speech_asr_aishell1_trainsets:validation#1000 \
88
--vllm_gpu_memory_utilization 0.9 \

examples/train/megatron/lora/new_special_tokens.sh

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,7 @@
11
# 2 * 60GiB, 2.7s/it
2+
# Note: The conversion script has no differences.
3+
# It will read the new_special_tokens parameter from args.json.
4+
25
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
36
NPROC_PER_NODE=2 \
47
CUDA_VISIBLE_DEVICES=0,1 \

examples/train/moe/llama4.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
# Manually select `target_modules` to avoid 'all-linear' selecting 'router'
1+
# If you don't want to train the router, set:
2+
# `--target_regex '^(language_model).*\.(q_proj|k_proj|v_proj|o_proj|gate_proj|up_proj|down_proj)$'`
23
NPROC_PER_NODE=4 \
34
USE_HF=1 \
45
CUDA_VISIBLE_DEVICES=0,1,2,3 \
@@ -14,7 +15,6 @@ swift sft \
1415
--learning_rate 1e-4 \
1516
--lora_rank 8 \
1617
--lora_alpha 32 \
17-
--target_regex '^(language_model).*\.(q_proj|k_proj|v_proj|o_proj|gate_proj|up_proj|down_proj)$' \
1818
--router_aux_loss_coef 1e-3 \
1919
--freeze_vit true \
2020
--gradient_accumulation_steps 4 \

0 commit comments

Comments
 (0)