[megatron] Support dpo adapters (#5451)

Jintao-Huang · web-flow · commit e2f4d85a9c91 · 2025-08-20T17:57:02.000+08:00
diff --git a/docs/source/GetStarted/SWIFT安装.md b/docs/source/GetStarted/SWIFT安装.md
@@ -38,6 +38,7 @@ pip install ms-swift==2.*
 
 ## 镜像
 
+docker可以查看[这里](https://github.com/modelscope/modelscope/blob/master/docker/build_image.py#L345)。
 ```
 # swift3.7.1
 modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.6.3-py311-torch2.7.1-vllm0.10.0-modelscope1.28.2-swift3.7.1
diff --git a/docs/source/Instruction/Megatron-SWIFT训练.md b/docs/source/Instruction/Megatron-SWIFT训练.md
@@ -453,6 +453,7 @@ lora训练：
 
 **DPO参数**:
 - ref_load: ref_model的加载路径。默认为None，即设置为`load`。
+- ref_adapter_load: 加载ref_adapter的权重路径，默认为None。若你要使用SFT产生的LoRA权重进行DPO，请使用"ms-swift>=3.8"，并在训练时设置`--adapter_load sft_ckpt --ref_adapter_load sft_ckpt --finetune true`。若是此场景的断点续训，则设置`--adapter_load rlhf_ckpt --ref_adapter_load sft_ckpt --finetune false`。
 - beta: 含义与[TRL](https://huggingface.co/docs/trl/main/en/dpo_trainer#trl.DPOConfig)相同。控制与参考模型偏差程度的参数。beta值越高，表示与参考模型的偏差越小。对于 IPO 损失函数 (loss_type="ipo")，beta是[论文](https://huggingface.co/papers/2310.12036)中所指的正则化参数。默认为0.1。
 - rpo_alpha: 来自[RPO 论文](https://huggingface.co/papers/2404.19733)中的参数，用于控制损失函数中NLL项的权重（即SFT损失），`loss = dpo_loss + rpo_alpha * sft_loss`，论文中推荐设置为`1.`。默认为`None`，即默认不引入sft_loss。
   - 注意：在"ms-swift<3.8"，其默认值为`1.`。在"ms-swift>=3.8"该默认值修改为`None`。
diff --git a/docs/source/Instruction/命令行参数.md b/docs/source/Instruction/命令行参数.md
@@ -433,8 +433,7 @@ RLHF参数继承于[训练参数](#训练参数)。
 
 - 🔥rlhf_type: 人类对齐算法类型，支持'dpo'、'orpo'、'simpo'、'kto'、'cpo'、'rm'、'ppo'、'grpo'和'gkd'。默认为'dpo'。
 - ref_model: 采用dpo、kto、ppo、grpo算法且使用全参数训练时需要传入。默认为None。
-- ref_adapters: 默认为`[]`。
-  - 注意：在"ms-swift>=3.8"，可以在训练时设置`--adapters sft_ckpt --ref_adapters sft_ckpt`，在该LoRA后进行继续训练，方便在LoRA SFT之后衔接DPO/KTO/GRPO的场景。若是此场景的断点续训，则设置`--resume_from_checkpoint rlhf_ckpt --ref_adapters sft_ckpt`。
+- ref_adapters: 默认为`[]`。若你要使用SFT产生的LoRA权重进行DPO/KTO/GRPO，请使用"ms-swift>=3.8"，并在训练时设置`--adapters sft_ckpt --ref_adapters sft_ckpt`。若是此场景的断点续训，则设置`--resume_from_checkpoint rlhf_ckpt --ref_adapters sft_ckpt`。
 - ref_model_type: 同model_type。默认为None。
 - ref_model_revision: 同model_revision。默认为None。
 - 🔥beta: KL正则项系数，默认为`None`，即`simpo`算法默认为`2.`，GRPO默认为`0.04`，GKD默认为0.5，其他算法默认为`0.1`。具体参考[文档](./人类对齐.md)。
diff --git a/docs/source_en/GetStarted/SWIFT-installation.md b/docs/source_en/GetStarted/SWIFT-installation.md
@@ -39,6 +39,7 @@ pip install ms-swift==2.*
 
 ## Mirror
 
+You can check Docker [here](https://github.com/modelscope/modelscope/blob/master/docker/build_image.py#L345).
 ```
 # swift3.7.1
 modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.6.3-py311-torch2.7.1-vllm0.10.0-modelscope1.28.2-swift3.7.1
diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
@@ -443,8 +443,7 @@ RLHF arguments inherit from the [training arguments](#training-arguments).
 
 - 🔥rlhf_type: Type of human alignment algorithm, supporting 'dpo', 'orpo', 'simpo', 'kto', 'cpo', 'rm', 'ppo', 'grpo' and 'gkd'. Default is 'dpo'.
 - ref_model: Required for full parameter training when using the dpo, kto, ppo or grpo algorithms. Default is None.
-- ref_adapters: Default is `[]`.
-  - Note: In "ms-swift>=3.8", you can set `--adapters sft_ckpt --ref_adapters sft_ckpt` during training to continue training from that LoRA, which is convenient for scenarios where DPO/KTO/GRPO follows LoRA SFT. For resuming training from a checkpoint in such scenarios, use `--resume_from_checkpoint rlhf_ckpt --ref_adapters sft_ckpt`.
+- ref_adapters: Default is `[]`. If you want to use the LoRA weights generated from SFT for DPO/KTO/GRPO, please use "ms-swift>=3.8" and set `--adapters sft_ckpt --ref_adapters sft_ckpt`. For resuming training from a checkpoint in this scenario, set `--resume_from_checkpoint rlhf_ckpt --ref_adapters sft_ckpt`.
 - ref_model_type: Same as model_type. Default is None.
 - ref_model_revision: Same as model_revision. Default is None.
 - 🔥beta: Coefficient for the KL regularization term. Default is `None`, meaning `simpo` algorithm defaults to `2.`, `grpo` algorithm defaults to `0.04`, `gkd` algorithm defaults to `0.5`, and other algorithms default to `0.1`. For more details, refer to the [documentation](./RLHF.md).
diff --git a/docs/source_en/Instruction/Megatron-SWIFT-Training.md b/docs/source_en/Instruction/Megatron-SWIFT-Training.md
@@ -475,6 +475,7 @@ LoRA Training:
 
 **DPO Parameters**
 - ref_load: The path to load the reference model. Defaults to `None`, which means it will be set to `load`.
+- ref_adapter_load: The path to load the ref_adapter weights, default is `None`. If you want to use LoRA weights generated from SFT for DPO, please use "ms-swift>=3.8" and set `--adapter_load sft_ckpt --ref_adapter_load sft_ckpt --finetune true` during training. For resuming training from a checkpoint in this scenario, set `--adapter_load rlhf_ckpt --ref_adapter_load sft_ckpt --finetune false`.
 - beta: Has the same meaning as in [TRL](https://huggingface.co/docs/trl/main/en/dpo_trainer#trl.DPOConfig). It controls the degree of deviation from the reference model. A higher beta value indicates less deviation from the reference model. For the IPO loss function (`loss_type="ipo"`), beta is the regularization parameter as mentioned in the [paper](https://huggingface.co/papers/2310.12036). Default is 0.1.
 - rpo_alpha: A parameter from the [RPO paper](https://huggingface.co/papers/2404.19733) that controls the weight of the NLL term (i.e., the SFT loss) in the loss function, where `loss = dpo_loss + rpo_alpha * sft_loss`. The paper recommends setting it to `1.`. The default value is `None`, meaning the SFT loss is not included by default.
   - Note: In "ms-swift<3.8", the default value was `1.`. Starting from "ms-swift>=3.8", the default has been changed to `None`.
diff --git a/examples/infer/demo_mllm.py b/examples/infer/demo_mllm.py
@@ -104,9 +104,10 @@ def get_data(mm_type: Literal['text', 'image', 'video', 'audio']):
     infer_backend = 'pt'
 
     if infer_backend == 'pt':
-        model = 'Qwen/Qwen2-Audio-7B-Instruct'
+        # test env: transformers==4.55.2
+        model = 'Qwen/Qwen2.5-Omni-7B'
         mm_type = 'audio'
-        engine = PtEngine(model, max_batch_size=64)
+        engine = PtEngine(model, max_batch_size=64, attn_impl='flash_attention_2')
     elif infer_backend == 'vllm':
         # test env: vllm==0.8.5.post1, transformers==4.51.3
         # The meaning of environment variables can be found at:
diff --git a/examples/infer/vllm/mllm_ddp.sh b/examples/infer/vllm/mllm_ddp.sh
@@ -2,7 +2,7 @@
 NPROC_PER_NODE=2 \
 CUDA_VISIBLE_DEVICES=0,1 \
 swift infer \
-    --model Qwen/Qwen2-Audio-7B-Instruct \
+    --model Qwen/Qwen2.5-Omni-7B \
     --infer_backend vllm \
     --val_dataset speech_asr/speech_asr_aishell1_trainsets:validation#1000 \
     --vllm_gpu_memory_utilization 0.9 \
diff --git a/examples/train/megatron/lora/new_special_tokens.sh b/examples/train/megatron/lora/new_special_tokens.sh
@@ -1,4 +1,7 @@
 # 2 * 60GiB, 2.7s/it
+# Note: The conversion script has no differences.
+# It will read the new_special_tokens parameter from args.json.
+
 PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NPROC_PER_NODE=2 \
 CUDA_VISIBLE_DEVICES=0,1 \
diff --git a/examples/train/moe/llama4.sh b/examples/train/moe/llama4.sh
@@ -1,4 +1,5 @@
-# Manually select `target_modules` to avoid 'all-linear' selecting 'router'
+# If you don't want to train the router, set:
+# `--target_regex '^(language_model).*\.(q_proj|k_proj|v_proj|o_proj|gate_proj|up_proj|down_proj)$'`
 NPROC_PER_NODE=4 \
 USE_HF=1 \
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
@@ -14,7 +15,6 @@ swift sft \
     --learning_rate 1e-4 \
     --lora_rank 8 \
     --lora_alpha 32 \
-    --target_regex '^(language_model).*\.(q_proj|k_proj|v_proj|o_proj|gate_proj|up_proj|down_proj)$' \
     --router_aux_loss_coef 1e-3 \
     --freeze_vit true \
     --gradient_accumulation_steps 4 \
diff --git a/examples/train/moe/qwen3_moe.sh b/examples/train/moe/qwen3_moe.sh
@@ -1,4 +1,5 @@
-# Manually select `target_modules` to avoid 'all-linear' selecting 'gate'
+# If you don't want to train the router, set:
+# `--target_modules q_proj k_proj v_proj o_proj gate_proj up_proj down_proj`
 CUDA_VISIBLE_DEVICES=0 \
 swift sft \
     --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
@@ -12,7 +13,6 @@ swift sft \
     --learning_rate 1e-4 \
     --lora_rank 8 \
     --lora_alpha 32 \
-    --target_modules q_proj k_proj v_proj o_proj gate_proj up_proj down_proj \
     --router_aux_loss_coef 1e-3 \
     --gradient_accumulation_steps 16 \
     --eval_steps 50 \
diff --git a/examples/train/multimodal/audio.sh b/examples/train/multimodal/audio.sh
@@ -1,4 +1,5 @@
-# pip install "transformers==4.48.*"
+pip install "transformers==4.48.*"
+
 CUDA_VISIBLE_DEVICES=0 \
 swift sft \
     --model Qwen/Qwen2-Audio-7B-Instruct \
diff --git a/swift/megatron/__init__.py b/swift/megatron/__init__.py
@@ -17,13 +17,15 @@
     from .argument import MegatronTrainArguments, MegatronRLHFArguments
     from .model import MegatronModelType, MegatronModelMeta, get_megatron_model_meta, register_megatron_model
     from .trainers import MegatronTrainer, MegatronDPOTrainer
+    from .tuners import LoraParallelLinear
 else:
     _import_structure = {
         'train': ['megatron_sft_main', 'megatron_pt_main', 'megatron_rlhf_main'],
         'utils': ['convert_hf2mcore', 'convert_mcore2hf', 'prepare_mcore_model', 'adapter_state_dict_context'],
         'argument': ['MegatronTrainArguments', 'MegatronRLHFArguments'],
         'model': ['MegatronModelType', 'MegatronModelMeta', 'get_megatron_model_meta', 'register_megatron_model'],
         'trainers': ['MegatronTrainer', 'MegatronDPOTrainer'],
+        'tuners': ['LoraParallelLinear'],
     }
 
     import sys
diff --git a/swift/megatron/argument/megatron_args.py b/swift/megatron/argument/megatron_args.py
@@ -18,6 +18,7 @@
 @dataclass
 class RLHFMegatronArgumentsMixin:
     ref_load: Optional[str] = None
+    ref_adapter_load: Optional[str] = None
 
     beta: float = 0.1
     rpo_alpha: Optional[float] = None
@@ -50,26 +51,27 @@ class MegatronTunerMixin:
     lora_dtype: Literal['float16', 'bfloat16', 'float32', None] = None
     use_rslora: bool = False
 
+    @staticmethod
+    def load_tuner_config(adapter_load: Optional[str]) -> Dict[str, Any]:
+        res = {}
+        if adapter_load is None:
+            return res
+        args_path = os.path.join(adapter_load, 'args.json')
+        if os.path.exists(args_path):
+            with open(args_path, 'r', encoding='utf-8') as f:
+                old_args = json.load(f)
+            tuner_keys = list(f.name for f in fields(MegatronTunerMixin)) + ['load']
+            for key in tuner_keys:
+                old_value = old_args.get(key)
+                if old_value is not None:
+                    res[key] = old_value
+            res.pop('adapter_load', None)
+        return res
+
     def __post_init__(self):
         if self.freeze_parameters_ratio > 0 and self.pipeline_model_parallel_size > 1:
             raise ValueError('`freeze_parameters_ratio` is not supported when `pipeline_model_parallel_size` > 1')
 
-        if self.adapter_load:
-            args_path = os.path.join(self.adapter_load, 'args.json')
-            if os.path.exists(args_path):
-                with open(args_path, 'r', encoding='utf-8') as f:
-                    old_args = json.load(f)
-                tuner_keys = list(f.name for f in fields(MegatronTunerMixin))
-                for key in tuner_keys:
-                    old_value = old_args.get(key)
-                    if old_value is not None:
-                        setattr(self, key, old_value)
-                if self.adapter_load is not None and hasattr(self, 'load'):
-                    old_value = old_args.get('load')
-                    if self.load is None and old_value is not None:
-                        logger.info(f'Setting args.load: {old_value}')
-                        self.load = old_value
-
 
 @dataclass
 class ExtraMegatronArguments(RLHFMegatronArgumentsMixin, MegatronTunerMixin):
diff --git a/swift/megatron/init.py b/swift/megatron/init.py
@@ -663,16 +663,6 @@ def _patch_peft_ModulesToSaveWrapper():
 
     class ModulesToSaveWrapper(OriginModulesToSaveWrapper):
 
-        def __init__(self, module_to_save, *args, **kwargs):
-            tp_group = getattr(module_to_save, 'tp_group', None)
-            if tp_group is not None:
-                module_to_save.tp_group = None
-            super().__init__(module_to_save, *args, **kwargs)
-            if tp_group is not None:
-                module_to_save.tp_group = tp_group
-                for module in self.modules_to_save.values():
-                    module.tp_group = tp_group
-
         def sharded_state_dict(
                 self,
                 prefix: str = '',
@@ -681,20 +671,21 @@ def sharded_state_dict(
         ) -> ShardedStateDict:
             sharded_state_dict = tuners_sharded_state_dict(self, prefix, sharded_offsets, metadata)
             if prefix == 'output_layer.':
-                output_layer_extra_state_key = f'{prefix}modules_to_save.default._extra_state'
-
-                # Old GPT checkpoints only stored the output layer weight key. So we remove the
-                # _extra_state key but check that it doesn't contain any data anyway
-                output_extra_state = sharded_state_dict.pop(output_layer_extra_state_key, None)
-                assert not (output_extra_state and output_extra_state.data
-                            ), f'Expected output layer extra state to be empty, got: {output_extra_state}'
+                for k in list(sharded_state_dict.keys()):
+                    if '_extra_state' in k:
+                        # Old GPT checkpoints only stored the output layer weight key. So we remove the
+                        # _extra_state key but check that it doesn't contain any data anyway
+                        output_extra_state = sharded_state_dict.pop(k, None)
+                        assert not (output_extra_state and output_extra_state.data
+                                    ), f'Expected output layer extra state to be empty, got: {output_extra_state}'
                 # fix error
                 if f'{prefix}modules_to_save.default.weight' in sharded_state_dict:
                     sharded_state_dict[f'{prefix}weight'] = sharded_state_dict[
                         f'{prefix}modules_to_save.default.weight']
             return sharded_state_dict
 
     peft_module.ModulesToSaveWrapper = ModulesToSaveWrapper
+    peft_module.OriginModulesToSaveWrapper = OriginModulesToSaveWrapper
 
 
 def _patch_TransformerLayer():
diff --git a/swift/megatron/trainers/base.py b/swift/megatron/trainers/base.py
@@ -7,7 +7,6 @@
 
 import megatron.core
 import torch
-import torch.distributed as dist
 import torch.nn
 from megatron.core import mpu
 from megatron.core.dist_checkpointing.mapping import ShardedTensorFactory
@@ -159,6 +158,9 @@ def _load_base_checkpoint(*_args, **kwargs):
                     mapping[k] = origin_k
                     v.key = v.key.replace('.base_layer', '')
                 elif '.modules_to_save' in k:
+                    if '.modules_to_save.default' not in k:
+                        # e.g. ref_adapter
+                        continue
                     # modules to save
                     origin_k = k
                     k = k.replace('.modules_to_save.default', '')
@@ -208,14 +210,14 @@ def new_model_provider_func(*args, **kwargs):
             model, optimizer, opt_param_scheduler = self._origin_setup_model_and_optimizer(
                 new_model_provider_func, model_type, *_args, **kwargs)
         args = get_args()
+        if args.initialize_embedding:
+            self._initialize_embedding(self.unwrapped_model)
+        if args.train_type != 'full' and args.modules_to_save:
+            copy_original_module_weight(self.unwrapped_model)
         if args.adapter_load is not None:
             with adapter_state_dict_context():
                 args.iteration, args.num_floating_point_operations_so_far = load_checkpoint(
                     model, optimizer, opt_param_scheduler, load_arg='adapter_load', strict=False)
-        if args.train_type != 'full' and args.modules_to_save:
-            copy_original_module_weight(self.unwrapped_model)
-        if args.initialize_embedding:
-            self._initialize_embedding(self.unwrapped_model)
         return model, optimizer, opt_param_scheduler
 
     @staticmethod
@@ -232,6 +234,7 @@ def _initialize_embedding(model):
             num_to_initialize = initialize_mask.sum().item()
             if num_to_initialize == 0:
                 continue
+            logger.info_if(f'num_to_initialize: {num_to_initialize}', cond=mpu.get_data_parallel_rank() == 0)
             tensor = module.weight.new_empty(num_to_initialize, module.weight.shape[1])
             module.weight.data[initialize_mask] = init_method(tensor)
 
diff --git a/swift/megatron/trainers/dpo_trainer.py b/swift/megatron/trainers/dpo_trainer.py
@@ -13,6 +13,7 @@
 
 from swift.trainers import DPOTrainer
 from swift.utils import get_current_device, get_logger
+from ..utils import copy_ref_adapter_weight
 from .trainer import MegatronTrainer
 from .utils import get_batch
 
@@ -51,7 +52,11 @@ def setup_model_and_optimizer(self, model_provider_func, model_type, *_args, **k
             self.ref_model.eval()
         else:
             self.ref_model = None
-        return super().setup_model_and_optimizer(model_provider_func, model_type, *_args, **kwargs)
+        model, optimizer, opt_param_scheduler = super().setup_model_and_optimizer(model_provider_func, model_type,
+                                                                                  *_args, **kwargs)
+        if args.ref_adapter_load is not None:
+            copy_ref_adapter_weight(self.unwrapped_model, 'ref_adapter')
+        return model, optimizer, opt_param_scheduler
 
     @staticmethod
     def _forward_step_helper(model, inputs):
@@ -157,10 +162,17 @@ def null_ref_context(self):
             context = nullcontext()
             ref_model = unwrap_model(self.ref_model)
         else:
-            context = self.peft_model.disable_adapter()
+            if args.ref_adapter_load is None:
+                context = self.peft_model.disable_adapter()
+            else:
+                context = nullcontext()
             ref_model = self.unwrapped_model
         with context:
+            if args.ref_adapter_load:
+                self.peft_model.set_adapter('ref_adapter')
             yield ref_model
+            if args.ref_adapter_load:
+                self.peft_model.set_adapter('default')
 
     def _replace_data_iterator(self, data_iterator):
         args = get_args()
diff --git a/swift/megatron/tuners/__init__.py b/swift/megatron/tuners/__init__.py
@@ -1,2 +1,2 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
-from . import lora
+from .lora import LoraParallelLinear
diff --git a/swift/megatron/tuners/lora.py b/swift/megatron/tuners/lora.py
diff --git a/swift/megatron/utils/__init__.py b/swift/megatron/utils/__init__.py
diff --git a/swift/megatron/utils/convert.py b/swift/megatron/utils/convert.py
diff --git a/swift/megatron/utils/utils.py b/swift/megatron/utils/utils.py
diff --git a/swift/plugin/metric.py b/swift/plugin/metric.py