Skip to content

About Qwen2.5 VL finetuning #5280

@xiang-xiang-zhu

Description

@xiang-xiang-zhu

After fine-tuning qwen2.5 vl on my own dataset and using the fine-tuned weights for inference, most of the outputs were normal, but some became abnormal.
The fine-tuning data format I used was based on https://github.com/modelscope/ms-swift/blob/main/examples/notebook/qwen2_5-vl-grounding/zh.ipynb : {“messages”: [{“role”: ‘system’, “content”: “You are a helpful assistant.”}, {“role”: “user”, “content”: “Find the in the image”}, {“role”: “assistant”, ‘content’: “”}], “images”: [“/xxx/x.jpg”], “objects”: {“ref”: [‘sheep’], “bbox”: [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}}. Here, I am using absolute coordinates.
Examples of abnormal outputs from the fine-tuned model include:

  1. “<|box_start|>(764,25),(1659,1)”,
  2. “<|box_start|><|box_start|>”,
  3. “ <|box_start|>“,
  4. ”<|box_start|>(987,369),(10825,370),(1141,760)<|box_end|>"

The fine-tuning script I used is:

MAX_PIXELS=1003520 \
swift sft \
    --model_type qwen2_5_vl \
    --model ../../qwen_25_vl_7B_awq \
    --dataset train_data.jsonl \
    --val_dataset val_data.jsonl \
    --attn_impl flash_attn\
    --train_type lora \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --freeze_vit true \
    --eval_steps 300 \
    --save_steps 100 \
    --save_total_limit 5 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --dataset_num_proc 4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions