Skip to content

huggingface上下载的数据集在finetuning的时候还需要重新下载 #4510

@zhaoyangwei123

Description

@zhaoyangwei123

已经在huggingface上下载好数据集了,但是在finetuning的时候总是要重新下载,下面是我的命令,已经在其中指定数据集路径:
NPROC_PER_NODE=8
MAX_PIXELS=1003520
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
swift sft
--model cache/hub/Qwen2.5_VL
--model_type qwen2_5_vl
--train_type full
--use_hf True
--dataset cache/hub/datasets--lmms-lab--LLaVA-Video-178K
--torch_dtype bfloat16
--attn_impl flash_attn
--freeze_vit true
--freeze_llm true
--freeze_aligner false
--num_train_epochs 3
--per_device_train_batch_size 2
--learning_rate 5e-6
--gradient_accumulation_steps 8
--eval_steps -1
--save_steps 1000
--save_total_limit 10
--logging_steps 5
--max_length 8192
--output_dir output
--warmup_ratio 0.05
--dataloader_num_workers 4
--dataset_num_proc 8
--deepspeed zero2

每次仍然会重新下载数据集:

  1. 指定路径名字就报错:
    [rank2]: datasets.data_files.EmptyDatasetError: The directory at /group/30105/weizhaoyang/cache/hub/datasets--lmms-lab--LLaVA-Video-178K doesn't contain any data files
    [rank0]:[W606 17:26:52.212998651 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
  2. 指定id名字就重新下载

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions