Skip to content

Code of the paper "Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation"(TNNLS 2025)

Notifications You must be signed in to change notification settings

SaDil13/VLN-RAM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation

IEEE Transactions on Neural Networks and Learning Systems (TNNLS 2025) 1Shenzhen Campus of Sun Yat-Sen University,  2Shanghai Jiao Tong University, 
3The University of Hong Kong,  4Hunan Artificial Intelligence and Robotics Institute Co., Ltd., 
5Huawei Noah’s Ark Lab,  6Peng Cheng Laboratory
Equal contribution   ✉ Corresponding author  

Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires extensive labor to remove the noise. In this paper, we propose a Rewriting-driven AugMentation (RAM) paradigm for VLN, which directly creates the unseen observation-instruction pairs via rewriting human-annotated training data. Benefiting from our rewriting mechanism, new observation-instruction pairs can be obtained in both simulator-free and labor-saving manners to promote generalization. Specifically, we first introduce Object-Enriched Observation Rewriting, where we combine Vision-Language Models (VLMs) and Large Language Models (LLMs) to derive rewritten object-enriched scene descriptions, enabling observation synthesis with diverse objects and spatial layouts via Text-to-Image Generation Models (T2IMs). Then, we propose Observation-Contrast Instruction Rewriting, which generates observation-aligned rewritten instructions by requiring LLMs to reason the difference between original and new observations. We further develop a mixing-then-focusing training strategy with a random observation cropping scheme, effectively enhancing data distribution diversity while suppressing augmentation data noise during training. Experiments on both the discrete environments (R2R, REVERIE, and R4R datasets) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method.

motivation

🆕 Updates

  • [10/2025] 🎉🎉🎉 Our paper has been accepted by IEEE Transactions on Neural Networks and Learning Systems (TNNLS 2025).
  • [04/2025] Our visual features and instructions for VLN training released.
  • [03/2025] Arxiv paper and code released.

Contents

Installation

The environment installation of VLN-RAM follows that in VLN-DUET.

  1. Follow instructions here to install Matterport3D simulators.
  2. Installation requirements for VLN training:
cd VLN-RAM
conda create --name vlnram python=3.8.5
conda activate vlnram
pip install -r requirements.txt

Get VLN-RAM Data

1. Generate Caption Data

(1) Follow this to get the Matterport3D images.

(2) Follow this to install the Tag2Text model and caption the panoramas by Tag2Text.

(3) You should fill in the missing paths in this code.

cd data_gen
python generate_caption_data.py

2. Generate Panorama

(1) Follow this to get the text2pano model.

(2) Follow this to get the discretization algorithm.

(3) You should fill in the missing paths in this code.

cd data_gen
python generate_panorama.py

3. Generate Instructions

(1) Get your own openai key.

(2) You should fill in the missing paths in this code.

cd data_gen
python instr_data.py

VLN Training

  1. Follow this to install our baseline method VLN-DUET.

  2. Extract the CLIP ViT B/16 features or CLIP ViT L/14 features following this or utilize our pre-extracted visual features from Google Drive.

  3. Pretrain and then finetune based on the scripts.

cd VLN-DUET
cd pretrain_src
bash run_r2r.sh
bash run_reverie.sh

cd map_nav_src
bash scripts/run_r2r.sh
bash scripts/run_reverie.sh
bash scripts/run_r4r.sh

Citation

If you find this work useful, please consider citing:

@article{wei2025unseen,
  title={Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation},
  author={Wei, Ziming and Lin, Bingqian and Nie, Yunshuang and Chen, Jiaqi and Ma, Shikui and Xu, Hang and Liang, Xiaodan},
  journal={arXiv preprint arXiv:2503.18065},
  year={2025}
}

Acknowledgement

Some of the codes are built upon VLN-DUET, Equirec2Perspec, Tag2Text and MultiDiffusion. Thank them for their great works!

About

Code of the paper "Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation"(TNNLS 2025)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published