Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires extensive labor to remove the noise. In this paper, we propose a Rewriting-driven AugMentation (RAM) paradigm for VLN, which directly creates the unseen observation-instruction pairs via rewriting human-annotated training data. Benefiting from our rewriting mechanism, new observation-instruction pairs can be obtained in both simulator-free and labor-saving manners to promote generalization. Specifically, we first introduce Object-Enriched Observation Rewriting, where we combine Vision-Language Models (VLMs) and Large Language Models (LLMs) to derive rewritten object-enriched scene descriptions, enabling observation synthesis with diverse objects and spatial layouts via Text-to-Image Generation Models (T2IMs). Then, we propose Observation-Contrast Instruction Rewriting, which generates observation-aligned rewritten instructions by requiring LLMs to reason the difference between original and new observations. We further develop a mixing-then-focusing training strategy with a random observation cropping scheme, effectively enhancing data distribution diversity while suppressing augmentation data noise during training. Experiments on both the discrete environments (R2R, REVERIE, and R4R datasets) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method.

🆕 Updates

[10/2025] 🎉🎉🎉 Our paper has been accepted by IEEE Transactions on Neural Networks and Learning Systems (TNNLS 2025).
[04/2025] Our visual features and instructions for VLN training released.
[03/2025] Arxiv paper and code released.

Installation

The environment installation of VLN-RAM follows that in VLN-DUET.

Follow instructions here to install Matterport3D simulators.
Installation requirements for VLN training:

cd VLN-RAM
conda create --name vlnram python=3.8.5
conda activate vlnram
pip install -r requirements.txt

Get VLN-RAM Data

1. Generate Caption Data

(1) Follow this to get the Matterport3D images.

(2) Follow this to install the Tag2Text model and caption the panoramas by Tag2Text.

(3) You should fill in the missing paths in this code.

cd data_gen
python generate_caption_data.py

2. Generate Panorama

(1) Follow this to get the text2pano model.

(2) Follow this to get the discretization algorithm.

(3) You should fill in the missing paths in this code.

cd data_gen
python generate_panorama.py

3. Generate Instructions

(1) Get your own openai key.

(2) You should fill in the missing paths in this code.

cd data_gen
python instr_data.py

VLN Training

Follow this to install our baseline method VLN-DUET.
Extract the CLIP ViT B/16 features or CLIP ViT L/14 features following this or utilize our pre-extracted visual features from Google Drive.
Pretrain and then finetune based on the scripts.

cd VLN-DUET
cd pretrain_src
bash run_r2r.sh
bash run_reverie.sh

cd map_nav_src
bash scripts/run_r2r.sh
bash scripts/run_reverie.sh
bash scripts/run_r4r.sh

Citation

If you find this work useful, please consider citing:

@article{wei2025unseen,
  title={Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation},
  author={Wei, Ziming and Lin, Bingqian and Nie, Yunshuang and Chen, Jiaqi and Ma, Shikui and Xu, Hang and Liang, Xiaodan},
  journal={arXiv preprint arXiv:2503.18065},
  year={2025}
}

Acknowledgement

Some of the codes are built upon VLN-DUET, Equirec2Perspec, Tag2Text and MultiDiffusion. Thank them for their great works!

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
VLN-DUET		VLN-DUET
assets		assets
data_gen		data_gen
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation

🆕 Updates

Contents

Installation

Get VLN-RAM Data

1. Generate Caption Data

2. Generate Panorama

3. Generate Instructions

VLN Training

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

SaDil13/VLN-RAM

Folders and files

Latest commit

History

Repository files navigation

Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation

🆕 Updates

Contents

Installation

Get VLN-RAM Data

1. Generate Caption Data

2. Generate Panorama

3. Generate Instructions

VLN Training

Citation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages