Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation
3The University of Hong Kong, 4Hunan Artificial Intelligence and Robotics Institute Co., Ltd.,
5Huawei Noah’s Ark Lab, 6Peng Cheng Laboratory
* Equal contribution ✉ Corresponding author
Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires extensive labor to remove the noise. In this paper, we propose a Rewriting-driven AugMentation (RAM) paradigm for VLN, which directly creates the unseen observation-instruction pairs via rewriting human-annotated training data. Benefiting from our rewriting mechanism, new observation-instruction pairs can be obtained in both simulator-free and labor-saving manners to promote generalization. Specifically, we first introduce Object-Enriched Observation Rewriting, where we combine Vision-Language Models (VLMs) and Large Language Models (LLMs) to derive rewritten object-enriched scene descriptions, enabling observation synthesis with diverse objects and spatial layouts via Text-to-Image Generation Models (T2IMs). Then, we propose Observation-Contrast Instruction Rewriting, which generates observation-aligned rewritten instructions by requiring LLMs to reason the difference between original and new observations. We further develop a mixing-then-focusing training strategy with a random observation cropping scheme, effectively enhancing data distribution diversity while suppressing augmentation data noise during training. Experiments on both the discrete environments (R2R, REVERIE, and R4R datasets) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method.
- [10/2025] 🎉🎉🎉 Our paper has been accepted by IEEE Transactions on Neural Networks and Learning Systems (TNNLS 2025).
- [04/2025] Our visual features and instructions for VLN training released.
- [03/2025] Arxiv paper and code released.
The environment installation of VLN-RAM follows that in VLN-DUET.
- Follow instructions here to install Matterport3D simulators.
- Installation requirements for VLN training:
cd VLN-RAM
conda create --name vlnram python=3.8.5
conda activate vlnram
pip install -r requirements.txt
(1) Follow this to get the Matterport3D images.
(2) Follow this to install the Tag2Text model and caption the panoramas by Tag2Text.
(3) You should fill in the missing paths in this code.
cd data_gen
python generate_caption_data.py
(1) Follow this to get the text2pano model.
(2) Follow this to get the discretization algorithm.
(3) You should fill in the missing paths in this code.
cd data_gen
python generate_panorama.py
(1) Get your own openai key.
(2) You should fill in the missing paths in this code.
cd data_gen
python instr_data.py
-
Follow this to install our baseline method VLN-DUET.
-
Extract the CLIP ViT B/16 features or CLIP ViT L/14 features following this or utilize our pre-extracted visual features from Google Drive.
-
Pretrain and then finetune based on the scripts.
cd VLN-DUET
cd pretrain_src
bash run_r2r.sh
bash run_reverie.sh
cd map_nav_src
bash scripts/run_r2r.sh
bash scripts/run_reverie.sh
bash scripts/run_r4r.sh
If you find this work useful, please consider citing:
@article{wei2025unseen,
title={Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation},
author={Wei, Ziming and Lin, Bingqian and Nie, Yunshuang and Chen, Jiaqi and Ma, Shikui and Xu, Hang and Liang, Xiaodan},
journal={arXiv preprint arXiv:2503.18065},
year={2025}
}Some of the codes are built upon VLN-DUET, Equirec2Perspec, Tag2Text and MultiDiffusion. Thank them for their great works!
