Navigation with Large Language Models: Discussion and References

18 Apr 2024

This is paper is available on arxiv under CC 4.0 DEED license.

Authors:

(1) Dhruv Shah, UC Berkeley and he contributed equally;

(2) Michael Equi, UC Berkeley and he contributed equally;

(3) Blazej Osinski, University of Warsaw;

(4) Fei Xia, Google DeepMind;

(5) Brian Ichter, Google DeepMind;

(6) Sergey Levine, UC Berkeley and Google DeepMind.

Table of Links

7 Discussion

We presented LFG, a method for utilizing language models for semantic guesswork to help navigate to goals in new and unfamiliar environments. The central idea in our work is that, while language models can bring to bear rich semantic understanding, their ungrounded inferences about how to perform navigational tasks are better used as suggestions and heuristics rather than plans. We formulate a way to derive a heuristic score from language models that we can then incorporate into a planning algorithm, and use this heuristic planner to reach goals in new environments more effectively. This way of using language models benefits from their inferences when they are correct, and reverts to a more conventional unguided search when they are not.

Limitations and future work: While our experiments provide a validation of our key hypothesis, they have a number of limitations. First, we only test in indoor environments in both sim and real yet the role of semantics in navigation likely differs drastically across domains – e.g., navigating a forest might implicate semantics very differently than navigating an apartment building. Exploring the applicability of semantics derived from language models in other settings would be another promising and exciting direction for future work. Second, we acknowledge that multiple requests to cloud-hosted LLMs with CoT is slow and requires an internet connection, severely limiting the extent of real-world deployment of the proposed method. We hope that ongoing advancements in quantizing LLMs for edge deployment and fast inference will address this limitation.

Acknowledgments

This research was partly supported by AFOSR FA9550-22-1-0273 and DARPA ANSR. The authors would like to thank Bangguo Yu, Vishnu Sashank Dorbala, Mukul Khanna, Theophile Gervet, and Chris Paxton, for their help in reproducing baselines. The authors would also like to thank Ajay Sridhar, for supporting real-world experiments, and Devendra Singh Chaplot, Jie Tan, Peng Xu, and Tingnan Zhang, for useful discussions in various stages of the project.

References

[1] G. N. DeSouza and A. C. Kak. Vision for mobile robot navigation: A survey. IEEE transactions on pattern analysis and machine intelligence, 24(2):237–267, 2002. 2

[2] P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sunderhauf, I. Reid, S. Gould, and ¨ A. van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3674–3683, 2018. 2

[3] N. Savinov, A. Dosovitskiy, and V. Koltun. Semi-parametric topological memory for navigation. arXiv preprint arXiv:1803.00653, 2018.

[4] N. Hirose, F. Xia, R. Mart´ın-Mart´ın, A. Sadeghian, and S. Savarese. Deep visual MPC-policy learning for navigation. IEEE Robotics and Automation Letters, 2019.

[5] D. Shah, A. Sridhar, A. Bhorkar, N. Hirose, and S. Levine. GNM: A General Navigation Model to Drive Any Robot. In arXiV, 2022.

[6] D. Shah, A. Sridhar, N. Dashora, K. Stachowicz, K. Black, N. Hirose, and S. Levine. ViNT: A Foundation Model for Visual Navigation. In 7th Annual Conference on Robot Learning (CoRL), 2023. 8

[7] S. K. Ramakrishnan, D. S. Chaplot, Z. Al-Halah, J. Malik, and K. Grauman. Poni: Potential functions for objectgoal navigation with interaction-free learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18890–18900, 2022. 2

[8] X. Chen, X. Wang, S. Changpinyo, A. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, et al. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022. 2

[9] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021. 2

[10] A. Majumdar, A. Shrivastava, S. Lee, P. Anderson, D. Parikh, and D. Batra. Improving visionand language navigation with image-text pairs from the web. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pages 259–274. Springer, 2020. 2

[11] A. Majumdar, G. Aggarwal, B. Devnani, J. Hoffman, and D. Batra. Zson: Zero-shot objectgoal navigation using multimodal goal embeddings. arXiv preprint arXiv:2206.12403, 2022.

[12] B. Chen, F. Xia, B. Ichter, K. Rao, K. Gopalakrishnan, M. S. Ryoo, A. Stone, and D. Kappler. Open vocabulary queryable scene representations for real world planning. arXiv preprint arXiv:2209.09874, 2022.

[13] C. Huang, O. Mees, A. Zeng, and W. Burgard. Visual language maps for robot navigation. arXiv preprint arXiv:2210.05714, 2022.

[14] D. Shah, B. Osinski, B. Ichter, and S. Levine. LM-nav: Robotic navigation with large pretrained models of language, vision, and action. In Annual Conference on Robot Learning (CoRL), 2022. 2, 3

[15] M. Shridhar, L. Manuelli, and D. Fox. Cliport: What and where pathways for robotic manipulation. In Proceedings of the 5th Conference on Robot Learning (CoRL), 2021. 3

[16] N. M. M. Shafiullah, C. Paxton, L. Pinto, S. Chintala, and A. Szlam. Clip-fields: Weakly supervised semantic fields for robotic memory, 2023.

[17] K. Jatavallabhula, A. Kuwajerwala, Q. Gu, M. Omama, T. Chen, S. Li, G. Iyer, S. Saryazdi, N. Keetha, A. Tewari, J. Tenenbaum, C. de Melo, M. Krishna, L. Paull, F. Shkurti, and A. Torralba. Conceptfusion: Open-set multimodal 3d mapping. arXiv, 2023. 3

[18] V. S. Dorbala, J. F. J. Mullen, and D. Manocha. Can an Embodied Agent Find Your ”Catshaped Mug”? LLM-Based Zero-Shot Object Navigation, 2023. 3, 7, 8, 9

[19] O. Mees, J. Borja-Diaz, and W. Burgard. Grounding language with visual affordances over unstructured data, 2023.

[20] I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg. Progprompt: Generating situated robot task plans using large language models, 2022.

[21] Y. Xie, C. Yu, T. Zhu, J. Bai, Z. Gong, and H. Soh. Translating natural language to planning goals with large-language models. arXiv preprint arXiv:2302.05128, 2023.

[22] Y. Ding, X. Zhang, C. Paxton, and S. Zhang. Task and motion planning with large language models for object rearrangement, 2023.

[23] K. Lin, C. Agia, T. Migimatsu, M. Pavone, and J. Bohg. Text2motion: From natural language instructions to feasible plans, 2023. 3, 4

[24] W. Huang, P. Abbeel, D. Pathak, and I. Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning (ICML), 2022. 3, 4

[25] B. Ichter, A. Brohan, Y. Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang, R. Julian, D. Kalashnikov, S. Levine, Y. Lu, C. Parada, K. Rao, P. Sermanet, A. T. Toshev, V. Vanhoucke, F. Xia, T. Xiao, P. Xu, M. Yan, N. Brown, M. Ahn, O. Cortes, N. Sievers, C. Tan, S. Xu, D. Reyes, J. Rettinghouse, J. Quiambao, P. Pastor, L. Luu, K.-H. Lee, Y. Kuang, S. Jesmonth, K. Jeffrey, R. J. Ruano, J. Hsu, K. Gopalakrishnan, B. David, A. Zeng, and C. K. Fu. Do as i can, not as i say: Grounding language in robotic affordances. In Annual Conference on Robot Learning (CoRL), 2022. 3, 4

[26] K. Valmeekam, A. Olmo, S. Sreedharan, and S. Kambhampati. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change), 2023. 3

[27] W. Huang, F. Xia, D. Shah, D. Driess, A. Zeng, Y. Lu, P. Florence, I. Mordatch, S. Levine, K. Hausman, and B. Ichter. Grounded decoding: Guiding text generation with grounded models for robot control, 2023. 3

[28] Y. Jiang, A. Gupta, Z. Zhang, G. Wang, Y. Dou, Y. Chen, L. Fei-Fei, A. Anandkumar, Y. Zhu, and L. Fan. Vima: General robot manipulation with multimodal prompts. arXiv preprint, 2022. 3

[29] S. Huang, L. Dong, W. Wang, Y. Hao, S. Singhal, S. Ma, T. Lv, L. Cui, O. K. Mohammed, B. Patra, Q. Liu, K. Aggarwal, Z. Chi, J. Bjorck, V. Chaudhary, S. Som, X. Song, and F. Wei. Language is not all you need: Aligning perception with language models, 2023.

[30] D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y. Chebotar, P. Sermanet, D. Duckworth, S. Levine, V. Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. Palm-e: An embodied multimodal language model. In arXiv preprint, 2023. 3

[31] B. Yamauchi. A frontier-based approach for autonomous exploration. In IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA), 1997. 3

[32] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models. In Neural Information Processing Systems (NeurIPS), 2022. 4

[33] T. Gervet, S. Chintala, D. Batra, J. Malik, and D. S. Chaplot. Navigating to objects in the real world. Science Robotics, 2023. 5, 7, 8, 9

[34] D. Shah and S. Levine. Viking: Vision-based kilometer-scale navigation with geographic hints. In Robotics: Science and Systems (RSS), 2022. 5, 8, 14

[35] X. Zhou, R. Girdhar, A. Joulin, P. Krahenb ¨ uhl, and I. Misra. Detecting twenty-thousand classes¨ using image-level supervision. In 17th European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-20077-9 21. 6, 8

[36] K. Yadav, S. K. Ramakrishnan, J. Turner, A. Gokaslan, O. Maksymets, R. Jain, R. Ramrakhya, A. X. Chang, A. Clegg, M. Savva, E. Undersander, D. S. Chaplot, and D. Batra. Habitat challenge 2022. https://aihabitat.org/challenge/2022/, 2022. 7

[37] E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra. DDPPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames. In International Conference on Learning Representations (ICLR), 2020. 7, 9

[38] D. S. Chaplot, H. Jiang, S. Gupta, and A. Gupta. Semantic curiosity for active visual learning. In ECCV, 2020. 7, 8, 9, 14

[39] B. Yu, H. Kasaei, and M. Cao. L3mvn: Leveraging large language models for visual target navigation, 2023. 7, 9

[40] D. Shah, B. Eysenbach, N. Rhinehart, and S. Levine. Rapid exploration for open-world navigation with latent goal models. In 5th Annual Conference on Robot Learning, 2021. 8

[41] A. Sridhar, D. Shah, C. Glossop, and S. Levine. NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration. In arXiv, 2023. 8, 14

[42] K. Yadav, A. Majumdar, R. Ramrakhya, N. Yokoyama, A. Baevski, Z. Kira, O. Maksymets, and D. Batra. Ovrl-v2: A simple state-of-art baseline for imagenav and objectnav, 2023. 9