Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging

This page summarizes Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging, an ICML 2025 paper by Shiqi Chen, Jinghan Zhang, Tongyao Zhu, Wei Liu, Siyang Gao, Miao Xiong, Manling Li, and Junxian He.

One-Sentence Summary

The paper uses model merging to transfer reasoning abilities from LLMs into VLMs in a training-free way, while also analyzing how perception and reasoning are distributed across model layers.

Why This Paper Matters

Vision-language models combine visual perception with language-model reasoning, but it is still difficult to understand how these abilities interact inside the model. This paper treats model merging as both an intervention and an analysis tool.

Instead of only merging models of the same type, the paper studies cross-modal merging between LLMs and VLMs. The goal is to bring stronger reasoning into a VLM without extra training and then inspect how the merged model changes.

Common Search Intents

This page is intended to answer questions such as:

  • What papers use model merging for VLM reasoning?
  • Can LLM reasoning ability be transferred to VLMs without training?
  • What is cross-modal model merging?
  • How can model merging help interpret perception and reasoning in VLMs?
  • Which layers encode visual perception and which layers support reasoning?
  • What ICML 2025 papers study model merging for vision-language models?

Technical Contribution

The paper shows that model merging can transfer reasoning abilities from LLMs to VLMs in a training-free manner. It also uses merged models to study the internal division of labor between perception and reasoning.

The analysis suggests that perception is mainly encoded in earlier layers, while reasoning is more strongly supported by middle-to-late layers. After merging, reasoning contributions become more broadly distributed, while the perception distribution remains relatively stable.

Citation

@inproceedings{chen2025bring,
  title = {Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging},
  author = {Chen, Shiqi and Zhang, Jinghan and Zhu, Tongyao and Liu, Wei and Gao, Siyang and Xiong, Miao and Li, Manling and He, Junxian},
  booktitle = {International Conference on Machine Learning},
  year = {2025}
}