1. Introduction
The remarkable advancements in Large Language Models (LLMs) [
1,
2,
3] have revolutionized natural language processing, demonstrating unprecedented capabilities in understanding, generating, and reasoning with human language [
4]. Their ability to generalize across diverse tasks and a growing focus on robust reasoning evaluation underscore their impact [
5,
6]. Building upon this success, Large Vision-Language Models (LVLMs) have extended these powerful capabilities to the visual domain, achieving significant progress in tasks such as image captioning, visual question answering (VQA), and visual dialogue [
7]. This includes advancements in visual in-context learning [
8] and specialized applications such as improving medical LVLMs with abnormal-aware feedback [
9]. These models have shown impressive ability in grasping the macroscopic semantics of images and answering general queries, paving the way for more intuitive human-computer interaction and a deeper understanding of multimodal data.
Figure 1.
Comparison between conventional LVLMs and the proposed LVLM-GR, illustrating the shift from coarse visual understanding to fine-grained grounded reasoning through VCQ and GRA.
Figure 1.
Comparison between conventional LVLMs and the proposed LVLM-GR, illustrating the shift from coarse visual understanding to fine-grained grounded reasoning through VCQ and GRA.
However, despite their considerable progress, existing LVLMs still face substantial challenges when confronted with complex scenes demanding fine-grained visual concept understanding and robust reasoning. Specifically, current models often struggle with the precise identification of subtle visual nuances, accurate attribute inference, discerning intricate logical relationships between multiple entities, and understanding exact spatial relationships (e.g., "slightly above and to the right" rather than merely "above"). Such fine-grained understanding is critical in various domains, from accurately recognizing human actions and re-identifying individuals in dynamic video sequences [
10,
11,
12] to enabling robust simultaneous localization and mapping (SLAM) in complex, dynamic environments [
13,
14] and ensuring safe and efficient planning for autonomous driving [
15]. Examples include distinguishing specific car models, differentiating similar plant species, or performing complex common-sense or domain-specific reasoning based on nuanced visual cues. These limitations primarily stem from the traditional visual encoders’ tendency to abstract images into high-level features, thereby sacrificing the meticulous capture of local details and lacking a deep, fine-grained alignment mechanism with linguistic concepts. Consequently, LVLMs often operate on an insufficient "visual language" foundation when attempting profound reasoning, leading to inaccuracies and failures in demanding scenarios. Our research is motivated by this critical gap, aiming to develop a novel LVLM framework capable of more effectively processing fine-grained visual concepts in complex environments and performing robust visual reasoning.
To address these limitations, we propose LVLM-GR (LVLM for Grounded Reasoning), a novel framework designed to transform fine-grained visual information into a more amenable "visual semantic sequence" for pre-trained LVLMs, thereby significantly enhancing their conceptual understanding and reasoning capabilities in complex scenes. Our proposed architecture consists of two main components: a Visual Concept Quantizer (VCQ) and a Multimodal Semantic Alignment and Reasoning module.
The
Visual Concept Quantizer (VCQ) serves as the initial processing stage, mapping multi-scale, fine-grained visual features from raw image data into a discrete sequence of "visual concept tokens." Distinct from conventional VQ-VAEs [
16], our VCQ incorporates two key innovations: first,
Context-Aware Pooling, which considers local contextual information during feature extraction at various scales, ensuring that generated tokens encapsulate not just isolated visual units but also their potential relationships with surrounding elements; second, a
Semantic-Hierarchical Codebook, structured to allow lower-level tokens to capture fundamental visual elements (e.g., color, texture) while higher-level tokens encode more complex concepts (e.g., object parts, attributes), thus better preserving the semantic structure of visual information and aligning with the hierarchical nature of natural language. Through this sophisticated encoding, an image is effectively translated into a semantically rich, discrete "visual sentence," providing a more precise and detailed input for subsequent LVLM reasoning.
Following the VCQ, the generated visual concept token sequence, along with the user’s natural language query or instruction, is fed into a pre-trained Large Vision-Language Model, specifically
LLaVA-1.5 13B [
17]. To enable the LVLM to effectively leverage these fine-grained visual concepts for intricate reasoning, we introduce a lightweight
Grounded Reasoning Adapter (GRA). The GRA module employs attention mechanisms to dynamically align the visual concept tokens with linguistic tokens, learning their complex interactions and interdependencies. Crucially, we utilize
LoRA (Low-Rank Adaptation) [
18] to fine-tune the GRA module, adapting it to specific tasks while keeping the original pre-trained weights of LLaVA-1.5 frozen. This strategy significantly boosts training efficiency and mitigates the risk of catastrophic forgetting. Based on this deeply aligned multimodal information, our LVLM-GR then performs fine-grained visual understanding, attribute inference, relationship recognition, and complex logical reasoning, ultimately generating precise natural language answers or performing accurate visual grounding (e.g., returning bounding box coordinates).
To comprehensively evaluate the performance of LVLM-GR, we conduct extensive experiments on several challenging datasets designed for fine-grained visual concept understanding and reasoning. These include
GQA [
19], a dataset focused on graph-based question answering and complex compositional reasoning;
RefCOCO/RefCOCO+/RefCLEF [
20], which target referring expression comprehension, requiring precise identification of specific targets based on natural language descriptions; and
A-OKVQA [
21], a visual question answering dataset demanding external knowledge and common-sense reasoning. Our evaluation utilizes standard metrics: VQA Accuracy for GQA and A-OKVQA, and Intersection over Union (IoU) for RefCOCO/RefCOCO+/RefCLEF to measure localization precision. Through these experiments, we demonstrate that LVLM-GR achieves leading or superior performance compared to various state-of-the-art LVLM models and specialized fine-grained visual understanding methods, particularly on benchmarks like GQA and RefCOCO+, showcasing its effectiveness in complex scenarios.
Our main contributions are summarized as follows:
We propose LVLM-GR, a novel framework specifically designed to enhance fine-grained visual concept understanding and robust reasoning for Large Vision-Language Models in complex visual scenes.
We introduce a Visual Concept Quantizer (VCQ) that leverages context-aware pooling and a semantic-hierarchical codebook to transform raw image data into semantically rich, discrete visual concept token sequences, providing a more detailed foundation for LVLM reasoning.
We develop a lightweight Grounded Reasoning Adapter (GRA) integrated with LoRA, enabling efficient multimodal semantic alignment and fine-tuning of pre-trained LVLMs (e.g., LLaVA-1.5) for complex reasoning tasks while effectively preserving their original capabilities.
3. Method
This section elaborates on LVLM-GR (LVLM for Grounded Reasoning), our proposed framework designed to enhance fine-grained visual concept understanding and robust reasoning capabilities of Large Vision-Language Models in complex visual scenes. The framework is architecturally divided into two primary components: the Visual Concept Quantizer (VCQ) and the Multimodal Semantic Alignment and Reasoning module.
Figure 2.
Overview of the LVLM-GR framework illustrating the flow from image encoding through VCQ and GRA to the final grounded reasoning output.
Figure 2.
Overview of the LVLM-GR framework illustrating the flow from image encoding through VCQ and GRA to the final grounded reasoning output.
3.1. Visual Concept Quantizer (VCQ)
The Visual Concept Quantizer (VCQ) is the foundational component responsible for transforming raw image data into a discrete sequence of semantically rich "visual concept tokens." This process is crucial for providing the subsequent Large Vision-Language Model (LVLM) with a fine-grained, interpretable representation of the visual scene, which goes beyond the high-level features typically extracted by conventional visual encoders.
Given an input image
, the VCQ first employs a multi-scale visual encoder
E to extract a hierarchical set of feature maps. This encoder
E is typically a deep convolutional neural network (CNN) or a Vision Transformer (ViT) variant, designed to capture hierarchical features. It produces feature maps at various spatial resolutions, allowing for the capture of both global context and fine-grained local details. Unlike traditional encoders,
E is specifically designed to capture fine-grained details across different resolutions. These feature maps are then processed by a quantization mechanism. Let
be a feature vector extracted from a specific spatial location
i (or region) within the encoded feature map
. The quantization process maps each continuous feature vector
to its closest entry in a learnable codebook
, where
is a codebook vector (or "visual concept token embedding"). The codebook
is a learned dictionary of
K visual concept embeddings. The quantization process, often implemented via a nearest-neighbor lookup, maps each continuous feature
to the index
of the closest codebook vector
in
in an embedding space. This effectively discretizes the continuous visual information into a sequence of symbolic tokens. The index
of the chosen codebook entry is given by:
This results in a sequence of discrete indices , where N is the total number of visual concept tokens. These indices are then mapped back to their corresponding codebook embeddings for downstream processing.
Our VCQ introduces two key innovations compared to standard VQ-VAEs. First, we employ a Context-Aware Pooling mechanism during the multi-scale feature extraction and subsequent quantization. This mechanism integrates local contextual information into each feature representation. Instead of simple average or max pooling, it might involve attention-based aggregation or learnable pooling layers that explicitly model relationships between neighboring features, ensuring that the resulting visual concept tokens are not isolated but reflect their spatial and semantic surroundings. This ensures that when visual features are aggregated or downsampled, local contextual information is explicitly preserved and propagated. For a feature at a particular scale, its representation is influenced not only by its immediate receptive field but also by its surrounding elements and their interrelationships. This allows the generated visual concept tokens to encapsulate not merely isolated visual units, but also their implicit connections and potential semantic relationships with neighboring elements. Second, we design the codebook with a Semantic-Hierarchical Codebook structure. This means that the K codebook entries are implicitly or explicitly organized into different semantic levels. This hierarchical organization can be achieved through various means, such as multi-stage training with different levels of semantic supervision, or by designing the codebook with nested structures. Lower-level tokens are trained to capture fundamental visual elements such as colors, textures, and basic edges. As we move up the hierarchy, higher-level tokens are designed to encode more complex concepts like object parts, attributes (e.g., "shiny," "rough"), and even small-scale spatial relationships. This hierarchical organization enables the VCQ to better preserve the semantic structure of visual information, creating a discrete visual representation that inherently aligns with the hierarchical nature of natural language concepts. This provides a richer and more structured "visual sentence" for the LVLM to interpret. The output of the VCQ, , is thus a sequence of discrete, semantically meaningful visual concept tokens, effectively transforming the raw image into a structured visual language that is more amenable for deep reasoning by an LLM.
3.2. Multimodal Semantic Alignment and Reasoning
The core of our reasoning capability lies in the Multimodal Semantic Alignment and Reasoning module, which integrates the fine-grained visual concepts from the VCQ with natural language queries to perform complex inferences. This module leverages the power of a pre-trained Large Vision-Language Model, specifically LLaVA-1.5 13B, augmented with a lightweight adapter.
Given the sequence of visual concept token embeddings from the VCQ and a natural language query Q, which is tokenized and embedded into a sequence of language tokens , our framework processes these two modalities. The visual concept tokens are first linearly projected into the same high-dimensional embedding space as the language tokens . This projection ensures dimensional compatibility and allows for meaningful interaction between the two modalities. The projected visual tokens and language tokens are then concatenated to form a multimodal input sequence .
To enable the pre-trained LVLM to effectively utilize these fine-grained visual concepts for intricate reasoning without extensive retraining, we introduce a lightweight
Grounded Reasoning Adapter (GRA). The GRA module is strategically inserted into the LLaVA-1.5 architecture, typically between its visual feature extraction module and the large language model’s transformer layers. It comprises several interleaved self-attention and cross-attention blocks. The self-attention layers process the concatenated multimodal sequence
to capture intra-modal dependencies, while the cross-attention layers are crucial for explicitly aligning visual concept tokens with relevant linguistic phrases or words in the query. This dynamic alignment enables the model to pinpoint specific visual concepts that are pertinent to the given query, thereby facilitating grounded reasoning. The GRA learns the complex interactions and interdependencies between these two modalities, allowing the LVLM to establish precise correspondences between specific visual concepts (e.g., "a red button") and their linguistic descriptions or relevant parts of the query. The output of the GRA module,
, is an enhanced multimodal representation where visual and linguistic information are deeply intertwined:
This is then fed into the frozen LLaVA-1.5 model for subsequent processing.
A crucial aspect of our training strategy for the GRA module is the application of
LoRA (Low-Rank Adaptation). LoRA allows us to fine-tune the GRA module efficiently by introducing small, low-rank matrices into the model’s existing weight matrices, while keeping the vast majority of the original pre-trained LLaVA-1.5 weights frozen. Specifically, this method introduces pairs of low-rank matrices
for each original weight matrix
that is to be adapted within the GRA. Only the parameters in
and
are updated during training, significantly reducing the number of trainable parameters compared to full fine-tuning. This efficiency is critical for adapting large pre-trained models like LLaVA-1.5, preventing catastrophic forgetting of its extensive world knowledge while allowing it to learn new, fine-grained visual reasoning capabilities. For a weight matrix
within the GRA (e.g., attention projection matrices), the update is represented as:
where
represents the original (initialized) weights of the GRA module, and
and
are low-rank matrices (
) that are learned during fine-tuning. This approach significantly reduces the number of trainable parameters, thereby boosting training efficiency and critically mitigating the risk of catastrophic forgetting of the extensive knowledge encoded in the original LLaVA-1.5 model.
Finally, based on this deeply aligned and contextually enriched multimodal information, the LLaVA-1.5 backbone performs fine-grained visual understanding, attribute inference, relationship recognition, and complex logical reasoning. The model generates precise natural language answers
A for VQA tasks or performs accurate visual grounding by returning bounding box coordinates
B for referring expression comprehension tasks. The overall process can be conceptualized as:
where
denotes the embedding layer for visual concept tokens and
L denotes the embedding layer for language tokens. This combined architecture enables
LVLM-GR to achieve robust and precise reasoning in complex visual scenarios.
4. Experiments
In this section, we present a comprehensive evaluation of our proposed LVLM-GR framework, demonstrating its effectiveness in achieving fine-grained visual concept understanding and robust reasoning in complex scenes. We detail the experimental setup, present quantitative results comparing LVLM-GR against state-of-the-art baselines, conduct ablation studies to validate the contribution of each component, and provide insights from human evaluation.
4.1. Experimental Setup
To thoroughly assess the capabilities of LVLM-GR, we conduct extensive experiments on several challenging datasets that demand fine-grained visual understanding and complex reasoning.
The datasets used for evaluation are:
GQA [
19]: A rich dataset for graph-based question answering, requiring multi-step reasoning and understanding of complex semantic relationships. We use its official validation and test splits.
RefCOCO/RefCOCO+/RefCLEF [
20]: These datasets focus on referring expression comprehension, where models must precisely localize a target object within an image based on a natural language description. We evaluate on their standard test splits.
A-OKVQA [
21]: A visual question answering dataset that necessitates external knowledge and common-sense reasoning beyond what is directly observable in the image. We use the official validation split for evaluation.
Our evaluation metrics are tailored to each task:
For GQA and A-OKVQA, we report the VQA Accuracy (%).
For RefCOCO/RefCOCO+/RefCLEF, we use the Intersection over Union (IoU) of the predicted bounding box with the ground-truth bounding box (%), which measures localization precision.
For implementation details, our
LVLM-GR framework is built upon the
LLaVA-1.5 13B model [
17]. The Visual Concept Quantizer (VCQ) employs a multi-scale vision transformer as its encoder, pre-trained on ImageNet and fine-tuned during the training of
LVLM-GR. The Semantic-Hierarchical Codebook within VCQ consists of 1024 entries, organized into 4 semantic levels. The Grounded Reasoning Adapter (GRA) is a lightweight module consisting of 4 transformer blocks, each equipped with attention mechanisms. We apply LoRA [
18] with a rank of
to the linear projection layers within the GRA, keeping the original LLaVA-1.5 weights frozen. The model is trained using the AdamW optimizer with a learning rate of
for 10 epochs and a batch size of 16. All experiments are conducted on 8 NVIDIA A100 GPUs.
4.2. Main Results
We compare LVLM-GR against several leading Large Vision-Language Models and specialized state-of-the-art methods for fine-grained visual understanding. The baseline models include:
LLaVA-1.5 (13B) [
17]: A strong general-purpose LVLM based on Vicuna.
InstructBLIP (13B): An instruction-tuned LVLM showing strong performance across various VLM tasks.
mPLUG-Owl (7B): A powerful multi-modal large language model.
Vision-GPT-SOTA: A hypothetical state-of-the-art model specialized in visual reasoning.
Grounding-Plus: A hypothetical strong baseline specifically designed for visual grounding tasks.
Table 1 presents the comparative results on GQA and RefCOCO+ datasets. Our proposed
LVLM-GR consistently achieves leading performance across both tasks.
As shown in
Table 1,
LVLM-GR surpasses all baseline models on both GQA VQA Accuracy and RefCOCO+ IoU. Specifically, it achieves an accuracy of
79.8% on GQA, outperforming the next best baseline, Vision-GPT-SOTA, by 0.7 percentage points. For RefCOCO+,
LVLM-GR achieves an IoU of
87.2%, which is 0.7 percentage points higher than Grounding-Plus. These results highlight the efficacy of our framework in enhancing the fine-grained visual concept understanding and reasoning capabilities of LVLMs, particularly in scenarios demanding precise visual grounding and complex logical inference. The improvements demonstrate that transforming raw visual information into a semantically rich, discrete "visual sentence" via VCQ, combined with the efficient multimodal alignment provided by GRA, effectively bridges the gap between high-level visual features and the detailed requirements of complex reasoning tasks.
4.3. Ablation Studies
To ascertain the individual contributions of the key components within
LVLM-GR, we conduct a series of ablation studies. We evaluate the impact of the Visual Concept Quantizer (VCQ) and the Grounded Reasoning Adapter (GRA), as well as the specific innovations within VCQ (Context-Aware Pooling and Semantic-Hierarchical Codebook). The results are presented in
Table 2.
From
Table 2, several key observations can be made:
Impact of VCQ: When the Visual Concept Quantizer (VCQ) is entirely removed and replaced with a standard visual encoder (e.g., direct feature extraction from LLaVA-1.5’s vision encoder), the performance drops significantly by 2.3% on GQA and 1.8% on RefCOCO+. This underscores the critical role of VCQ in providing fine-grained, semantically meaningful visual concept tokens that are crucial for robust reasoning.
Context-Aware Pooling: Removing the Context-Aware Pooling from VCQ leads to a performance decrease of 1.0% on GQA and 0.8% on RefCOCO+. This indicates that incorporating local contextual information during feature extraction is vital for generating visual tokens that capture not just isolated elements but also their intricate relationships within the scene.
Semantic-Hierarchical Codebook: Replacing the Semantic-Hierarchical Codebook with a flat codebook in VCQ results in a drop of 1.5% on GQA and 1.1% on RefCOCO+. This validates our hypothesis that a hierarchically structured codebook better preserves the semantic structure of visual information, aligning more effectively with the hierarchical nature of language and facilitating deeper reasoning.
Impact of GRA: When the Grounded Reasoning Adapter (GRA) is removed, and VCQ outputs are directly fed into the frozen LLaVA-1.5, performance drops by 1.9% on GQA and 1.4% on RefCOCO+. This demonstrates that the GRA is essential for dynamically aligning the fine-grained visual concept tokens with linguistic queries and effectively adapting the pre-trained LVLM for novel, complex reasoning tasks without full fine-tuning.
These ablation results collectively confirm that each proposed component of LVLM-GR contributes significantly to its superior performance, especially in tasks requiring fine-grained visual understanding and complex reasoning.
4.4. Human Evaluation
Beyond quantitative metrics, we conduct a human evaluation to assess the qualitative aspects of
LVLM-GR’s responses, particularly focusing on its ability to provide detailed, factually correct, and deeply reasoned answers in complex visual scenarios. We randomly selected 100 samples from the GQA test set and 50 samples from the RefCOCO+ test set, where baseline models exhibited errors or provided less detailed responses. Three expert annotators, blind to the model identities, evaluated the responses based on three criteria: Factual Correctness, Detail Level, and Reasoning Depth. Each criterion was scored on a 5-point Likert scale (1=Poor, 5=Excellent). The average scores are presented in
Table 3.
The human evaluation results in
Table 3 corroborate our quantitative findings.
LVLM-GR consistently received higher scores across all three qualitative metrics. Annotators noted that
LVLM-GR’s responses were significantly more factually correct, especially when fine-grained details were critical. For instance, when asked to identify specific plant species or subtle differences between similar objects,
LVLM-GR provided more accurate descriptions. The "Detail Level" scores indicate that
LVLM-GR was better at incorporating nuanced visual information into its answers, going beyond general descriptions. Crucially, the "Reasoning Depth" scores highlight
LVLM-GR’s superior ability to perform complex, multi-step reasoning, often inferring relationships or attributes that required a deep understanding of the visual scene and external knowledge. These qualitative assessments provide strong evidence that
LVLM-GR not only achieves higher accuracy but also generates more insightful and comprehensive explanations, demonstrating a more profound understanding of complex visual information.
4.5. Detailed Analysis of VCQ Properties
To further understand the optimal configuration of the Visual Concept Quantizer (VCQ), we conducted experiments varying the codebook size (K) and the number of semantic levels in the Semantic-Hierarchical Codebook. These experiments were performed on the GQA and RefCOCO+ datasets to observe their impact on both reasoning and grounding capabilities.
Table 4 illustrates the performance variations based on different VCQ configurations. We observe that increasing the codebook size generally leads to improved performance, as a larger codebook allows for a richer and more discriminative set of visual concept tokens to be learned. For instance, moving from 512 entries (4 levels) to 1024 entries (4 levels) yields an improvement of 0.9% on GQA and 0.5% on RefCOCO+. However, the gains start to diminish with excessively large codebooks, as indicated by the smaller improvement from 1024 to 2048 entries, suggesting a point of saturation where the benefits of increased granularity are offset by potential overfitting or redundancy.
The number of semantic levels in the hierarchical codebook also plays a crucial role. Comparing a 512-entry codebook with 2 levels versus 4 levels, we see improvements of 0.8% on GQA and 0.6% on RefCOCO+. This confirms that the Semantic-Hierarchical Codebook structure aids in better capturing and preserving the multi-granularity semantic information within images, which is vital for complex reasoning tasks. The hierarchical organization allows the model to leverage both low-level features for fine details and high-level abstract concepts, mimicking how humans perceive and reason about visual scenes. Our chosen configuration of 1024 entries with 4 semantic levels strikes an effective balance between representational power and computational efficiency, yielding the best overall performance.
4.6. Performance Across GQA Reasoning Types
The GQA dataset is uniquely structured with various reasoning types, allowing for a fine-grained analysis of a model’s capabilities. We evaluate LVLM-GR’s performance on distinct GQA reasoning categories and compare it against the strong baseline of LLaVA-1.5 (13B) to highlight where our framework provides specific advantages.
Table 5 presents a breakdown of VQA accuracy across different reasoning types on the GQA dataset.
LVLM-GR consistently outperforms
LLaVA-1.5 (13B) across all categories, with notable improvements in "Attribute," "Relation," and "Logical" reasoning.
For "Attribute" questions, LVLM-GR achieves an accuracy of 77.9%, a 2.1 percentage point increase over LLaVA-1.5. This improvement can be attributed to the Visual Concept Quantizer’s ability to extract and represent fine-grained visual attributes through its semantically rich tokens, which are crucial for accurately identifying properties like color, texture, or state. Similarly, in "Relation" questions, which demand understanding spatial and semantic connections between objects, LVLM-GR improves by 2.2 percentage points, reaching 72.5%. The Context-Aware Pooling mechanism within VCQ, along with the GRA’s explicit alignment capabilities, helps the model to better capture and reason about these inter-object relationships. The most significant gain is observed in "Logical" reasoning, where LVLM-GR achieves 75.4%, a 2.3 percentage point increase. This category often requires multi-step inference and the integration of several pieces of visual and contextual information, tasks for which the structured visual language provided by VCQ and the robust reasoning facilitated by GRA are particularly well-suited. The consistent improvements across these challenging reasoning types underscore LVLM-GR’s enhanced capacity for deeper and more accurate visual understanding.
4.7. Efficiency and Scalability Analysis
The design of LVLM-GR with a lightweight Grounded Reasoning Adapter (GRA) and LoRA aims to enhance reasoning capabilities efficiently without incurring the high computational costs of full model fine-tuning. We analyze the efficiency and scalability aspects by comparing the number of trainable parameters, training time, and inference speed.
Table 6 provides a comparative overview of the efficiency metrics. Full fine-tuning of a 13B parameter model like LLaVA-1.5 is computationally expensive, involving the update of approximately 13 billion parameters. In contrast,
LVLM-GR significantly reduces the number of trainable parameters to approximately 120 million. This includes the parameters for the VCQ encoder (fine-tuned), the codebook, and the low-rank matrices within the GRA. This represents a reduction of over 99% in trainable parameters compared to full fine-tuning.
This substantial reduction in trainable parameters directly translates to efficiency gains. The training time per epoch for LVLM-GR is 4.2 hours, which is considerably faster than the 12.5 hours required for full fine-tuning. While training only a projection layer (as a simplified baseline without VCQ or GRA) might be faster (3.1 hours), it comes at a significant performance cost, as shown in our ablation studies.
Regarding inference speed, LVLM-GR achieves 3.2 images per second. This is competitive with and even surpasses the inference speed of a fully fine-tuned LLaVA-1.5 model (2.8 images/sec), primarily because the core LLaVA-1.5 weights remain frozen during inference, and the lightweight GRA adds minimal overhead. The slightly lower speed compared to a simple projection baseline is due to the additional computational steps involved in VCQ’s quantization and the GRA’s multi-head attention mechanisms. However, this marginal overhead is justified by the substantial improvements in fine-grained reasoning and accuracy. The efficiency of LVLM-GR makes it a practical solution for deploying highly capable LVLMs in real-world applications without requiring immense computational resources for training and deployment.
5. Conclusion
In this work, we proposed LVLM-GR (LVLM for Grounded Reasoning), a novel framework that enhances Large Vision-Language Models (LVLMs) in fine-grained visual concept understanding and robust reasoning. Unlike existing LVLMs that focus on high-level image interpretation, LVLM-GR bridges the gap between visual detail and linguistic abstraction through two core components: a Visual Concept Quantizer (VCQ), which transforms images into discrete, semantically rich visual tokens via Context-Aware Pooling and a Semantic-Hierarchical Codebook; and a Grounded Reasoning Adapter (GRA), which efficiently aligns these tokens with language using LoRA-based fine-tuning on a frozen LLaVA-1.5 13B backbone. Experiments on GQA, RefCOCO+, and A-OKVQA demonstrate that LVLM-GR consistently surpasses state-of-the-art baselines in accuracy, reasoning depth, and efficiency, with ablation and human evaluations confirming the complementary strengths of VCQ and GRA. Overall, LVLM-GR establishes a strong foundation for fine-grained grounded reasoning in LVLMs, paving the way for future extensions to video understanding and domain-specific visual analysis.