Submitted:
26 July 2025
Posted:
29 July 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Works
3. Model Architecture Justification
3.1. Choice of LLaMA over Other LLMs
3.2. Modular Vision Processing over Integrated VLMs
4. Main Scheme and Working Principles of the Model
5. 3D Image Captioning
5.1. Object Detection Matrix
5.2. Image Captioning
5.3. Prompt for LLaMA

5.4. Mathematical Model Explanations
DETR (DEtection TRansformer):
- : class probability for object i
- : bounding box coordinates
- N: fixed number of object queries
MiDaS (Monocular Depth Estimation):
Moondream (Vision-Language Model):
- C: Generated caption
- : Extracts visual features
- Decoder: Generates text tokens based on visual tokens
6. Spatial Model Training
6.1. Data Preparation
- Image Processing: We selected 2000 images from the COCO 2017 training set [23], which depict various 3D scenes. For each image, we employed the DETR model [4] for object detection and the Moondream model [5] for generating detailed descriptions. DETR provided object counts and their 2D spatial locations within a 2×3 grid, while MiDaS estimated relative depth values to derive the z-coordinates, enabling 3D spatial reasoning. Together, these models generated tuples of coordinates for detected objects. Moondream produced a natural language description of each image. These components form the foundation of the prompts used in subsequent steps.
-
Prompt Construction: For each image, we constructed a prompt by combining the Moondream-generated image description with the DETR-derived object detection matrix (augmented with MiDaS depth data). The prompt was designed to elicit a detailed description of the depicted place, integrating both the visual narrative and the 3D spatial arrangement of objects. The prompt format is structured as:Prompt Format:Image Description: [Description]This grid-based object detection matrix represents detected objects in different regions of the image. [Matrix]Describe this place in detail.System Prompt:"You are a visual understanding and interpretation assistant. You will receive an input consisting of a natural language description of an image along with a grid-based object detection matrix, which contains object names, counts, and their spatial positions. Your task is to give information and answer questions about places."
-
Response Generation: Using the constructed prompts, we generated responses from two language models:
- DeepSeek-V3-0324: This model produced high-quality, contextually rich descriptions, which were labeled as the chosen responses for DPO training.
- LLaMA-3.2-1B-Instruct: The pre-fine-tuned version of this model generated baseline descriptions, labeled as the rejected responses.
Both models were configured with a system prompt to act as visual understanding assistants, using parameters max_tokens=1024, temperature=0.7, and top_p=0.95 to ensure consistency in response generation. -
Dataset Structuring for DPO: For each of the 2000 processed images, a dataset entry was created comprising:
- The constructed prompt.
- The response from DeepSeek-V3-0324 (chosen).
- The response from LLaMA-3.2-1B-Instruct (rejected).
This triplet structure is critical for DPO, enabling the model to learn from pairwise preferences and improve its description quality by aligning with the chosen responses.
6.2. Fine-Tuning
6.2.1. Parameter Changes and Training Details

| Metric | Initial | Final |
|---|---|---|
| Loss | 0.693 | 0.204 |
| Chosen Reward | 0.0 | 0.441 |
| Rejected Reward | 0.0 | -4.848 |
| Margin | 0.0 | 5.289 |
| Gradient Norm | 3.54 | 0.076 |
| Learning Rate | 0 |
7. TTS Model Training
7.1. Justification for Choosing the Orpheus Model
7.1.1. Transformer-Based Architecture
7.1.2. Multimodal Processing Capabilities
7.1.3. Support for Multi-Speaker Synthesis
7.2. Training Techniques
7.3. Training Stages
7.3.1. Hyperparameters
7.4. Overview and Performance Analysis
8. Discussion
8.1. Limitations
8.2. Advantages
9. Conclusion
- AquaLabs/EchoLLaMA-1B
- AquaLabs/Orpheus-3B-0.1-ft-Elise
- AquaLabs/Spatial-DPO-Dataset
Author Contributions
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Generation Examples
| Input | Generated Prompt | Response |
|---|---|---|
![]() |
Image Description: The image features a calico cat sitting on the hood of a black car. The cat is comfortably perched on the car’s roof, with its head resting on the hood. The car is parked in front of a house, and there is a potted plant nearby. The scene captures a peaceful moment between the cat and its surroundings.
This grid-based object detection matrix represents detected objects in different regions of the image.
|
Based on the provided grid-based object detection matrix, here’s a summary of the image: A calico cat is sitting comfortably on the hood of a black car, with its head resting on the hood. The cat is positioned in a serene and peaceful manner, as if enjoying the moment. The car is parked in front of a house, with a potted plant nearby. This scene captures a lovely domestic moment between the cat and its surroundings. |
![]() |
Image Description: The image features a computer monitor with a silver Apple logo on the front, sitting on a wooden desk. Next to the monitor, there is a cell phone with a blue screen, placed on the desk. The cell phone appears to be turned on, as it is illuminated.
The desk is also adorned with several books scattered around, with some placed near the monitor and others on the desk surface. The books vary in size and are positioned at different angles, creating a cozy and inviting atmosphere.
This grid-based object detection matrix represents detected objects in different regions of the image.
|
Based on the provided image description and object detection matrix, here’s a summary of the scene: A computer monitor with a silver Apple logo on the front sits on a wooden desk. To the left of the monitor, a cell phone with a blue screen is placed, illuminated and turned on. The desk is adorned with several books, scattered around and placed near the monitor and desk surface. The books vary in size and are positioned at different angles, creating a cozy and inviting atmosphere. The overall scene appears to be a comfortable and modern workspace, with the Apple logo and cell phone serving as a focal point. |
Appendix B. Detailed Training Techniques
Appendix B.1. LoRA (Low-Rank Adaptation)
Appendix B.2. DPO (Direct Policy Optimization)
References
- et al., A.G. The Llama 3 Herd of Models, 2024, [arXiv:cs.AI/2407.21783].
- Team, G. Gemma: Open Models Based on Gemini Research and Technology, 2024, [arXiv:cs.CL/2403.08295].
- Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B, 2023, [arXiv:cs.CL/2310.06825].
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers, 2020, [arXiv:cs.CV/2005.12872].
- Korrapati, V. moondream2 (Revision 92d3d73), 2024. https://doi.org/10.57967/hf/3219.
- CanopyLabs. Orpheus 3B 0.1-ft. https://huggingface.co/canopylabs/orpheus-3b-0.1-ft, 2024. Available on Hugging Face.
- DeepSeek-AI. DeepSeek-V3 Technical Report, 2025, [arXiv:cs.CL/2412.19437].
- Wang, Y.; Guizilini, V.; Zhang, T.; Wang, Y.; Zhao, H.; Solomon, J. DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries, 2021, [arXiv:cs.CV/2110.06922].
- Peng, S.; Genova, K.; Jiang, C.M.; Tagliasacchi, A.; Pollefeys, M.; Funkhouser, T. OpenScene: 3D Scene Understanding with Open Vocabularies, 2023, [arXiv:cs.CV/2211.15654].
- Luo, T.; Rockwell, C.; Lee, H.; Johnson, J. Scalable 3D Captioning with Pretrained Models, 2023, [arXiv:cs.CV/2306.07279].
- Xue, Z.; Li, R.; Li, M. Recent Progress in Conversational AI, 2022, [arXiv:cs.CL/2204.09719].
- Tu, T.; Palepu, A.; Schaekermann, M.; Saab, K.; Freyberg, J.; Tanno, R.; Wang, A.; Li, B.; Amin, M.; Tomasev, N.; et al. Towards Conversational Diagnostic AI, 2024, [arXiv:cs.AI/2401.05654].
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models, 2023, [arXiv:cs.CL/2302.13971].
- Rafailov, R.; Sharma, A.; Mitchell, E.; Ermon, S.; Manning, C.D.; Finn, C. Direct Preference Optimization: Your Language Model is Secretly a Reward Model, 2024, [arXiv:cs.LG/2305.18290].
- Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 2017, 114, 3521–3526. https://doi.org/10.1073/pnas.1611835114.
- OpenAI.; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report, 2024, [arXiv:cs.CL/2303.08774].
- Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual Instruction Tuning, 2023, [arXiv:cs.CV/2304.08485].
- Ranftl, R.; Lasinger, K.; Hafner, D.; Schindler, K.; Koltun, V. Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer, 2020, [arXiv:cs.CV/1907.01341].
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection, 2021, [arXiv:cs.CV/2010.04159].
- Ranftl, R.; Bochkovskiy, A.; Koltun, V. Vision Transformers for Dense Prediction. CoRR 2021, abs/2103.13413, [2103.13413].
- MrDragonFox. Elise Dataset. https://huggingface.co/datasets/MrDragonFox/Elise, 2025.
- Zhang, A. Speech Recognition (Version 3.11) [Software]. https://github.com/Uberi/speech_recognition#readme, 2017. Available from GitHub.
- Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context, 2015, [arXiv:cs.CV/1405.0312].
- Hugging Face. Hugging Face. The ai community building the future. the platform where the machine learning community collaborates on models, datasets, and applications, 2024.
- .
- Lian, W. axolotl, 2024.
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need, 2023, [arXiv:cs.CL/1706.03762].
- Shen, J.; Pang, R.; Weiss, R.J.; Schuster, M.; Jaitly, N.; Yang, Z.; Chen, Z.; Zhang, Y.; Wang, Y.; Skerry-Ryan, R.; et al. Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions, 2018, [arXiv:cs.CL/1712.05884].
- van den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. WaveNet: A Generative Model for Raw Audio, 2016, [arXiv:cs.SD/1609.03499].
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models, 2021, [arXiv:cs.CL/2106.09685].
- Goodfellow, I.J.; Mirza, M.; Xiao, D.; Courville, A.; Bengio, Y. An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks, 2015, [arXiv:stat.ML/1312.6211].
- Siuzdak, H.; Grötschla, F.; Lanzendörfer, L.A. SNAC: Multi-Scale Neural Audio Codec, 2024, [arXiv:cs.SD/2410.14411].
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing, 2020. https://doi.org/10.18653/v1/2020.emnlp-demos.6.





| Parameter | Value |
|---|---|
| Base Model | LLaMA-3.2-1B-Instruct |
| LoRA Rank | 8 |
| (DPO) | 0.1 |
| Learning Rate | (cosine decay) |
| Batch Size | 16 (with 2×8 accumulation) |
| Sequence Length | 8192 |
| Lr. Rate | Max Steps | Warmup Steps | Grad. Accumulation Steps | Per-Device Batch Size | Optimizer |
|---|---|---|---|---|---|
| 360 | 5 | 4 | 1 | AdamW_8bit |
| LoRA r | LoRA | LoRA Dropout |
|---|---|---|
| 64 | 64 | 0 |
| Model | Parameter Size (B) | Training Time |
|---|---|---|
| Orpheus-3B-0.1-ft | 3 | 47 minutes |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).



