Submitted:
23 June 2026
Posted:
23 June 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
- A Systematic Review of Streaming Tasks: According to the task scenarios addressed, we summarize existing tasks into real-time perception and event/action understanding, real-time description and narrative generation, agent-based interaction and task planning, and video dialogue.
- A Holistic Taxonomy of Streaming Video Understanding: We establish a formal classification that explicitly distinguishes reactive and proactive streaming paradigms, highlighting their unique technical challenges and application scenarios, which frames the subsequent technical discussion.
- A Panoramic Summary of Benchmarks: We categorize existing benchmarks and datasets into a clear taxonomy covering multi-turn dialogue & QA, real-time captioning & narration, and proactive response & timing evaluation, providing guidance for researchers.
- Challenges and Future Directions: We identify key challenges, including temporal modeling, efficiency, and response triggering, and outline promising directions toward more adaptive, efficient, and predictive streaming systems.
2. Preliminaries
2.1. Streaming Video Understanding
2.1.1. Proactive Response Paradigm
2.1.2. Reactive Response Paradigm
2.2. Streaming Video Understanding Tasks
2.2.1. Proactive Streaming Video Understanding Tasks
- Streaming Real-Time Narration: Requires the model to act as a live commentator, autonomously generating continuous and non-redundant descriptions as new semantic events unfold, emphasizing narrative timing and fluency.
- Continuous State Monitoring: Implements a “one-query, multiple-updates” paradigm. Given a standing task (e.g., “count the people”), the model must proactively update its response whenever the state of the target variable changes in the stream.
- Event Conditioned Responding: Focuses on evidence-driven output. The model must monitor the stream for specific conditions or hidden evidence and only initiate a response (such as an alert or a deferred answer) at the exact moment the trigger appears.
- Proactive Multi-turn Interaction: Simulates a human-like assistant in a duplex conversation. The model must autonomously manage interaction flow, including judging user input validity, interrupting redundant content, and proactively initiating new turns.

2.2.2. Reactive Streaming Video Understanding Tasks
- Real-Time Visual Perception: Evaluates the model’s ability to capture the “present” state. It asks about ongoing actions or immediate attributes at the current timestamp, testing the efficiency of real-time visual feature extraction.
- History Backward Tracing: Requires the model to retrieve specific details or event sequences from the “past”. It assesses the robustness of long-term episodic memory and the ability to locate historical clues within the accumulated video stream.
- Future Action Prediction: Focuses on the “near-future” by requiring the model to anticipate the next steps or potential outcomes based on current trends and causal reasoning, moving beyond simple recognition to anticipatory inference.
- Contextual Multi-Turn QA: Evaluates the consistency of the “dialogue history”. The model must resolve coreferences (e.g., pronouns like “it” or “he”) and maintain logical coherence across sequential questions that depend on previous interactions.
3. Proactive Streaming Models
3.1. Token-Driven Triggering via EOS and Action Token
3.1.1. Binary Triggering and Action Extensions
3.1.2. Training Challenges and Solutions
3.1.3. Discussion
3.2. Dedicated Classification Heads and Detectors
3.2.1. Direct Trigger Prediction and Event Modeling
3.2.2. Predictive Decision Making.
3.2.3. Discussion
3.3. Uncertainty and Perplexity Validation
3.3.1. PPL-Based Verification Triggering
3.3.2. Discussion
3.4. Visual Change and Event-Based Trigger
3.4.1. Visual-change-Based Triggering
3.4.2. Discussion
4. Reactive Streaming Models
4.1. Information Input Interception
4.1.1. Token-Level Sparsification and Re-Use
4.1.2. Attention Mechanism Optimization
4.1.3. Discussion
4.2. Working Memory Maintenance
4.2.1. Query-Agnostic KV Eviction
4.2.2. Training-Inference Alignment and Multi-Timescale Buffers
4.2.3. Discussion
4.3. Long-Term Memory Compression
4.3.1. Explicit Hierarchical Structures
4.3.2. Semantic Summarization and Implicit Memory
4.3.3. Discussion
4.4. On-Demand History Recall
4.4.1. KV-Based Memory Construction and Indexing
4.4.2. Adaptive Retrieval Strategies
4.4.3. Discussion
5. Benchmarks and Datasets
5.1. Multi-Turn Dialogue & QA
| # | Dataset | Date | Venue | Focus Area | Scale | Link |
|---|---|---|---|---|---|---|
| 1 | ProReady-QA[81] | 2026.03 | CVPR 2026 | Proactive Response & Timing Evaluation | 5.0K QAs | Link |
| 2 | Live Gaming Benchmark[82] | 2026.03 | ICML 2026 | Proactive Response & Timing Evaluation | 3.0K videos | Link |
| 3 | RIVER Bench[139] | 2026.03 | arXiv | Multi-Turn Dialogue & QA | 4.3K QAs | Link |
| 4 | StreamEQA[141] | 2025.12 | arXiv | Multi-Turn Dialogue & QA | 21.0K QAs | N/A |
| 5 | StreamGaze[143] | 2025.12 | arXiv | Proactive Response & Timing Evaluation | 8.5K QAs | Link |
| 6 | OmniStar-RNG[48] | 2025.11 | NeurIPS 2025 | Real-time Captioning & Narration | 20.1K videos | Link |
| 7 | StreamingCoT[142] | 2025.10 | ACM MM 2025 | Multi-Turn Dialogue & QA | 5.7K videos | Link |
| 8 | ESTP-Bench[70] | 2025.10 | NeurIPS 2025 | Proactive Response & Timing Evaluation | 2.3K QAs | Link |
| 9 | ODV-Bench[117] | 2025.09 | NeurIPS 2025 | Multi-Turn Dialogue & QA | 32.0K QAs | Link |
| 10 | OST-Bench[140] | 2025.07 | NeurIPS 2025 | Multi-Turn Dialogue & QA | 10.0K QAs | Link |
| 11 | ProactiveVideoQA[144] | 2025.07 | arXiv | Proactive Response & Timing Evaluation | 3.5K QAs | Link |
| 12 | PROASSIST[76] | 2025.06 | EMNLP 2025 | Proactive Response & Timing Evaluation | 30.1K QAs | Link |
| 13 | RTV-Bench[138] | 2025.05 | NeurIPS 2025 | Multi-Turn Dialogue & QA | 4.6K QAs | Link |
| 14 | Live-WhisperX-526K[75] | 2025.04 | CVPR 2025 | Real-time Captioning & Narration | 526.0K videos | Link |
| 15 | Live-CC-5M[75] | 2025.04 | CVPR 2025 | Real-time Captioning & Narration | 5.0M videos | Link |
| 16 | OmniMMI[145] | 2025.03 | CVPR 2025 | Proactive Response & Timing Evaluation | 2.3K QAs | Link |
| 17 | VAPDA-127K[77] | 2025.03 | arXiv | Proactive Response & Timing Evaluation | 2.4K videos | N/A |
| 18 | YT-Conversation[80] | 2025.02 | NAACL 2025 | Multi-Turn Dialogue & QA | 414 videos | Link |
| 19 | SVBench[127] | 2025.02 | ICLR 2025 | Multi-Turn Dialogue & QA | 50.0K QAs | Link |
| 20 | OVBench[50] | 2025.01 | CVPR 2025 | Multi-Turn Dialogue & QA | 4.9K QAs | Link |
| 21 | OVO-Bench[56] | 2025.01 | CVPR 2025 | Proactive Response & Timing Evaluation | 3.1K QAs | Link |
| 22 | StreamBench[128] | 2025.01 | ICLR 2025 | Multi-Turn Dialogue & QA | 1.8K QAs | Link |
| 23 | StreamingBench[55] | 2024.11 | arXiv | Multi-Turn Dialogue & QA | 4.5K QAs | Link |
| 24 | MMDuetIT [47] | 2024.11 | EMNLP 2025 | Proactive Response & Timing Evaluation | 109.0K videos | Link |
| 25 | TemporalBench [137] | 2024.10 | arXiv | Multi-Turn Dialogue & QA | 10.0K QAs | Link |
| 26 | QEVD-FIT-COACH[69] | 2024.07 | NeurIPS 2024 | Proactive Response & Timing Evaluation | 74 videos | Link |
5.2. Real-Time Captioning & Narration
5.3. Proactive Response & Timing Evaluation
5.4. Discussion
6. Challenges and Future Directions
6.1. Challenges
6.1.1. Proactive Paradigm for Streaming Video Understanding

6.1.2. Reactive Paradigm for Streaming Video Understanding
6.2. Future Directions
6.2.1. Towards a Unified Proactive-Reactive Meta-Architecture
- Always-on Subconscious Perception (Bottom Tier): This layer operates continuously at a high frame rate with minimal computational cost. It maintains a lightweight, sliding-window KV cache to encode streaming visual tokens without invoking the heavyweight LLM. Its primary goal is to maintain a compressed episodic memory of the unbounded stream.
- Dual-mode Triggering Mechanism (Middle Tier): This layer acts as the routing controller. It continuously evaluates two condition streams: an external user query stream and an internal visual change/uncertainty stream . The triggering function can be formalized as , where indicates that either a reactive query has arrived or a proactive threshold (e.g., anomaly detection or perplexity spike) has been crossed.
- On-demand Cognitive Reasoning (Top Tier): The computationally intensive multimodal LLM remains dormant until explicitly awakened by the signal. Once triggered, the model retrieves contextually relevant tokens from the subconscious perception layer and generates a temporally grounded response .

6.2.2. Efficient Memory Architectures and Hierarchical Context
6.2.3. Hardware-Software Co-Design for Edge Deployment
6.2.4. Streaming Video as the Visual Cortex of Embodied AI
7. Conclusions
References
- Zellers, R.; Lu, J.; Lu, X.; Yu, Y.; Zhao, Y.; Salehi, M.; Kusupati, A.; Hessel, J.; Farhadi, A.; Choi, Y. Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022; pp. 16375–16387. [Google Scholar]
- Grauman, K.; Westbury, A.; Byrne, E.; Chavis, Z.; Furnari, A.; Girdhar, R.; Hamburger, J.; Jiang, H.; Liu, M.; Liu, X.; et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022; pp. 18995–19012. [Google Scholar]
- Tang, Y.; Bi, J.; Xu, S.; Song, L.; Liang, S.; Wang, T.; Zhang, D.; An, J.; Lin, J.; Zhu, R.; et al. Video understanding with large language models: A survey. IEEE Transactions on Circuits and Systems for Video Technology, 2025. [Google Scholar]
- Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International conference on machine learning. PMLR, 2023; pp. 19730–19742. [Google Scholar]
- Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 34892–34916. [Google Scholar] [CrossRef]
- Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; et al. Gemini: a family of highly capable multimodal models. arXiv 2023, arXiv:2312.11805. [Google Scholar]
- Hurst, A.; Lerer, A.; Goucher, A.P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.; Welihinda, A.; Hayes, A.; Radford, A.; et al. Gpt-4o system card. arXiv 2024, arXiv:2410.21276. [Google Scholar]
- Bai, S.; Cai, Y.; Chen, R.; Chen, K.; Chen, X.; Cheng, Z.; Deng, L.; Ding, W.; Gao, C.; Ge, C.; et al. Qwen3-vl technical report. arXiv 2025, arXiv:2511.21631. [Google Scholar]
- Wang, W.; Gao, Z.; Gu, L.; Pu, H.; Cui, L.; Wei, X.; Liu, Z.; Jing, L.; Ye, S.; Shao, J.; et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv 2025, arXiv:2508.18265. [Google Scholar]
- Lee, J.; Chang, J.; Lee, D.; Choi, J. CA2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2026, 48, 2803–2819. [Google Scholar] [CrossRef] [PubMed]
- Ye, Q.; Yu, Z.; Shao, R.; Cui, Y.; Kang, X.; Liu, X.; Torr, P.; Cao, X. CAT+: Investigating and Enhancing Audio-Visual Understanding in Large Language Models. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 8674–8690. [Google Scholar] [CrossRef] [PubMed]
- Chen, L.H.; Lu, S.; Zeng, A.; Zhang, H.; Wang, B.; Zhang, R.; Zhang, L. MotionLLM: Understanding Human Behaviors from Human Motions and Videos. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 1–15. [Google Scholar] [CrossRef] [PubMed]
- Song, E.; Chai, W.; Ye, T.; Hwang, J.N.; Li, X.; Wang, G. MovieChat+: Question-Aware Sparse Memory for Long Video Question Answering. IEEE Trans. Pattern Anal. Mach. Intell. 2026, 48, 374–389. [Google Scholar] [CrossRef] [PubMed]
- Li, J.; Gao, M.; He, X.; Tang, S.; Zheng, W.S.; Xiao, J.; Wang, M.; Chua, T.S.; Zhuang, Y. Momentor++: Advancing Video Large Language Models With Fine-Grained Long Video Reasoning. IEEE Trans. Pattern Anal. Mach. Intell. 2026, 48, 6208–6224. [Google Scholar] [CrossRef] [PubMed]
- Zhang, K.; Yang, Z.; Han, M.; Zhuge, Y.; Hao, H.; Li, C.; Li, Z.; Chang, X. SELongVLM: Empowering Long Video Language Models with Self-Corrective Clip Selection. IEEE Trans. Pattern Anal. Mach. Intell. 2026, 1–16. [Google Scholar] [CrossRef] [PubMed]
- Tian, S.; Wang, R.; Guo, H.; Wu, P.; Dong, Y.; Wang, X.; Yang, J.; Zhang, H.; Zhu, H.; Liu, Z. Ego-R1: Agentic Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning. IEEE Trans. Pattern Anal. Mach. Intell. 2026, 1–16. [Google Scholar] [CrossRef] [PubMed]
- Peirone, S.A.; Pistilli, F.; Alliegro, A.; Tommasi, T.; Averta, G. Hier-EgoPack: Hierarchical Egocentric Video Understanding With Diverse Task Perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2026, 48, 1917–1931. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Y.; Wu, J.; Li, W.; Li, B.; Ma, Z.; Liu, Z.; Li, C. Llava-video: Video instruction tuning with synthetic data. arXiv 2024, arXiv:2410.02713. [Google Scholar]
- Lin, B.; Ye, Y.; Zhu, B.; Cui, J.; Ning, M.; Jin, P.; Yuan, L. Video-llava: Learning united visual representation by alignment before projection. In Proceedings of the Proceedings of the 2024 conference on empirical methods in natural language processing, 2024; pp. 5971–5984. [Google Scholar]
- Maaz, M.; Rasheed, H.; Khan, S.; Khan, F. Video-chatgpt: Towards detailed video understanding via large vision and language models. Proceedings of the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics 2024, Volume 1, 12585–12602. [Google Scholar] [CrossRef]
- Li, K.; He, Y.; Wang, Y.; Li, Y.; Wang, W.; Luo, P.; Wang, Y.; Wang, L.; Qiao, Y. Videochat: Chat-centric video understanding. Sci. China Inf. Sci. 2025, 68, 200102. [Google Scholar] [CrossRef]
- Wang, Y.; Li, X.; Yan, Z.; He, Y.; Yu, J.; Zeng, X.; Wang, C.; Ma, C.; Huang, H.; Gao, J.; et al. Internvideo2. 5: Empowering video mllms with long and rich context modeling. arXiv 2025, arXiv:2501.12386. [Google Scholar]
- Shao, H.; Hu, Y.; Wang, L.; Song, G.; Waslander, S.L.; Liu, Y.; Li, H. Lmdrive: Closed-loop end-to-end driving with large language models. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024; pp. 15120–15130. [Google Scholar]
- Yang, J.; Liu, S.; Guo, H.; Dong, Y.; Zhang, X.; Zhang, S.; Wang, P.; Zhou, Z.; Xie, B.; Wang, Z.; et al. Egolife: Towards egocentric life assistant. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025; pp. 28885–28900. [Google Scholar]
- Kim, M.J.; Pertsch, K.; Karamcheti, S.; Xiao, T.; Balakrishna, A.; Nair, S.; Rafailov, R.; Foster, E.; Lam, G.; Sanketi, P.; et al. Openvla: An open-source vision-language-action model, 2024. 1, 4. Available online: https://arxiv. [PubMed]
- Chen, J.; Lv, Z.; Wu, S.; Lin, K.Q.; Song, C.; Gao, D.; Liu, J.W.; Gao, Z.; Mao, D.; Shou, M.Z. Videollm-online: Online video large language model for streaming video. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024; pp. 18407–18418. [Google Scholar]
- Fu, S.; Yang, Q.; Li, Y.M.; Peng, Y.X.; Lin, K.Y.; Wei, X.; Hu, J.F.; Xie, X.; Zheng, W.S. ViSpeak: Visual Instruction Feedback in Streaming Videos. arXiv 2025, arXiv:2503.12769. [Google Scholar]
- Yang, A.; Nagrani, A.; Seo, P.H.; Miech, A.; Pont-Tuset, J.; Laptev, I.; Sivic, J.; Schmid, C. Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023; pp. 10714–10726. [Google Scholar]
- Ataallah, K.; Shen, X.; Abdelrahman, E.; Sleiman, E.; Zhu, D.; Ding, J.; Elhoseiny, M. Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens. arXiv 2024, arXiv:2404.03413. [Google Scholar]
- Maaz, M.; Rasheed, H.; Khan, S.; Khan, F.S. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv 2023, arXiv:2306.05424. [Google Scholar]
- Li, K.; He, Y.; Wang, Y.; Li, Y.; Wang, W.; Luo, P.; Wang, Y.; Wang, L.; Qiao, Y. Videochat: Chat-centric video understanding. arXiv 2023, arXiv:2305.06355. [Google Scholar]
- Wang, Y.; Li, K.; Li, Y.; He, Y.; Huang, B.; Zhao, Z.; Zhang, H.; Xu, J.; Liu, Y.; Wang, Z.; et al. Internvideo: General video foundation models via generative and discriminative learning. arXiv 2022, arXiv:2212.03191. [Google Scholar]
- Cheng, Z.; Leng, S.; Zhang, H.; Xin, Y.; Li, X.; Chen, G.; Zhu, Y.; Zhang, W.; Luo, Z.; Zhao, D.; et al. VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs. arXiv 2024, arXiv:2406.07476. [Google Scholar]
- Liu, Z.; Dong, Y.; Liu, Z.; Hu, W.; Lu, J.; Rao, Y. Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution. arXiv 2024, arXiv:2409.12961. [Google Scholar]
- Ren, S.; Yao, L.; Li, S.; Sun, X.; Hou, L. Timechat: A time-sensitive multimodal large language model for long video understanding. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024; pp. 14313–14323. [Google Scholar]
- Zhang, H.; Wang, Y.; Tang, Y.; Liu, Y.; Feng, J.; Dai, J.; Jin, X. Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams. arXiv 2024, arXiv:2406.08085. [Google Scholar]
- Song, E.; Chai, W.; Ye, T.; Hwang, J.N.; Li, X.; Wang, G. Moviechat+: Question-aware sparse memory for long video question answering. arXiv 2024, arXiv:2404.17176. [Google Scholar]
- He, B.; Li, H.; Jang, Y.K.; Jia, M.; Cao, X.; Shah, A.; Shrivastava, A.; Lim, S.N. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024; pp. 13504–13514. [Google Scholar]
- Wang, X.; Song, D.; Chen, S.; Zhang, C.; Wang, B. LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture. arXiv 2024, arXiv:cs. [Google Scholar]
- Xue, F.; Chen, Y.; Li, D.; Hu, Q.; Zhu, L.; Li, X.; Fang, Y.; Tang, H.; Yang, S.; Liu, Z.; et al. LongVILA: Scaling Long-Context Visual Language Models for Long Videos. arXiv 2024, arXiv:cs. [Google Scholar]
- Zhang, P.; Zhang, K.; Li, B.; Zeng, G.; Yang, J.; Zhang, Y.; Wang, Z.; Tan, H.; Li, C.; Liu, Z. Long context transfer from language to vision. arXiv 2024, arXiv:2406.16852. [Google Scholar]
- Li, F.; Zhang, R.; Zhang, H.; Zhang, Y.; Li, B.; Li, W.; Ma, Z.; Li, C. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv 2024, arXiv:2407.07895. [Google Scholar]
- Li, Q.; Chen, Z.; Wang, W.; Wang, W.; Ye, S.; Jin, Z.; Chen, G.; He, Y.; Gao, Z.; Cui, E.; et al. Omnicorpus: A unified multimodal corpus of 10 billion-level images interleaved with text. arXiv 2024, arXiv:2406.08418. [Google Scholar]
- Qian, R.; Dong, X.; Zhang, P.; Zang, Y.; Ding, S.; Lin, D.; Wang, J. Streaming long video understanding with large language models. Adv. Neural Inf. Process. Syst. 2024, 37, 119336–119360. [Google Scholar] [CrossRef]
- Zhou, X.; Arnab, A.; Buch, S.; Yan, S.; Myers, A.; Xiong, X.; Nagrani, A.; Schmid, C. Streaming dense video captioning. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024; pp. 18243–18252. [Google Scholar]
- Gao, J.; Lian, Y.; Zhou, Z.; Fu, Y.; Wang, B. LiveChat: A large-scale personalized dialogue dataset automatically constructed from live streaming. arXiv 2023, arXiv:2306.08401. [Google Scholar]
- Wang, Y.; Meng, X.; Wang, Y.; Liang, J.; Wei, J.; Zhang, H.; Zhao, D. Videollm knows when to speak: Enhancing time-sensitive video comprehension with video-text duet interaction format. arXiv 2024, arXiv:2411.17991. [Google Scholar]
- Yang, Z.; Zhang, K.; Hu, Y.; Wang, B.; Qian, S.; Wen, B.; Yang, F.; Gao, T.; Dong, W.; Xu, C. Livestar: Live streaming assistant for real-world online video understanding. Adv. Neural Inf. Process. Syst. 2026, 38, 31266–31304. [Google Scholar]
- Di, S.; Yu, Z.; Zhang, G.; Li, H.; Zhong, T.; Cheng, H.; Li, B.; He, W.; Shu, F.; Jiang, H. Streaming video question-answering with in-context video kv-cache retrieval. arXiv 2025, arXiv:2503.00540. [Google Scholar]
- Huang, Z.; Li, X.; Li, J.; Wang, J.; Zeng, X.; Liang, C.; Wu, T.; Chen, X.; Li, L.; Wang, L. Online Video Understanding: OVBench and VideoChat-Online. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025; pp. 3328–3338. [Google Scholar]
- Zhang, H.; Wang, Y.; Tang, Y.; Liu, Y.; Feng, J.; Jin, X. Flash-VStream: Efficient Real-Time Understanding for Long Video Streams. arXiv 2025, arXiv:2506.23825. [Google Scholar]
- Hong, L.; Liu, Z.; Chen, W.; Tan, C.; Feng, Y.; Zhou, X.; Guo, P.; Li, J.; Chen, Z.; Gao, S.; et al. LVOS: A Benchmark for Large-Scale Long-Term Video Object Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2026, 48, 946–961. [Google Scholar] [CrossRef] [PubMed]
- Yang, S.; Yu, W.; Yang, W.; Liu, X.; Tan, H.; Lan, L.; Xiao, N. WildVideo: Benchmarking LMMs for Understanding Video-Language Interaction. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 9330–9344. [Google Scholar] [CrossRef] [PubMed]
- Wu, J.; Liu, W.; Liu, Y.; Liu, M.; Nie, L.; Lin, Z.; Chen, C.W. A Survey on Video Temporal Grounding With Multimodal Large Language Model. IEEE Trans. Pattern Anal. Mach. Intell. 2026, 48, 1521–1541. [Google Scholar] [CrossRef] [PubMed]
- Lin, J.; Fang, Z.; Chen, C.; Wan, Z.; Luo, F.; Li, P.; Liu, Y.; Sun, M. Streamingbench: Assessing the gap for mllms to achieve streaming video understanding. arXiv 2024, arXiv:2411.03628. [Google Scholar]
- Li, Y.; Niu, J.; Miao, Z.; Ge, C.; Zhou, Y.; He, Q.; Dong, X.; Duan, H.; Ding, S.; Qian, R.; et al. Ovo-bench: How far is your video-llms from real-world online video understanding? Available online: https://arxiv.
- Yang, A.; Miech, A.; Sivic, J.; Laptev, I.; Schmid, C. Learning to Answer Visual Questions From Web Videos. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 3202–3218. [Google Scholar] [CrossRef] [PubMed]
- Li, Y.; Wang, X.; Xiao, J.; Ji, W.; Chua, T.S. Transformer-Empowered Invariant Grounding for Video Question Answering. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 9510–9522. [Google Scholar] [CrossRef] [PubMed]
- Li, J.; Wei, P.; Han, W.; Zhu, S.C.; Fan, L. IntentQA: Intent Question Answering in Videos by Cognitive Context Reasoning. IEEE Trans. Pattern Anal. Mach. Intell. 2026, 1–18. [Google Scholar] [CrossRef] [PubMed]
- Li, J.; Liao, Z.; Xiao, F.; Li, T.; Zhang, Q.; Zhao, H.; Niu, L.; Chen, G.; Zhang, L.; Jiang, C. Parse, Align and Aggregate: Graph-Driven Compositional Reasoning for Video Question Answering. IEEE Trans. Pattern Anal. Mach. Intell. 2026, 48, 5586–5603. [Google Scholar] [CrossRef] [PubMed]
- Chen, T.; Liu, H.; Wang, Y.; Chen, Y.; He, T.; Gan, C.; He, H.; Lin, W. MECD+: Unlocking Event-Level Causal Graph Discovery for Video Reasoning. IEEE Trans. Pattern Anal. Mach. Intell. 2026, 48, 2628–2645. [Google Scholar] [CrossRef] [PubMed]
- Li, L.L.; Fang, J.; Xiao, J.; Yu, H.; Lv, C.; Xue, J.; Li, Z.; Chua, T.S. ADVersa: Abductive Driving Accident Video Understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2026, 48, 6980–6998. [Google Scholar] [CrossRef] [PubMed]
- Dao, N.N.; Tran, A.T.; Tu, N.H.; Thanh, T.T.; Bao, V.N.Q.; Cho, S. A contemporary survey on live video streaming from a computation-driven perspective. ACM Comput. Surv. 2022, 54, 1–38. [Google Scholar] [CrossRef]
- Laghari, A.A.; Shahid, S.; Yadav, R.; Karim, S.; Khan, A.; Li, H.; Shoulin, Y. The state of art and review on video streaming. J. High Speed Netw. 2023, 29, 211–236. [Google Scholar] [CrossRef]
- Nguyen, T.; Bin, Y.; Xiao, J.; Qu, L.; Li, Y.; Wu, J.Z.; Nguyen, C.D.; Ng, S.K.; Tuan, L.A. Video-language understanding: A survey from model architecture, model training, and data perspectives. Proc. Find. Assoc. Comput. Linguist. ACL 2024, 2024, 3636–3657. [Google Scholar] [CrossRef]
- Wu, S.; Chen, J.; Lin, K.Q.; Wang, Q.; Gao, Y.; Xu, Q.; Xu, T.; Hu, Y.; Chen, E.; Shou, M.Z. Videollm-mod: Efficient video-language streaming with mixture-of-depths vision computation. Adv. Neural Inf. Process. Syst. 2024, 37, 109922–109947. [Google Scholar] [CrossRef]
- Fedus, W.; Zoph, B.; Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res. 2022, 23, 1–39. [Google Scholar]
- Li, W.; Hu, B.; Shao, R.; Shen, L.; Nie, L. Lion-fs: Fast & slow video-language thinker as online video assistant. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025; pp. 3240–3251. [Google Scholar]
- Panchal, S.; Bhattacharyya, A.; Berger, G.; Mercier, A.; Böhm, C.; Dietrichkeit, F.; Pourreza, R.; Li, X.; Madan, P.; Lee, M.; et al. What to say and when to say it: Live fitness coaching as a testbed for situated interaction. Adv. Neural Inf. Process. Syst. 2024, 37, 75853–75882. [Google Scholar] [CrossRef]
- Zhang, Y.; Shi, C.; Wang, Y.; Yang, S. Eyes wide open: Ego proactive video-llm for streaming video. arXiv 2025, arXiv:2510.14560. [Google Scholar]
- Xia, J.; Chen, P.; Zhang, M.; Sun, X.; Zhou, K. Streaming Video Instruction Tuning. arXiv 2025, arXiv:2512.21334. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the Proceedings of the IEEE international conference on computer vision, 2017; pp. 2980–2988. [Google Scholar]
- Liu, Z.; Guo, L.; Li, H.; Zhen, R.; He, X.; Ji, R.; Ren, X.; Zhang, Y.; Lu, H.; Liu, J. Thinking in Streaming Video. arXiv 2026, arXiv:2603.12938. [Google Scholar]
- Chen, J.; Chen, Z.; Du, C.; He, M.; He, W.; Li, H.; Li, Q.; Liu, Z.; Ma, H.; Pan, X.; et al. StreamingClaw Technical Report. arXiv 2026, arXiv:2603.22120. [Google Scholar]
- Chen, J.; Zeng, Z.; Lin, Y.; Li, W.; Ma, Z.; Shou, M.Z. Livecc: Learning video llm with streaming speech transcription at scale. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025; pp. 29083–29095. [Google Scholar]
- Zhang, Y.; Dong, X.L.; Lin, Z.; Madotto, A.; Kumar, A.; Damavandi, B.; Chai, J.; Moon, S. Proactive Assistant Dialogue Generation from Streaming Egocentric Videos. arXiv 2025, arXiv:2506.05904. [Google Scholar]
- Yang, Z.; Gao, C.; Liu, J.; Wu, P.; Pang, G.; Shou, M.Z. AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction, Detection, and Analysis. arXiv 2025, arXiv:2503.21904. [Google Scholar]
- Lin, J.; Tong, J.; Wu, H.; Zhang, J.; Liu, J.; Jin, X.; Shen, X. Speak While Watching: Unleashing TRUE Real-Time Video Understanding Capability of Multimodal Large Language Models. arXiv 2026, arXiv:2601.06843. [Google Scholar]
- Qian, J.; Du, H.; Nan, G.; Huang, S.; Yu, J.; Wang, H.; Chen, J.; Cai, M.; Yang, M.; Li, J.; et al. Learning to Respond: A Large-Scale Benchmark and Progressive Learning Framework for Trigger-Centric Online Video Understanding. [CrossRef] [PubMed]
- Kim, J.; Kim, M.S.; Chung, J.; Cho, J.; Kim, J.; Kim, S.; Sim, G.; Yu, Y. Egospeak: learning when to speak for egocentric conversational agents in the wild. Proc. Find. Assoc. Comput. Linguist. NAACL 2025, 2025, 2990–3005. [Google Scholar] [CrossRef]
- Azad, S.; Vineet, V.; Rawat, Y.S. Streamready: Learning what to answer and when in long streaming videos. arXiv 2026, arXiv:2603.08620. [Google Scholar]
- Yan, W.; Dai, Y.; Ran, Q.; Li, H.; Lin, W.; Liao, H.; Xie, X.; Jin, T.; Lian, J. Proact-VL: A Proactive VideoLLM for Real-Time AI Companions. arXiv 2026, arXiv:2603.03447. [Google Scholar]
- Tian, X.; Li, W.; Xu, B.; Dong, H.; Wang, Y.; Shen, H. ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding. arXiv 2026, arXiv:2601.10323. [Google Scholar]
- Wang, H.; Feng, B.; Lai, Z.; Xu, M.; Li, S.; Ge, W.; Dehghan, A.; Cao, M.; Huang, P. StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant. arXiv 2025, arXiv:2505.05467. [Google Scholar]
- Qian, R.; Ding, S.; Dong, X.; Zhang, P.; Zang, Y.; Cao, Y.; Lin, D.; Wang, J. Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025; pp. 24045–24055. [Google Scholar]
- Ding, X.; Wu, H.; Yang, Y.; Jiang, S.; Zhang, Q.; Bai, D.; Chen, Z.; Cao, T. Streammind: Unlocking full frame rate streaming video dialogue through event-gated cognition. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025; pp. 13448–13459. [Google Scholar]
- Kim, J.; Lee, H.; Rehg, J.M.; Kim, M.; Ro, Y.M. STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding. arXiv 2026, arXiv:2603.27593. [Google Scholar]
- Kang, H.; Park, Y.; Yoo, Y.; Choi, Y.; Kim, S.J. Open-ended hierarchical streaming video understanding with vision language models. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025; pp. 20715–20725. [Google Scholar]
- Mun, J.; Yang, L.; Ren, Z.; Xu, N.; Han, B. Streamlined dense video captioning. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019; pp. 6588–6597. [Google Scholar]
- Yang, H.; Tang, F.; Zhao, L.; An, X.; Hu, M.; Li, H.; Zhuang, X.; Lu, Y.; Zhang, X.; Swikir, A.; et al. Streamagent: Towards anticipatory agents for streaming video understanding. arXiv 2025, arXiv:2508.01875. [Google Scholar]
- Zheng, Y.; Ding, X.; Yang, Y.; Jiang, S.; Wu, H.; Zhang, Q.; Wang, W.; Cao, T.; Liu, Y. Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding. arXiv 2026, arXiv:2603.19054. [Google Scholar]
- Simons, D.J.; Rensink, R.A. Change blindness: Past, present, and future. Trends Cogn. Sci. 2005, 9, 16–20. [Google Scholar] [CrossRef] [PubMed]
- Yao, L.; Li, Y.; Wei, Y.; Li, L.; Ren, S.; Liu, Y.; Ouyang, K.; Wang, L.; Li, S.; Li, S.; et al. Timechat-online: 80% visual tokens are naturally redundant in streaming videos. In Proceedings of the Proceedings of the 33rd ACM International Conference on Multimedia, 2025; pp. 10807–10816. [Google Scholar]
- Zhang, K.; Yang, Z.; Wang, B.; Qian, S.; Xu, C. Querystream: Advancing streaming video understanding with query-aware pruning and proactive response. In Proceedings of the The Fourteenth International Conference on Learning Representations, 2026. [Google Scholar]
- Cai, W.; Zhang, H.; Huang, Y.; Sun, S.; Deng, J.; Xu, S.; Song, J.; Zhang, Z. Color When It Counts: Grayscale-Guided Online Triggering for Always-On Streaming Video Sensing. arXiv 2026, arXiv:2603.22466. [Google Scholar]
- Guan, Y.; Yin, L.; Liang, D.; Ju, J.; Luo, Z.; Luan, J.; Liu, Y.; Bai, X. Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously. arXiv 2026, arXiv:2603.12262. [Google Scholar]
- Zhang, J.; Tong, J.; Lin, J.; Wu, H.; Sun, Y.; Ma, Y.; Shen, X. Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models. arXiv 2026, arXiv:2603.02872. [Google Scholar]
- Xie, Y.; He, B.; Wang, J.; Zheng, X.; Ye, Z.; Wu, Z. FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding. arXiv 2026, arXiv:2603.02096. [Google Scholar]
- Wang, L.; Jin, Z.; Hao, Y.; Chen, Y.; Liu, K.; Ao, Y.; Zhao, J. Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models. arXiv 2026, arXiv:2603.11896. [Google Scholar]
- Yan, Y.; Xu, J.; Di, S.; Wu, H.; Xie, W. Omnistream: Mastering perception, reconstruction and action in continuous streams. arXiv 2026, arXiv:2603.12265. [Google Scholar]
- Shi, B.; Fu, S.; Lian, L.; Ye, H.; Eigen, D.; Reite, A.; Li, B.; Kautz, J.; Han, S.; Chan, D.M.; et al. Attend before attention: Efficient and scalable video understanding via autoregressive gazing. arXiv 2026, arXiv:2603.12254. [Google Scholar]
- Wen, S.; Wang, Z.; Zhang, X.; Huang, L.; Wu, W. Eventmemagent: Hierarchical event-centric memory for online video understanding with adaptive tool use. arXiv 2026, arXiv:2602.15329. [Google Scholar]
- Zhang, Y.; Shi, C.; Yang, S. WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs. arXiv 2026, arXiv:2602.22142. [Google Scholar]
- Zhang, H.; Yang, S.; Fu, J.; Ng, S.K.; Qiu, X. HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding. arXiv 2026, arXiv:2601.14724. [Google Scholar]
- Wang, Y.; Liu, X.; Gui, X.; Lin, X.; Yang, B.; Liao, C.; Chen, T.; Zhang, L. Accelerating Streaming Video Large Language Models via Hierarchical Token Compression. arXiv 2025, arXiv:2512.00891. [Google Scholar]
- Zheng, N.; Huang, J.; Guo, Q.; Zhao, F. VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs. arXiv 2025, arXiv:2512.22226. [Google Scholar]
- Jin, X.; et al. StreamingAssistant: Efficient Visual Token Pruning. arXiv 2025, arXiv:2512.12560. [Google Scholar]
- Kim, D.; Yang, S.; Shin, W.; Kim, J.Y. V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval. arXiv 2025, arXiv:2512.12284. [Google Scholar]
- Ye, S.; Ouyang, B.; Qian, T.; Zeng, L.; Yuan, M.; Chu, X.; Hong, W.; Chen, X. Venus: An Efficient Edge Memory-and-Retrieval System for VLM-based Online Video Understanding. arXiv 2025, arXiv:2512.07344. [Google Scholar]
- Wang, Y.; Liu, S.; Wang, D.; Xu, N.; Wan, G.; Zhang, H.; Zhao, D. MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning. arXiv 2025, arXiv:2512.06810. [Google Scholar]
- Patel, S.; Patel, D. CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding. arXiv 2025, arXiv:2511.13644. [Google Scholar]
- Chen, Y.; Bai, X.; Wang, Z.; Bai, C.; Dai, Y.; Lu, M.; Zhang, S. Streamkv: Streaming video question-answering with segment-based kv cache retrieval and compression. arXiv 2025, arXiv:2511.07278. [Google Scholar]
- Xu, R.; et al. StreamingVLM: Real-Time Understanding for Infinite Video Streams. arXiv 2025, arXiv:2510.09608. [Google Scholar]
- Chen, X.; Tao, K.; Shao, K.; Wang, H. Streamingtom: Streaming token compression for efficient video understanding. arXiv 2025, arXiv:2510.18269. [Google Scholar]
- Dorovatas, V.; Seifi, S.; Gupta, G.; Aljundi, R. Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs. arXiv 2025, arXiv:2510.17364. [Google Scholar]
- Sun, G.; Li, Y.; Wu, X.; Yang, Y.; Li, W.; Ma, Z.; Zhang, C. video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory. arXiv 2025, arXiv:2510.11129. [Google Scholar]
- Zeng, X.; Qiu, K.; Zhang, Q.; Li, X.; Wang, J.; Li, J.; Yan, Z.; Tian, K.; Tian, M.; Zhao, X.; et al. Streamforest: Efficient online video understanding with persistent event memory. arXiv 2025, arXiv:2509.24871. [Google Scholar]
- Yang, Y.; et al. StreamMem: Query-Agnostic KV Cache Memory. arXiv 2025, arXiv:2502. [Google Scholar]
- Zeng, R.; Mao, J.; Lai, M.; Phan, M.H.; Dong, Y.; Wang, W.; Chen, Q.; Hu, X. OVG-HQ: Online Video Grounding with Hybrid-modal Queries. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025; pp. 21085–21096. [Google Scholar]
- Wei, M.; et al. StreamVLN: Streaming Vision-and-Language Navigation. arXiv 2025, arXiv:2507.05240. [Google Scholar]
- Kim, M.; et al. InfiniPot-V: Memory-Constrained KV Cache Compression. arXiv 2025, arXiv:2506.15745. [Google Scholar]
- Zhao, Z.; Wang, K.; Li, S.; Qian, R.; Lin, W.; Liu, H. CogStream: Context-guided Streaming Video Question Answering. arXiv 2025, arXiv:2506.10516. [Google Scholar]
- Ning, Z.; Liu, G.; Jin, Q.; Ding, W.; Guo, M.; Zhao, J. LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval. arXiv 2025, arXiv:2505.15269. [Google Scholar]
- Yan, Y.; Xu, J.; Di, S.; Liu, Y.; Shi, Y.; Chen, Q.; Li, Z.; Huang, Y.; Xie, W. Learning Streaming Video Representation via Multitask Training. arXiv 2025, arXiv:2504.20041. [Google Scholar]
- Chatterjee, D.; Remelli, E.; Song, Y.; Tekin, B.; Mittal, A.; Bhatnagar, B.; CamgÃķz, N.C.; Hampali, S.; Sauser, E.; Ma, S.; et al. Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding. arXiv 2025, arXiv:2504.13915. [Google Scholar]
- Li, R.; Tan, Y.; Shi, Y.; Shao, J. VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers. arXiv 2025, arXiv:2503.09387. [Google Scholar]
- Yang, Z.; Hu, Y.; Du, Z.; Xue, D.; Qian, S.; Wu, J.; Yang, F.; Dong, W.; Xu, C. SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding. In Proceedings of the The Thirteenth International Conference on Learning Representations.
- Xiong, H.; Yang, Z.; Yu, J.; Zhuge, Y.; Zhang, L.; Zhu, J.; Lu, H. Streaming video understanding and multi-round interaction with memory-enhanced knowledge. arXiv 2025, arXiv:2501.13468. [Google Scholar]
- Fu, C.; Lin, H.; Wang, X.; Zhang, Y.F.; Shen, Y.; Liu, X.; Cao, H.; Long, Z.; Gao, H.; Li, K.; et al. Vita-1.5: Towards gpt-4o level real-time vision and speech interaction. arXiv 2025, arXiv:2501.01957. [Google Scholar]
- Liu, J.; Yu, Z.; Lan, S.; Wang, S.; Fang, R.; Kautz, J.; Li, H.; Alvare, J.M. Streamchat: Chatting with streaming video. arXiv 2024, arXiv:2412.08646. [Google Scholar]
- Eyzaguirre, C.; Tang, E.; Buch, S.; Gaidon, A.; Wu, J.; Niebles, J.C. Streaming detection of queried event start. Adv. Neural Inf. Process. Syst. 2024, 37, 100698–100733. [Google Scholar] [CrossRef]
- Wang, Y.; Song, Y.; Xie, C.; Liu, Y.; Zheng, Z. Videollamb: Long streaming video understanding with recurrent memory bridges. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025; pp. 24170–24181. [Google Scholar]
- Yang, D.; Zhan, C.; Wang, Z.; Wang, B.; Ge, T.; Zheng, B.; Jin, Q. Synchronized video storytelling: Generating video narrations with structured storyline. arXiv 2024, arXiv:2405.14040. [Google Scholar]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
- Yang, Z.; Xue, D.; Qian, S.; Dong, W.; Xu, C. Ldre: Llm-based divergent reasoning and ensemble for zero-shot composed image retrieval. In Proceedings of the Proceedings of the 47th International ACM SIGIR conference on research and development in information retrieval, 2024; pp. 80–90. [Google Scholar]
- Yang, Z.; Qian, S.; Xue, D.; Wu, J.; Yang, F.; Dong, W.; Xu, C. Semantic editing increment benefits zero-shot composed image retrieval. In Proceedings of the Proceedings of the 32nd ACM International Conference on Multimedia, 2024; pp. 1245–1254. [Google Scholar]
- Cai, M.; Tan, R.; Zhang, J.; Zou, B.; Zhang, K.; Yao, F.; Zhu, F.; Gu, J.; Zhong, Y.; Shang, Y.; et al. TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models. arXiv 2024, arXiv:2410.10818. [Google Scholar]
- Xun, S.; Tao, S.; Li, J.; Shi, Y.; Lin, Z.; Zhu, Z.; Yan, Y.; Li, H.; Zhang, L.; Wang, S.; et al. Rtv-bench: Benchmarking mllm continuous perception, understanding and reasoning through real-time video. arXiv 2025, arXiv:2505.02064. [Google Scholar]
- Shi, Y.; Zhao, Q.; Jiang, T.; Zeng, X.; Wang, Y.; Wang, L. RIVER: A Real-Time Interaction Benchmark for Video LLMs. arXiv 2026, arXiv:2603.03985. [Google Scholar]
- Lin, J.; Zhu, C.; Xu, R.; Mao, X.; Liu, X.; Wang, T.; Pang, J. Ost-bench: Evaluating the capabilities of mllms in online spatio-temporal scene understanding. arXiv 2025, arXiv:2507.07984. [Google Scholar]
- Wang, Y.; Li, Z.; Qian, T.; Zheng, H.; Wang, Z.; Fu, Y.; Wang, X. StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios. arXiv 2025, arXiv:2512.04451. [Google Scholar]
- Hu, Y.; Yang, Z.; Wang, S.; Qian, S.; Wen, B.; Yang, F.; Gao, T.; Xu, C. StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA. In Proceedings of the Proceedings of the 33rd ACM International Conference on Multimedia, 2025; pp. 13464–13470. [Google Scholar]
- Lee, D.; Mukherjee, S.; Kveton, B.; Rossi, R.A.; Lai, V.D.; Yoon, S.; Bui, T.; Dernoncourt, F.; Bansal, M. StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos. arXiv 2025, arXiv:2512.01707. [Google Scholar]
- Wang, Y.; Meng, X.; Wang, Y.; Zhang, H.; Zhao, D. Proactivevideoqa: A comprehensive benchmark evaluating proactive interactions in video large language models. arXiv 2025, arXiv:2507.09313. [Google Scholar]
- Wang, Y.; Wang, Y.; Chen, B.; Wu, T.; Zhao, D.; Zheng, Z. Omnimmi: A comprehensive multi-modal interaction benchmark in streaming video contexts. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025; pp. 18925–18935. [Google Scholar]
- Yang, Z.; Du, Z.; Qian, S.; Xu, C. Never Seen Before: Benchmarking Genuine Zero-Shot Composed Image Retrieval with Consistent Video-Sourced Datasets. arXiv 2026, arXiv:2606.07032. [Google Scholar]
- Yang, Z.; Zhang, K.; Qian, S.; Dong, W.; Xu, C. Don’t Pause: Streaming Video-Language Synchrony for Online Video Understanding. arXiv 2026, arXiv:2606.06991. [Google Scholar]
- Yang, Z.; Zhang, K.; Wang, B.; Qian, S.; Xu, C. LiveStarPro: Proactive Streaming Video Understanding with Hierarchical Memory for Long-Horizon Streams. arXiv 2026, arXiv:2606.17798. [Google Scholar]





| Survey | Year | Primary Focus | Streaming Setting |
Online Interaction |
Triggering Taxonomy |
Streaming Benchmarks |
|---|---|---|---|---|---|---|
| Laghari et al. [64] | 2023 | Video streaming technologies (compression/protocols) | ▵ | ✗ | ✗ | ✗ |
| Dao et al. [63] | 2022 | Computation-driven live video streaming (systems/delivery) | ▵ | ✗ | ✗ | ✗ |
| Tang et al. [3] | 2025 | LLM-based video understanding (general survey) | ▵ | ▵ | ✗ | ✗ |
| Nguyen et al. [65] | 2024 | Video-language understanding (architecture/training/data) | ✗ | ✗ | ✗ | ▵ |
| Wu et al. [54] | 2026 | MLLM-based video temporal grounding | ✗ | ✗ | ✗ | ✗ |
| This survey | 2026 | Streaming video understanding with Video-LLMs | ✓ | ✓ | ✓ | ✓ |
| # | Model | Date | Venue | Categories | Backbone | Scale | Training | Github |
|---|---|---|---|---|---|---|---|---|
| 1 | VST [96] | 2026.03 | ECCV | KV Cache Management | Qwen2.5-VL | 7B | SFT+RL | Link |
| 2 | TaYS [97] | 2026.03 | CVPR | KV Cache Management | Qwen2.5-VL | 7B | SFT+RL | Link |
| 3 | FluxMem [98] | 2026.03 | CVPR | Memory Summarization | Qwen2.5-VL | 7B | Training-free | Link |
| 4 | TWW [99] | 2026.03 | ECCV | Memory Summarization | Qwen3-VL | 8B | SFT | Link |
| 5 | OmniStream [100] | 2026.03 | arXiv | Computational Efficiency | DINOv3 | 7B | SFT | Link |
| 6 | AutoGaze [101] | 2026.03 | CVPR | Computational Efficiency | NVILA-8B-Video | 8B | SFT+RL | Link |
| 7 | STRIDE [87] | 2026.03 | arXiv | Classification Heads | Qwen3-VL | 2B | SFT | Link |
| 8 | ColorTrigger [95] | 2026.03 | CVPR | Visual Change | InternVL3.5 | 8B | Training-free | Link |
| 9 | StreamingClaw [74] | 2026.03 | arXiv | Token-Driven Triggering | / | / | SFT | Link |
| 10 | Em-Garde [91] | 2026.03 | arXiv | Classification Heads | Qwen2.5-VL | 7B | SFT+RL | Link |
| 11 | ThinkStream [73] | 2026.03 | ECCV | Token-Driven Triggering | Qwen2.5-VL | 3B | SFT+RL | Link |
| 12 | StreamReady [81] | 2026.03 | CVPR | Classification Heads | Qwen2-VL | 7B | SFT | N/A |
| 13 | Proact-VL [82] | 2026.03 | ICML | Classification Heads | Qwen2-VL | 7B | SFT | Link |
| 14 | EventMemAgent [102] | 2026.02 | arXiv | Memory Summarization | Qwen3-VL | 8B | SFT+RL | Link |
| 15 | WeaveTime [103] | 2026.02 | CVPR | Retrieval Augmented | LLaVA-OV | 7B | SFT | Link |
| 16 | ROMA [83] | 2026.01 | arXiv | Classification Heads | Qwen2.5-Omni | / | SFT | Link |
| 17 | QueryStream [94] | 2026.01 | ICLR | Visual Change | Qwen2.5-VL | 7B | Training-free | Link |
| 18 | HERMES [104] | 2026.01 | ACL | Memory Summarization | Qwen2.5-VL | 7B | Training-free | Link |
| 19 | STC [105] | 2025.12 | CVPR | Computational Efficiency | Qwen2-VL | 7B | Training-free | Link |
| 20 | VideoScaffold [106] | 2025.12 | arXiv | Memory Summarization | Vicuna | 7B | SFT | Link |
| 21 | Streamo [71] | 2025.12 | arXiv | Token-Driven Triggering | Qwen2.5-VL | 7B | SFT | Link |
| 22 | StreamingAssistant [107] | 2025.12 | arXiv | KV Cache Management | Qwen2.5-VL | 7B | Training-free | N/A |
| 23 | V-Rex [108] | 2025.12 | HPCA | Retrieval Augmented | Llama-3 | 8B | SFT | N/A |
| 24 | Venus [109] | 2025.12 | INFOCOM | Retrieval Augmented | Qwen2-VL | 7B | Training-free | N/A |
| 25 | ToM [79] | 2025.12 | arXiv | Classification Heads | Qwen2.5-VL | 3B | SFT+RL | N/A |
| 26 | MMDuet2 [110] | 2025.12 | ICLR | Token-Driven Triggering | Qwen2.5-VL | 3B | SFT+RL | Link |
| 27 | LiveStar [48] | 2025.11 | NeurIPS | Perplexity Validation | InternLM2.5 | 8B | SFT | Link |
| 28 | CacheFlow [111] | 2025.11 | arXiv | Retrieval Augmented | LLaVA-OV | 7B | Training-free | N/A |
| 29 | StreamKV [112] | 2025.11 | AAAI | Retrieval Augmented | LLaVA-OV | 7B | Training-free | Link |
| 30 | VideoLLM-EyeWO [70] | 2025.10 | NeurIPS | Token-Driven Triggering | LLaMA-3 | 8B | SFT | Link |
| 31 | StreamingVLM [113] | 2025.10 | arXiv | KV Cache Management | Qwen2.5-VL | 7B | SFT | Link |
| 32 | StreamingTOM [114] | 2025.10 | arXiv | Retrieval Augmented | LLaVA-OV | 7B | Training-free | Link |
| 33 | rLiVS [115] | 2025.10 | NeurIPS | Retrieval Augmented | LLaVA-OV | 7B | Training-free | N/A |
| 34 | video-SALMONN S [116] | 2025.10 | arXiv | Memory Summarization | Qwen3-VL | 8B | SFT+TTT | N/A |
| 35 | StreamForest [117] | 2025.09 | NeurIPS | Memory Summarization | Qwen2 | 7B | SFT | Link |
| 36 | OpenHOUSE [88] | 2025.09 | ICCV | Classification Heads | InternVL2 | 8B | SFT | N/A |
| 37 | StreamMem [118] | 2025.08 | arXiv | KV Cache Management | Qwen2.5-VL | 3B | Training-free | Link |
| 38 | StreamAgent [90] | 2025.08 | arXiv | Classification Heads | Qwen2.5-VL | 7B | SFT | N/A |
| 39 | OVG-HQ-Unify [119] | 2025.08 | ICCV | Memory Summarization | / | / | TTT | Link |
| 40 | StreamVLN [120] | 2025.07 | arXiv | KV Cache Management | Qwen2 | 7B | SFT | Link |
| 41 | InfiniPot-V [121] | 2025.06 | NeurIPS | KV Cache Management | Qwen2.5-VL | 7B | Training-free | Link |
| 42 | CogReasoner [122] | 2025.06 | arXiv | Retrieval Augmented | Qwen2.5 | 7B | SFT | Link |
| 43 | ProAssist [76] | 2025.06 | EMNLP | Token-Driven Triggering | LLaMA-3.1 | 8B | SFT | Link |
| 44 | Flash-VStream [51] | 2025.06 | ICCV | Memory Summarization | Qwen2-VL | 7B | SFT | Link |
| 45 | StreamBridge [84] | 2025.05 | NeurIPS | Classification Heads | Qwen2-VL | 7B | SFT | Link |
| 46 | LiveVLM [123] | 2025.05 | arXiv | Retrieval Augmented | LLaVA-OV | 7B | Training-free | N/A |
| 47 | TimeChat-Online [93] | 2025.04 | ACM MM | Visual Change | Qwen2.5-VL | 7B | SFT | Link |
| 48 | Streamformer [124] | 2025.04 | ICCV | Computational Efficiency | LLaVA-Next | 7B | SFT | Link |
| 49 | LiveCC [75] | 2025.04 | CVPR | Token-Driven Triggering | Qwen2-VL | 7B | SFT | Link |
| 50 | ProVideLLM [125] | 2025.04 | ICCV | Memory Summarization | Llama-3.1 | 8B | SFT | Link |
| 51 | ViSpeak [27] | 2025.03 | ICCV | Classification Heads | Qwen2.5 | 7B | SFT | Link |
| 52 | AssistPDA [77] | 2025.03 | arXiv | Token-Driven Triggering | Qwen2-VL | 2B | SFT | N/A |
| 53 | VideoScan [126] | 2025.03 | arXiv | Memory Summarization | LLaVA-Video | 7B | SFT | Link |
| 54 | LION-FS [68] | 2025.03 | CVPR | Token-Driven Triggering | Llama-3 | 8B | SFT | Link |
| 55 | StreamMind [86] | 2025.03 | ICCV | Classification Heads | VideoLLaMA2 | 8B | SFT | Link |
| 56 | ReKV [49] | 2025.03 | ICLR | Retrieval Augmented | LLaVA-OV | 7B | Training-free | Link |
| 57 | EgoSpeak [80] | 2025.02 | NAACL | Classification Heads | LSTR | 3B | SFT | Link |
| 58 | StreamingChat [127] | 2025.02 | ICLR | KV Cache Management | InternVL2 | 8B | SFT | Link |
| 59 | StreamChat [128] | 2025.01 | ICLR | Memory Summarization | LongVA | 7B | Training-free | Link |
| 60 | Dispider [85] | 2025.01 | CVPR | Classification Heads | Qwen2 | 7B | SFT | Link |
| 61 | VITA-1.5 [129] | 2025.01 | NeurIPS | Computational Efficiency | Qwen2 | 7B | SFT | Link |
| 62 | VideoChat-Online [50] | 2025.01 | CVPR | Memory Summarization | InternVL2 | 4B | SFT | Link |
| 63 | StreamChat [130] | 2024.12 | CVPR | Computational Efficiency | Qwen2.5 | 7B | SFT | Link |
| 64 | SDQES [131] | 2024.12 | NeurIPS | Computational Efficiency | EgoVideo | 7B | SFT | N/A |
| 65 | MMDuet [47] | 2024.11 | EMNLP | Classification Heads | LLaVA-OV | 7B | SFT | Link |
| 66 | VideoLLaMB [132] | 2024.09 | ICCV | Memory Summarization | Vicuna | 7B | SFT | Link |
| 67 | VideoLLM-MoD [66] | 2024.08 | NeurIPS | Token-Driven Triggering | Llama-3 | 8B | SFT | N/A |
| 68 | STREAM-VLM [69] | 2024.07 | NeurIPS | Token-Driven Triggering | LLaMA-2 | 7B | SFT | Link |
| 69 | VideoLLM-online [26] | 2024.06 | CVPR | Token-Driven Triggering | LLaMA-3 | 8B | SFT | Link |
| 70 | VideoStreaming [44] | 2024.05 | NeurIPS | Memory Summarization | Vicuna | 7B | SFT | N/A |
| 71 | VideoNarrator [133] | 2024.05 | ACL | Memory Summarization | Baichuan | 7B | SFT | Link |
| 72 | StreamingDVC [45] | 2024.04 | CVPR | Memory Summarization | T5-Base decoder | / | SFT | Link |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).