Keyframe Selection and Multimodal Fusion for Product Recognition in E-Commerce Live Streaming

Yichuan Zheng; Jin Shi; Wei Shen

doi:10.20944/preprints202605.0313.v1

Submitted:

05 May 2026

Posted:

06 May 2026

You are already at the latest version

Abstract

Product recognition in e-commerce live streaming is hindered by rapid viewpoint changes, occlusions, motion blur, and inconsistencies between visual and spoken information. Existing approaches typically focus on individual components such as detection, OCR, or speech recognition, which limits their effectiveness in end-to-end scenarios.To address this problem, we propose an integrated framework that combines task-oriented keyframe selection with multimodal semantic fusion. The framework first uses D-FINE to localize product regions, and then selects informative frames through two complementary strategies. Strategy A considers both detection confidence and Laplacian-based sharpness, while Strategy B combines detection confidence with a learned image-quality score estimated by an EfficientNetV2-based model. OCR, visual recognition, and ASR are then applied to the selected data, and a Qwen-Plus large language model is used to integrate multimodal evidence into structured product outputs. Experiments on an in-house dataset demonstrate significant gains over a last-frame baseline. Strategy A increases Perfect Match Rate from 58.00% to 80.00% and Product Name Recognition Accuracy from 78.00% to 98.00%. Strategy B achieves 77.00% and 98.00%, respectively. Ablation studies further show that the full multimodal framework consistently outperforms unimodal and dual-modality variants. In addition, Top-K analysis indicates that single-frame inference provides a good balance between performance and efficiency.Overall, the proposed framework offers an effective and practical solution for product recognition in complex live-streaming scenarios.

Keywords:

e-commerce live streaming

;

product recognition

;

keyframe selection

;

multimodal fusion

;

large language models

;

object detection

;

image quality assessment

;

information extraction

Subject:

Computer Science and Mathematics - Computer Vision and Graphics

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Keyframe Selection and Multimodal Fusion for Product Recognition in E-Commerce Live Streaming

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe