Submitted:
05 May 2026
Posted:
06 May 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
2.1. Overview of Related Work
2.2. Review of Visual Object Detection Models
2.3. Video-Level Product Recognition and Multimodal Understanding
2.4. Keyframe Selection and Multi-Frame Strategies
2.5. Image Quality Assessment (IQA)
2.6. Large Language Models and Multimodal Fusion
3. Method
3.1. Problem Definition
3.2. Overall Architecture
- Frame extraction & scoring: Frames are sampled from the input video at a fixed frame rate, and a quality score is computed for each frame (e.g., a combination of detection confidence and Laplacian sharpness). According to the scoring strategy, a single frame or multiple candidate frames are selected for downstream processing.
- OCR pipeline: Text recognition is performed on the candidate frames to obtain raw textual content and internal OCR confidence scores. The raw text, confidence scores, and positional information are provided as input to an LLM, which is instructed to return normalized and structured outputs, such as brand name, product name, product category, and specifications.
- ASR pipeline: The system performs speech recognition on the entire audio stream of the video to obtain a complete transcription of spoken content. This transcription does not include temporal or frame-level alignment information; instead of localizing specific video segments, it serves as semantic evidence that assists in understanding the products presented in the video. The LLM then parses and standardizes the raw ASR transcription to produce structured information, including potentially mentioned product names, brands, specifications, and their semantic confidence scores. At this stage, the LLM effectively conducts a semantic-level preliminary judgment based on spoken descriptions, forming textual evidence units parallel to the visual modality.
- Visual recognition with Qwen-VL: For each product region detected in the candidate frames, Qwen-VL is invoked to perform image-based recognition and description. The model input consists of cropped image regions and task-specific prompts, and the output includes product category, brand, specifications, and confidence scores, forming standardized structured evidence units.
- Evidence aggregation and fusion LLM: Structured outputs from the three pipelines are aggregated as contextual input and provided to the fusion LLM with explicit fusion instructions. The fusion LLM produces the final decision, including the product name, brand, and product category.
3.3. Frame Scoring Strategy
3.3.1. Baseline: Fixed-Position Strategy
3.3.2. Strategy A: Traditional Visual Feature–Based Scoring
3.3.3. Strategy B: Deep Learning–Based Scoring
3.4. Frame Selection and Implementation Details
3.4.1. Frame Selection Pipeline
- A detector is applied to each frame to obtain product bounding boxes and their confidence scores.
- For each detected bounding box, either a sharpness score is computed (Strategy A) or the cropped region is fed into the regression model (Strategy B).
- A frame-level composite score is calculated for each frame.
- The frame with the highest composite score is selected as the final recognition frame:
- The selected frame is then passed to the downstream OCR and multimodal fusion modules.
3.4.2. Top-K Multi-Frame Fusion Strategy
3.5. Multimodal Fusion and Final Matching
3.5.1. Multimodal Information Extraction
3.5.2. LLM-Based Fusion Mechanism
- 1.
- Consistency First: When multiple modalities provide identical or highly consistent information, such results are prioritized to enhance the reliability of the final output.
- 2.
- Reliability Ranking: Modalities are fused according to their confidence-based reliability, with the priority order defined as OCR (packaging text) > image recognition > ASR (speech), ensuring that highly reliable evidence dominates the final decision.
- 3.
- Conflict Resolution: In cases where outputs from different modalities conflict, the result with higher confidence and stronger consistency with OCR evidence is preferred.
- 4.
- Fault-Tolerant Completion: When information from certain modalities is missing, the model generates a complete judgment based on the available modalities, thereby maintaining the stability and robustness of the system output.
3.5.3. Temperature Parameter and Inference Stability
3.6. Data
3.6.1. Data Sources and Overall Statistics
3.6.2. Annotation Protocol and Workflow
3.6.3. Data Augmentation and Preprocessing
3.6.4. Pseudo-Labeling Pipeline for EfficientNet Training
3.6.5. Subset Sampling for Comparative Experiments
3.7. Models and Training
3.7.1. D-FINE Detector
3.7.2. EfficientNetV2
4. Experiments
4.1. Model Training Results and Analysis
4.1.1. D-FINE Training Results and Performance Analysis
4.1.2. EfficientNetV2 Training Results and Performance Analysis
4.2. Experimental Metrics and Statistical Methods
4.2.1. Experiment 1: OCR Information Quantity Metrics
4.2.2. Experiment 2: Frame Quality Evaluation Metrics
4.2.3. Experiment 3: Recognition Accuracy Metrics and Threshold Setting
- 1.
-
Field-Level AccuracySemantic similarity between recognition results and ground-truth annotations is calculated for each field, with the following definitions:
- Brand Recognition Accuracy: proportion of samples with brand similarity ;
- Product Name Recognition Accuracy: proportion of samples with product-name similarity ;
- Category Recognition Accuracy: proportion of samples with category similarity .
- 2.
-
Video-Level AccuracyBased on the matching results of the three fields, two video-level evaluation metrics are defined:
- Perfect Match Rate: proportion of videos whose brand similarity , product-name similarity , and category similarity simultaneously;
- Semantic Similarity: average semantic similarity of the three fields, namely brand, product name, and category.
4.3. Experiment 1: Comparison Between Single-Frame and Top-K Multi-Frame Fusion
4.3.1. Experimental Setup
- 1.
- Single-Frame Strategy: Directly select the single frame with the highest comprehensive quality score as the OCR input;
- 2.
- Top-3 Fusion Strategy: Select the top 3 frames with the highest quality scores and fuse their OCR outputs;
- 3.
- Top-5 Fusion Strategy: Select the top 5 frames with the highest quality scores and fuse their OCR outputs.
4.3.2. Experimental Results
4.3.3. Results Analysis
4.4. Experiment 2: Comparison of Frame Quality Between Strategy A and Strategy B
4.4.1. Experimental Setup
- 1.
- Input Data Preparation: Randomly sample 50 videos from 442 product-related videos as the test set to ensure diversity in scenes and content.
- 2.
- Frame Selection Stage: Apply Strategy A and Strategy B separately to score and rank frames for each video, selecting the highest-scoring single frame as the representative frame for each strategy. The weight parameters of both strategies are kept consistent, with detection confidence weighted at 0.7 and quality score weighted at 0.3, to ensure fairness in the comparison.
- 3.
- OCR Recognition Stage: Use the same OCR model (PaddleOCR) to perform text recognition on the selected representative frames, extracting recognized text and corresponding confidence scores. This experiment evaluates only the frame performance in the OCR stage, without incorporating ASR or image recognition modules. This is because the core objective of the frame scoring strategy is to select clear frames with sufficient information, and its effectiveness should primarily be reflected through the recognition performance of visible text.
- 4.
- Results Statistics and Analysis: Perform statistical analysis on the OCR recognition results of both strategies across all test videos, including metrics such as average confidence, recognized text quantity, proportion of high-confidence text, and frame index differences.
4.4.2. Experimental Results
- Automatic Evaluation Results
- (1)
- OCR Confidence Performance
- (2)
- Text Recognition Quality
- (3)
- Improvement and Degradation Performance
- (4)
- Time Cost
- Human Evaluation Results
4.4.3. Results Analysis
- (1)
- Improved OCR Confidence and Stability
- (2)
- Avoidance of Extremely Low-Quality Frames
- (3)
- Human Evaluation Validation
4.5. Experiment 3: End-to-End Recognition Accuracy Comparison
4.5.1. Experimental Setup
- 1.
- Baseline: Directly use the last frame of the video for recognition;
- 2.
- Strategy A: Perform end-to-end recognition using the frame selection Strategy A proposed in Experiment 2;
- 3.
- Strategy B: Perform end-to-end recognition using the frame selection Strategy B proposed in Experiment 2.
4.5.2. Experimental Results
4.5.3. Results Analysis
- (1)
- Overall Performance Analysis
- (2)
- Analysis of Why Strategy B Underperforms Strategy A in the Full Pipeline
4.6. Experiment 4: Ablation Study on Multimodal Evidence Fusion
4.6.1. Experimental Setup
- OCR only: Uses only textual information from keyframes.
- OCR+ASR: Adds speech transcripts to supplement visual text.
- OCR+Qwen-VL: Combines text with visual-semantic features from the VLM.
- OCR+ASR+Qwen-VL: The proposed full multimodal framework.
4.6.2. Experimental Results
4.6.3. Results Analysis
5. Conclusions
- The Top-K multi-frame fusion strategy provides significant improvements in information completeness (including the number of recognized OCR characters and high-confidence text), but its computational cost increases substantially with larger K. Considering both efficiency and recognition accuracy, the single-frame strategy offers higher engineering feasibility in resource-constrained and practical deployment scenarios.
- Strategy B outperforms Strategy A in OCR confidence and frame selection stability, more effectively avoiding the selection of extremely low-quality frames. However, in the final end-to-end evaluation, Strategy A still achieves better overall recognition performance, with higher Perfect Match Rate and Semantic Similarity than Strategy B. This result reflects a trade-off between “global image quality” and “task-relevant information completeness.” Based on the above findings, this paper recommends Strategy A as the default single-frame strategy for practical deployment, while Strategy B or Top-K multi-frame fusion can be used as supplements in scenarios with looser efficiency constraints or in applications with higher requirements for recognition confidence.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Wangjingshe E-commerce Research Center. The Release of “2024 China Live E-commerce Market Data Report”. https://www.100ec.cn/detail–6649904.html, 2025. Accessed: 2025-06-09.
- Transparency Market Research. Livestream E-commerce Market Size & Industry Share to 2034. Technical report, Transparency Market Research, 2025.
- Pe, P. TikTok Shop GMV in 2024 Surpassed US$30 Billion. https://thelowdown.momentum.asia/tiktok-shop-gmv-in-2024-surpassed-us30-billion, 2025. Accessed: 2026-02-02.
- Wangjingshe. TikTok E-commerce French Station Accelerates Expansion, Transaction Volume Soars Sevenfold in Half a Year. https://fgw.sz.gov.cn/ztzl/qtztzl/szscjmyjjfzzhfwpt/hwtz/sjal/content/post_12527556.html, 2025. Accessed: 2026-02-02.
- Sellercraft. TikTok Shop vs Shopee GMV Trends in Southeast Asia (2023–2025). https://sellercraft.co/tiktok-shop-vs-shopee-gmv-trends-in-southeast-asia-2023-2025-unpacking-the-e-commerce-showdown/, 2025.
- Yang, W.; Chen, Y.; Li, Y.; Cheng, Y.; Liu, X.; Chen, Q.; Li, H. Cross-view Semantic Alignment for Livestreaming Product Recognition. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2023, pp. 13404–13413.
- for Market Regulation, S.A. The Fifth Batch of Typical Cases in the Field of Live E-commerce. https://www.samr.gov.cn/xw/zj/art/2026/art_a7a1fb24ceac4ed9a4789161bfe49e0f.html, 2026.
- TikTok Shop. TikTok Shop’s latest Safety and IPR Reports: Focusing on security while growing globally. https://seller.tiktokglobalshop.com/business/us/newsroom/detail/10023362, 2025.
- Amazon. 2024 Brand Protection Report: How Amazon Uses AI Innovations to Stop Fraud and Counterfeits. Technical report, Amazon, 2025.
- OECD and EUIPO. Mapping Global Trade in Fakes 2025: Global Trends and Enforcement Challenges. Technical report, OECD, 2025.
- Zhu, H.; Wei, H.; Li, B.; Yuan, X.; Kehtarnavaz, N. A Review of Video Object Detection: Datasets, Metrics and Methods. Applied Sciences 2020, 10, 7834. [CrossRef]
- Tsai, Y.H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.P.; Salakhutdinov, R. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2019, p. 6558.
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779–788. [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Proceedings of the European Conference on Computer Vision (ECCV), 2020.
- Peng, Y.; Li, H.; Wu, P.; Zhang, Y.; Sun, X.; Wu, F. D-FINE: Redefine Regression Task in DETRs as Fine-Grained Distribution Refinement, 2024.
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, .; Polosukhin, I. Attention Is All You Need. Advances in Neural Information Processing Systems 2017, 30.
- Kuhn, H.W. The Hungarian Method for the Assignment Problem. Naval Research Logistics Quarterly 1955, 2, 83–97. [CrossRef]
- Neubeck, A.; Van Gool, L. Efficient Non-Maximum Suppression. In Proceedings of the Proceedings of the International Conference on Pattern Recognition (ICPR). IEEE, 2006, pp. 850–855.
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Advances in Neural Information Processing Systems 2015, 28.
- Weng, Z.; Meng, L.; Wang, R.; Wu, Z.; Jiang, Y.G. A Multimodal Framework for Video Ads Understanding, 2021.
- Gu, J.; Qin, T.; Chen, H. A Key Frame Extraction Method Based on MPEG-7 Color Features and Block Motion Information. Journal of Guangxi University (Natural Science Edition) 2010, 35, 310–314. [CrossRef]
- Cernekova, Z.; Pitas, I.; Nikou, C. Information Theory-Based Shot Cut/Fade Detection and Video Summarization. IEEE Transactions on Circuits and Systems for Video Technology 2006, 16, 82–91. [CrossRef]
- Mittal, A.; Moorthy, A.K.; Bovik, A.C. No-Reference Image Quality Assessment in the Spatial Domain. IEEE Transactions on Image Processing 2012, 21, 4695–4708. [CrossRef]
- Kharchevnikova, A.; Savchenko, A.V. Efficient Video Face Recognition Based on Frame Selection and Quality Assessment. PeerJ Computer Science 2021, 7, e391. [CrossRef]
- Park, J.; Lee, J.; Kim, I.J.; Sohn, K. SumGraph: Video Summarization via Recursive Graph Modeling, 2020.
- Zhuang, Y.; Rui, Y.; Huang, T.S.; Mehrotra, S. Adaptive Key Frame Extraction Using Unsupervised Clustering. In Proceedings of the Proceedings of the International Conference on Image Processing (ICIP), 1998, pp. 866–870. [CrossRef]
- Mittal, A.; Soundararajan, R.; Bovik, A.C. Making a “Completely Blind” Image Quality Analyzer. IEEE Signal Processing Letters 2013, 20, 209–212. [CrossRef]
- Su, S.; Yan, Q.; Zhu, Y.; Zhang, C.; Ge, X.; Sun, J.; Zhang, Y. Blindly Assess Image Quality in the Wild Guided by a Self-Adaptive Hyper Network. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 3664–3673. [CrossRef]
- Liu, X.; van de Weijer, J.; Bagdanov, A.D. RankIQA: Learning from Rankings for No-Reference Image Quality Assessment, 2017.
- Zhang, W.; Zhai, G.; Wei, Y.; Yang, X.; Ma, K. Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective, 2023.
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision, 2021.
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale, 2021.
- Golestaneh, S.A.; Dadsetan, S.; Kitani, K.M. No-Reference Image Quality Assessment via Transformers, Relative Ranking, and Self-Consistency, 2022.
- Yang, S.; Wu, T.; Shi, S.; Lao, S.; Gong, Y.; Cao, M.; Wang, J.; Yang, Y. MANIQA: Multi-Dimension Attention Network for No-Reference Image Quality Assessment, 2022.
- Li, J.; Li, D.; Savarese, S.; Hoi, S. BLIP-2: Bootstrapping Language-Image Pre-Training with Frozen Image Encoders and Large Language Models. In Proceedings of the Proceedings of the International Conference on Machine Learning (ICML). PMLR, 2023, pp. 19730–19742.
- Cui, C.; Sun, T.; Lin, M.; Gao, T.; Zhang, Y.; Liu, J.; Wang, X.; Zhang, Z.; Zhou, C.; Liu, H.; et al. PaddleOCR 3.0 Technical Report, 2025.
- An, K.; Chen, Q.; Deng, C.; Du, Z.; Gao, C.; Gao, Z.; Gu, Y.; He, T.; Hu, H.; Hu, K.; et al. FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs, 2024.
- Tan, M.; Le, Q.V. EfficientNetV2: Smaller Models and Faster Training, 2021.






| Category | Parameter | Value |
|---|---|---|
| Model | Backbone | HGNetv2-B2 (pretrained) |
| Hidden dimension | 256 | |
| Decoder layers | 4 | |
| Number of queries | 300 | |
| Number of classes | 2 (product, background) | |
| Training | Epochs | 132 |
| Batch size | 16 | |
| Learning rate | (backbone: ) | |
| Optimizer | AdamW (, weight decay ) | |
| LR schedule | MultiStepLR (milestone=500, ) + linear warmup | |
| Warmup steps | 500 | |
| AMP | Enabled | |
| EMA | Enabled (decay=0.9999) | |
| Data | Input size | |
| Data augmentation | PhotometricDistort, RandomZoomOut, RandomIoUCrop, Flip | |
| Stop epoch | 120 | |
| Loss | Loss weights | vfl: 1.0, bbox: 5.0, giou: 2.0, fgl: 0.15, ddf: 1.5 |
| Evaluation | Metrics | COCO mAP (IoU 0.5–0.95), AP50, AP75 |
| Category | Parameter | Value |
|---|---|---|
| Model | Backbone | EfficientNetV2-M (pretrained on ImageNet-1K) |
| Input resolution | ||
| Parameters | M | |
| Dropout rate | 0.3 | |
| Training | Batch size | 16 |
| Epochs | 20 | |
| Learning rate | ||
| Optimizer | Adam | |
| Weight decay | ||
| Loss | Function | MSE (regression) |
| LR schedule | Scheduler | ReduceLROnPlateau (factor=0.5, patience=3, min lr ) |
| Data | Split | 80% / 20% |
| Augmentation | Methods | Flip (50%), brightness/contrast (), Gaussian noise (20%) |
| Training strategy | Early stopping | 5 epochs |
| Metric | Value | Description |
|---|---|---|
| mAP (IoU 0.5–0.95) | 27.76% | COCO standard mean Average Precision over IoU thresholds from 0.5 to 0.95 |
| 40.36% | Average Precision at IoU = 0.5 | |
| AP | 28.25% | Average Precision at IoU = 0.75 |
| 44.14% | Average Recall with up to 100 detections per image | |
| Final training loss | 20.71 | Loss value at the end of training (epoch 131) |
| Metric | Value | Description |
|---|---|---|
| Validation MSE | 0.00220 | Mean Squared Error |
| Validation MAE | 0.03667 | Mean Absolute Error |
| Validation RMSE | 0.04687 | Root Mean Squared Error |
| Pearson correlation | 0.9323 | Linear correlation between predicted scores and ground-truth scores |
| Best epoch | 17 | Epoch with the lowest validation loss |
| Metric | Description |
|---|---|
| Avg. number of characters | Average number of characters recognized per video |
| Avg. confidence | Mean confidence score of all recognized characters |
| High-confidence characters | Average number of characters per video with confidence |
| Processing time | Average processing time per video from input to OCR completion |
| Metric | Definition | Role |
|---|---|---|
| OCR Average Confidence | Average confidence of all recognized characters in the selected frames | Reflects the clarity and readability of text regions |
| Detection Confidence | Average confidence of detection boxes from D-FINE | Measures the detectability of targets in the image and overall quality |
| Proportion of High-Confidence Text | Proportion of text with confidence no less than 0.7 | Evaluates the proportion of high-quality recognition results |
| OCR Text Quantity | Total number of recognized text entries | Reflects the completeness of information extraction |
| Processing Time | Total time from video input to output of selected frames | Reflects the overall runtime efficiency of the algorithm |
| Variable | Threshold | Perfect Match Rate (%) |
Corresponding Field Accuracy (%) |
|---|---|---|---|
| Brand | 0.60 | 81.00 | 81.00 |
| Brand | 0.65 | 80.00 | 80.00 |
| Brand | 0.70 | 80.00 | 80.00 |
| Brand | 0.75 | 78.00 | 78.00 |
| Brand | 0.80 | 74.00 | 74.00 |
| Product Name | 0.40 | 80.00 | 100.00 |
| Product Name | 0.45 | 80.00 | 99.00 |
| Product Name | 0.50 | 80.00 | 98.00 |
| Product Name | 0.55 | 79.00 | 97.00 |
| Product Name | 0.60 | 75.00 | 90.00 |
| Product Name | 0.65 | 72.00 | 84.00 |
| Product Name | 0.70 | 67.00 | 76.00 |
| Category | 0.40 | 80.00 | 99.00 |
| Category | 0.45 | 80.00 | 99.00 |
| Category | 0.50 | 80.00 | 99.00 |
| Category | 0.55 | 74.00 | 88.00 |
| Category | 0.60 | 66.00 | 77.00 |
| Metric | Single Frame () | Top-3 Frames | Top-5 Frames |
|---|---|---|---|
| Average Number of Characters | 17 | 20 | 25 |
| Average Confidence | 0.7593 | 0.8897 | 0.8850 |
| Average Number of High-Confidence Characters | 15 | 18 | 20 |
| Average Processing Time (s) | 32.51 | 93.11 | 155.52 |
| Improvement | Degradation | Equivalent | |
|---|---|---|---|
| Number of Samples | 24 (48.0%) | 11 (22.0%) | 15 (30.0%) |
| Strategy A | Strategy B | Score Improvement (B relative to A) | |
|---|---|---|---|
| Average Quality Score | 3.43 | 3.73 | +0.30 (8.75%) |
| Metric | Baseline | Strategy A | Strategy B |
|---|---|---|---|
| Perfect Match Rate | 58.00% | 80.00% | 77.00% |
| Product Name Recognition Accuracy | 78.00% | 98.00% | 98.00% |
| Semantic Similarity | 68.75% | 82.38% | 80.40% |
| Modality | Perfect Match Rate | Product Name Recognition Accuracy | Semantic Similarity |
|---|---|---|---|
| OCR only | 70.00% | 92.00% | 78.41% |
| OCR+ASR | 75.00% | 97.00% | 80.72% |
| OCR+Qwen-VL | 76.00% | 96.00% | 81.09% |
| Full Framework | 80.00% | 98.00% | 82.38% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).