Streaming Transformer Networks: Unified Hearing-to-Speech Recognition and Intelligent Text Generation Systems

P. Selvaprasanth

doi:10.20944/preprints202603.0205.v1

Submitted:

03 March 2026

Posted:

04 March 2026

You are already at the latest version

Abstract

Streaming Transformer Networks: Unified Hearing-to-Speech Recognition and Intelligent Text Generation Systems introduce a groundbreaking architecture that processes real-time audio streams to produce both synthesized speech outputs and contextually intelligent text, overcoming traditional limitations in multimodal AI systems. Traditional speech recognition models often operate offline, requiring full audio sequences before generating results, which hinders interactive applications. This work proposes a transformer-based framework that unifies hearing-to-speech translation directly converting input audio into natural-sounding speech with advanced text generation capabilities, enabling seamless dual-mode responses in conversational agents. By adapting transformers for streaming via causal attention and triggered mechanisms, the system achieves low-latency performance while maintaining high fidelity in prosody preservation and semantic coherence. Key innovations include shared encoder layers for efficiency, hybrid decoding paths for modality-specific outputs, and joint optimization across diverse objectives like word error rate minimization and perceptual quality enhancement. Evaluations on standard benchmarks demonstrate superior results, with latency under 200ms and error rates rivalling non-streaming baselines, paving the way for deployment in voice assistants, live captioning, and real-time dialogue systems. This unified approach not only reduces model complexity but also advances end-to-end learning for dynamic audio-to-multimodal generation tasks.

Keywords:

streaming transformers

;

hearing-to-speech recognition

;

intelligent text generation

;

causal attention

;

triggered attention

;

end-to-end ASR

;

multimodal AI

;

low-latency inference

Subject:

Computer Science and Mathematics - Computer Science

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Streaming Transformer Networks: Unified Hearing-to-Speech Recognition and Intelligent Text Generation Systems

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe