Transformer-Based Pipeline for Speech-to-Text Transcription and Automated Text Synthesis

R Karthick

doi:10.20944/preprints202603.0299.v1

Submitted:

02 March 2026

Posted:

04 March 2026

You are already at the latest version

Abstract

This paper introduces a novel Transformer-Driven Pipeline that seamlessly integrates acoustic hearing, automated speech transcription, and writing synthesis into a unified end-to-end framework powered by advanced transformer architectures. Beginning with raw acoustic inputs captured via microphones, the pipeline preprocesses audio signals into spectrogram representations, leveraging stacked transformer encoders with multi-head self-attention to extract contextualized phonetic and prosodic features. These features feed into a sequence-to-sequence transcription module, where cross-attention mechanisms align auditory patterns with linguistic tokens, achieving robust speech-to-text conversion even in noisy environments or with diverse accents. Extending beyond transcription, the system employs a generative decoder to synthesize structured written outputs, such as summaries, reports, or formatted notes, by refining transcripts through autoregressive language modelling while preserving semantic fidelity and stylistic nuances derived from the original speech. Experimental validation on benchmark datasets like LibriSpeech and Common Voice demonstrates superior performance, with word error rates reduced by up to 25% compared to RNN baselines and enhanced fluency in synthesis metrics like BLEU scores. The pipeline's parallelizable design ensures real-time efficiency, making it ideal for applications in assistive technologies, live captioning, and automated documentation. This work highlights transformer's versatility in bridging auditory perception and textual production, paving the way for scalable multimodal AI systems.

Keywords:

transformer architecture

;

speech processing pipeline

;

acoustic hearing

;

automatic speech recognition

;

speech transcription

;

writing synthesis

;

self-attention

Subject:

Computer Science and Mathematics - Computer Science

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Transformer-Based Pipeline for Speech-to-Text Transcription and Automated Text Synthesis

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe