This paper introduces a novel Transformer-Driven Pipeline that seamlessly integrates acoustic hearing, automated speech transcription, and writing synthesis into a unified end-to-end framework powered by advanced transformer architectures. Beginning with raw acoustic inputs captured via microphones, the pipeline preprocesses audio signals into spectrogram representations, leveraging stacked transformer encoders with multi-head self-attention to extract contextualized phonetic and prosodic features. These features feed into a sequence-to-sequence transcription module, where cross-attention mechanisms align auditory patterns with linguistic tokens, achieving robust speech-to-text conversion even in noisy environments or with diverse accents. Extending beyond transcription, the system employs a generative decoder to synthesize structured written outputs, such as summaries, reports, or formatted notes, by refining transcripts through autoregressive language modelling while preserving semantic fidelity and stylistic nuances derived from the original speech. Experimental validation on benchmark datasets like LibriSpeech and Common Voice demonstrates superior performance, with word error rates reduced by up to 25% compared to RNN baselines and enhanced fluency in synthesis metrics like BLEU scores. The pipeline's parallelizable design ensures real-time efficiency, making it ideal for applications in assistive technologies, live captioning, and automated documentation. This work highlights transformer's versatility in bridging auditory perception and textual production, paving the way for scalable multimodal AI systems.