Submitted:
26 June 2026
Posted:
30 June 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
Research Contributions
- A four-stage, phase-coherent audio processing pipeline specifically designed for Quranic audio preprocessing.
- A domain-specific spectral shaping gain curve, derived from the frequency energy profile of clean Quranic reference audio, that preserves vowel formants in the 1000-2000 Hz band and fricative energy above 4000 Hz.
- The first use of Griffin-Lim phase reconstruction for Quranic voice enhancement shows that, after magnitude-domain processing, phase coherence is the main factor influencing perception naturalness.
- Formant Clarity (FC), Spectral Contrast (SC), and Energy Ratio (ER) are three Quran-specific evaluation measures that reflect phoneme-level signal characteristics pertinent to Qira’at classification.
- Results from a controlled synthetic experiment and 48 actual room recordings demonstrate a steady improvement over three well-established baseline techniques.Figure 1. The suggested four-stage processing pipeline’s block diagram. Reverberant Quranic recording in a single channel is the input. Improved recording with retained phonetic characteristics is the result.Figure 1. The suggested four-stage processing pipeline’s block diagram. Reverberant Quranic recording in a single channel is the input. Improved recording with retained phonetic characteristics is the result.

2. Related Work
2.1. Quranic Recitation Classification
2.2. Classical Speech Dereverberation
2.3. Deep Learning and Diffusion-Based Dereverberation
2.4. Griffin-Lim Phase Reconstruction
2.5. Research Gap
| Ref. | Method | Application | Limitation | Metric |
|---|---|---|---|---|
| [1] | MFCC + SVM | Reciter ID | Clean audio only; no reverb robustness | Acc. |
| [2] | CNN-BiGRU + CTC | Quranic ASR | No reverb preprocessing | WER |
| [13] | Whisper-based ASR | Quranic recognition | Assumes clean audio input | WER |
| [4] | DNN Acoustic | Recitation assist. | Degrades under room acoustics | WER |
| [7] | Spectral Sub. | Generic denoising | Musical noise; damages fricatives | SNR |
| [9] | WPE Dereverb. | Generic dereverb | Destroys FC; fails on real rooms | PESQ/STOI |
| [18] | SGMSE+ Diffusion | Speech enhancement | Large corpus; domain mismatch | PESQ |
| [19] | StoRM (Hybrid) | Speech enhancement | High compute; not domain-specific | PESQ/STOI |
| Prop. | 4-Stage pipeline | Qira’at classif. | - | FC, SC, ER |
3. Materials and Methods

3.1. Stage 1: Domain-Adapted Spectral Shaping
3.2. Stage 2: Adaptive Noise Floor Subtraction

3.3. Stage 3: Soft Voice Activity Detection with Power-Law Decay

3.4. Stage 4: Griffin-Lim Phase Reconstruction

3.5. Algorithm Parameters
4. Results
4.1. Dataset
4.2. Baseline Methods
- Spectral Subtraction [7]: noise estimated from the first 20 STFT frames; ; window 512; hop 128.
- Wiener Filter [8]: MMSE-optimal gain computed from an initial silence segment; same STFT parameters.
- WPE Dereverberation [9]: nara_wpe package; taps; ; 5 iterations; window 512.
- Reverberant Only: unprocessed signal serving as the degradation baseline.
- MetricGAN+ [25]: GAN-based deep learning speech enhancement network pretrained on VoiceBank-DEMAND dataset. No fine tuning was done since in real-world application, domain specific paired data for Qur’anic texts will not be available.
4.3. Evaluation Metrics
4.3.1. Standard Speech Quality Metrics
4.3.2. Quranic-Specific Metrics
- Energy Ratio (ER, dB) is the ratio, separated by the 30th percentile threshold, between the average energy in active frames and the average energy in silent frames. Clear word boundaries are indicated by a high ER value, which also makes automatic segmentation easier.
- The average spectral contrast over six frequency sub-bands (librosa) is represented by Spectral Contrast (SC). Higher values correlate with phoneme separability and show improved discrimination between harmonic and noise components.
-
Formant Clarity (FC):Formant Clarity (FC) measures the stability of time evolution of the first two vowel formants (F1, F2) using Linear Predictive Coding of order 12[28]. to extract the parameters per frame. The LPC polynomial roots for each frame are calculated by the Levinson-Durbin recursion, and F1 and F2 are considered to be the two lowest frequency elements within the range 50-8000 Hz. FC is calculated as follows:where and represent the mean and standard deviation of each formant frequency for each frame. The larger the FC measure, the better the stability and clarity of the formant frequencies, which is necessary for Qira’at discrimination.
4.4. Controlled Experiment: Synthetic Reverberation

4.5. Real Room Evaluation: 48 Recordings

4.6. Feature Discriminability Experiment
4.7. Qira’at Classification Validation
4.8. Statistical Significance

5. Discussion
5.1. Quranic-Specific Metrics
5.2. Standard Metric Performance

5.3. Iterative Development
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| ER | Energy Ratio |
| FC | Formant Clarity |
| FFT | Fast Fourier Transform |
| MFCC | Mel-Frequency Cepstral Coefficient |
| PESQ | Perceptual Evaluation of Speech Quality |
| SC | Spectral Contrast |
| SI-SDR | Scale-Invariant Signal-to-Distortion Ratio |
| SNR | Signal-to-Noise Ratio |
| STFT | Short-Time Fourier Transform |
| STOI | Short-Time Objective Intelligibility |
| VAD | Voice Activity Detection |
| WPE | Weighted Prediction Error |
References
- Alkhateeb, J.H. A Machine Learning Approach for Recognizing the Holy Quran Reciter. Int. J. Adv. Comput. Sci. Appl. 2020, 11. [Google Scholar] [CrossRef]
- Harere, A.A.; Jallad, K.A. Quran Recitation Recognition using End-to-End Deep Learning. arXiv 2023, arXiv:eess. [Google Scholar]
- Davis, S.; Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 357–366. [Google Scholar] [CrossRef]
- Ghori, A.F.; Waheed, A.; Waqas, M.; Mehmood, A.; Ali, S.A. Acoustic modelling using deep learning for Quran recitation assistance. Int. J. Speech Technol. 2022, 26, 113–121. [Google Scholar] [CrossRef]
- Shakeel; M.A.; National University of Sciences and Technology; Islamabad; Pakistan.; Khattak, H.A.; Khurshid; N.; National University of Sciences and Technology - Pakistan.; University of Sciences and Technology; Islamabad; Pakistan. Deep acoustic modelling for Quranic Recitation – current solutions and future directions. PSI TIR J. 2024, 20, 61–73. [Google Scholar] [CrossRef]
- Kuttruff, H. Room Acoustics, 5 ed.; Spon Press: London, UK, 2009. [Google Scholar]
- Boll, S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 1979, 27, 113–120. [Google Scholar] [CrossRef]
- Nakatani, T.; Yoshioka, T.; Kinoshita, K.; Miyoshi, M.; Juang, B.H. Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction. IEEE Trans. Audio Speech Lang. Process. 2010, 18, 1717–1731. [Google Scholar] [CrossRef]
- Griffin, D.W.; Lim, J.S. Signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 1984, 32, 236–243. [Google Scholar] [CrossRef]
- Al-Kharusi, M.H.; Hayat, K.; Ruqeishi, K.B.A.; Lone, H.R. A Critical Review of the Need for Knowledge-Centric Evaluation of Quranic Recitation. arXiv 2025, arXiv:cs. [Google Scholar]
- Al Ajmi, S.A.; Hayat, K.; Al Obaidi, A.M.; Kumar, N.; Najim AL-Din, M.S.; Magnier, B. Faked speech detection with zero prior knowledge. Discov. Appl. Sci. 2024, 6, 288. [Google Scholar] [CrossRef]
- Khan, R.U.; Qamar, A.M.; Hadwan, M. Quranic Reciter Recognition: A Machine Learning Approach. Adv. Sci. Technol. Eng. Syst. J. 2019, 4, 173–176. [Google Scholar] [CrossRef]
- Alshboul, M.; Al Muaitah, A.R.; Al-Issa, S.; Al-Ayyoub, M. Enhanced Neural Speech Recognition of Quranic Recitations via a Large Audio Model. Appl. Sci. 2025, 15. [Google Scholar] [CrossRef]
- Al-Ayyoub, M.; Damer, N.A.; Hmeidi, I. Using deep learning for automatically determining correct application of basic quranic recitation rules. Int. Arab J. Inf. Technol. 2018, 15, 620–625. [Google Scholar]
- Kinoshita, K.; Delcroix, M.; Yoshioka, T.; Nakatani, T.; Habets, E.; Haeb-Umbach, R.; Leutnant, V.; Sehr, A.; Kellermann, W.; Maas, R.; et al. The REVERB challenge: Acommon evaluation framework for dereverberation and recognition of reverberant speech. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013; Institute of Electrical and Electronics Engineers Inc.; pp. 1–4. [Google Scholar] [CrossRef]
- Drude, L.; Heitkaemper, J.; Böddeker, C.; Haeb-Umbach, R. SMS-WSJ: Database, performance measures, and baseline recipe for multi-channel source separation and recognition. CoRR 2019, abs/1910.13934, [1910.13934. [Google Scholar]
- Wang, H.; Pandey, A.; Wang, D. A systematic study of DNN based speech enhancement in reverberant and reverberant-noisy environments. Comput. Speech Lang. 2025, 89, 101677. [Google Scholar] [CrossRef]
- Richter, J.; Welker, S.; Lemercier, J.M.; Lay, B.; Gerkmann, T. Speech Enhancement and Dereverberation With Diffusion-Based Generative Models. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 2351–2364. [Google Scholar] [CrossRef]
- Lemercier, J.M.; Richter, J.; Welker, S.; Gerkmann, T. StoRM: A Diffusion-Based Stochastic Regeneration Model for Speech Enhancement and Dereverberation. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 2724–2737. [Google Scholar] [CrossRef]
- Wang, Z.; Wichern, G.; Roux, J.L. Convolutive Prediction for Monaural Speech Dereverberation and Noisy-Reverberant Speaker Separation. CoRR 2021, abs/2108.07376. [Google Scholar]
- Rosenbaum, T.; Winebrand, E.; Cohen, O.; Cohen, I. Deep-Learning Framework for Efficient Real-Time Speech Enhancement and Dereverberation. Sensors 2025, 25. [Google Scholar] [CrossRef] [PubMed]
- Perraudin, N.; Balázs, P.; Søndergaard, P.L. A fast Griffin-Lim algorithm. 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2013; pp. 1–4. [Google Scholar]
- Scheibler, R.; Bezzam, E.; Dokmanic, I. Pyroomacoustics: A Python Package for Audio Room Simulation and Array Processing Algorithms. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE, 2018; pp. 351–355. [Google Scholar] [CrossRef]
- Allen, J.; Berkley, D. Image method for efficiently simulating small-room acoustics. J. Acoust. Soc. Am. Copyright: Copyright 2016 Elsevier B.V., All rights reserved.. 1979, 65, 943–950. [Google Scholar] [CrossRef]
- Fu, S.; Yu, C.; Hsieh, T.; Plantinga, P.; Ravanelli, M.; Lu, X.; Tsao, Y. MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement. CoRR 2021, abs/2104.03538. [Google Scholar]
- Rix, A.W.; Hollier, M.; Hekstra, A.P.; Beerends, J.G. Perceptual Evaluation of Speech Quality ( PESQ ), the new ITU standard for end-to-end speech quality assessment. Part I – Time alignment. 2001. [Google Scholar] [PubMed]
- Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 2125–2136. [Google Scholar] [CrossRef]
- Makhoul, J. Linear prediction: A tutorial review. Proc. IEEE 1975, 63, 561–580. [Google Scholar] [CrossRef]
- Williamson, D.S.; Wang, D. Time-frequency masking in the complex domain for speech dereverberation and denoising. IEEE ACM Trans. Audio Speech Lang. Process. 2017, 25, 1492–1501. [Google Scholar] [CrossRef] [PubMed]
| Freq. Band (Hz) | Energy | Acoustic Rationale | |
|---|---|---|---|
| <100 | 0.05 | <0.1% | Sub-bass noise; no phonetic content |
| 100-300 | 0.10 | ∼3% | Primary reverberation energy source |
| 300-500 | 0.18 | ∼8% | Secondary reverberation tail |
| 500-1000 | 0.35 | ∼37% | Voice body; moderate preservation |
| 1000-2000 | 0.75 | ∼51% | Vowel formants; high preservation (Madd/Harakaat) |
| 2000-4000 | 0.15 | ∼1% | Reverberation tail attenuation |
| 4000-16000 | 0.35 | <1% | Fricative consonants; preserved |
| Parameter | Value | Purpose |
|---|---|---|
| FFT size () | 1,024 | High freq. resolution (15.6 Hz/bin) |
| Hop length | 256 | 16 ms temporal resolution |
| Window function | Hann | Minimise spectral leakage |
| Sample rate | 16,000 Hz | Standard speech processing rate |
| Noise percentile | 15th | Quietest frames as noise reference |
| Noise subtraction () | 2.0 | Balanced noise reduction |
| Spectral floor () | 0.001 | Prevents frequency bin suppression |
| Gate minimum () | 0.10 | Preserves whispered consonants |
| VAD threshold | 30th pct. | Speech/silence boundary |
| Gap fill duration | 220 ms | Prevents mid-word fragmentation |
| Power decay exponent (p) | 2.0 | Natural word-boundary fade |
| Decay duration (D) | 45 frames | 720 ms consonant resonance |
| Smoothing window | 11 frames | Gradual gate transitions |
| Griffin-Lim iterations | 10 | Phase convergence |
| Metric | Reverb. | Spec. Sub. | WPE | Proposed |
|---|---|---|---|---|
| SNR (dB) | −3.239 | −3.234 | −3.239 | −1.874★ |
| PESQ (ITU-T) | +2.187 | +2.191 | +1.116 | +2.495★ |
| STOI (IEEE) | +0.466 | +0.467★ | +0.174 | +0.466 |
| Energy Ratio (dB) | +14.906 | +15.054 | +38.193★ | +28.959 |
| Total wins / 4 | 0 | 0 | 1 | 3 |
| Metric | Orig. | S.Sub. | Wien. | WPE | MetricGAN+ | Proposed |
|---|---|---|---|---|---|---|
| SNR (dB) | −1.968 | −1.962 | −1.968 | −0.400★ | −1.968 | −0.839 |
| PESQ | +1.199 | +1.130 | +1.187 | +1.070 | +1.199 | +1.251★ |
| STOI | +0.117 | +0.114 | +0.115 | +0.087 | +0.117 | +0.107 |
| ER (dB) | +12.336 | +12.422 | +12.338 | +19.092 | +12.336 | +19.579★ |
| SI-SDR (dB) | −51.545 | −51.026★ | −51.477 | −53.284 | −51.545 | −51.908 |
| SC | +29.748 | +35.078 | +34.047 | +16.784 | +29.748 | +40.109★ |
| FC | +473.416 | +507.341 | +508.441 | +148.293 | +473.416 | +822.942★ |
| Total wins | 0 | 1 | 0 | 1 | 0 | 4 |
| Method | KNN Accuracy (%) | Intra-class Compactness |
|---|---|---|
| Original | 75.0 | 8.010 |
| Spectral Sub. | 75.0 | 7.965 |
| Wiener Filter | 72.9 | 8.019 |
| WPE Dereverb. | 70.8 | 7.366★ |
| Proposed | 77.1 ★ | 7.576 |
| Condition | Accuracy |
|---|---|
| Clean | 96.1% |
| Reverberant | 94.1% |
| After Proposed Pipeline | 96.1% |
| Metric | Proposed (Mean±SD) | Baseline (Mean±SD) | p-value | Sig. |
|---|---|---|---|---|
| ER (dB) | <0.001 | *** | ||
| SC | <0.001 | *** | ||
| FC | <0.001 | *** | ||
| PESQ | <0.05 | * | ||
| SNR (dB) | <0.001 | *** |
| Ver. | Key Change | Quality | Problem | Action |
|---|---|---|---|---|
| V1 | Basic STFT + hard gate | Poor (6.0) | Metallic voice | Add Griffin-Lim |
| V2 | +Griffin-Lim (10 iter.) | Good (8.5) | Sharp word ends | Extend decay |
| V3 | Decay: 20→45 frames | V.Good (9.5) | Consonant clips | Raise |
| Final | : 0.05→0.10 | Exc. (9.5+) | None | - |
| Parameter | Initial | Final | Effect |
|---|---|---|---|
| HF gain (>4 kHz) | 0.02 | 0.35 | Restored fricative clarity |
| Noise | 3.0 | 2.0 | Eliminated metallic artefacts |
| Gate type | Hard | Soft | Removed click artefacts |
| Gap fill duration | 120 ms | 220 ms | Prevented mid-word cuts |
| Power exponent | 4.0 | 2.0 | Natural decay shape |
| FFT size | 512 | 1024 | Improved frequency resolution |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).