Submitted:
24 April 2025
Posted:
25 April 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Analysing Tonal Structures
1.2. Data Availability
1.3. Research Questions
Is the output of extraction models accurate enough for music historical research, specifically studying tonal structures in the 16th and 17th centuries?
- What are suitable state-of-the-art audio pitch extraction models?
- What is the pitch extraction model that results in pitch (class) profiles most similar to those extracted from symbolic encodings?
- What is the effect of year of recording, number of voices and the ensemble composition on the quality of pitch (class) profiles?
- How can we study tonal structures using pitch content extracted from audio?
2. Background
2.1. Modes
2.2. Music Corpus Studies
2.3. Multiple Pitch Estimation
2.4. Models in This Study
3. Methods
3.1. Create Dataset
3.2. Extract Pitch
3.3. Postprocessing Pitch Extractions
3.3.1. Concert Pitch
3.3.2. Pitch Binning
3.3.3. Loudness Thresholding
3.4. Create Features
3.4.1. Finals
3.4.2. Pitch Profiles and Pitch Class Profiles
3.4.3. Distance Between Profiles
3.5. Evaluation
3.6. Exploring the Effect of Performance and Recording on Pitch Extractions of the Best Extraction Model
- Recent recordings yield more accurate pitch (class) profiles.
- A lower number of voices yields more more accurate pitch (class) profiles.
- The ensemble composition on which the model is trained on yields the most accurate pitch (class) profiles.
4. Results
4.1. The CANTO-JRP Dataset
- Metadata composer, composition title, number of voices, instrumentation category, recording decade, performer(s), final pitch, the extent to which the audio is similar to the encoding. Figure 6 provides some characteristics of the dataset, and shows that the dataset mainly consists of recent, a cappella recordings for four voices.
- Recordings for 611 out of the 902 works on the JRP website, usable recordings have been found on Spotify; these are collected in a playlist.7 For convenient reference, the order of the playlist has been maintained in the metadata.
- Pitch estimations On each of these recordings Multif0, Basicpitch, and Multipitch (both 195f and 214c) have been applied. The extractions are made available in the dataset.
- Encodings The encodings of the 611 works in the dataset have been copied from the JRP website.
4.2. Finals
4.3. Distance Between Extractions and Encoding
| Mp 195f | Mp 214c | Basicpitch | CQT | HPCP | |
|---|---|---|---|---|---|
| Multif0 | 5.2E-23 | 3.0E-28 | 2.3E-54 | 6.3E-273 | 5.5E-149 |
| Mp 195f | 0.25 | 1.6E-08 | 1.5E-142 | 1.9E-58 | |
| Mp 214c | 6.7E-06 | 3.5E-130 | 1.1E-50 | ||
| Basicpitch | 5.2E-87 | 1.2E-25 | |||
| CQT | 1.4E-20 |
| Mp 195f | Mp 214c | Basicpitch | |
|---|---|---|---|
| Multif0 | 1.1E-18 | 7.6E-39 | 1.3E-99 |
| Mp 195f | 2.5E-05 | 4.0E-35 | |
| Mp 214c | 3.6E-16 |
4.4. Clustering Profiles with Various Extraction Models


4.5. The Effect of Performance and Recording on Pitch Class Profiles
5. Case Studies in Polyphonic Modality
6. Discussion and Perspectives for Music History Research
6.1. Limitations of This Study
6.2. Observations on Multiple Pitch Estimation Models
7. Conclusions and Future Work
Author Contributions
Institutional Review Board Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Judd, C.C. Tonal structures in early music; Routledge, 1998.
- Wiering, F. Are we ready for a big data history of music? https://webspace.science.uu.nl/~wieri103/presentations/Athens2023v6.pdf, 2023. Presented at the International Conference on Computational and Cognitive Musicology (ICCCM), Athens, Greece, Accessed: 24 Apr. 2025.
- Urquhart, P. Sound and sense in Franco-Flemish music of the Renaissance: Sharps, flats, and the problem of ’musica ficta’; Vol. 7, Peeters Publishers, 2021.
- Thomas, J. Motet Database Catalogue Online. https://www.uflib.ufl.edu/motet/ Accessed: 27 Feb. 2025.
- Albrecht, J.D.; Huron, D. A statistical approach to tracing the historical development of major and minor pitch distributions, 1400-1750. Music Perception: An Interdisciplinary Journal 2012, 31, 223–243. [Google Scholar] [CrossRef]
- Lieck, R.; Moss, F.C.; Rohrmeier, M. The Tonal Diffusion Model. Transactions of the International Society for Music Information Retrieval (TISMIR) 2020, 3, 153. [Google Scholar] [CrossRef]
- Harasim, D.; Moss, F.C.; Ramirez, M.; Rohrmeier, M. Exploring the foundations of tonality: statistical cognitive modeling of modes in the history of Western classical music. Humanities and Social Sciences Communications 2021, 8, 1–11. [Google Scholar] [CrossRef]
- Cornelissen, B.; Zuidema, W.H.; Burgoyne, J.A.; et al. Mode classification and natural units in plainchant. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR); 2020; pp. 869–875. [Google Scholar]
- Cuesta, H.; McFee, B.; Gómez, E. Multiple f0 estimation in vocal ensembles using convolutional neural networks. In Proceedings of the International Society for Music Information Retrieval (ISMIR), Montréal, Canada; 2020. [Google Scholar]
- Muziekweb.nl. Muziekweb. https://www.muziekweb.nl, 2025. Accessed: 27 Feb. 2025.
- Glareanus, H. Dodecachordon; Heinrich Petri: Basel, 1547. Reprint in various editions available; originally published in Latin. Basel.
- Rose, S.; Tuppen, S.; Drosopoulou, L. Writing a Big Data history of music. Early Music 2015, 43, 649–660. [Google Scholar] [CrossRef]
- Park, D.; Bae, A.; Schich, M.; Park, J. Topology and evolution of the network of western classical music composers. EPJ Data Science 2015, 4, 1–15. [Google Scholar] [CrossRef]
- Broze, Y.; Huron, D. Is higher music faster? Pitch–speed relationships in Western compositions. Music Perception: An Interdisciplinary Journal 2013, 31, 19–31. [Google Scholar] [CrossRef]
- Yust, J. Stylistic information in pitch-class distributions. Journal of New Music Research 2019, 48, 217–231. [Google Scholar] [CrossRef]
- Upham, F.; Cumming, J. Auditory streaming complexity and Renaissance mass cycles. Empirical Musicology Review 2020, 15, 202–222. [Google Scholar] [CrossRef]
- Moss, F.C. Transitions of tonality: a model-based corpus study. PhD thesis, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland, 2019. [Google Scholar]
- Moss, F.C.; Lieck, R.; Rohrmeier, M. Computational modeling of interval distributions in tonal space reveals paradigmatic stylistic changes in Western music history. Humanities and Social Sciences Communications 2024, 11, 1–11. [Google Scholar] [CrossRef]
- Weiß, C.; Mauch, M.; Dixon, S.; Müller, M. Investigating style evolution of Western classical music: A computational approach. Musicae Scientiae 2019, 23, 486–507. [Google Scholar] [CrossRef]
- Geelen, B.; Burn, D.; De Moor, B. A clustering analysis of Renaissance polyphony using state-space models. Journal of the Alamire Foundation 2021, 13, 127–146. [Google Scholar] [CrossRef]
- Arthur, C. Vicentino versus Palestrina: A computational investigation of voice leading across changing vocal densities. Journal of New Music Research 2021, 50, 74–101. [Google Scholar] [CrossRef]
- Moss, F.C.; Neuwirth, M.; Rohrmeier, M. TP3C (Version 1.0.1), 2020. Dataset. [CrossRef]
- Rodin, J.; Sapp, C. The Josquin Research Project. https://josquin.stanford.edu/ Accessed: 27 Feb. 2025.
- Benetos, E.; Dixon, S.; Duan, Z.; Ewert, S. Automatic music transcription: An overview. IEEE Signal Processing Magazine 2019, 36, 20–30. [Google Scholar] [CrossRef]
- Bhattarai, B.; Lee, J. A comprehensive review on music transcription. Applied Sciences 2023, 13, 11882. [Google Scholar] [CrossRef]
- Hawthorne, C.; Stasyuk, A.; Roberts, A.; Simon, I.; Huang, C.Z.A.; Dieleman, S.; Elsen, E.; Engel, J.; Eck, D. Enabling factorized piano music modeling and generation with the MAESTRO dataset. arXiv 2018, arXiv:1810.12247 2018. [Google Scholar]
- Wang, J.C.; Lu, W.T.; Chen, J. Mel-RoFormer for vocal separation and vocal melody transcription. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), San Francisco, United States; 2024. [Google Scholar]
- Gardner, J.P.; Simon, I.; Manilow, E.; Hawthorne, C.; Engel, J. MT3: Multi-task multitrack music transcription. In Proceedings of the International Conference on Learning Representations (ICLR); 2022. [Google Scholar]
- Bittner, R.M.; McFee, B.; Salamon, J.; Li, P.; Bello, J.P. Deep salience representations for f0 estimation in polyphonic music. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), Suzhou, China; 2017; pp. 63–70. [Google Scholar]
- Weiß, C.; Müller, M. From music scores to audio recordings: Deep pitch-class representations for measuring tonal structures. ACM Journal on Computing and Cultural Heritage 2024. [Google Scholar] [CrossRef]
- Yu, H.; Duan, Z. Note-level transcription of choral music. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), San Francisco, United States; 2024. [Google Scholar]
- Bittner, R.M.; Bosch, J.J.; Rubinstein, D.; Meseguer-Brocal, G.; Ewert, S. A lightweight instrument-agnostic model for polyphonic note transcription and multipitch estimation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Singapore; 2022. [Google Scholar]
- Gómez, E. Tonal description of music audio signals. PhD thesis, Universitat Pompeu Fabra, Department of Information and Communication Technologies, Barcelona, Spain, 2006.
- Schörkhuber, C.; Klapuri, A. Constant-Q transform toolbox for music processing. In Proceedings of the Sound and Music Computing Conference (SMC), Barcelona, Spain; 2010; pp. 3–64. [Google Scholar]
- Brown, J.C. Calculation of a constant Q spectral transform. The Journal of the Acoustical Society of America 1991, 89, 425–434. [Google Scholar] [CrossRef]
- Lloyd, S.P. Least squaresqQuantization in PCM. IEEE Transactions on Information Theory 1982, 28. [Google Scholar] [CrossRef]
- Rousseeuw, P.J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 1987, 20, 53–65. [Google Scholar] [CrossRef]
- Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. Journal of Machine Learning Research 2008, 9. [Google Scholar]
- Chacón, J.E.; Rastrojo, A.I. Minimum adjusted Rand index for two clusterings of a given size. Advances in Data Analysis and Classification 2023, 17, 125–133. [Google Scholar] [CrossRef]
- Vinh, N.X.; Epps, J.; Bailey, J. Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. Journal of Machine Learning Research 2010, 11, 2837–2854. [Google Scholar]
- Powers, H.S. The modality of “Vestiva i colli”. In Studies of Renaissance and Baroque Music in Honor of Arthur Mendel; Marshall, R.L., Ed.; Bärenreiter: Kassel, London, 1974; pp. 31–46. [Google Scholar]
- Powers, H.S. Modal representation in polyphonic offertories. Early Music History 1982, 2, 43–86. [Google Scholar] [CrossRef]
- Meier, B. Rhetorical aspects of the Renaissance modes. Journal of the Royal Musical Association 1990, 115, 182–190. [Google Scholar] [CrossRef]
- Wiering, F. The language of the modes: Studies in the history of polyphonic modality; Routledge: New York and London, 2001. [Google Scholar]
- Mangani, M.; Sabaino, D. Tonal types and modal attributions in late Renaissance polyphony: new observations. Acta musicologica 2008, 80, 231–250. [Google Scholar]
- Hu, Z. On the theory and practice of chromaticism in Renaissance music. Bachelor’s thesis, Amherst College, 2013.
| 1 | The word symbolic in symbolic encoding refers to the use of a finite alphabet to encode music notation. |
| 2 | For example, the white keys of the piano form a diatonic scale. |
| 3 | The authors of MT3 specifically mention that this model has not been trained on singing. We have run a preliminary test on 150 works in our dataset, evaluating the quality of the pitch (class) profiles. Instead of finals extracted from the MT3 output, we used the ground truth finals, thereby boosting the performance of MT3. Even with this advantage, the results are only marginally better than Basicpitch and worse than Multif0 and Multipitch. Therefore, we decided not to further pursue the evaluation of MT3 in our study. |
| 4 | Middle C is MIDI tone 60 |
| 5 | |
| 6 | For example, the sound of a lute decays faster than a voice does. |
| 7 | |
| 8 | |
| 9 |










| Mode | Final | Range | Reciting Tone |
|---|---|---|---|
| 1 | D | high (authentic) | A |
| 2 | D | low (plagal) | F |
| 3 | E | high (authentic) | C |
| 4 | E | low (plagal) | A |
| 5 | F | high (authentic) | C |
| 6 | F | low (plagal) | A |
| 7 | G | high (authentic) | D |
| 8 | G | low (plagal) | C |
| 9 | A | high (authentic) | - |
| 10 | A | low (plagal) | - |
| 11 | C | high (authentic) | - |
| 12 | C | low (plagal) | - |
| Study | Data type | Items | Dataset |
|---|---|---|---|
| Rose et al. (2015) [12] | Metadata | 2,000,000 | British Library of printed music, Hughes’s catalogue of manuscript music in the British Museum, RISM A/II |
| Broze & Huron (2013) [14] | Audio | 880,906 | Naxos track samples and some smaller subsets |
| Park et al. (2015) [13] | Metadata | 63,679 | ArkivMusic and All Music Guide |
| Harasim et al. (2021) [7] | Encoding | 13,402 | Classical Archives, Lost Voices, ELVIS, CRIM |
| Yust (2019) [15] | Encodings | 4,544 | YCAC |
| Upham & Cumming (2020) [16] | Encodings | 2,016 | JRP, RenCOmp7 |
| Moss (2019) [17] | Encodings | 2,012 | ABC, CCARH, CDPL, DCML, Koẑeluh |
| Moss et al. (2024) [18] | Encodings | 2,012 | TP3C |
| Weiß et al. (2018) [19] | Audio | 2,000 | Cross-Era Dataset |
| Geelen et al. (2021) [20] | Encodings | 1,248 | JRP |
| Arthur (2021) [21] | Encodings | 707 | Palestrina Masses |
| Multiple pitch estimation model | Multif0 [9] | Multipitch [30] | Basicpitch [32] |
|---|---|---|---|
| Model input | HCQT + phase differentials | HCQT | HCQT |
| HCQT harmonics | 1, 2, 3, 4, 5 | 0.5, 1, 2, 3, 4, 5 | 0.5, 1, 2, 3, 4, 5, 6, 7 |
| Input bin size in cents | 20 | 33 | 33 |
| Output bin size in cents | 20 | 100 | 100 |
| Tracks in training | 69 | 744 | 4127 |
| Instrumentation in training | a capella | opera, chamber music, symphonic, a capella | vocal guitar, piano, synthesizers, orchestra |
| Genre | classical | classical | classical and pop |
| Polyphonic/monophonic | polyphonic | polyphonic | monophonic and polyphonic |
| Annotation | f0 annotation per voice | mixed: aligned scores, multitrack, midi-guided performance | unspecified |
| Architecture | Late/Deep CNN | Deep Residual CNN | CNN |
| Loudness in output | yes | no | no |
| Extraction model | correct pitch of finals | correct pitch class of finals |
|---|---|---|
| Multif0 | 95.4% | 99.0% |
| Mp 195f | 90.3% | 94.6% |
| Mp 214c | 93.1% | 95.3% |
| Basicpitch | 72.8% | 84.0% |
| Extraction model | Pitch Class Profiles | Pitch Profiles | ||||
|---|---|---|---|---|---|---|
| mean | median | stdev | mean | median | stdev | |
| Multif0 [9] | 0.0481 | 0.0400 | 0.0414 | 0.0741 | 0.0442 | 0.0770 |
| Mp 195f [30] | 0.0954 | 0.0506 | 0.0911 | 0.1028 | 0.0611 | 0.0838 |
| Mp 214c [30] | 0.0932 | 0.0609 | 0.0819 | 0.1047 | 0.0727 | 0.0766 |
| Basicpitch [32] | 0.1143 | 0.0705 | 0.1018 | 0.1464 | 0.1073 | 0.0948 |
| CQT [34] | 0.2036 | 0.2007 | 0.0380 | |||
| HPCP [33] | 0.1495 | 0.1383 | 0.0502 | |||
| Extraction model | Pitch Class Profiles | Pitch Profiles | ||
|---|---|---|---|---|
| ARI | AMI | ARI | AMI | |
| Multif0 [9] | .84 | .79 | .58 | .61 |
| Mp 195f [30] | .45 | .47 | .32 | .36 |
| Mp 214c [30] | .52 | .53 | .35 | .40 |
| Basicpitch [32] | .44 | .42 | .34 | .34 |
| CQT [34] | .34 | .42 | ||
| HPCP [33] | .50 | .56 | ||
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).