Submitted:
13 July 2023
Posted:
13 July 2023
Read the latest preprint version here
Abstract
Keywords:
1. Introduction
2. Materials and Methods
2.1. Dataset
- The main notes are labeling with information about the note position and pitch and duration in the natural scale. The labeling method has two steps. Firstly, draw the bounding box: the bounding box should contain the complete notes (head, stem and tail) and the specific spatial information of the head. In other words, the bounding box is supposed to contain the 0th line to 5th line of the staff and position of the head. Then, annotate the object: the format of the label is the ’duration_pitch’ code under the natural scale. As shown in Figure 1f,g.
- Label the categories of symbols that affect the pitch and duration of the main notes as well as position information. In the score, the clef, key signature, dot and pitch-shifting notation (sharp, flat and natural) are the main control symbols that affect the pitch and duration of the main note, and Table 1a–c list the control symbols identified and understood in this paper. Each of all these kinds of symbols is labeled with a minimum external bounding box containing the whole symbol and category information, as shown in Figure 1a–c,e.
- Label the categories and position of the symbols of the rest. The rest is used in a score to express stopping performance for a specified duration. The symbol of the rest is labeled with a minimum external bounding box that contains the rest entirely as well as information of its category and duration. The rests identified and understood in this paper are listed in Table 1d, while the rests in the staff are labeled as shown in Figure 1d.
2.2. Low-level Semantic Understanding Stage
2.3. High-level Semantic Understanding Stage
- The clefs are the symbols used to determine the exact pitch position of a natural scale in the staff. It is recorded at the leftmost end of each staff, and also a flag that indicates the m-th line in the staff. Meanwhile, it is also the first symbol considered by the NERA when encoding the pitch;
- The key signature located after clef is the symbol used to mark the ascending or descending the pitch of the corresponding notes and expressed as a value in the NERA. The clefs and key signatures are effective within one line of staff notation;
- In accidentals, the pitch-shifting notations change the pitch. It raises, lowers or restores the pitch of the note on which it is applied. The dot extends the original duration of the note by half.
2.3.1. Data Preprocessing
- Removal of invalid symbols. The task of this paper is to implement the encoding of the pitch and duration of staff notes during the performance. Among the numerous symbols that affect the pitch and duration of notes are the clefs, the key signatures, the accidentals and the natural scales, while other symbols are considered as invalid symbols within this article. In the preprocessing stage, invalid symbols are removed and valid symbols are retained. We define the set of valid symbols as . The relationships among clefs, key signatures, accidentals, natural scales, valid symbol set and dataset are shown in equation (1):where clef, key signature, accidental and natural scale are denoted by , , and set , respectively. Specifically, the P is the space spanned by the natural scale , and the spanned by the duration . What’s more, the element 0 in set means there is no key signature and implys that the signature in this line is C major. Each natural scale s has two pieces of information which indicate the pitch and duration respectively;
- Sort of valid symbols. The YOLOv5 algorithm in the LSNS outputs the objects, and each object is unordered with the information , where denotes the symbol’s class, and denote the Cartesian coordinate values of the center point of the object bounding box, and denote the width and height. The clef is the first element of each row in the staff. Let its center point coordinate is . Denote , where D is the distance between two adjacent clefs’ center points. If the symbol with , then it goes to the same line. Next, the symbols in the same row are sorted in order of X from small to large. By this method, all valid symbols are rearranged in a new order which is the exact order of the symbols when reading the staff.
| Algorithm 1:Algorithm of the NERA Preprocessing Part |
|
2.3.2. Note Reconstructing
| Algorithm 2 Algorithm of NERA Notation Reconstructing Part |
|
2.3.3. Note Encoding
-
Pitch EncodingAccording to the clef, key signature and the MIDI encoding rules, the pitch code of the nature scale is converted to a code that includes the function of clef and key signature in m-th line one by one. We define as the mapping of this strategy, and obtain the converted code . The encoding process is shown in Figure 2. Then, the pitch encoding part obtains the pitch code for each note played with using the MIDI encoding rules after scanning the note control vector . As shown in equation (4):
-
Duration EncodingScan each duration control vector and corresponding note duration vector , define the individual performance style coefficient as , apply MIDI encoding rule, then duration encoding strategy is shown in equation (5):where, is varying according to the different performers. means that the performers’ characteristics are not considered.
2.4. System Structure

3. Results
3.1. Data
3.2. Trainning
3.3. Evaluation Metrics
3.4. Experiment and analysis
3.4.1. Experiment of LSNS
- In staff 1, many cumbersome note beams along with the high density of symbols result in relative high rates of error and omission. As shown in Figure 4a;
- Staff 2 has the highest density of symbols and its recall is relatively low. As shown in Figure 4b;
- Staff 3 has lower complexity of each item and its performance evaluation is better;
- The error and omission of notes in staff 4 are mostly concentrated in the notes with longer note stem. As shown in Figure 4c;
- Staff 5 has higher complexity of each item and very low image resolution, and its evaluation are worse than others;
- Staff 6 has lower image resolution. Similar to staff 1, its notes with common note beams are tedious. As shown in Figure 4d;
- Staff 7 has a lower image resolution, but its performance evaluation is better due to the lower complexities of other attributes;
- In staff 9, the error detection notes are those located in the higher plus line on the staff, as shown in Figure 4e.
3.4.2. Experiment of HSNS
- When the input is ideal, the error rate and the omission rate of the output result are the performance indexes of the NERA.
- The error rate and the omission rate are the performance indexes of the whole system when the output is practical.
- Misidentification of the pitch and duration of natural scales can lead to errors during HSNS;
- Misidentification or omission of accidentals (sharp, flat, natural, dot) acting on natural scales can lead to errors during HSNS;
- Omission of a note affects the HSNS of the note or the preceding and following notes. There are three cases: When the note is preceded and followed by separate notes, the omission of the note does not affect the semantics of the preceding and following notes; when a note is preceded by a pitch-shifting notation (sharp, flat, natural) and followed by another note, the omission of the note will cause the pitch-shifting notation originally used for the note to be applied on the latter note, resulting in a pitch error at the HSNS of understanding of the latter note; when the note is preceded by a note and followed by a dot, the omission of the note will cause the appendage originally used for the note to act on the preceding note, thus the HSNS of the preceding note will be incorrectly timed;
- Misidentification or omission of key signature will result in a pitch error in the HSNS for some notes in this line. There are three cases: when the key signature is missed, the pitch of the note in the key signature range is incorrect at the HSNS; When the key signature is misidentified as a key with the same mode of action, i.e., when both modes of action are the same, making the natural scale ascending (or descending) but with a different range of action, the HSNS of some of the notes will be wrong in terms of pitch; When the key signature is incorrectly identified as a key with a different mode of action, the pitch will be incorrect when the note is semantically understood;
- When the clef is missed, all natural scales in this row are affected by the clef of the previous line. When the clef is incorrectly identified, an error occurs at the HSNS of all natural scales in this row;
4. Conclusions and Outlooks
4.1. Conclusions
4.2. Outlooks
- The staff notation in this paper is mainly related to the pitch and duration of musical melodies. The recognition of other symbols, such as dynamics, staccatos, trills and characters related to the information of the staff is one of the future tasks to be solved;
- The accurate recognition of complex natural scales such as chords is a priority;
- The recognition of symbols in more complex staff images, e.g., those with larger intervals, denser symbols and more noise in the image.
- It is important to improve the scope of accidentals, so that they can be combined with bar lines and repetition lines, etc;
- The semantic understanding of notes is based on the LSNS, and after solving the problem of the types of symbols recognized by the model, each note can be given richer expression information;
- In this paper, rests are recognized, but the information is not utilized in semantic understanding. In the future, this information and the semantic relationships of other symbols can be used to generate a complete code of the staff during performances.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| LSNS | Low-level semantic understanding stage |
| HSNS | High-level semantic understanding stage |
| NERA | Note encoding reconstruction algorithm |
Appendix A
- Staff 1: Canon and Gigue in D major (Pachelbel, Johann)
- Staff 2: Oboe String Quartet in C Minor,Violino concertato (JS BACH BWV 1060)
- Staff 3: Sechs ländlerische Tänze für 2 Violinen und Bass (Woo15),Violino 1 (Beethoven, Ludwig van)
- Staff 4: Violin Concerto RV 226,Violino principale (A. Vivaldi)
- Staff 5: String Duo no. 1 in G for violin and viola KV 423 (Wolfgang Amadeus Mozart)
- Staff 6: Partia à Cembalo solo (G. Ph. Telemann)
- Staff 7: Canon in D,Piano Solo (Johann Pachelbel)
- Staff 8: Für Elise in A Minor Wo0 59 (Ludwig van Beethoven)
- Staff 9: Passacaglia (Handel Halvorsen)
- Staff 10: Prélude n°1 Do Majeur (J.S. Bach)
References
- Downie, J.S. Music information retrieval. Annu. Rev. Inf. Sci. Technol. 2003, 37, 295–340. [Google Scholar]
- Casey, M.A.; Veltkamp, R.; Goto, M.; Leman, M.; Rhodes, C.; Slaney, M. Content-Based Music Information Retrieval: Current Directions and Future Challenges. Proc. IEEE 2008, 96, 668–696. [Google Scholar] [CrossRef]
- Moysis, L.; Iliadis, L.A.; Sotiroudis, S.P.; Boursianis, A.D.; Papadopoulou, M.S.; Kokkinidis, K.-I.D.; Volos, C.; Sarigiannidis, P.; Nikolaidis, S.; Goudos, S.K. Music Deep Learning: Deep Learning Methods for Music Signal Processing-A Review of the State-of-the-Art. IEEE Access 2023, 11, 17031–17052. [Google Scholar] [CrossRef]
- Tardon, L.J.; Barbancho, I.; Barbancho, A.M.; Peinado, A.; Serafin, S.; Avanzini, F. 16th Sound and Music Computing Conference SMC 2019 (28–31 May 2019, Malaga, Spain). Applied Sciences-Basel 2019, 9, 2492. [Google Scholar] [CrossRef]
- Calvo-Zaragoza, J.; Hajič, J., Jr.; Pacha, A. Understanding Optical Music Recognition. ACM Comput. Surv. 2020, 53, 7499. [Google Scholar] [CrossRef]
- Rebelo, A.; Fujinaga, I.; Paszkiewicz, F.; Marcal, A.R.S.; Guedes, C.; Cardoso, J.S. Optical music recognition: state-of-the-art and open issues. Int. J. Multimed. Inf. Retr. 2012, 1, 173–190. [Google Scholar] [CrossRef]
- Calvo-Zaragoza, J.; Barbancho, I.; Tardon, L.J.; Barbancho, A.M. Avoiding staff removal stage in optical music recognition: application to scores written in white mensural notation. Pattern Anal. Appl. 2015, 18, 933–943. [Google Scholar] [CrossRef]
- Rebelo, A.; Capela, G.; Cardoso, J.S. Optical recognition of music symbols. Int. J. Doc. Anal. Recognit. (IJDAR) 2010, 13, 19–31. [Google Scholar] [CrossRef]
- Baró, A.; Riba, P.; Fornés, A. Towards the Recognition of Compound Music Notes in Handwritten Music Scores. In Proceedings of the 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), Shenzhen, China, 23–26 October 2016; pp. 465–470. [Google Scholar]
- Pacha, A.; Choi, K.Y.; Coüasnon, B.; Ricquebourg, Y.; Zanibbi, R.; Eidenberger, H. Handwritten Music Object Detection: Open Issues and Baseline Results. In Proceedings of the 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), Vienna, Austria, 24–27 April 2018; pp. 163–168. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 Dcember 2015. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
- Hajič jr., J.; Dorfer, M.; Widmer, G.; Pecina, P. Towards Full-Pipeline Handwritten OMR with Musical Symbol Detection by U-Nets. In Proceedings of the 19th International Society for Music Information Retrieval Conference, Paris, France, 23–27 September 2018; pp. 225–232. [Google Scholar]
- Tuggener, L.; Elezi, I.; Schmidhuber, J.; Stadelmann, T. Deep Watershed Detector for Music Object Recognition. arXiv 2018, arXiv:1805.10548. [Google Scholar]
- Huang, Z.; Jia, X.; Guo, Y. State-of-the-Art Model for Music Object Recognition with Deep Learning. Appl. Sci. 2019, 9, 2645. [Google Scholar] [CrossRef]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Van der Wel, E.; Ullrich, K. Optical Music Recognition with Convolutional Sequence-to-Sequence Models. arXiv 2017, arXiv:1707.04877. [Google Scholar]
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 Dcember 2014. [Google Scholar]
- Baró, A.; Riba, P.; Calvo-Zaragoza, J.; Fornés, A. From Optical Music Recognition to Handwritten Music Recognition: A baseline. Pattern Recognit. Lett. 2019, 123, 1–8. [Google Scholar] [CrossRef]
- Greff, K.; Srivastava, R.K.; Koutník, J.; Steunebrink, B.R.; Schmidhuber, J. LSTM: A Search Space Odyssey. IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 2222–2232. [Google Scholar] [CrossRef] [PubMed]
- Huber, M.D. The MIDI Manual: A Practical Guide to MIDI in the Project Studio; Taylor & Francis: Abingdon, UK, 2007. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
- Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- Tuggener, L.; Satyawan, Y.P.; Pacha, A.; Schmidhuber, J.; Stadelmann, T. The DeepScoresV2 Dataset and Benchmark for Music Object Detection. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 Jauary 2021; pp. 9188–9195. [Google Scholar]
- Hajič, J., Jr.; Pecina, P. The MUSCIMA++ Dataset for Handwritten Optical Music Recognition. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; pp. 39–46. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
- Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
| 1 | The staff images in the dataset are the open-licence staffs provided by the International Music Score Library Project (IMSLP). No copyright issues are involved. |









![]() |
| Staff | Complexity Variables | Evaluation | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| name | page | Type | Span | Density (symbols) | Density (external-notes) | Resolution | Precision | Recall | ||
| Staff 1 | 2 | 16 | 19 | 484 | 78 | 1741 | 0.968 | 0.930 | ||
| Staff 2 | 5 | 17 | 19 | 679 | 146 | 2232 | 0.996 | 0.988 | ||
| Staff 3 | 3 | 13 | 19 | 319 | 95 | 1673 | 0.997 | 0.992 | ||
| Staff 4 | 12 | 20 | 20 | 478 | 80 | 1741 | 0.994 | 0.981 | ||
| Staff 5 | 7 | 19 | 24 | 530 | 145 | 200 | 0.980 | 0.958 | ||
| Staff 6 | 5 | 19 | 20 | 367 | 63 | 435 | 0.992 | 0.970 | ||
| Staff 7 | 5 | 15 | 19 | 350 | 62 | 854 | 0.996 | 0.993 | ||
| Staff 8 | 3 | 13 | 20 | 441 | 40 | 1536 | 0.990 | 0.969 | ||
| Staff 9 | 3 | 11 | 20 | 424 | 160 | 2389 | 0.986 | 0.966 | ||
| Staff 10 | 2 | 17 | 18 | 315 | 86 | 1780 | 0.987 | 0.976 | ||
| Precision | Recall | |
|---|---|---|
| clef | 1.0 00 | 0.993 |
| key signature | 0.992 | 0.990 |
| Staff | Ideal input | Practical input | ||||
|---|---|---|---|---|---|---|
| Error rate | Omission rate | Error rate | Omission rate | |||
| Staff 1 | 0.006 | 0.000 | 0.052 | 0.044 | ||
| Staff 2 | 0.011 | 0.000 | 0.016 | 0.010 | ||
| Staff 3 | 0.010 | 0.000 | 0.020 | 0.006 | ||
| Staff 4 | 0.019 | 0.000 | 0.027 | 0.020 | ||
| Staff 5 | 0.013 | 0.000 | 0.044 | 0.014 | ||
| Staff 6 | 0.005 | 0.000 | 0.020 | 0.008 | ||
| Staff 7 | 0.000 | 0.000 | 0.004 | 0.010 | ||
| Staff 8 | 0.020 | 0.000 | 0.055 | 0.053 | ||
| Staff 9 | 0.022 | 0.000 | 0.037 | 0.021 | ||
| Staff 10 | 0.000 | 0.000 | 0.036 | 0.019 | ||
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
