Submitted:
14 March 2024
Posted:
18 March 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. State-of-the-Art
3. Datasets and Preprocessing Methodology
3.1. Bibliographic Data
3.1.1. Giant Corpus
- {
- "doi": "10.2307/2177340",
- "articleType": 3,
- "citationStyle": 0,
- "citationStringAnnotated": "<author><family>Ritchie</family>,
- <given>E.</given> and <family>Powell</family>, <given>Elmer Ellsworth
- </given></author>(<issued><year>1907</year></issued>) <title>Spinoza
- and Religion.</title> <container-title>The Philosophical Review
- </container-title>, <volume>16</volume>(<issue>3</issue>), p.
- <page>339</page>. [online] Available from: <URL>http://dx.doi.
- org/10.2307/2177340</URL>"
- }
- doi: Digital Object Identifier (DOI) is an unique identifier of the document it represents2.
- articleType: The identifier representing the type of document (thesis, article, book, etc.).
- citationStyle: The identifier representing the citation style (APA, Harvard, IEEE, etc).
- citationStringAnnotated: The annotated reference string.

3.1.2. CORA Corpus
3.2. Preprocessing
3.2.1. Preprocessing Giant Corpus
| -Author | - Year | -Title | -Container-Title |
| -Volume | -Issue | -Page | -ISBN |
| -ISSN | -Publisher | -DOI | -URL |

3.2.2. Preprocessing CORA Corpus
4. Architectures of the Evaluated Models
- Its ability to efficiently integrate and manage different types of embeddings.
- Extensible and modular architecture makes it easy to add additional model-specific layers, such as Word Dropout and Locked Dropout.
- Comprehensive documentation and practical examples available.
- Word Dropout: This layer reduces overfitting by randomly "turning off" (i.e., setting to zero) some word vectors during training, which helps the model not to rely too much on specific words.
- Locked Dropout: Similar to word dropout, but applied uniformly across all dimensions of a word vector at a given step. This improves the robustness of the model by preventing it from overfitting to specific patterns (token combinations) in the training data.
- Embedding2NN: A layer that transforms concatenated embeddings into a representation more appropriate for processing by subsequent layers. This transformation can include non-linear operations to capture more complex relationships in the data.
- Linear: A linear layer that acts as a classifier, mapping the processed representations to the target segmentation labels.
4.1. CRF Model


-
(embeddings): StackedEmbeddings: Refers to combining various types of word embeddings to form a rich and complex representation. It utilizes BytePairEmbeddings and CharacterEmbeddings:
- -
- (list_embedding_0): BytePairEmbeddings(model=0-bpe-multi-100000-50):BytePairEmbeddings are based on Byte Pair Encoding (BPE), capturing subword-level semantics, useful for handling out-of-vocabulary (OOV) words. The specific BPE model used is indicated by "model=0-bpe-multi-100000-50", detailing its parameters.
- -
-
(list_embedding_1): CharacterEmbeddings: Uses character-level embeddings, crucial for understanding orthographic peculiarities and common errors in texts.
- *
- (char_embedding): Embedding(275, 25): Defines a character embedding with a vocabulary size of 275 and 25-dimensional vectors.
- *
- (char_rnn): LSTM(25, 25, bidirectional=True): A bidirectional LSTM network that processes character embeddings, with 25 units in both directions, capturing contexts before and after each character.
- (word_dropout): WordDropout(p=0.05): Applies dropout at the word level with a probability of 0.05, helping prevent overfitting by randomly "turning off" words during training.
- (locked_dropout): LockedDropout(p=0.5): Applies dropout uniformly across all dimensions of word vectors at a given step, with a probability of 0.5, enhancing the model’s robustness.
- (embedding2nn): Linear(in_features=650, out_features=650, bias=True): A linear layer transforming concatenated embeddings representation, preparing them for processing by subsequent layers.
- (linear): Linear(in_features=650, out_features=29, bias=True): Another linear layer acting as a classifier, mapping processed representations to 29 target label categories.
- (loss_function): ViterbiLoss(): Employs the Viterbi loss function, suitable for sequential prediction tasks like reference segmentation.
- (crf): CRF(): Indicates the use of a Conditional Random Field for sequence label prediction, optimizing the coherence and accuracy of predicted labels.
4.2. BiLSTM+CRF Model


- (rnn): LSTM(650, 256, batch_first=True, bidirectional=True): A bidirectional LSTM layer that processes sequences in both forward and backward directions. With an input size of 650 features and an output of 256 features, it captures contextual information from both past and future data points within a sequence, enhancing the model’s ability to understand complex dependencies in bibliographic reference segmentation.
4.3. Transformer+CRF Model


- (positional_encoding): PositionalEncoding(dropout=0.1, inplace=False): This layer adds positional information to the input embeddings, allowing the model to capture the sequence of the text. The addition of positional encodings is crucial for attention-based models, such as transformers, as it enables them to distinguish the order of words in a sequence. The use of dropout with a probability of 0.1 helps to prevent overfitting by randomly "turning off" parts of the positional embeddings during training to enhance the model’s robustness.
- (transformer_encoder_layer): CustomTransformerEncoderLayer(...) and (transformer_encoder): TransformerEncoder(...): These two layers represent the heart of the Transformer model, where the first defines the structure of a single layer of the transformer encoder, including multi-head attention, residual connections, and layer normalization, while the second stacks multiple of these layers to construct the complete encoder. The apparent redundancy between these layers is because the first specifies the architecture and configuration of an individual layer within the encoder, including specific operations such as attention and normalization, and the second encapsulates the repetition of these layers to form the complete encoder, allowing the model to process and learn from sequences with greater depth and complexity. The inclusion of BatchNorm1d in the custom layer suggests an adaptation to stabilize and accelerate training by normalizing activations across mini-batches, which is not typical in standard transformers but can offer benefits in terms of convergence and performance in specific tasks like bibliographic reference segmentation.
4.4. Training
- Ritchie B-AUTHOR
- , B-PUNC
- E I-AUTHOR
- . B-PUNC
- and I-AUTHOR
- Powell I-AUTHOR
- , B-PUNC
- Elmer I-AUTHOR
- Ellsworth I-AUTHOR
- ( B-PUNC
- 1907 B-YEAR
- ) B-PUNC
- Spinoza B-TITLE
- and I-TITLE
- Religion I-TITLE
- . B-PUNC
- The B-CONTAINER-TITLE
- Philosophical I-CONTAINER-TITLE
- Review I-CONTAINER-TITLE
- , B-PUNC
- 16 B-VOLUME
- ( B-PUNC
- 3 B-ISSUE
- ) B-PUNC
- , B-PUNC
- p O
- . B-PUNC
- 339 B-PAGE
- . B-PUNC
- [ B-PUNC
- online O
- ] B-PUNC
- Available O
- from O
- : O
- 80% for training.
- 10% for hyperparameter tuning.
- 10% for performance evaluation.
4.5. Model Evaluation
5. Experiments and Results
5.1. Results on the CORA Corpus
5.2. Considerations about the CORA Corpus
- <author> M. Ahlskog, J. Paloheimo, H. Stubb, P. Dyreklev, M. Fahlman,
- O. Inganas and M.R. Andersson, </author> <container-title> J Appl.
- Phys., </container-title> <volume> 76, </volume><pages>893, </pages>
- <year> (1994). </year>
5.3. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Khabsa, M.; Giles, C.L. The number of scholarly documents on the public web. PloS one 2014, 9, e93949. [Google Scholar] [CrossRef]
- Ware, M.; Mabe, M. The STM report: An overview of scientific and scholarly journal publishing 2015.
- Bornmann, L.; Mutz, R. Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references. Journal of the association for information science and technology 2015, 66, 2215–2222. [Google Scholar] [CrossRef]
- Becker, D.A.; Chiware, E.R. Citation Analysis of Master’s Theses and Doctoral Dissertations: Balancing Library Collections With Students’ Research Information Needs. Journal of Academic Librarianship 2015, 41, 613–620. [Google Scholar] [CrossRef]
- Rizvi, S.T.R.; Dengel, A.; Ahmed, S. A Hybrid Approach and Unified Framework for Bibliographic Reference Extraction. IEEE Access 2020, 8, 217231–217245. [Google Scholar] [CrossRef]
- Jain, V.; Baliyan, N.; Kumar, S. Machine Learning Approaches for Entity Extraction from Citation Strings. International Conference on Information Technology. Springer, 2023, pp. 287–297.
- Choi, W.; Yoon, H.M.; Hyun, M.H.; Lee, H.J.; Seol, J.W.; Lee, K.D.; Yoon, Y.J.; Kong, H. Building an annotated corpus for automatic metadata extraction from multilingual journal article references. PLoS one 2023, 18, e0280637. [Google Scholar] [CrossRef] [PubMed]
- Bergmark, D. Automatic extraction of reference linking information from onlinedocuments. Cornell University 2000.
- Hetzner, E. A simple method for citation metadata extraction using hidden markov models. Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries, 2008, pp. 280–284.
- Patro, S.; Wang, W. Learning Top- k Transformation Rules. Database and Expert Systems Applications, 2011, pp. 172–186.
- Peng, F.; McCallum, A. Information extraction from research papers using conditional random fields. Information Processing and Management 2006, 42, 963–979. [Google Scholar] [CrossRef]
- Lopez, P. GROBID: Combining automatic bibliographic data recognition and term extraction for scholarship publications. International conference on theory and practice of digital libraries. Springer, 2009, pp. 473–474.
- Councill, I.G.; Giles, C.L.; Kan, M.Y. ParsCit: An open-source CRF Reference String and Logical Document Structure Parsing Package. Proceedings of the 6th International Conference on Language Resources and Evaluation, 2008, Vol. 8, pp. 661–667.
- Prasad, A.; Kaur, M.; Kan, M.Y. Neural ParsCit: a deep learning-based reference string parser. International Journal on Digital Libraries 2018, 19, 323–337. [Google Scholar] [CrossRef]
- Rodrigues Alves, D.; Colavizza, G.; Kaplan, F. Deep Reference Mining From Scholarly Literature in the Arts and Humanities. Frontiers in Research Metrics and Analytics 2018, 3. [Google Scholar] [CrossRef]
- Tkaczyk, D.; Gupta, R.; Cinti, R.; Beel, J. ParsRec: A novel meta-learning approach to recommending bibliographic reference parsers. CEUR Workshop Proceedings, 2018, Vol. 2259, pp. 162–173, [1811.10369].
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint 2018, arXiv:1810.04805. [Google Scholar]
- Grennan, M.; Schibel, M.; Collins, A.; Beel, J. Giant: The 1-billion annotated synthetic bibliographic-reference-string dataset for deep citation parsing. CEUR Workshop Proceedings, 2019, Vol. 2563, pp. 260–271.
- Anzaroot, S.; McCallum, A. A new dataset for fine-grained citation field extraction. ICML 2013 Workshop on Peer Reviewing and Publishing Models 2013.
- Grennan, M.; Beel, J. Synthetic vs. Real Reference Strings for Citation Parsing, and the Importance of Re-training and Out-Of-Sample Data for Meaningful Evaluations: Experiments with GROBID, GIANT and Cora. arxiv.org 2020, pp. 1–7, [2004.10410].
- Akbik, A.; Bergmann, T.; Blythe, D.; Rasul, K.; Schweter, S.; Vollgraf, R. FLAIR: An easy-to-use framework for state-of-the-art NLP. NAACL 2019, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), 2019, pp. 54–59.
- Lafferty, J.; McCallum, A.; Pereira, F.C. Conditional random fields: Probabilistic models for segmenting and labeling sequence data 2001.
- Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint 2015, arXiv:1508.01991. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser; Polosukhin, I. Attention is all you need. Advances in neural information processing systems 2017, 30. [Google Scholar]
- Yan, H.; Deng, B.; Li, X.; Qiu, X. TENER: adapting transformer encoder for named entity recognition. arXiv preprint 2019, arXiv:1911.04474. [Google Scholar]
- Mitrofan, M.; Păiș, V. Improving Romanian BioNER using a biologically inspired system. Proceedings of the 21st Workshop on Biomedical Language Processing, 2022, pp. 316–322.
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 |





| Parameter | Value |
|---|---|
| Learning Rate | 0.003 |
| Batch Size | 1024 |
| Maximum Epochs | 150 |
| Optimizer | AdamW |
| Patience | 2 |
| Class | Transformer+CRF | BiLSTM+CRF | CRF |
|---|---|---|---|
| PUNC | 1 | 1 | 1 |
| AUTHOR | 0.9799 | 0.9997 | 0.9771 |
| TITLE | 0.969 | 0.9997 | 0.9667 |
| CONTAINER-TITLE | 0.9512 | 0.9997 | 0.9484 |
| YEAR | 0.963 | 0.9992 | 0.9629 |
| PUBLISHER | 0.9837 | 0.9998 | 0.9834 |
| DOI | 0.9667 | 1 | 0.9642 |
| URL | 0.9987 | 1 | 0.9987 |
| PAGE | 0.9919 | 1 | 0.994 |
| VOLUME | 0.8049 | 0.9936 | 0.8064 |
| ISSUE | 0.8615 | 0.9872 | 0.8553 |
| ISBN | 0.9924 | 1 | 0.9922 |
| ISSN | 0.9714 | 1 | 0.9655 |
| Cora Entity | Entity in models |
|---|---|
| AUTHOR | AUTHOR |
| BOOKTITLE/JOURNAL | CONTAINER-TITLE |
| DATA | YEAR |
| PAGES | PAGE |
| PUBLISHER | PUBLISHER |
| VOLUME | VOLUME |
| TITLE | TITLE |
| TECH | <REMOVED> |
| INSTITUTE | <REMOVED> |
| EDITOR | <REMOVED> |
| NOTE | <REMOVED> |
| Class | Transformer+CRF | BiLSTM+CRF | CRF |
|---|---|---|---|
| PUNC | 1 | 1 | 1 |
| AUTHOR | 0.9517 | 0.9876 | 0.9325 |
| TITLE | 0.5441 | 0.838 | 0.6096 |
| YEAR | 0.982 | 0.9885 | 0.9804 |
| CONTAINER-TITLE | 0.1392 | 0.619 | 0.1937 |
| PAGE | 0.8652 | 0.9483 | 0.8803 |
| VOLUME | 0.3402 | 0.8213 | 0.3425 |
| PUBLISHER | 0.2629 | 0.7931 | 0.2903 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).