Preprint
Article

This version is not peer-reviewed.

Automatic Gender Identification from Text

A peer-reviewed version of this preprint was published in:
Applied Sciences 2024, 14(24), 12041. https://doi.org/10.3390/app142412041

Submitted:

21 November 2024

Posted:

22 November 2024

You are already at the latest version

Abstract

Gender identification of authors in literary texts is a compelling area of research within computational linguistics and natural language processing. Analyzing the gender of authors can uncover biases and socio-cultural dynamics of the past, deepening our understanding of historical texts. Inspired by the historical context where women often used male pseudonyms to navigate the literary world, this study seeks to determine an author's gender, relying on their written works using various classifiers, including language models. Our contributions include compiling a large-scale dataset of literary texts and conducting extensive experiments with different classification models. Our results show that the best-performing model, GPT2, achieved an impressive accuracy of 0.925.

Keywords: 
;  ;  ;  ;  ;  

1. Introduction

Gender identification of authors in literary texts is an interesting field of study in computational linguistics. Determining the gender of authors can reveal prejudices and underlying socio-cultural dynamics that were present in the literary world and enhance our understanding of literary works. Historical examples support this endeavor, as women who encountered obstacles to publication and recognition often adopted male pseudonyms in the literary world. For example, George Sand, pseudonym of Amantine Lucile Aurore Dupin, Baroness Dudevant, a 19th-century French writer and memoirist with her books [1,2]. Vernon Lee, the pseudonym of British writer Violet Paget with [3,4]. James Tiptree, Jr., Alice Sheldon pseudonym with books [5,6]. An additional example is the Brontë sisters, who adopted male pseudonyms, and Mary Ann Evans, who published as George Eliot, highlighting the systemic barriers that have historically hindered women’s participation and visibility in the literary field.
In this study, we aim to identify the gender of the author from the written texts, with a specific focus on literary works such as books and novels.
Our contributions are two-fold:
  • We present a thoroughly collected dataset comprising a diverse set of literary works span genres, time periods, and cultural contexts, including historical novels (e.g., Romola by Mary Ann Evans), short novels (e.g., Absalom’s Hair by Bjornstjerne Bjornson), long novels (e.g., Moby-Dick by Herman Melville), science fiction novels (e.g., The Confessions of Artemas Quibble by Arthur Train), dime novels (e.g., The Dock Rats of New York by Harlan Page Halsey), mystery adventure novels (e.g., The Danger Trail by James Oliver Curwood), and children’s fiction novels (e.g., Heidi by Johanna Spyri).
  • We perform extensive comparative experiments with diverse classifiers and analyze their results. In our experiments, GPT2 and XLNet Model ( XLNet ) emerged as the top-performing models. GPT2 achieved the highest overall accuracy score of 0.930 and excelled in precision, recall, and F1 scores for both female and male categories, indicating robust and consistent performance across metrics. XLNet followed closely with a score of 0.910, also demonstrating strong precision and recall, particularly with F1 average of 0.860. Logistic Regression ( LR ) models with various configurations showed moderate performance, with scores around 0.830 to 0.840. The XGB classifier (XGB), especially when fine-tuned, also performed well, achieving a score of 0.850 and decent F1 scores, making it a solid choice among traditional machine learning methods.
We believe that the compiled dataset and the reported results will help advance the field of literary analysis by providing valuable resources and insights.

3. Experimental Study

The primary purpose of this study is to explore and evaluate various machine learning and deep learning models for the task of gender recognition from book texts. In this context, gender recognition involves identifying the gender of an author of the book. This task presents several challenges due to the nuanced and complex nature of language and the variability in writing styles.

3.1. Experimental Settings

The experiments were conducted using Google Colab with a runtime environment equipped with 32 GB of RAM and a T4 GPU. Python version 3 was utilized for all implementations.

3.1.1. Logistic Regression

The LR models were trained using the a library for the large linear classification, aka a liblinear solver with a maximum of 1000 iterations. The models were evaluated using precision and recall scores. During the experiment, the dataset was split into training and testing sets with a random state value 42. Precision and recall scores were computed for classes ’m’ (male) and ’f’ (female). Each iteration of the model training used 20 books.
Furthermore, we trained the LR classifier with L1 and L2 penalty regularizations using the same evaluation protocol.

3.1.2. Extreme Gradient Boosting Classifier

The Extreme Gradient Boosting (XGB) model were trained with default parameters for binary classification.
The models were evaluated using accuracy, precision, and recall scores. A confusion matrix and metrics for both classes ’m’ and ’f’ were computed. Each iteration of the model training used 20 books. Additionally, we experimented with an enhanced XGB configuration with specific parameters like gamma, subsample, and colsample bytree.

3.1.3. Support Vector Machine

The SVM models were trained with a linear kernel. An SVM with a linear kernel is used to classify data by finding the hyperplane that best separates different classes in a high-dimensional space. The linear kernel is particularly suited for problems where the classes are linearly separable, meaning they can be separated by a straight line or hyperplane in the feature space.

3.1.4. Bidirectional Encoder Representations from Transformers

The BERT model was fine-tuned using the BERTForSequenceClassification model from Hugging Face’s transformers library, using the AdamW optimizer, and a learning rate of 2e-5. The model evaluation included accuracy, precision, and recall scores. Training and testing sets were tokenized using BERTTokenizer.

3.1.5. The Generative Pre-Trained Transformer 2

The GPT2 models were fine-tuned for sequence classification using GPT2ForSequenceClassification from Hugging Face’s transformers library. AdamW optimizer was employed with a learning rate of 5e-5.

3.1.6. XLNet Model

The XLNet model was fine-tuned for sequence classification using XLNetForSequenceClassification from Hugging Face’s transformers library. The AdamW optimizer was used with a learning rate of 5e-5.

3.1.7. RoBERTa

The RoBERTa model was fine-tuned for sequence classification using RoBERTaForSequenceClassification from Hugging Face’s transformers library. The AdamW optimizer was utilized with a learning rate of 5e-5.

3.2. Data Preprocessing

We experimented with the seven architectures described at the beginning of Section 3. The training data was batched for efficient processing. All models were trained on 90% of the books’ text (after removing the header and the footer as explained in Section 2.6).
Figure 2 presents the experimental pipeline. We employed different preprocessing and text representation depending on the specific model being used. For LR, XGB, and SVM (left side of the flowchart), the process involved tokenization, converting the input texts into Term Frequency-Inverse Document Frequency (TF-IDF) vectors, and then feeding these vectors into the respective prediction models to train and make predictions.
For transformer-based models BERT, GPT2, XLNet, and RoBERTa (right side of the flowchart), the books were processed to create sentence vectors, which are rich and context-aware representations of the text. For obtaining the sentence vectors and performing the classification tasks, the Hugging Face transformer library was used for all models. The tokenized text is passed through each model: BERT uses the embedding of the [CLS] token, GPT2 uses the final hidden state, XLNet utilizes the embeddings of the final hidden state, and RoBERTa uses the initial special token’s embedding as the sentence vector. These sentence vectors were then fed into the respective transformer models, which subsequently classified the author’s gender based on the features extracted from the sentence vectors.
In the experiments with the LR, SVM, and XGB, we used CountVectorizer to prepare text data for further processing by creating a vocabulary of words. In the experiments with the BERT, GPT2, XLNET, and RoBERTa, we utilized a self-tokenizer. The model was trained on the extracted feature vectors (tokenized or CountVectorizer), using corresponding ground truth labels.
During the training of BERT, GPT2, XLNET, and RoBERTa, we used AdamW [30,31] optimizer.

4. Results

This section presents the results of the experiments on the BookSCE dataset. In all the experiments, we followed the settings described in Section 3.1.
The results are reported in Table 2. We calculated the overall classification accuracy, the average Precision (P), Recall (R), and F1-score, as well as the Precision, Recall, and F1-score for each class separately (denoted with subscripts ’m’ and ’f’ for male and female, respectively). The Acc column of Table 2 contains the average classification accuracy results for all models on the BookSCE dataset. As can be seen, GPT2 provided the best overall accuracy of 0.925, and XLNet followed with the second-best accuracy score of 0.907
The GPT2 model achieved an impressive accuracy of 0.925 and precision scores of 0.944 for males and 0.851 for females. This model’s success can be attributed to its generative and autoregressive training, which enhances its ability to interpret contextual information. The high performance of GPT2 in binary classification tasks highlights its strong contextual understanding and precise prediction capabilities.
The XLNet model also performed well, with an accuracy of 0.907 and precision scores of 0.931 for males and 0.806 for females. The XLNet’s architecture, which includes autoregressive and permutation-based training, enables it to excel in capturing complex patterns and dependencies in the text. This contributes to its high performance, making it a strong contender in binary classification tasks.
At the bottom of the performance table, BERT and RoBERTa showed significantly lower accuracies of 0.739 and 0.711, respectively. The relatively poor performance of these models may be due to their different architectural designs and the way they handle textual data. The high versatility and variability in book texts can pose a challenge for these models, making it harder to capture consistent patterns, especially when dealing with the nuanced language found in literature.
Interestingly, the simpler LR model demonstrated comparable performance, achieving an overall accuracy of 0.837 and even outperforming some of the more complex models like BERT and RoBERTa. The LR model precision for males was 0.886, and for females it was 0.605. This phenomenon can be explained by the unique nature of book texts, which may exhibit more straightforward correlations between features and the target variable, making simpler models like LR surprisingly effective. Furthermore, adding regularization of L2 marginally improved the accuracy of LR to 0.841, highlighting the impact of regularization techniques on model performance. In summary, the results indicate that sophisticated models like GPT2 and XLNet are particularly well-suited for the binary classification task on the BookSCE dataset, likely due to their advanced architectures and training methods.

5. Case Study

To assess the ability of our classifiers to identify female authors who wrote under male pen names, we conducted an experiment using 12 books from the BookSCE dataset. These books were authored by females using male pseudonyms and are listed in Table 3. For this experiment, we used a different data split: the training set included all books from the BookSCE dataset except those listed in Table 3, while the test set comprised only the books listed in Table 3. For the experiment, we applied three classifiers: GPT2, XLNet and LR. The results showed that GPT2 achieved an accuracy of 0.833, with XLNet accuracy of 0.750. However, the LR model did not perform well for this task, achieving only 0.333 (Table 3). It is worth noting that "Middlemarch" complex and deeply layered novel, renowned for its exploration of social, political, and personal issues in 19th-century England. George Eliot’s narrative style in this book is intricate, with a focus on psychological depth and realism, which might mirror the writing styles of male authors of the time, which may explain why female recognition was more challenging. "Daniel Deronda" focuses heavily on a male character, making it somewhat difficult for classifiers to correctly identify the author’s gender. Additionally, "Impressions of Theophrastus Such" contains numerous short stories, which complicates the task of gender identification as female author book.

6. Discussion and Future Work

One key finding of our study is the ability to identify an author’s gender, which can be used to explore gender dynamics in literature. By examining how gender identity and expression impact literary themes, writing styles, and narrative tactics, scholars can gain insight into the experiences and viewpoints of both male and female authors. For instance, novels written by women may highlight issues like relationships, identity, and societal expectations, while those written by men may focus on power, masculinity, and societal hierarchy. This critical examination can reveal the intricate relationship between gender, creativity, and cultural creation in literature, as well as how gendered symbols, motifs, and tropes represent wider cultural views on gender and identity.
In future work, we plan to expand the dataset by adding more books written by female authors to improve model performance and fairness. Techniques like grid search [32], random search [33], and Bayesian optimization [34] will be used to optimize hyperparameters. Developing novel techniques for mitigating biases and improving model generalization is essential to ensure the fairness and reliability of gender prediction models. This might involve fairness-aware training objectives and adversarial training methods.
Additionally, incorporating advanced models such as GPT-4 [35], T5 [36], and Large Language Models (LLMs) [37] like Llama 2 [38] could enhance prediction accuracy and generalization. Fine-tuning these models on gender prediction tasks can take advantage of their rich linguistic representations and contextual understanding to improve performance.

7. Conclusions

The objective of this study is to classify the gender of the author based on the text of the author’s books. For this research, we compiled the BookSCE dataset, collected primarily from books in the Gutenberg Project. The books were annotated by our research group with meta-data and author-related information, including the gender of the author. The BookSCE dataset contains 8222 unique books dating from the 16th century to the present day, with varying levels of complexity, including archaic language, different writing styles, and diverse genres. We believe that the compiled dataset will be valuable for the research community, providing a robust foundation to advance studies and driving further developments in this field. Additionally, the dataset will be valuable for various tasks, e.g., automatic text dating, and it includes a range of annotations beyond gender annotation. The dataset been utilized in the CoLiE: Automatic Classification of Literary Epochs competition [39] to predict the literary epoch in which the text was written.
We conducted extensive experiments on binary author gender classification using several models: LR, XGBClassifier, SVM, BERT, GPT2, XLNet, and RoBERTa.
The best results were achieved by the GPT2 model, with an overall accuracy of 0.925 and an F1 score of 0.885. The XLNet followed, with second-best scores, including an overall score of 0.907 and an average F1 score of 0.856. The LR model, known for its straightforward and understandable nature, not only provided good results but also demonstrated its effectiveness in book author gender classification. This highlights its suitability for scenarios requiring transparency and simplicity. Additionally, it remains a viable option in resource-constrained environments.
In future research, we plan to focus on enhancing gender prediction models by incorporating advanced techniques and models such as GPT-4, T5, and other LLMs.

Author Contributions

Conceptualization, V.Y., M.L., and I.R.; methodology, M.L., and I.R.; software, V.Y.; validation, V.Y.; formal analysis, V.Y., M.L., and I.R..; investigation, V.Y., M.L., and I.R.; resources, V.Y., M.L., and I.R.; writing—original draft preparation, V.Y.; writing—V.Y., M.L., and I.R.; supervision, M.L., and I.R.; project administration, M.L., and I.R.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The dataset will be made available upon request.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Sand, G. Indiana; None, Chicago Review Press, Incorporated, 2000.
  2. Sand, G. Valentine; Chicago Review Press, 2005.
  3. Lee, V. The Tower of the Mirrors and Other Essays on the Spirit of Places (1914). By: Vernon Lee: Vernon Lee Was the Pseudonym of the British Writer Violet Paget (14 October 1856 - 13 February 1935).; CreateSpace Independent Publishing Platform, 2017.
  4. Paget, V. A phantom lover, by Vernon Lee; IndyPublish. com, 1886.
  5. Tiptree, J. Her Smoke Rose Up Forever; S.F. MASTERWORKS, Orion, 2014.
  6. Tiptree, J. The Starry Rift; Orion, 2015.
  7. Boyd, R.L.; Ashokkumar, A.; Seraj, S.; Pennebaker, J.W., Eds. The Development and Psychometric Properties of LIWC-22; This article is published by LIWC.net, Austin, Texas 78703 USA in conjunction with the LIWC2022 software program.: New York, 2022. [CrossRef]
  8. Bamman, D.; Smith, N.A. Unsupervised Discovery of Biographical Structure from Text. Transactions of the Association for Computational Linguistics, 2 2014, pp. 363–375. [CrossRef]
  9. Bsir, B.; Zrigui, M. Bidirectional LSTM for Author GenderIdentification. 10th International Conference, ICCCI 2018, 2018. [CrossRef]
  10. Zhu, Z.; Ke, Z.; Cui, J.; Yu, H.; Liu, G. The construction of Chinese microblog gender-specific thesauruses and user gender classification. Applied Network Science 2018, 3, 1–17. [Google Scholar] [CrossRef] [PubMed]
  11. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1908.04577 2019, [arXiv:cs.CL/1810.04805]. [CrossRef]
  12. Cheng, N.; Chandramouli, R.; Subbalakshmi, K. Author gender identification from text. Digital Investigation 2011, 8, 78–88. [Google Scholar] [CrossRef]
  13. Cheng, N.; Chen, X.; Chandramouli, R.; Subbalakshmi, K. Gender Identification from E-mails. 2009 IEEE Symposium on Computational Intelligence and Data Mining, 2009, pp. 154 – 158. [CrossRef]
  14. Johannsen, A.; Hovy, D.; Søgaard, A. Cross-lingual syntactic variation over age and gender. Conference on Computational Natural Language Learning 2015, pp. 104–110. [CrossRef]
  15. Ford, E.; Shepherd, S.; Jones, K.; Hassan, L. Toward an Ethical Framework for the Text Mining of Social Media for Health Research: A Systematic Review. Sec. Health Informatics Volume 2 - 2020 2021, [2020.592237]. [CrossRef]
  16. Jaiswal, S.; Verma, A.K.; Mukherjee, A. Auditing Gender Analyzers on Text Data. Proceedings of the 2023 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining; Association for Computing Machinery: New York, NY, USA, 2024; p. 108–115. [CrossRef]
  17. Schwarzenberg, P.; Figueroa, A.R. Textual Pre-Trained Models for Gender Identification Across Community Question-Answering Members. IEEE Access 2023, 11, 3983–3995. [Google Scholar] [CrossRef]
  18. Lee-Thorp, J.; Ainslie, J.; Eckstein, I.; Ontanon, S. FNet: Mixing Tokens with Fourier Transforms, 2022, [arXiv:cs.CL/2105.03824]. [CrossRef]
  19. Sun, Z.; Yu, H.; Song, X.; Liu, R.; Yang, Y.; Zhou, D. MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices, 2020, [arXiv:cs.CL/2004.02984]. [CrossRef]
  20. PAN. Author Profiling. 2017. Available online: https://pan.webis.de/clef17/pan17-web/author-profiling.html. [CrossRef]
  21. Pritom, R.R. Gender Recognition Dataset. 2021. Available online: https://www.kaggle.com/datasets/rashikrahmanpritom/gender-recognition-dataset.
  22. Bisong, E., Ed. Building Machine Learning and Deep Learning Models on Google Cloud Platform; Apress Berkeley, CA: OTTAWA, ON, Canada, 2019. [CrossRef]
  23. Demir-Kavuk, O.; Akutsu, M.K.T.; Knapp, E.W. Prediction using step-wise L1, L2 regularization and feature selection for small data sets with large number of features. 2011 10th International Conference on Machine Learning and Applications 2011. [CrossRef]
  24. Oakes, M.; Gaizauskas, R.; Fowkes, H. A Method Based on the Chi-Square Test for Document Classification. TConference: SIGIR 2001: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, September 9-13, 2001, New Orleans, Louisiana, USA 2001.
  25. Bentéjac, C.; Csörgő, A.; Martínez-Muñoz, G. A Comparative Analysis of XGBoost. Universidad Autónoma de Madrid November 2019. [CrossRef]
  26. CORTES, C.; VAPNIK, V. Support-Vector Networks. 1995 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. 1995. [CrossRef]
  27. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I.; others. Language models are unsupervised multitask learners. OpenAI blog 2019, 1, 9.
  28. Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv preprint arXiv:1906.08237 2019, [arXiv:cs.CL/1907.11692]. [CrossRef]
  29. Liu, Y.; Ott, M.; Goyal, N.; Duand, J.; Joshi†, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692 2020, [arXiv:cs.CL/1906.08237]. [CrossRef]
  30. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. (No Title) 2017, [arXiv:cs.LG/1412.6980]. [CrossRef]
  31. Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv preprint arXiv:1711.05101 2019, [arXiv:cs.LG/1711.05101]. [CrossRef]
  32. Pontes, F.; Amorim, G.; Balestrassi, P.; Paiva, A.; Ferreira, J. Design of experiments and focused grid search for neural network parameter optimization. Neurocomputing 2016, 186, 22–34. [Google Scholar] [CrossRef]
  33. Bergstra, J.; Bengio, Y. Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research 13 2012.
  34. Garrido-Merchán, E.C.; Gozalo-Brizuela, R.; González-Carvajal, S. Comparing BERT against Traditional Machine Learning Models in Text Classification. Journal of Computational and Cognitive Engineering 2023, pp. 1–7. [CrossRef]
  35. OpenAI.; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; Avila, R.; Babuschkin, I.; Balaji, S.; Balcom, V.; Baltescu, P.; Bao, H.; Bavarian, M.; Belgum, J.; Bello, I.; Berdine, J.; Bernadett-Shapiro, G.; Berner, C.; Bogdonoff, L.; Boiko, O.; Boyd, M.; Brakman, A.L.; Brockman, G.; Brooks, T.; Brundage, M.; Button, K.; Cai, T.; Campbell, R.; Cann, A.; Carey, B.; Carlson, C.; Carmichael, R.; Chan, B.; Chang, C.; Chantzis, F.; Chen, D.; Chen, S.; Chen, R.; Chen, J.; Chen, M.; Chess, B.; Cho, C.; Chu, C.; Chung, H.W.; Cummings, D.; Currier, J.; Dai, Y.; Decareaux, C.; Degry, T.; Deutsch, N.; Deville, D.; Dhar, A.; Dohan, D.; Dowling, S.; Dunning, S.; Ecoffet, A.; Eleti, A.; Eloundou, T.; Farhi, D.; Fedus, L.; Felix, N.; Fishman, S.P.; Forte, J.; Fulford, I.; Gao, L.; Georges, E.; Gibson, C.; Goel, V.; Gogineni, T.; Goh, G.; Gontijo-Lopes, R.; Gordon, J.; Grafstein, M.; Gray, S.; Greene, R.; Gross, J.; Gu, S.S.; Guo, Y.; Hallacy, C.; Han, J.; Harris, J.; He, Y.; Heaton, M.; Heidecke, J.; Hesse, C.; Hickey, A.; Hickey, W.; Hoeschele, P.; Houghton, B.; Hsu, K.; Hu, S.; Hu, X.; Huizinga, J.; Jain, S.; Jain, S.; Jang, J.; Jiang, A.; Jiang, R.; Jin, H.; Jin, D.; Jomoto, S.; Jonn, B.; Jun, H.; Kaftan, T.; Łukasz Kaiser.; Kamali, A.; Kanitscheider, I.; Keskar, N.S.; Khan, T.; Kilpatrick, L.; Kim, J.W.; Kim, C.; Kim, Y.; Kirchner, J.H.; Kiros, J.; Knight, M.; Kokotajlo, D.; Łukasz Kondraciuk.; Kondrich, A.; Konstantinidis, A.; Kosic, K.; Krueger, G.; Kuo, V.; Lampe, M.; Lan, I.; Lee, T.; Leike, J.; Leung, J.; Levy, D.; Li, C.M.; Lim, R.; Lin, M.; Lin, S.; Litwin, M.; Lopez, T.; Lowe, R.; Lue, P.; Makanju, A.; Malfacini, K.; Manning, S.; Markov, T.; Markovski, Y.; Martin, B.; Mayer, K.; Mayne, A.; McGrew, B.; McKinney, S.M.; McLeavey, C.; McMillan, P.; McNeil, J.; Medina, D.; Mehta, A.; Menick, J.; Metz, L.; Mishchenko, A.; Mishkin, P.; Monaco, V.; Morikawa, E.; Mossing, D.; Mu, T.; Murati, M.; Murk, O.; Mély, D.; Nair, A.; Nakano, R.; Nayak, R.; Neelakantan, A.; Ngo, R.; Noh, H.; Ouyang, L.; O’Keefe, C.; Pachocki, J.; Paino, A.; Palermo, J.; Pantuliano, A.; Parascandolo, G.; Parish, J.; Parparita, E.; Passos, A.; Pavlov, M.; Peng, A.; Perelman, A.; de Avila Belbute Peres, F.; Petrov, M.; de Oliveira Pinto, H.P.; Michael.; Pokorny.; Pokrass, M.; Pong, V.H.; Powell, T.; Power, A.; Power, B.; Proehl, E.; Puri, R.; Radford, A.; Rae, J.; Ramesh, A.; Raymond, C.; Real, F.; Rimbach, K.; Ross, C.; Rotsted, B.; Roussez, H.; Ryder, N.; Saltarelli, M.; Sanders, T.; Santurkar, S.; Sastry, G.; Schmidt, H.; Schnurr, D.; Schulman, J.; Selsam, D.; Sheppard, K.; Sherbakov, T.; Shieh, J.; Shoker, S.; Shyam, P.; Sidor, S.; Sigler, E.; Simens, M.; Sitkin, J.; Slama, K.; Sohl, I.; Sokolowsky, B.; Song, Y.; Staudacher, N.; Such, F.P.; Summers, N.; Sutskever, I.; Tang, J.; Tezak, N.; Thompson, M.B.; Tillet, P.; Tootoonchian, A.; Tseng, E.; Tuggle, P.; Turley, N.; Tworek, J.; Uribe, J.F.C.; Vallone, A.; Vijayvergiya, A.; Voss, C.; Wainwright, C.; Wang, J.J.; Wang, A.; Wang, B.; Ward, J.; Wei, J.; Weinmann, C.; Welihinda, A.; Welinder, P.; Weng, J.; Weng, L.; Wiethoff, M.; Willner, D.; Winter, C.; Wolrich, S.; Wong, H.; Workman, L.; Wu, S.; Wu, J.; Wu, M.; Xiao, K.; Xu, T.; Yoo, S.; Yu, K.; Yuan, Q.; Zaremba, W.; Zellers, R.; Zhang, C.; Zhang, M.; Zhao, S.; Zheng, T.; Zhuang, J.; Zhuk, W.; Zoph, B. GPT-4 Technical Report, 2024, [arXiv:cs.CL/2303.08774]. [CrossRef]
  36. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, 2023, [arXiv:cs.LG/1910.10683]. [CrossRef]
  37. D’Souza, J. A Review of Transformer Models, 2023. [CrossRef]
  38. Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; Bikel, D.; Blecher, L.; Ferrer, C.C.; Chen, M.; Cucurull, G.; Esiobu, D.; Fernandes, J.; Fu, J.; Fu, W.; Fuller, B.; Gao, C.; Goswami, V.; Goyal, N.; Hartshorn, A.; Hosseini, S.; Hou, R.; Inan, H.; Kardas, M.; Kerkez, V.; Khabsa, M.; Kloumann, I.; Korenev, A.; Koura, P.S.; Lachaux, M.A.; Lavril, T.; Lee, J.; Liskovich, D.; Lu, Y.; Mao, Y.; Martinet, X.; Mihaylov, T.; Mishra, P.; Molybog, I.; Nie, Y.; Poulton, A.; Reizenstein, J.; Rungta, R.; Saladi, K.; Schelten, A.; Silva, R.; Smith, E.M.; Subramanian, R.; Tan, X.E.; Tang, B.; Taylor, R.; Williams, A.; Kuan, J.X.; Xu, P.; Yan, Z.; Zarov, I.; Zhang, Y.; Fan, A.; Kambadur, M.; Narang, S.; Rodriguez, A.; Stojnic, R.; Edunov, S.; Scialom, T. Llama 2: Open Foundation and Fine-Tuned Chat Models, 2023, [arXiv:cs.CL/2307.09288]. [CrossRef]
  39. Rabaev, I.; Litvak, M.; Younkin, V.; Campos, R.; Jorge, A.M.; Jatowt, A. The Competition on Automatic Classification of Literary Epochs, 2023.
Figure 1. Left: Example of book after removing 10% of the text at the beginning and the Gutenberg Header; Right: Example of book after removing Gutenberg footer.
Figure 1. Left: Example of book after removing 10% of the text at the beginning and the Gutenberg Header; Right: Example of book after removing Gutenberg footer.
Preprints 140438 g001
Figure 2. Left: Example of book after removing 10% of the text at the beginning and the Gutenberg Header; Right: Example of book after removing Gutenberg footer.
Figure 2. Left: Example of book after removing 10% of the text at the beginning and the Gutenberg Header; Right: Example of book after removing Gutenberg footer.
Preprints 140438 g002
Table 1. The BookSCE split into training, validation, and testing sets.
Table 1. The BookSCE split into training, validation, and testing sets.
Male Female Total
Training 5108 1320 6428
Validation 715 183 897
Test 709 188 898
Total 6532 1691 8223
Table 2. Models’ performance on the BookSCE dataset; the f and m subscripts indicate that the corresponding value is computed for the female (f) or male (m) class.
Table 2. Models’ performance on the BookSCE dataset; the f and m subscripts indicate that the corresponding value is computed for the female (f) or male (m) class.
Model Acc (avg) Precisionf Precisionm Recallf Recallm F1f F1m F1 (avg)
LogisticRegression 0.837 0.605 0.886 0.528 0.914 0.564 0.900 0.732
LogisticRegression with L2 0.841 0.637 0.889 0.576 0.912 0.605 0.901 0.753
LogisticRegression with L1 0.825 0.601 0.870 0.485 0.915 0.537 0.892 0.714
XGB 0.849 0.797 0.854 0.355 0.977 0.491 0.911 0.701
XGB + XGBoost 0.853 0.787 0.861 0.402 0.971 0.532 0.913 0.723
SVM 0.809 0.548 0.878 0.538 0.882 0.543 0.880 0.711
BERT 0.739 0.286 0.800 0.160 0.894 0.205 0.844 0.525
GPT2 0.925 0.851 0.944 0.784 0.963 0.816 0.953 0.885
XLNET 0.907 0.806 0.931 0.735 0.953 0.769 0.942 0.856
RoBERTa 0.711 0.140 0.779 0.070 0.884 0.093 0.828 0.461
Table 3. Books by Female Authors Using Male Pen Names from the BookSCE Dataset. Note: V indicates "Correct" and X indicates "Not Correct" recognition.
Table 3. Books by Female Authors Using Male Pen Names from the BookSCE Dataset. Note: V indicates "Correct" and X indicates "Not Correct" recognition.
Book Information Female Name Recognition
Author Year Book Name GPT2 XLNet LR
Eliot, George 1832 Middlemarch X X X
Eliot, George 1859 Adam Bede V V X
Eliot, George 1861 Silas Marner V V X
Eliot, George 1860 The Mill on the Floss V V V
Eliot, George 1876 Daniel Deronda V X X
Eliot, George 1879 Impressions of Theophrastus Such X V X
Eliot, George 1857 Scenes of Clerical Life V V V
Eliot, George 1862 Romola V V X
Eliot, George 1866 Felix Holt, the Radical V V X
Lee, Vernon 1886 A Phantom Lover Yes V V
Lee, Vernon 1903 Penelope Brandling: A Tale of the Welsh coast in the Eighteenth Century. V V X
Brontë, Anne 1848 The Tenant of Wildfell Hall V X V
Acc 0.833 0.750 0.333
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated