Submitted:
15 May 2024
Posted:
20 May 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Works
2.1. Dataset
2.2. Author Attribution
2.3. Text Classification
3. Dataset
- Novels to handle the lack of long and informal data
- News to handle the lack of long and formal data
- Tweets to handle the lack of short and formal, and informal data
3.1. Novel Dataset
3.2. News Dataset
3.3. Twitter Dataset
4. Model
4.1. Embedding Layer
4.2. Bi-LSTM layer
4.3. Residual Connection
4.4. Attention Mechanism
4.5. Softmax
5. Experimental Setup
5.1. Data Pre-processing
- Remove non-ASCII and non-Persian characters
- Remove punctuation, including ”.”, ”,”, ”!”, ”?” , etc.
- Remove All URL links.
- Remove the URL that does not contain any specific textual information.
- Remove numbers. These numbers usually do not contain specific textual information, so they are removed from text to optimize content.
- Remove Stop-words. Stop-words usually refer to the most common words in the language that do not contain semantic information
5.2. Embedding
5.3. Hyperparameter Tuning
| Number of LSTM layers | |||
|---|---|---|---|
| Novels | News | ||
| 1 | 86.29 | 80.37 | 65.84 |
| 2 | 88.03 | 82.31 | 70.62 |
| 3 | 90.22 | 85.10 | 71.39 |
| 4 | 88.71 | 83.39 | 69.30 |
| 5 | 88.63 | 83.18 | 68.65 |
| Number of Hidden units | |||
| Novels | News | ||
| 64 | 88.91 | 83.76 | 70.08 |
| 128 | 90.22 | 85.10 | 71.39 |
| 256 | 89.54 | 84.47 | 70.57 |
| Number of Attention heads | |||
| Novels | News | ||
| 10 | 89.79 | 85.02 | 71.03 |
| 20 | 90.22 | 85.10 | 71.39 |
| 30 | 89.06 | 84.19 | 70.90 |
6. Results
7. Conclusion and Future Work
- Data preparation: In light of the numerous news resources and social media postings associated with each individual, it is possible to increase the number of authors and, consequently, the size of the data for each author.
- Data processing: As mentioned previously, since typical pre-processing such as removing stop words or punctuation reduces accuracy, we use Hazm normalizer. There needs to be more research in this area so that the best way to normalize the data can be accurately applied.
- Embedding: In future work, recent conceptual embeddings, like the encoder part of BERT and MT5, can be used.
- Model architecture: This architecture can be defined in other tasks such as NLI, Question Answering, Machine Translation, etc.
References
- Stamatatos, E. On the robustness of authorship attribution based on character n-gram features. Journal of Law and Policy 2013, 21, 421–439. [Google Scholar]
- Potthast, M.; Braun, S.; Buz, T.; Duffhauss, F.; Friedrich, F.; Gülzow, J.; Köhler, J.; Loetzsch, W.; Müller, F.; Müller, M.; Paßmann, R.; Reinke, B.; Rettenmeier, L.; Rometsch, T.; Sommer, T.; Träger, M.; Wilhelm, S.; Stein, B.; Stamatatos, E.; Hagen, M. Who Wrote the Web? Revisiting Influential Author Identification Research Applicable to Information Retrieval. 2016, Vol. 9626, pp. 393–407. [CrossRef]
- Sidorov, G.; Velasquez, F.; Stamatatos, E.; Gelbukh, A.; Chanona-Hernández, L. Syntactic N-grams as machine learning features for natural language processing. Expert Syst. Appl. 2014, 41, 853–860. [Google Scholar] [CrossRef]
- Gomez Adorno, H.; Sidorov, G.; Pinto, D.; Vilariño, D.; Gelbukh, A. Automatic Authorship Detection Using Textual Patterns Extracted from Integrated Syntactic Graphs. Sensors 2016, 16, 1374. [Google Scholar] [CrossRef] [PubMed]
- Mosteller, F.; Wallace, D.L. Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed Federalist Papers. Journal of the American Statistical Association 1963, 58, 275–309. [Google Scholar] [CrossRef]
- Phani, S.; Lahiri, S.; Biswas, A. Authorship Attribution in Bengali Language. In Proceedings of the 12th International Conference on Natural Language Processing; NLP Association of India: Trivandrum, India, 2015; pp. 100–105. [Google Scholar]
- Diederich, J.; Kindermann, J.; Leopold, E.; Paass, G. Authorship attribution with support vector machines. Applied intelligence 2003, 19, 109–123. [Google Scholar] [CrossRef]
- Zheng, Q.; Tian, X.; Yang, M.; Wang, H. The email author identification system based on support vector machine (SVM) and analytic hierarchy process (AHP). IAENG International journal of computer Science 2019, 46, 178–191. [Google Scholar]
- Ouamour, S.; Sayoud, H. Authorship attribution of ancient texts written by ten arabic travelers using a smo-svm classifier. 2012 International Conference on Communications and Information Technology (ICCIT). IEEE, 2012, pp. 44–47. [CrossRef]
- Martín-del Campo-Rodríguez, C.; Alvarez, D.A.P.; Sifuentes, C.E.M.; Sidorov, G.; Batyrshin, I.; Gelbukh, A. Authorship Attribution through Punctuation n-grams and Averaged Combination of SVM 2019.
- Khonji, M.; Iraqi, Y.; Jones, A. An evaluation of authorship attribution using random forests. 2015 International Conference on Information and Communication Technology Research (ICTRC). IEEE, 2015, pp. 68–71. [CrossRef]
- Maitra, P.; Ghosh, S.; Das, D. Authorship Verification-An Approach based on Random Forest. arXiv 2016, arXiv:1607.08885. [Google Scholar]
- Grieve, J. Quantitative authorship attribution: An evaluation of techniques. Literary and linguistic computing 2007, 22, 251–270. [Google Scholar] [CrossRef]
- Jin, M.; Jiang, M. Text clustering on authorship attribution based on the features of punctuations usage. 2012 IEEE 11th International Conference on Signal Processing. IEEE, 2012, Vol. 3, pp. 2175–2178. [CrossRef]
- Das, P.; Tasmim, R.; Ismail, S. An experimental study of stylometry in bangla literature. 2015 2nd International Conference on Electrical Information and Communication Technologies (EICT). IEEE, 2015, pp. 575–580. [CrossRef]
- Sari, Y.; Vlachos, A.; Stevenson, M. Continuous n-gram representations for authorship attribution. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 2017, pp. 267–273.
- Jafariakinabad, F.; Tarnpradab, S.; Hua, K.A. Syntactic recurrent neural network for authorship attribution. arXiv 2019, arXiv:1902.09723. [Google Scholar]
- Jafariakinabad, F.; Hua, K.A. Style-aware neural model with application in authorship attribution. 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA). IEEE, 2019, pp. 325–328. [CrossRef]
- Shaheen, Z.; Wohlgenannt, G.; Filtz, E. Large scale legal text classification using transformer models. arXiv 2020, arXiv:2010.12871. [Google Scholar]
- Yüksel, A.E.; Türkmen, Y.A.; Özgür, A.; Altınel, B. Turkish tweet classification with transformer encoder. Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), 2019, pp. 1380–1387.
- González, J.Á.; Hurtado, L.F.; Pla, F. ELiRF-UPV at TASS 2019: Transformer Encoders for Twitter Sentiment Analysis in Spanish. IberLEF@ SEPLN, 2019, pp. 571–578.
- Alam, T.; Khan, A.; Alam, F. Bangla Text Classification using Transformers. arXiv 2020, arXiv:2011.04446. [Google Scholar]
- Shaheen, Z.; Wohlgenannt, G.; Filtz, E. Large scale legal text classification using transformer models. arXiv 2020, arXiv:2010.12871. [Google Scholar]
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv, arXiv:1910.03771.
- Zhou, P.; Shi, W.; Tian, J.; Qi, Z.; Li, B.; Hao, H.; Xu, B. Attention-based bidirectional long short-term memory networks for relation classification. Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: Short papers), 2016, pp. 207–212.
- Jing, R. A self-attention based LSTM network for text classification. Journal of Physics: Conference Series. IOP Publishing, 2019, Vol. 1207, p. 012008. [CrossRef]
- Ding, Z.; Xia, R.; Yu, J.; Li, X.; Yang, J. Densely connected bidirectional lstm with applications to sentence classification. CCF International Conference on Natural Language Processing and Chinese Computing. Springer, 2018, pp. 278–287. [CrossRef]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.
- Zhu, X.; Yin, S.; Chen, Z. Attention based BiLSTM-MCNN for sentiment analysis. 2020 IEEE 5th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA). IEEE, 2020, pp. 170–174. [CrossRef]
- Liu, G.; Guo, J. Bidirectional LSTM with attention mechanism and convolutional layer for text classification. Neurocomputing 2019, 337, 325–338. [Google Scholar] [CrossRef]
- Conneau, A.; Schwenk, H.; Barrault, L.; Lecun, Y. Very deep convolutional networks for text classification. arXiv 2016, arXiv:1606.01781. [Google Scholar]
- Lai, S.; Xu, L.; Liu, K.; Zhao, J. Recurrent convolutional neural networks for text classification. Twenty-ninth AAAI conference on artificial intelligence, 2015. [CrossRef]
- Basiri, M.E.; Nemati, S.; Abdar, M.; Cambria, E.; Acharya, U.R. ABCDM: An attention-based bidirectional CNN-RNN deep model for sentiment analysis. Future Generation Computer Systems 2021, 115, 279–294. [Google Scholar] [CrossRef]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
- Shrestha, P.; Sierra, S.; González, F.A.; Montes, M.; Rosso, P.; Solorio, T. Convolutional neural networks for authorship attribution of short texts. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 2017, pp. 669–674.
- Wang, J.H.; Liu, T.W.; Luo, X.; Wang, L. An LSTM approach to short text sentiment classification with word embeddings. Proceedings of the 30th conference on computational linguistics and speech processing (ROCLING 2018), 2018, pp. 214–223.


| Data Type | Number of Samples |
|---|---|
| Train | 19473 |
| Evaluation | 3125 |
| Test | 3437 |
| Data Type | Number of Samples | |
| . | ||
| Train | 66702 | 22242 |
| Evaluation | 10702 | 3569 |
| Test | 11772 | 3926 |
| Data Type | Number of Samples |
|---|---|
| Train | 231727 |
| Evaluation | 34288 |
| Test | 37717 |
| Data Type | Number of Samples |
|---|---|
| Train | 107866 |
| Evaluation | 19036 |
| Test | 22395 |
| ID | Models | Novel Data | News Data | Twitter Data | |||
|---|---|---|---|---|---|---|---|
| Accuracy | F-score | Accuracy | F-score | Accuracy | F-score | ||
| 1 | CNN [35] | 85.92 | 85.92 | 63.34 | 62.15 | 60.70 | 60.50 |
| 2 | LSTM | 76.94 | 77.39 | 70.41 | 71.13 | 57.85 | 58.29 |
| 3 | M-CNN | 67.17 | 68.93 | 66.50 | 66.69 | 51.40 | 51.40 |
| 4 | VD-CNN [31] | 59.18 | 60.02 | 48.32 | 50.65 | 55.00 | 54.99 |
| 5 | Transformer | 78.14 | 78.20 | 50.79 | 51.15 | 58.00 | 58.59 |
| 6 | R-CNN | 89.14 | 89.14 | 79.77 | 79.76 | 67.22 | 67.82 |
| 7 | ABCDM [33] | 79.68 | 79.43 | 69.96 | 70.76 | 60.00 | 59.25 |
| 8 | DC-Bi-LSTM | 85.15 | 85.06 | 78.01 | 78.21 | 65.78 | 65.74 |
| 9 | AC-Bi-LSTM [30] | 66.95 | 66.73 | 60.92 | 62.08 | 54.15 | 54.42 |
| 10 | SATLSTM [25] | 86.65 | 86.62 | 77.57 | 77.74 | 65.59 | 65.27 |
| 11 | Ours | 90.22 | 90.22 | 85.10 | 85.04 | 71.39 | 71.18 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).