Submitted:
26 January 2023
Posted:
30 January 2023
You are already at the latest version
Abstract
Keywords:
I. Introduction
- Boolean Values
- String: Limited length sequence of bytes
- Names: unique character sequence
- Arrays: 1-Dimensional array containing ordered PDF objects
- Null: if the “null” object is present in the PDF document, that means that the reference object doesn’t exist in the PDF file
- Integers and real numbers
- Dictionaries: set of key pair that links each object name to its content
- Streams: Unlimited sequence of bytes, where reading process for such object can be incremental
- Header: the first line of PDF file containing the version of the PDF under the format of “%PDF-a.b” where a.b represent the version
- Body: which contains the main content of the PDF document (i.e PDF objects)
- Cross-reference table: contains the location/address of each object of the body part.
- Trailer: contains the offset location of cross reference table, by which a PDF viewer application can start reading the file from the trailer file to proceed to Cross-ref table then the main PDF file object
II. Literature Review
III. Methodology
- 1-
- Import collected dataset into python code.
- 2-
- Splitting the dataset into features and target arrays dataset.
- 3-
- Dataset preprocessing and cleaning.
- 4-
- Splitting dataset into Training and testing datasets.
- 5-
- Using Random Forest classification on the Training and Testing dataset.
- 6-
- Producing prediction results.
- A.
- Dataset:
-
General features
- PDF size
- title characters
- encryption
- metadata size
- page number
- header
- image number
- text
- object number
- font objects
- number of embedded files
- average size of all the embedded media
-
Structural features
- No. of keywords “streams”
- No. of keywords “endstreams”
- Average stream size
- No. of Xref entries
- No. of name obfuscations
- Total number of filters used
- No. of objects with nested filters
- No. of stream objects (ObjStm)
- No. of keywords “/JS”, No. of keywords “/JavaScript”
- No. of keywords “/URI”, No. of keywords “/Action”
- No. of keywords “/AA”, No. of keywords “/OpenAction”
- No. of keywords “/launch”, No. of keywords “/submitForm”
- No. of keywords “/Acroform”, No. of keywords “/XFA”
- No. of keywords “/JBig2Decode”, No. of keywords “/Colors”
- No. of keywords “/Richmedia”, No. of keywords “/Trailer”
- No. of keywords “/Xref”, No. of keywords “/Startxref”
- B.
- Random Forest:
- C.
- Evaluation metrics
- Accuracy:
- Recall:
- Precision:
- D.
- Data preprocessing and cleaning:
- Data Cleaning
- 1)
- converted last column to 1 for ‘Malicious’ and 0 for ‘Benign’.
- 2)
- Removed some invalid records containing irrelevant data in JavaScript and Object fields.
- 3)
- Removed Records containing “Nan” for all the features (representing ‘Null’ value).
- Data Splitting:
- a)
- Data splitting into Training/Testing of ratio of 80%:20%
- b)
- Using K-Fold Cross validation with 10 Folds
IV. Results
- a)
- Training and Testing data of 80:20 ratio and
- b)
- using Cross validation with 10 K-fold,
V. Conclusions
VI. References
References
- N. Milosevic, “History of malware,” 02 2013.
- Internet Crime Report, 2021. https://www.ic3.gov/.
- Mouhammd Alkasassbeh, Mohammad A. Abbadi, Ahmed M. Al-Bustanji. LightGBM Algorithm for Malware Detection. Applied Sciences 2022, 1230. [CrossRef] [PubMed]
- Or-Meir, O.; Nissim, N.; Elovici, Y.; Rokach, L. Dynamic malware analysis in the modern era—A state of the art survey. ACM Comput. Surv. 2019, 52, 1–48. [CrossRef] 15. Albulayhi, K.; Abu Al-Haija, Q.; Alsuhibany, S.A.; Jillepalli, A.A.; Ashrafuzzaman, M.; Sheldon, F.T. IoT Intrusion Detection Using Machine Learning with a Novel High Performing Feature Selection Method. Appl. Sci. 2022, 12, 5015. [Google Scholar]
- Document management – portable document format – part 1: Pdf 1.7. Standard, International Organization for Standardization, Geneva, CH, Mar. 2008.
- PDF properties and metadata, Adobe Acrobat Accessed 6,Dec 2022.
- Aslan, Ömer & Samet, Refik. (2020). A Comprehensive Review on Malware Detection Approaches. IEEE Access. 8. 1-1. [CrossRef]
- Elingiusti, Michele & Aniello, Leonardo & Querzoni, Leonardo. (2018). PDF-Malware Detection: A Survey and Taxonomy of Current Techniques. [CrossRef]
- Albahar, Marwan & Thanoon, Mohammed & Alzilai, Monaj & Alrehily, Alla & Alfaar, Munirah & Al-Ghamdi, Maimoona & Alassaf, Norah. (2021). Toward Robust Classifiers for PDF Malware Detection. Computers, Materials and Continua. 69. 2181-2202. [CrossRef]
- VirusTotal https://virustotal.com/.
- Contagio Malware Dump, “External data source,” [Online]. Available: http://contagiodump.blogspot.com.au.
- Falah, Ahmed & Pan, Lei & Huda, Shamsul & Pokhrel, Shiva & Anwar, Adnan. (2021). Improving malicious PDF classifier with feature engineering: A data-driven approach. Future Generation Computer Systems. 115. 314-326. [CrossRef]
- CIC-Evasive-PDFMal2022 Dataset CIC-Evasive-PDFMal2022 | Datasets | Canadian Institute for Cybersecurity | UNB.
- Abu Al-Haija, Q.; Odeh, A.; Qattous, H. PDF Malware Detection Based on Optimizable Decision Trees. Electronics 2022, 11, 3142. [Google Scholar] [CrossRef]
- Chandran, P. & Hema, Rajini & Jeyakarthic, M.. (2022). Invasive weed optimization with stacked long short term memory for PDF malware detection and classification. International journal of health sciences. 4187-4204. [CrossRef]
- Issakhani, Maryam & Victor, Princy & Tekeoglu, Ali & Lashkari, Arash. (2022). PDF Malware Detection based on Stacking Learning. 562-570. [CrossRef]
- S. Y. Yerima, A. S. Y. Yerima, A. Bashar and G. Latif, “Malicious PDF detection Based on Machine Learning with Enhanced Feature Set,” 2022 14th International Conference on Computational Intelligence and Communication Networks (CICN), Al-Khobar, Saudi Arabia, 2022, pp. 486–491. [CrossRef]
- Breiman, L. Random Forests. Machine Learning 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Han, J., Kamber, M., & Pei, J. (2012). Data mining: concepts and techniques. Elsevier.
- Internet Security Report, 2021, WatchGuard’s Internet Security Report – Q4 2021.
- Pedro Ernesto Aquino, How Has Ransomware Changed Cyber Insurance,2022.
- Alzubi, Jafar & Nayyar, Anand & Kumar, Akshi. (2018). Machine Learning from Theory to Algorithms: An Overview. Journal of Physics: Conference Series. IEEE 1142. 012012. [CrossRef]





| Metric | Score |
|---|---|
| Accuracy | 99.5% |
| Precision | 99.35% |
| Recall | 99.72% |
| F1-Score | 99.5% |
| Metric | Score |
|---|---|
| Accuracy | 98.9% |
| Precision | 99.35% |
| Recall | 99.72% |
| F1-Score | 99.5% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).