Submitted:
02 July 2024
Posted:
03 July 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Systematic Linux syscall Categorization and Security Risk Ranking: We categorize and assign risk rankings to all extracted calls based on their potential security risks. The validity of these categories and rankings is confirmed through statistical analysis in this study.
- Static Analysis Dataset: We reverse-engineered 2,331 ARM architecture-based ELF binaries to extract syscalls and library syscall wrappers using static analysis and correlated with syscall categories and security risk ranking to create a comprehensive dataset, which includes the syscalls, category and security risk ranking for each syscall, along with added statistical features for each binary.
- Malware Detection:To demonstrate the reliability of using static analysis for malware detection based on system calls extracted from the disassembled binaries, we evaluate the dataset created with standard ML classification models such as Logistic Regression, Random Forest classifier, Support Vector Classification (SVC), and Multi-Layer Perceptron (MLP) Neural Network.
2. Preliminaries
2.1. System and Library Calls
- Syscalls: In Linux, a system call (syscall) is the primary interface through which user-space programs request services and access functionalities from the operating system kernel. It acts as a bridge between the user space, where applications run, and the kernel space, where core system functions reside. System calls enable programs to perform privileged operations and access restricted system resources, such as hardware interaction, process management, file I/O operations, network communication, and memory management as shown in Figure 1. Some system calls are architecture-specific, and their implementation and availability can vary between different hardware platforms (e.g., x86, ARM, MIPS). These differences are due to the unique characteristics and requirements of each architecture, necessitating specific handling within the kernel to optimize performance and compatibility [8].
- Library syscall wrappers: When using programming languages like C++ or Java, developers often use pre-built functions from libraries such as the GNU C Library glibc, which contains routines for file management, memory allocation, and computational tasks. When a program calls one of these library functions, it may require system-level functionalities that reside in kernel space, like hardware interaction or process management. Consequently, library functions often make system calls and provide wrapper functions to syscalls in order to perform these tasks as shown in Figure 1.
- Virtual syscalls: Many Linux distributions also provide optimization of certain syscalls called virtual syscalls. Virtual syscalls, or vDSO (virtual Dynamic Shared Object) calls, are a set of performance-optimized routines provided by the Linux kernel that user-space applications can use to execute certain system calls more efficiently. These virtual syscalls are mapped directly into the process address space, allowing some system call functionalities to be executed without the overhead of a traditional syscall. vDSO calls are generally captured through dynamic analysis, as they involve runtime components and optimizations. This study does not capture Virtual syscalls.
2.2. Linux Malware Static Analysis
3. Literature Review
| Research | Statistical Analysis | Architecture | Features Used | Dynamic/Static | Models | Accuracy | Comparison |
|---|---|---|---|---|---|---|---|
| Asmitha Vinod [10] | non-parametric statistical methods like Kruskal-Wallis ranking test (KW), Deviation From Poisson (DFP) | - | syscalls | Dynamic | J48, Adaboost, Random forest | 97.30% | Dynamic analysis* |
| Asmitha Vinod [11] | - | - | syscalls | Dynamic | Naïve bayes, J48, Adaboost, RF, IBK-5 | 97% | Dynamic analysis* |
| Phu et al [12] | n-gram, chi square | MIPS | syscalls | Dynamic | RF, NB, SVM | 97% | Dynamic analysis*, MIPS architecture-specific |
| Tahir Qadir [13] | - | x86, MIPS, ARM | syscalls | Dynamic | SVM, LR, RF, MLP, Bagging | Bagging, RF (99%) | Dynamic analysis* |
| Shobana Poonkuzhali [14] | - | - | syscalls | Dynamic | RNN | 98.7% | Dynamic analysis* |
| Abderrahmane et al [15] | - | ARM | log files, syscalls | Dynamic | CNN | 93.3% | Dynamic analysis*, ARM architecture-specific |
| Our research | chi squared test, Wilcoxon test | Multi architecture | syscalls, library syscall wrappers | Static | Linear Regression, Random Forest, SVC, Neural network | 96.86% | Static analysis requiring no execution or environment setup. Multi-architecture support. Can be combined with other static features for a higher accuracy. We conduct statistical analysis to validate syscall ranking and category assignment. These scores can be adjusted for various architectures. |
4. Methodology
4.1. Iot Malware Dataset
4.2. Reverse Engineering
4.3. Heuristics
| Syscall category | Description |
|---|---|
| FileSystem | Handles file management operations such as reading, writing, and permissions. |
| Process | Manages process lifecycle such as creation, execution, and termination. |
| Memory | Controls memory allocation, deallocation, change and access critical for process management. |
| Network | Encompasses syscalls for network communication such as socket management, send, recv. |
| System | General system calls for services, system configuration and management. |
| Metadata | Involves retrieval and manipulation of file or system metadata. |
| Signal | Inter-process communication, and signal-driven interruptions. |
| Security | Key management, encryption, and access controls. |
| NonblockingIO | Non-blocking operations for input/output. |
| Time | Measuring time, manipulation. |
4.4. Statistical Feature Engineering
- Call Count: The total number of system calls made by each binary. This feature reflects the general activity level of the binary, which can indicate suspicious behavior.
- Distinct Call Count: The number of unique syscalls made by each binary. A high variety of calls can be indicative of complex or unusual binary behavior.
- Category Frequency: For each of the ten syscall categories, we calculated the frequency of syscalls falling into each category per binary. This helps in understanding which types of operations are predominant in a binary, aiding in profiling typical and atypical behaviors.
- High-Risk Call Proportion: The proportion of syscalls that are ranked as ’High’ risk relative to the total number of syscalls. This feature specifically targets the detection of syscalls more commonly associated with malicious activities.
- Entropy of Calls: We calculated the entropy of syscall distribution within each category of a binary to measure the unpredictability and randomness of syscall usage, which can be higher in malware due to evasion techniques or diverse functionalities.
- Weighted Risk Score: By assigning weights to syscalls based on their assigned risk levels (High, Medium, Low), we computed an overall risk score for each binary. This score provides a quantitative measure of the potential threat posed by the binary based on the observed syscalls.
4.5. Data Organization
4.6. Handling of Outliers
4.7. Statistical Feature Analysis
- Data Collection: We calculate the frequency of each syscall rank for both malware and benign binaries.
- Normalization: The frequencies are normalized to proportions within each group.
-
chi-squared Test: We perform the chi-squared test to determine if the observed differences in the distribution of syscall ranks are statistically significant.where are the observed frequencies of risk rank, and are the expected frequencies. We use the same methodology to assess the importance of syscall categories as well.
4.8. Machine Learning Classification Based on Static Extraction of Syscalls
- Logistic Regression: To evaluate a simple yet effective baseline for binary classification.
- Random Forest: Random Forest classifier is robust in handling outliers and anomalies typical of malware, and suitable for complex interactions between features.
- LinearSVC: LinearSVC is often effective with high-dimensional data, and used for interpretability.
- Multi-Layer Perceptron (MLP) Neural Network: MLP is effective is capturing complex patterns and interactions in the data through layered architecture.
4.9. HyperParameters
4.9.1. Vectorization
-
RandomForestClassifier:Through GridSearch, we identified the optimal number of trees as 200. The default values were retained for maximum depth of the tree, which was set to None, and Gini impurity is used for the split criteria.
-
LinearSVC:For LinearSVC, we used the default ’squared-hinge’ loss function with an L2 regularization and a set the tolerance of for the stopping criteria.
-
MLP Classifier:For the Neural network model, we employed the rectified linear unit function (ReLU) for activation and the `adam’ solver for weight optimization, with an initial learning rate of 0.001.
5. Results and Analysis
- Accuracy: The proportion of true results (both true positives and true negatives) among the total number of cases examined.where , , , and represent the number of true positives, true negatives, false positives, and false negatives, respectively.
- Precision: The proportion of true positive results in all positive predictions.
- F1 Score: The harmonic mean of precision and recall.
5.1. Discussion





6. Conclusion
7. Future Works
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Howarth, J. 80+ Amazing IoT Statistics (2024-2030) — explodingtopics.com. https://explodingtopics.com/blog/iot-stats. [Accessed 09-05-2024].
- Zscaler ThreatLabz Finds a 400Year-over-Year. https://www.zscaler.com/press/zscaler-threatlabz-finds-400-increase-iot-and-ot-malware-attacks-year-over-year-underscoring. [Accessed 09-05-2024].
- Ngo, Q.D.; Nguyen, H.T.; Le, V.H.; Nguyen, D.H. A survey of IoT malware and detection methods based on static features. ICT express 2020, 6, 280–286. [Google Scholar] [CrossRef]
- Antony, A.; Sarika, S. A review on IoT operating systems. Int. J. Comput. Appl 2020, 176, 33–40. [Google Scholar] [CrossRef]
- AV-ATLAS Malware Portal. https://portal.av-atlas.org/malware, 2023. [Online; accessed 10-May-2023].
- Pancake.; others. The Official Radare2 Book: ESIL, 2024.
- Ramamoorthy, J.; Gupta, K.; Shashidhar, N.K.; Varol, C. Linux IoT Malware Variant Classification Using Binary Lifting and Opcode Entropy. Electronics 2024, 13, 2381. [Google Scholar] [CrossRef]
- Kerrisk, M. ; others. Linux Programmer’s Manual: syscalls, 2024. Online; accessed 18-June-2024.
- Jones, M. IBM Developer — developer.ibm.com. https://developer.ibm.com/articles/l-linux-kernel/. [Accessed 09-05-2024].
- Asmitha, K.; Vinod, P. Linux malware detection using non-parametric statistical methods. 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE, 2014, pp. 356–361.
- Asmitha, K.; Vinod, P. A machine learning approach for linux malware detection. 2014 international conference on issues and challenges in intelligent computing techniques (ICICT). IEEE, 2014, pp. 825–830.
- Phu, T.N.; Dang, K.H.; Quoc, D.N.; Dai, N.T.; Binh, N.N. A novel framework to classify malware in mips architecture-based iot devices. Security and Communication Networks 2019, 2019, 1–13. [Google Scholar] [CrossRef]
- Tahir, I.; Qadir, S. Machine Learning-based Detection of IoT Malware using System Call Data 2022.
- Shobana, M.; Poonkuzhali, S. A novel approach to detect IoT malware by system calls using Deep learning techniques. 2020 International Conference on Innovative Trends in Information Technology (ICITIIT). IEEE, 2020, pp. 1–5.
- Abderrahmane, A.; Adnane, G.; Yacine, C.; Khireddine, G. Android malware detection based on system calls analysis and CNN classification. 2019 IEEE wireless communications and networking conference workshop (WCNCW). IEEE, 2019, pp. 1–6.
- Olsen, S.H.; OConnor, T. Toward a Labeled Dataset of IoT Malware Features. 2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, 2023, pp. 924–933.
- Refade. IoT_ARM: A Collection of IoT Malware Samples for ARM Architecture, 2024.
- Tallarida, R.J.; Murray, R.B.; Tallarida, R.J.; Murray, R.B. Chi-square test. Manual of pharmacologic calculations: with computer programs.
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, E. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 2011, 12, 2825–2830. [Google Scholar]






| Class | Binary Count | syscall Counts |
|---|---|---|
| Malware | 1,117 | 163,288 |
| Benign | 1,214 | 45,284 |
| Dataset feature | Description |
|---|---|
| binary_id | Unique identifier for each binary |
| hash | Hash of the binary |
| Architecture | ARM - although our metholodolgy supports multiple architectures with binary lifting strategy |
| isMalware | Whether the binary is malicious or beningn |
| prc | 32-bit/64-bit |
| Endian | The endianess of the binary (LSB or MSB) |
| Stripped | Binary attributes indicating whether it is stripped of symbol information |
| call_api | Name of the syscall |
| call_desc | Description of the function of the syscall |
| call_type | syscall or lib (library syscall wrapper) |
| call_cat | Category of the syscall, Refer Table 3 |
| call_rank | Security risk level of the syscall based on its potential for malicious use (High, Medium, Low) |
| bin_all_call_cnt | Count of all the syscalls and lib calls in the binary |
| bin_dist_call_cnt | Count of the distinct syscalls and libcalls in the binary |
| bin_dist_cat_cnt | Count of each category of syscalls and libcalls |
| bin_each_api_cnt | Count of unique syscalls in the binary |
| bin_calls_per_cat | Number of calls per category in a binary |
| bin_dist_call_per_cat | Number of calls in each distinct category in a binary |
| call_rank_n | Number of syscalls in the binary with the current syscall rank |
| bin_calls_per_rank | Number of syscalls and libcalls in each rank per binary |
| std_dev_calls_per_cat | Standard deviation of the syscalls per category in a binary |
| mean_calls_per_cat | Mean of calls per category in a binary |
| average_rank | Average rank of the syscalls and libcalls in a binary |
| weighted_rank_score | Weight of the syscall, Refer Equation 1 |
| total_rank_score | Sum of all weighted rank scores across all syscalls in a binary, Refer Equation 2 |
| diversity_score | Unpredictability of the syscall ranks in the binary, Refer Equation 3 |
| final_score | Total rank score added to the diversity score, Refer Equation 4 |
| cat_concentration_index | Number of syscall categories in the binary (Unused) |
| Model | Accuracy | Precision | F1 Score |
|---|---|---|---|
| Random Forest | 93.34% | 94.71% | 96.86% |
| Logistic Regression | 92.34% | 96.0% | 95.06% |
| SVC | 92.48% | 96.16% | 95.14% |
| MLP NN | 92.07% | 93.34% | 95.03% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).