Performance analysis of machine learning algorithms for Malware classification

Bushra, Raisa Hasan; Alam, Md Taukir; Saha, Aniruddho; Fahim, Nazmus Sakib; Binty, Nabila Mourium

View/Open

18301064, 18301277, 18201117, 18201166, 19101082_CSE.pdf (11.61Mb)

Date

2022-09-29

Publisher

Brac University

Abstract

Malware detection research has been popular over the years as the variations and complexity of malware attacks are increasing daily. Using variously Supervised and Unsupervised machine learning algorithms to detect, identify, or classify malware attacks has been proven a very effective technique for some past years. Some com- mon and widely concerning malware attacks are Trojan, Adware, Ransomware, and Zero-day. In this paper, we used ten ML algorithms such as AdaBoost, Stochastic Gradient Descent (SGD), Naïve Bayes (NB), Decision Tree (DT), Random For- est (RF), XGBoost, Logistic Regression (LR), Multi-Layer Perceptron (MLP), K- Nearest Neighbour(KNN), Support Vector Machine (SVM) for classifying software- based Trojan attacks, Ransomware, Adware and Zero-day attacks. This research was conducted on a dataset having a total sample of 12863 malware, consisting of the malware categories mentioned above, to extract features and learn patterns. Also, we showed a comparison between these ML methods and analysis based on how they classify these popular malware in this paper after testing each classifier on the selected dataset. After implementation, RF achieved the highest accuracy of 86.97%, and Gaussian NB achieved the lowest accuracy of 47.84%. MLP, XGBoost, KNN, DT, AdaBoost, SVM, LR, SGD got 83.60%, 82.59%, 80.68%, 79.63%, 73.30%, 73.22%, 67.08%, 64.40% accuracy respectively. Other than accuracy, our analysis was based on individual accuracy, precision, and F1-score, TPR, TNR, FPR, and FNR of malware classes for each ML classifier.

Keywords

Machine learning; Trojan; Adware; Ransomware; Classification; Malware; Zero-day; Naïve Bayes; Stochastic gradient descent; Random forest; Decision tree; AdaBoost; XGBoost; Logistic regression; Multi-layer perceptron; K- nearest neighbour; Support vector machine

LC Subject Headings

Regression analysis; Computer algorithms

Description

This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2022.

Cataloged from PDF version of thesis.

Includes bibliographical references (pages 32-36).

Department

Department of Computer Science and Engineering, Brac University

Type

Thesis

Collections

Thesis & Report, BSc (Computer Science and Engineering) [1586]