PDFGuardian: An innovative approach to interpretable PDF malware detection using XAI with SHAP framework
Abstract
As the world is moving more and more towards a digital era, a great majority of
data is transferred through a famous format known as PDF. One of its biggest
obstacles is still the age-old problem: malware. Even though several anti-malware
and anti-virus software exist, many of which cannot detect PDF Malware. Emails
carrying harmful attachments have recently been used in targeted cyber attacks
against businesses. Because most email servers do not allow executable files to be
attached to emails, attackers prefer to use non-executable files like PDF files. In
various sectors, machine learning algorithms and neural networks have been proven
to successfully detect known and unidentified malware. However, it can be difficult
to understand how these models make their decisions. Such lack of transparency
can be a problem, as it is important to understand how an AI system is making
decisions in order to ensure that it is acting ethically and responsibly. In some cases,
machine and deep learning models may make biased or discriminatory decisions or
have unintended consequences. Hence, Explainable AI comes into play. To address
this issue, this paper suggests using machine learning algorithms SGD(Stochastic
Gradient Descent), XGBoost Classifier, and deep learning algorithms Single Layer
Perceptron, ANN(Artificial Neural Network) and check their interpretability using
Explainable AI (XAI)’s SHAP framework to classify a PDF file being malicious or
clean for a global and local understanding of the models.