Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

Android malware analysis in a nutshell

Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

* E-mail: [email protected] , [email protected]

Affiliations Security Engineering Lab, Computer Science Department, Prince Sultan University, Riyadh, KSA, Computer Science Department, King Abdullah II School of Information Technology, The University of Jordan, Amman, Jordan

ORCID logo

Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

Affiliation Security Engineering Lab, Computer Science Department, Prince Sultan University, Riyadh, KSA

Affiliations Security Engineering Lab, Computer Science Department, Prince Sultan University, Riyadh, KSA, Electronics and Electrical Communication Engineering Department, Faculty of Electronic Engineering, Menoufia University, Menouf, Egypt

  • Iman Almomani, 
  • Mohanned Ahmed, 
  • Walid El-Shafai

PLOS

  • Published: July 5, 2022
  • https://doi.org/10.1371/journal.pone.0270647
  • Reader Comments

Table 1

This paper offers a comprehensive analysis model for android malware. The model presents the essential factors affecting the analysis results of android malware that are vision-based. Current android malware analysis and solutions might consider one or some of these factors while building their malware predictive systems. However, this paper comprehensively highlights these factors and their impacts through a deep empirical study. The study comprises 22 CNN (Convolutional Neural Network) algorithms, 21 of them are well-known, and one proposed algorithm. Additionally, several types of files are considered before converting them to images, and two benchmark android malware datasets are utilized. Finally, comprehensive evaluation metrics are measured to assess the produced predictive models from the security and complexity perspectives. Consequently, guiding researchers and developers to plan and build efficient malware analysis systems that meet their requirements and resources. The results reveal that some factors might significantly impact the performance of the malware analysis solution. For example, from a security perspective, the accuracy, F1-score, precision, and recall are improved by 131.29%, 236.44%, 192%, and 131.29%, respectively, when changing one factor and fixing all other factors under study. Similar results are observed in the case of complexity assessment, including testing time, CPU usage, storage size, and pre-processing speed, proving the importance of the proposed android malware analysis model.

Citation: Almomani I, Ahmed M, El-Shafai W (2022) Android malware analysis in a nutshell. PLoS ONE 17(7): e0270647. https://doi.org/10.1371/journal.pone.0270647

Editor: Sathishkumar V E, Hanyang University, KOREA, REPUBLIC OF

Received: April 30, 2022; Accepted: June 14, 2022; Published: July 5, 2022

Copyright: © 2022 Almomani et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The datasets that support the findings of this study are available online. These datasets were derived from the following resources available in the public domains: 1. https://www.sec.tu-bs.de/~danarp/drebin/download.html 2. https://www.impactcybertrust.org/dataset_view?idDataset=1275 .

Funding: This research was carried out without any financial support; however, the publication fee is sponsored by the Prince Sultan University, Saudi Arabia.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Mal cious soft ware (Malware) is any software built for unauthorized purposes and mala fide aims. So, the malware affects the operating system performance and its running services due to its harmful behavior. Currently, android malware is one of the most critical threats that can encrypt or defect the operation of Android devices [ 1 ]. This is because Android malware applications (APKs) can steal or cipher sensitive data, show undesirable advertising, disrupt normal functions, or control the users’ devices without their knowledge [ 2 ].

There are a lot of groups and categories of Android malware APKs, such as worms, botnet, rootkits, ransomware, and Trojans [ 3 ]. These Android malware attacks can exploit metamorphic and polymorphic procedures to obfuscate traditional malware recognition and detection algorithms. Moreover, the Android malware developers have a tendency to modify small sections of the developed and implemented source codes to create other malware alternatives and threats that can evade the malware detection techniques [ 4 ]. Consequently, the identification process of Android malware attacks from the same malware family becomes tremendously challenging [ 5 ]. Therefore, efficient Android malware detection algorithms based on smart artificial intelligence (AI) tools need to be developed and implemented to identify and recognize the harmful effect of Android malware threats [ 6 , 7 ].

Android malware detection and identification algorithms are categorized into four main groups: static-based, dynamic-based, vision-based, or hybrid-based detection algorithms [ 8 – 12 ]. In static-based identification algorithms, the Android malware APKs are analyzed without executing them. So, these static-based algorithms depend on extracting some of the important features from the suspected source codes to identify and recognize the Android malware families. However, the main disadvantage of these static-based algorithms is that they are not robust to code obfuscation, and they need more computation steps during the process of extracting features [ 13 , 14 ]. In dynamic-based identification algorithms, the traces and features of the suspected source codes are examined and analyzed during their execution and running. The critical disadvantage of these algorithms is that they are more time-consuming and require additional storage resources [ 15 ].

On the other hand, in the hybrid-based identification algorithms, two or more types of identification categories are simultaneously employed to efficiently detect the Android malware attacks. But this malware identification category needs more sequential steps, high computational complexity, human intervention, and manual effort [ 16 ]. In vision-based malware identification algorithms, the Android malware APKs or their extracted features are converted to visual 2D digital images before the classification and detection process. Therefore, the main features of the Android malware APKs can be extracted and obtained by the unzipping or decompilation processes [ 17 , 18 ]. Then, the resulting 1D binary vectors of the extracted features (i.e., Android manifest, SMALI, and Classes.dex) are transformed to 2D vectors (grayscale images). In the last step, the resulting 2D grayscale images are forwarded to a well-developed malware classifier such as Convolutional Neural Networks (CNN)-based malware classifiers to detect and classify the category and family of the analyzed Android malware APKs.

Recently, Deep Learning (DL) and optimization algorithms are currently utilized and exploited in mitigating Android malware threats [ 19 – 22 ]. Thus, DL networks such as CNN algorithms are the most common AI and DL-based recognition & identification techniques used to detect malware attacks from the input malware visual images [ 23 – 25 ]. Furthermore, the CNN networks have the ability to efficiently distinguish various objects and aspects on the input visual images using well-tuned learning biases and weights based on utilizing optimization algorithms. Therefore, the CNN algorithms are the best choice for image classification challenges and applications, such as classifying malware images [ 26 – 29 ]. Consequently, efficient developed CNN algorithms can be used to automatically collect and obtain the rich and valuable features from Android malware visual images. Then, these obtained features are used to classify and identify the different families of Android malicious APKs.

Therefore, in our proposed work, without executing or running the Android APKs, we first converted their binary data into 2D images. After that, we employed a well-developed CNN-based Android malware detection algorithm to classify different categories of Android malware families from these 2D images. In addition, we tested and analyzed different 21 pre-trained CNN algorithms to check their detection performance in identifying and recognizing the Android malware classes from their visual images. Thus, the DL-based CNN algorithms differ from traditional Machine Learning (ML) algorithms that accomplish feature representation with specific parameter configuration or particular assumptions. Therefore, compared to conventional ML algorithms, the DL-based CNN algorithms can effectively discover complex patterns and obtain valuable features from multi-dimensional patterns like visual images.

In Android malware analysis and detection systems, many parameters and factors need to be considered that control the identification and recognition performance of the utilized malware classifiers. These parameters include (1) the analyzed Android dataset (balanced or imbalanced), (2) the utilized evaluation metrics (i.e., security or complexity metrics), (3) the type of malware analysis (static, dynamic, hybrid, or vision), and (4) the type of APK components selected to be analyzed in the detection process (i.e., Full APK file, android manifest file, SMALI file, or Classes.dex file).

So, this research is motivated by the importance of the area of android malware analysis and detection solutions due to the increased risk of such types of attacks. In addition, there are tremendous existing efforts utilizing vision-based algorithms to analyze and detect android malware with high accuracy. The most critical issue in the previous related works is that they only studied some parameters in their introduced malware detection systems. But, to achieve high detection accuracy and efficient malware analysis, many factors must be investigated that directly or indirectly affect the malware classification process.

As most current malware detection systems consider one or some factors while building their malware predictive systems, this motivates us to offer a comprehensive analysis model for Android malware. The model presents the essential factors affecting the analysis results of vision-based Android malware. Consequently, we comprehensively highlighted these factors and tested their impacts through a deep empirical study. The goal is to support researchers and developers by providing a clear guide on planning and building efficient Malware analysis systems that meet their requirements and available resources.

The significant contributions of our work are detailed as follows:

  • Summarizing and comparing the most recent vision-based Android malware detection systems and the main factors studied by them.
  • Proposing a nutshell vision-based model for efficiently detecting malware apps. This model considers a comprehensive set of factors that might impact the efficiency of malware analysis and detection solutions from the security and complexity perspectives. These factors include the nature of the malware datasets, APK conversion process & format, CNN algorithms used, and the evaluation metrics applied.
  • Constructing a deep empirical study to implement all these factors and related parameters and analyze their impacts by running more than 450 experiments within the same environmen.
  • Investigating the malware detection performance of different 22 CNN algorithms as part of the empirical study on the two most common imbalanced Android malware datasets (DREBIN and AMD). One of these CNN algorithms is developed from scratch for this research.
  • Avoiding the need for static or dynamic analysis for classifying Android malware attacks by converting Android threats to visual images for easy and low-complex classification process using CNN algorithms. Thus, we achieved low computational complexity and, at the same time, obtained high detection accuracy.
  • Studying the impact of different visual formats of Android malware APKs on the security and complexity performance of malware detection algorithms.
  • Analyzing highly imbalanced Android malware datasets containing unbalanced malware classes to achieve proper detection performance.
  • Report and analyze the experiments’ results whether the malware APKs were directly converted to images or rich extracted features from the Android APKs were converted to visual images.
  • Performing a deep comparative analysis for the security and complexity metrics performance of all tested scenarios composed in the proposed comprehensive vision-based model.

The structure of this work is as follows. Section Related Work summarizes and compares the recent related studies. Section Proposed presents the proposed comprehensive android malware analysis model. Section Analysis illustrates the model evaluation and results discussions & analysis. Finally, Section Conclusions concludes the paper and offers some future directions.

Related work

This section summarizes and compares the previous work related to image-based malware detection algorithms and systems. Table 1 shows a summary and comparison in terms of type of image conversion (and if the process used involves unzipping, de-compilation, or both), used dataset(s), utilized CNN algorithms, performance evaluation measures considered such as model/prepossessing complexity and security measures.

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

https://doi.org/10.1371/journal.pone.0270647.t001

Various algorithms were introduced in the literature that use unzipping, de-compilation, or both in the image conversion process. Regarding unzipping-related approaches [ 30 ], introduced a byte-level malware classification method by using Markov technique in classes.dex-to-image conversion and then using deep CNN for the classification. Moreover [ 31 ], proposed a system to classify malware by converting non-intuitive features into images to extract features using CNN and use the features in classical ML algorithms such as KNN to detect the malware family. [ 32 ] implemented and introduced a color visualization method on classes.dex and AndroidManifest.xml files in malware Android apps and classify the images using CNN-ResNet models. In [ 33 ] paper, classical machine algorithms such as Random Forest, K-nearest Neighbors, Decision Tree, Bagging, AdaBoost, and Gradient Boost were used for classification after constructing feature vectors from gray images, yielded from converting APK contents such as classes.dex to images. [ 34 ] proposed an approach to enhance blockchain user security by implementing RGB image visualization technique on three types of files in Android apps: classes.dex, AndroidManifest.xml, and Certificate. Then, train different classification models and apply a decision mechanism to detect malware versus benign. On the other hand, for de-compilation techniques [ 35 ], introduced a method called AdMat which treats Android apps as images by forming an adjacency matrix for each app and then feeding them to the CNN model to classify an app to malware or benign. Additionally [ 36 ], combined Opcodes, API packages, and API functions to construct RGB images and then use CNN for classification. [ 37 ] mapped permissions to severity levels [ 38 ]to create images to be fed to the CNN model for malware classification. Other methods such as [ 39 ] used network interactions as features to be converted to images to be input for CNN.

Different datasets were used in the previous papers to test the models and systems. The main ones were Drebin and AMD. Some of them used DREBIN alone, such as [ 31 , 36 ], and some of them used only AMD, such as [ 39 ]. However, most of them used a combination of both [ 32 , 33 , 35 , 37 ].

To evaluate the performance of the resulted predictive models, several metrics were used in the literature. Common metrics were accuracy, precision, recall, and F1-score [ 30 , 31 , 34 – 36 ]. Other metrics were used such as error rate, specificity, sensitivity, MSE, and FPR [ 31 , 32 , 36 , 37 ].

Even though different works have been introduced for malware detection analysis, none of them studied the approach comprehensively in terms of the used image conversion methods, datasets, CNN models, and evaluation metrics. This can be clearly observed in the comparison conducted among the related work and our proposed analysis model, as shown in Table 1 . For example, in terms of the employed CNN algorithms used, most of the related works examined a few models such as VGG16, ResNet, and customized CNN algorithms such as in [ 30 – 32 , 34 – 37 , 39 ]. Moreover, they did not take into consideration all different file formats of Android malware samples. For instance, the authors in [ 30 – 34 ] focused on the unzipping prepossessing without considering the impact of decompiling preprocessing for the Android malware APKs. On the other hand, the authors in [ 35 – 37 ] only considered the decompiling preprocessing. Additionally, few assessment metrics were used for performance evaluation and complexity & security analysis of the examined CNN algorithms. For example, some of the related studies used training time, prepossessing time, test time, APK file size, and RAM usage as complexity parameters, such as in [ 30 – 32 , 34 – 37 , 39 ]. However, none of these related works introduced a comprehensive analysis of all of these complexity metrics. Moreover, in terms of security measures used, the related works used various security metrics such as accuracy, precision, recall, and F1-score, such as in [ 30 – 32 , 34 – 37 , 39 ]. But many other assessment metrics must be considered and analyzed. For instance, the authors in [ 31 ] considered other metrics such as error rate and MSE, while the authors in [ 36 ] evaluated their suggested CNN algorithms using TPR and FPR. However, most related studies did not present deep and comprehensive security analyses such as the estimation of NPV, PPV, and FOR parameters that can provide more insights.

Therefore, in this paper, we introduce a comprehensive model that profoundly investigates the critical factors that might impact the performance of android malware analysis systems in terms of efficiency, complexity, and security perspectives. Our proposed work covers different APK file formats, and different scenarios of de-compilation & unzipping preprocessing to extract more features from the android APKs such as AM, DEX, de-compiled AM, and SMALI. Additionally, the proposed Android malware analysis model tests the performance of different 22 CNN algorithms in terms of comprehensive security and complexity metrics to deeply analyze their detection and computational efficiencies.

Proposed comprehensive Android malware analysis model

This section presents a nutshell model of building vision-based prediction models for Android malware detection systems. As shown in Fig 1 , primary factors should be considered as they will affect the Android malware analysis and detection processes. These factors include:

  • Type of conversion : This defines how the Android malware apk is analyzed. One option is to keep it as is (compressed) and then convert it to an image. Another option is to decompile the apk file first using different tools such as apktool ( https://ibotpeaches.github.io/Apktool/ ). This tool decompiles the apk to generate smali files and the Android Manifest (AM) file. Then these files will be stacked and then converted to images. Additionally, the analysis system could consider only unzipping the apk file and then converting the resulting AM file and “classes.dex” (CD) files to images.
  • Dataset Nature : The created or chosen Android malware dataset could severely impact the analysis model and the resulted predictive models. This includes the type of malware apps considered and their primary behavior, the number of families (classes), and whether the dataset is balanced or not.
  • CNN Algorithms : The type of CNN algorithm that will be used to build the predictive model is vital to the performance of the malware detection systems. Therefore, this study has examined most of the well-known CNN algorithms (all currently implemented by Keras ( https://keras.io/ )) to provide a deep insight into the CNN algorithms’ impact on detecting Android malware applications.
  • Evaluation : The way the Android malware analysis and predictive models are evaluated is critical to trade-off the system performance in terms of security and complexity. Therefore, the evaluation metrics must be carefully selected based on the system’s needs and available resources.

thumbnail

https://doi.org/10.1371/journal.pone.0270647.g001

The main flow of our proposed model is illustrated in Fig 2 . The first phase in the proposed nutshell model is the selection of the benchmarked android applications (apks) datasets that are heavily utilized in vision-based malware analysis systems. Therefore, both DREBIN [ 40 ] and AMD [ 41 ] have been selected. The reason behind choosing two different datasets is to show the impact of only changing the nature of the dataset that the system is analyzing and testing on the performance of the overall detection process. After that, the model processes these apks in different ways of conversions: (1) apk is kept compressed as is, (2) apk is decompiled using Apktool to produce android decompiled manifest file (DAM), and Smali files, and (3) apk is unzipped to generate android manifest (AM) file and Dex files. Then, the image conversion phase is started by converting all features resulting from the above files into images. These visual malware images are obtained by converting the extracted features’ binaries to 8-bit vectors and then converted to 2D grayscale images. For more details and explanations for the byte-to-image conversion process, it can be checked in [ 23 , 26 ].

thumbnail

https://doi.org/10.1371/journal.pone.0270647.g002

The final phase is applying 22 CNN models for training and testing the predictive models and then evaluating their performances using a comprehensive set of assessment metrics related to complexity such as time, CPU & storage utilization for both the per-processing and model execution phases. Additionally, 16 security-related metrics are also measured.

21 pre-trained CNN algorithms (VGG16, ResNet50, VGG19, DenseNet121, DenseNet169, DenseNet201, EfficientNetB0, EfficientNetB1, EfficientNetB2, EfficientNetB3, EfficientNetB4, EfficientNetB5, EfficientNetB6, EfficientNetB7, InceptionResNetV2, InceptionV3, MobileNet, MobileNetV2, MobileNetV3Large, MobileNetV3Small, and Xception) [ 42 – 44 ] are examined. These pretrained CNN algorithms are developed in Python and implemented in Keras and TensorFlow libraries [ 45 – 47 ]. Additionally, another CNN algorithms is developed from scratch in this research. This algorithm has different layers, as shown in Fig 3 .

thumbnail

https://doi.org/10.1371/journal.pone.0270647.g003

It consists of several sequential stages. The first stage is the processing of the input visual malware images through the input layer and the Batch-Normalization (BN) layer that normalizes the visual images by re-scaling and re-centering processes. The BN layer is also introduced in the proposed algorithm to stabilize the CNN network. Then, in the second stage, the superlative and furthermost effective features are extracted and accumulated through several 2D convolutional layers (Conv2D), containing the same padding and stride by one. The weights of each utilized Conv2D are initialized with an orthogonal matrix.

The number of employed filters in the Conv2D layers are 8, 16, 32, 64, 64, 256, respectively. Also, the Conv2D layers are interspersed with pooling layers called MaxPooling that selecting the most significant pixel values in a four-pixel space. So, the MaxPooling layers are characterized by reducing the computational burden of the proposed neural CNN network. After that, the GlobalAveragePooling2D is introduced to gather the most common features during the training process.

In the last stage, which is the decision-making, classification & detection stage, the spatial data is primarily converted to one-dimensional data by the flatten later. Next, three sequential fully connected layers (Dense) are utilized, each one of the first two Dense layers consists of 1024 nodes (neurons) whilst the last Dense layer consists of a number of nodes that equal the number of classified classes (eight malware classes in our proposed work). In addition, in the proposed CNN algorithm, we used the Dropout layer to prevent the overfitting problem. Furthermore, the Rectified Linear Unit (ReLU) is also utilized in all Conv2D and Dense layers as an activation function. But the ReLU is used in the last SoftMax layer to make the classification decision. Table 2 presents the specifications of all employed layers in the proposed CNN algorithm.

thumbnail

https://doi.org/10.1371/journal.pone.0270647.t002

Model evaluation and results analysis

This section describes and discusses the security and complexity analysis for the proposed comprehensive model. So, the in-detail analysis and testing of the employed 22 CNN algorithms are introduced in terms of different evaluation metrics. The simulation specifications of all examined CNN algorithms in the proposed comprehensive vision-based android malware detection model is summarized in Table 3 .

thumbnail

https://doi.org/10.1371/journal.pone.0270647.t003

Two imbalanced android datasets (DREBIN [ 40 ] and AMD [ 41 ]) are examined in the simulation analysis. Each one of these datasets contains eight android malware classes. The names and numbers of android malware APKs of the examined DREBIN and AMD datasets are presented in Table 4 .

thumbnail

https://doi.org/10.1371/journal.pone.0270647.t004

Assessment metrics

android malware detection research paper

Security analysis

To assess the security of the proposed comprehensive model, we carried out extensive simulation experiments based on different vision-based scenarios. So, the examined 22 CNN algorithms, including the proposed one, are tested on five vision-based formats, which are: (1) direction conversion of an APK file to a visual image, (2) conversion of Android Manifest (AM) file extracted from the unzipping process to a visual image, (3) conversion of AM file extracted from the decompilation (DAM) process to a visual image, (4) conversion of Classes.dex (CD) file extracted from the unzipping process to a visual image, and (5) conversion of SMALI file extracted from the decompilation process to a visual image. All the above-mentioned security-related metrics are calculated. For simplicity in presenting and comparing the results, the accuracy, precision, recall, and F1-Score metrics are highlighted in each tested CNN algorithm for the five studied vision-based scenarios on two different android malware datasets (DREBIN & AMD), as shown in Tables 5 and 6 .

thumbnail

https://doi.org/10.1371/journal.pone.0270647.t005

thumbnail

https://doi.org/10.1371/journal.pone.0270647.t006

Tables 5 and 6 present the performance of all predictive models generated based on the DREBIN and AMD datasets from security perspectives. The results revealed that the proposed CNN algorithm achieves superior detection efficacy for the assessed security parameters compared to the other conventional CNN algorithms. Furthermore, it is demonstrated for the two examined android malware datasets that the DAM vision-based format introduces the best security performance for the proposed CNN algorithm and almost all tested CNN algorithms compared to other examined vision-based formats.

Moreover, Tables 5 and 6 show that the achievement of high detection efficacy depends on the proper selection of the CNN algorithm and the appropriate choice of utilized vision-based format. So, for example, in some tested cases, the DAM vision-based format is not the best vision-based scenario for some examined CNN algorithms. Therefore, based on the security target of the android malware analysis system, it can select the appropriate CNN model and vision-based format.

22 CNN models were implemented and applied on the two datasets. To simplify the presentation of the simulation results, we introduce only the confusion matrices and the accuracy & loss curves of the best-performed CNN model for the two investigated android malware datasets. Fig 4 presents the acquired confusion matrices of the proposed CNN algorithm for the two tested AMD and DREBIN android malware datasets for the best DAM image format. The security performance evaluation in terms of accuracy, recall, precision, and F1-Score can be estimated from these confusion matrices. It is demonstrated that the proposed CNN algorithm gives low false detection and low misclassification rate for the eight examined malware classes in both datasets. Fig 5 introduces the obtained accuracy & loss curves of the proposed CNN algorithm for the two tested AMD and DREBIN android malware datasets for the best DAM image format. The achieved results confirm that the proposed CNN algorithm provides the highest detection accuracy and the lowest detection loss compared to the other examined CNN algorithms, as also clarified in Tables 5 and 6 .

thumbnail

https://doi.org/10.1371/journal.pone.0270647.g004

thumbnail

https://doi.org/10.1371/journal.pone.0270647.g005

Table 7 shows the highest increase in the performance achieved among the different predictive models in terms of accuracy, F1-Score, precision, and recall. The comparison was conducted to show how various factors can affect the performance of the resulting predictive models in case of (a) only changing the type of conversion while keeping the same dataset and the applied CNN algorithm, (b) keeping the same conversion type and dataset while changing the applied CNN algorithm (c) keeping the type of conversion and applied CNN algorithm while changing the dataset itself. For example, the accuracy improvement reached 52.80% when CD type is used instead of the whole APK utilizing InceptionResNetV2 algorithm and AMD dataset. On the other hand, the accuracy improved by 107% when DAM type was used by our proposed algorithm (scratch) in comparison to the InceptionResNetV2 algorithm. The rest of the most significant F1-score, precision, and recall improvements have reached 95.16%, 71.44%, and 52.8%, respectively, when different conversion types were considered while applying the same CNN algorithm. Additionally, within the same conversion type, applying different CNN algorithms introduced 139.91% of F1-score improvement in the case of DAM type, 109.88% precision improvement in the case of SMALI type, and 107.04% in the case of DAM type.

thumbnail

https://doi.org/10.1371/journal.pone.0270647.t007

Similar behaviors were observed when DREBIN dataset was used. When applying the same CNN algorithm but considering different conversion types, the accuracy, F1-Score, precision, and recall have been improved by 77.55%, 149.71%, 126.16%, and 77.55%, respectively. However, the highest increase in the performance reached 131.29%, 236.44%, 192.00%, and 131.29% in terms of accuracy, F1-Score, precision and recall, respectively, when APK format was used and different CNN algorithms were applied.

The performance was also affected when the dataset itself was changed. For example, the amount of improvement in the accuracy, F1-Score, precision, and recall was higher when the DREBIN dataset was used, whether by changing the conversion type or the applied CNN algorithm, as also shown in Table 7 .

About the above comparisons and discussion, we can emphasize the impact of different factors on the performance of the android malware analysis systems. These factors need to be carefully addressed by the developers of the malware detection system to build predictive models that meet their needs.

In the following section, another way of assessing the malware analysis systems in terms of complexity. The developers can balance both the security and complexity measures when building their systems.

Complexity analysis

In addition to the security evaluation of the proposed comprehensive android malware analysis and predictive model, we have measured the complexity concerning the models’ execution and pre-processing phases. The models’ execution cost was calculated based on the model computational test time and CPU usage. Therefore, the complexity of all examined CNN algorithms, including our proposed algorithm, was measured when the two different android malware datasets were utilized, as shown in Tables 8 and 9 . The experiments’ outcomes reveal that (a) in the case of the DREBIN dataset, the test time was higher when APK as a whole was converted to an image, especially in our proposed CNN algorithm, VGG16, ResNet50, DenseNet121, DenseNet169, and EfficientNetB0, (b) there was variation among the models in regards to testing time even after using the same conversion type, (c) CPU usage in case of APK was less or close to the other types’ CPU usages in almost all CNN algorithms. In general, the CPU usage values were close in all algorithms for all conversion types, (d) overall, the test time was higher in DREBIN in comparison to AMD in all applied CNN algorithms, (e) fewer variations among the test time in the case of using AMD dataset in comparison to DREBIN with the highest value observed in EfficientNetB7, (f) CPU usage values were close for all conversion types and applied CNN algorithms in the case of using the AMD dataset, (g) our proposed CNN algorithm achieved lower testing time and CPU usage compared to other transfer learning CNN algorithms for the two tested datasets.

thumbnail

https://doi.org/10.1371/journal.pone.0270647.t008

thumbnail

https://doi.org/10.1371/journal.pone.0270647.t009

As discussed in the propose work section, the CNN algorithm is developed from scratch, and it is not a pre-trained CNN algorithm. Thus, as clarified in Table 2 (last row), our proposed CNN algorithm used a small number of trainable/non-trainable parameters compared to other pre-trained CNN algorithms. Therefore, it introduced a lower execution time.

Furthermore, the complexity of the pre-processing phases was measured in terms of (a) the speed of decompiling and unzipping processes for both the two tested android malware datasets, and (b) the size of the obtained visual images for all types of conversion considered in this research. Fig 6 demonstrates, in terms of histograms, the speed of decompiling and unzipping processes for the DREBIN and AMD datasets.

thumbnail

https://doi.org/10.1371/journal.pone.0270647.g006

The histogram distributions show the number of samples (y-axis) unzipped or decompiled as the time elapsed (x-axis). The general observation is that the unzipping process is faster than the decompiling for the two examined android malware datasets. This can be witnessed by counting the number of samples that can be unzipped by time. For example, in the case of the DREBIN dataset, more than 4000 apps took less than 0.005 seconds to be unzipped. In contrast, most apps (around 3000) took 2 to 3 seconds to be decompiled. AMD dataset apps took less time to unzip and decompile. For example, around 10000 apps took less than 0.01 seconds to be unzipped. In comparison, about 7000 apps took less than 3 seconds to be decompiled.

Similar outcomes are observed in the case of using the AMD dataset. However, overall, the unzipping and decompilation processes were faster in DREBIN than in the AMD dataset; this is due to the nature of the android apps included in this dataset.

Moreover, Figs 7 – 9 show the histogram distribution of the files size of the resulted images from the different types of conversations for the two datasets. Fig 7 presents the files size comparison between the DREBIN and AMD datasets of the produced APK images. It can be noticed that files size was much larger when the AMD dataset was used.

thumbnail

https://doi.org/10.1371/journal.pone.0270647.g007

thumbnail

https://doi.org/10.1371/journal.pone.0270647.g008

thumbnail

https://doi.org/10.1371/journal.pone.0270647.g009

The files size comparisons between the DREBIN and AMD datasets in the case of AM/DAM images and CD/SMALI images are shown in Figs 8 and 9 , respectively. The obtained results declare that the APK images have the largest files size compared to the other images resulting from other files types for both datasets. However, AM/DAM images are the smallest among them. Moreover, for all types of files and produced images, AMD was higher in size than DREBIN. Again this is due to the nature of the android apps included in this dataset.

Conclusions and future work

Android is the leading operating system worldwide, with around 70% market share. Consequently, attracting different security attackers to produce threatening malware apps that serve their bad intentions. On the other hand, security professionals are highly motivated to build efficient and smart android malware analysis and detection systems. These systems could be built based on vision-based approaches where the android apps or some of their components are converted to images. In this context, CNN algorithms are one of the best choices to generate vision-based predictive solutions.

The main shortcoming of the current related works is the focus on some factors when developing their malware analysis solutions, limiting the selection of best factors and practices that meet the target performance within the available resources.

Therefore, this study aims to provide a nutshell model for analyzing android malware apps that facilitates achieving high performance while respecting the system’s constraints. Furthermore, this research studied intensely the main factors that might significantly influence the performance of detecting android malware from security and complexity perspectives.

This study started by conducting a deep comparison among recent related works in the area of vision-based android malware analysis to check the primary factors considered by them and their ways of assessing them. Then we have built a comprehensive malware analysis model that captures essential aspects, processes, and practices that need to be considered to ensure the efficient building of malware detection systems. This model provides a thorough vision to developers on what to choose and why based on the systems’ needs and resources.

The primary factors that are included in our proposed model are: the type of conversions that decide on which features will be converted to images and how, dataset nature that depends on the kind of android malware apps included in the dataset, CNN algorithms that will be used to build the malware predictive solution, and most importantly the evaluation process that comprehensively assesses the performance of the malware analysis system in terms of complexity and security.

A deep empirical study has been conducted to evaluate the proposed model. The results reveal that the chosen factors and processes can significantly impact the performance of the analysis model, whether in terms of the security metrics such as accuracy, F1-score, precision, recall, or the complexity metrics such as test time, CPU usage, storage size, and pre-processing speed.

As a result, the proposed model will effectively direct the developers of malware analysis systems on which factors to adopt based on their requirements and the chosen factors’ impacts. Therefore, the researchers and developers can benefit from our model to trade off these factors to ensure building malware analysis systems that meet their goals.

For future work, other comprehensive models could be proposed for android malware analysis systems that are not vision-based. Additionally, we could introduce nutshell analysis models for different types of malware to other kinds of operating systems. Furthermore, we intend to study the effect of using variable byte sizes and different image sizes for the visual features of the Android malware applications. Moreover, a deep analysis of different misclassification and obfuscation classification scenarios can be investigated.

S1 and S2 Tables illustrate the security performance of different CNN algorithms utilizing DREBIN and AMD datasets, respectively. As mentioned before, these metrics were not included in the analysis section for simplicity in presenting the results and highlighting the main evaluation metrics in regards to the detection performance.

Supporting information

S1 table. security performance of models on drebin dataset based on other metrics..

https://doi.org/10.1371/journal.pone.0270647.s001

S2 Table. Security performance of models on AMD dataset based on other metrics.

https://doi.org/10.1371/journal.pone.0270647.s002

Acknowledgments

The authors would like to acknowledge the support of the Security Engineering Lab (SEL) at Prince Sultan University. Moreover, this research was done during the author Iman Almomani’s sabbatical year 2021/2022 from the University of Jordan, Amman–Jordan.

  • View Article
  • Google Scholar
  • PubMed/NCBI
  • 4. Naseer M, Rusdi JF, Shanono NM, Salam S, Muslim ZB, Abu NA, et al. Malware Detection: Issues and Challenges. In: Journal of Physics: Conference Series. vol. 1807. IOP Publishing; 2021. p. 012011.
  • 9. Almomani I, Khayer A. Android applications scanning: The guide. In: 2019 International Conference on Computer and Information Sciences (ICCIS). IEEE; 2019. p. 1–5.
  • 12. Acharya V, Ravi V, Mohammad N. EfficientNet-based Convolutional Neural Networks for Malware Classification. In: 2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT). IEEE; 2021. p. 1–6.
  • 14. Al Khayer A, Almomani I, Elkawlak K. ASAF: Android static analysis framework. In: 2020 First International Conference of Smart Systems and Emerging Technologies (SMARTTECH). IEEE; 2020. p. 197–202.
  • 23. Almomani I, Alkhayer A, El-Shafai W. An Automated Vision-Based Deep Learning Model for Efficient Detection of Android Malware Attacks. IEEE Access. 2022;.
  • 24. Sriram S, Vinayakumar R, Sowmya V, Alazab M, Soman K. Multi-scale learning based malware variant detection using spatial pyramid pooling network. In: IEEE INFOCOM 2020-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). IEEE; 2020. p. 740–745.
  • 25. Ganesan S, Ravi V, Krichen M, Sowmya V, Alroobaea R, Soman K. Robust malware detection using residual attention network. In: 2021 IEEE International Conference on Consumer Electronics (ICCE). IEEE; 2021. p. 1–6.
  • 29. Yadav P, Menon N, Ravi V, Vishvanathan S, Pham TD. A two-stage deep learning framework for image-based android malware detection and variant classification. Computational Intelligence;.
  • 32. Zhang H, Qin J, Zhang B, Yan H, Guo J, Gao F. A Multi-class Detection System for Android Malicious Apps Based on Color Image Features. In: International Conference on Security and Privacy in New Computing Environments. Springer; 2020. p. 186–206.
  • 40. Arp D, Spreitzenbarth M, Hubner M, Gascon H, Rieck K, Siemens C. Drebin: Effective and explainable detection of android malware in your pocket. In: Ndss. vol. 14; 2014. p. 23–26.
  • 41. Li Y, Jang J, Hu X, Ou X. Android malware clustering through malicious payload mining. In: International symposium on research in attacks, intrusions, and defenses. Springer; 2017. p. 192–214.
  • 42. Brownlee J. Deep learning with Python: develop deep learning models on Theano and TensorFlow using Keras. 2016;.
  • 43. Hodnett M, Wiley JF. R Deep Learning Essentials: A step-by-step guide to building deep learning models using TensorFlow, Keras, and MXNet. 2018;.
  • 44. Vasilev I, Slater D, Spacagna G, Roelants P, Zocca V. Python Deep Learning: Exploring deep learning techniques and neural network architectures with Pytorch, Keras, and TensorFlow. 2019;.
  • 45. Joseph FJJ, Nonsiri S, Monsakul A. Keras and TensorFlow: A hands-on experience. 2021; p. 85–111.
  • 46. Géron A. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. 2019;.
  • 47. Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, et al. {TensorFlow}: A System for {Large-Scale} Machine Learning. In: 12th USENIX symposium on operating systems design and implementation (OSDI 16); 2016. p. 265–283.

Subscribe to the PwC Newsletter

Join the community, add a new evaluation result row, android malware detection.

13 papers with code • 0 benchmarks • 1 datasets

Benchmarks Add a Result

Most implemented papers, continuous learning for android malware detection.

android malware detection research paper

We propose a new hierarchical contrastive learning scheme, and a new sample selection technique to continuously train the Android malware classifier.

AndrODet: An Adaptive Android Obfuscation Detector

omirzaei/androdet • Future Generation Computer Systems 2019

This is typically applied to protect intellectual property in benign apps, or to hinder the process of extracting actionable information in the case malware.

Why an Android App is Classified as Malware? Towards Malware Classification Interpretation

wubozhi/Xmal • 24 Apr 2020

In this paper, to fill this gap, we propose a novel and interpretable ML-based approach (named XMal) to classify malware with high accuracy and explain the classification result meanwhile.

Deep Learning for Android Malware Defenses: a Systematic Literature Review

In this paper, we conducted a systematic literature review to search and analyze how deep learning approaches have been applied in the context of malware defenses in the Android environment.

heterogeneous temporal graph transformer: an intelligent system for evolving android malware detection

To capture malware evolution, we further consider the temporal dependence and introduce a heterogeneous temporal graph to jointly model malware propagation and evolution by considering heterogeneous spatial dependencies with temporal dimensions.

DexRay: A Simple, yet Effective Deep Learning Approach to Android Malware Detection based on Image Representation of Bytecode

This work-in-progress paper contributes to the domain of Deep Learning based Malware detection by providing a sound, simple, yet effective approach (with available artefacts) that can be the basis to scope the many profound questions that will need to be investigated to fully develop this domain.

Can We Leverage Predictive Uncertainty to Detect Dataset Shift and Adversarial Examples in Android Malware Detection?

Our main findings are: (i) predictive uncertainty indeed helps achieve reliable malware detection in the presence of dataset shift, but cannot cope with adversarial evasion attacks; (ii) approximate Bayesian methods are promising to calibrate and generalize malware detectors to deal with dataset shift, but cannot cope with adversarial evasion attacks; (iii) adversarial evasion attacks can render calibration methods useless, and it is an open problem to quantify the uncertainty associated with the predicted labels of adversarial examples (i. e., it is not effective to use predictive uncertainty to detect adversarial examples).

MaMaDroid2.0 -- The Holes of Control Flow Graphs

The changes in the ratio between benign and malicious samples have a clear effect on each one of the models, resulting in a decrease of more than 40% in their detection rate.

Towards a Fair Comparison and Realistic Evaluation Framework of Android Malware Detectors based on Static Analysis and Machine Learning

serralba/androidmaldet_comparative • 25 May 2022

As in other cybersecurity areas, machine learning (ML) techniques have emerged as a promising solution to detect Android malware.

Efficient Query-Based Attack against ML-Based Android Malware Detection under Zero Knowledge Setting

gnipping/advdroidzero-access-instructions • 5 Sep 2023

The widespread adoption of the Android operating system has made malicious Android applications an appealing target for attackers.

Accessibility Links

  • Skip to content
  • Skip to search IOPscience
  • Skip to Journals list
  • Accessibility help
  • Accessibility Help

Click here to close this panel.

Purpose-led Publishing is a coalition of three not-for-profit publishers in the field of physical sciences: AIP Publishing, the American Physical Society and IOP Publishing.

Together, as publishers that will always put purpose above profit, we have defined a set of industry standards that underpin high-quality, ethical scholarly communications.

We are proudly declaring that science is our only shareholder.

An Analysis of Machine Learning-Based Android Malware Detection Approaches

R. Srinivasan 1 , S Karpagam 2 , M. Kavitha 1 and R. Kavitha 1

Published under licence by IOP Publishing Ltd Journal of Physics: Conference Series , Volume 2325 , International Conference on Electronic Circuits and Signalling Technologies 02/06/2022 - 03/06/2022 Online Citation R. Srinivasan et al 2022 J. Phys.: Conf. Ser. 2325 012058 DOI 10.1088/1742-6596/2325/1/012058

Article metrics

1165 Total downloads

Share this article

Author e-mails.

[email protected]

[email protected]

[email protected]

Author affiliations

1 Professor, Department of Computer Science and Engineering, Vel Tech University, Avadi, Chennai - 600062, Tamil Nadu, India.

2 Associate Professor, Department of Mathematics, Vel Tech Multi Tech Dr Rangarajan Dr Sakunthala Engineering College, VelTech Rangarajan Dr Sagunthala R&D Institute of Science and Technology, Chennai, Tamil Nādu, India.

Buy this article in print

Despite the fact that Android apps are rapidly expanding throughout the mobile ecosystem, Android malware continues to emerge. Malware operations are on the rise, particularly on Android phones, it make up 72.2 percent of all smartphone sales. Credential theft, eavesdropping, and malicious advertising are just some of the ways used by hackers to attack cell phones. Many researchers have looked into Android malware detection from various perspectives and presented hypothesis and methodologies. Machine learning (ML)-based techniques have demonstrated to be effective in identifying these attacks because they can build a classifier from a set of training cases, eliminating the need for explicit signature definition in malware detection.

This paper provided a detailed examination of machine-learning-based Android malware detection approaches. According to present research, machine learning and genetic algorithms are in identifying Android malware, this is a powerful and promising solution. In this quick study of Android apps, we go through the Android system architecture, security mechanisms, and malware categorization.

Export citation and abstract BibTeX RIS

Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence . Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

Android Malware Detection Based on Informative Syscall Subsequences

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

  • Open access
  • Published: 09 March 2023

Android malware category detection using a novel feature vector-based machine learning model

  • Hashida Haidros Rahima Manzil   ORCID: orcid.org/0000-0002-2865-2794 1 &
  • S. Manohar Naik   ORCID: orcid.org/0000-0002-1059-8945 1  

Cybersecurity volume  6 , Article number:  6 ( 2023 ) Cite this article

3782 Accesses

13 Citations

Metrics details

Malware attacks on the Android platform are rapidly increasing due to the high consumer adoption of Android smartphones. Advanced technologies have motivated cyber-criminals to actively create and disseminate a wide range of malware on Android smartphones. The researchers have conducted numerous studies on the detection of Android malware, but the majority of the works are based on the detection of generic Android malware. The detection based on malware categories will provide more insights about the malicious patterns of the malware. Therefore, this paper presents a detection solution for different Android malware categories, including adware, banking, SMS malware, and riskware. In this paper, a novel Huffman encoding-based feature vector generation technique is proposed. The experiments have proved that this novel approach significantly improves the efficiency of the detection model. This method makes use of system call frequencies as features to extract malware’s dynamic behavior patterns. The proposed model was evaluated using machine learning and deep learning methods. The results show that the proposed model with the Random Forest classifier outperforms some existing methodologies with a detection accuracy of 98.70%.

Introduction

The legendary Android operating system has dominated the smartphone industry since 2011 (Statista 2011 ). The Android operating system has approximately 2.5 billion active users from 190 countries, according to Android Statistics (Business of Apps: Android Statistics 2022 ). In this digital era, Android smartphones play an essential role in fulfilling multiple user needs. Therefore, its impacts on various aspects of society are immense. The Android operating system dominated the global market share since 2014 and it loomed over 87% of the market share in 2022 (Business of Apps: Android Statistics 2022 ). Android has become a prime target for cybercriminals because of its widespread use and open-source nature. Cybercriminals and anti-malware developers are constantly at odds in the realm of malware detection. With evolving technologies, the two protagonists have quickly modified their strategies. Malicious actors typically strive to profit in an unethical or even unlawful way. Mobile malware could steal sensitive and confidential user data, misuse the user's device to send SMS to premium text services, or install adware that causes users to view malicious websites or download other malware. The researchers have conducted several studies to develop countermeasures against Android operating systems and application security issues. Generally, most of the studies are based on the detection of generic Android malware, and studies based on the malware categories are relatively few. The malicious patterns can effectively be identified by the malware category recognition. Android malware detection can be categorized into the signature and behavior-based detection. The signature-based approach detects malicious behavior only after malware attacks have already occurred (a posteriori event), because it generates well-defined patterns (Portokalidis et al. 2006 ; Wressnegger et al. 2017 ; Oyama et al. 2012 ). Also, signature-based solutions rely on cryptographic algorithms or similarity measurement techniques (Tchakounté et al. 2021 ). This traditional method successfully detects well-known dangerous patterns because these malicious patterns are already stored in the database, whereas it cannot detect zero-day attacks.

On the other hand, the behavior-based approaches are further classified into static, dynamic, and hybrid methods. Researchers frequently use static analysis to extract features without running applications on a real device or emulator. This approach is particularly appealing since it requires less computing time and overhead during implementation. Static analysis cannot detect malware that hides or obfuscates its abusive behavior during execution. Dynamic analysis is required to address this issue because it monitors the run-time behavior of the applications. System calls are the most frequently retrieved features by dynamic analysis techniques. System call sequences will adequately reveal the malignant behavior patterns of different malware categories. As a result, this paper presents a dynamic analysis-based method that uses system call frequencies as features. This method successfully identified several categories of Android malware, including riskware, banking trojan, SMS malware, and adware.

The highlights of the proposed methodology are described below:

The system call frequencies are utilized to build a detection solution for Android malware categories.

A dynamic analysis-based model is proposed for the detection of Android malware categories.

A novel feature vector generation method based on Huffman Encoding is incorporated with the detection model.

The proposed model was evaluated using machine learning and deep learning techniques and compared with previous studies.

Sect. " Related works " summarizes the related studies on Android malware detection. Sect. " Proposed methodology " thoroughly explains the proposed methodology for detecting Android malware categories. The experimental results are discussed in Sect. " Experiments and Results ". Sect. " Conclusion " concludes the paper and discusses potential directions for further research.

Related works

The detection systems can recognize more malware by identifying their related families, prioritizing the risky families, and capturing their impact on users (Alswaina and Elleithy 2020 ). Generally, Android malware detection approaches are classified into two categories: signature-based and behavior-based techniques. The behavior-based detection approaches are further divided into static, dynamic, and hybrid analysis-based techniques.

The signature-based techniques primarily rely on the known signature patterns of the malware. For example, a set of semi-supervised algorithms for the automatic generation of different Android malware family signatures were developed by Atzeni et al. ( 2018 ). However, this approach fails to detect unknown malware attacks. Therefore, researchers generally prefer something other than this traditional technique like behavior-based strategies, which includes static, dynamic, and hybrid methods.

The articles (Alswaina and Khaled 2018 ; Arindaam Roy. et al. 2020 ; Zhou et al. 2020 ; Zhu et al. 2021 ; Elayan and Mustafa 2021 ; Imtiaz 2021 ; Almahmoud and Dalia Alzu’bi et al. 2021 ; Pei et al. 2020 ; Kim et al. 2021 ; Bai et al. 2021 ) propose static analysis-based techniques for detecting Android malware. The studies (Alswaina and Khaled. 2018 , Imtiaz 2021 , Pei et al. 2020 , Kim et al. 2021 , Bai et al. 2021 ) discusses the family classification of Android malware with a static analysis approach. Alswaina et al. (Alswaina and Khaled. 2018 ) used machine learning approaches to categorize malware families and developed a reverse engineering framework to extract permissions. Ibrahim et al. ( 2021 ) proposed DeepAMD for android malware and its family in static and dynamic layers. Similarly, for malware detection and family attribution, Pei et al. ( 2020 ) developed a unique deep-learning system called AMalNet. Kim et al. ( 2021 ) propose leveraging built-in custom permissions and machine learning to categorize Android malware families. Bai et al. ( 2021 ) used static features such as permissions, API calls, activity, services, broadcast receivers, and content providers to classify Android malware families. This study (Bai et al. 2021 ) uses a lot of machine learning and neural network techniques, based on manual features from literature reviews and documentary features from Android developers.

The dynamic analysis methodology has been applied to detect Android malware since obfuscated malware and malicious dynamic content loading cannot be detected by static analysis. Several research articles, such as those in Mahindru and Sangal ( 2021a , 2021b ), Martín et al. ( 2018 ), Abderrahmane et al. ( 2019 ), D'Angelo et al. ( 2021 ), explore dynamic analysis-based Android malware detection. Martin A et al. ( 2018 ) introduce CANDYMAN, a malware classification tool that leverages the Markov chain to categorize Android malware families. They rely on deep learning techniques and use Markov chains for detection. System calls are used as features in convolutional neural networks by Abderrahamane et al. ( 2019 ) to detect fraudulent Android applications. This approach depends on pair-level system call dependencies. With the assistance of the CuckooDroid sandbox ( 2020 ), a mobile security framework for static and dynamic feature extraction, D'Angelo et al. ( 2021 ) created another dynamic analysis-based Android malware classification system.

The combined static and dynamic features are promoted in hybrid-based detection systems. For example, in their article, Ding et al. ( 2021 ) present a hybrid analysis-based technique for identifying Android malware and categorizing malware families. This approach used static and dynamic analysis to extract static features (like permissions and intents) and dynamic features (like network traffic data) to classify malware families. Similarly, Taheri et al. ( 2019 ) presented a hybrid, two-layer Android malware analyzer based on static and dynamic analysis-based malware category classification. Dhalaria and Gandotra ( 2021 ) depicted another hybrid approach for both malware detection and family classification, in which the authors have employed the information gain feature selection algorithm. Many malware detection studies have recently tried to use machine learning to make advancements in detecting unidentified Android malware (Meijin et al. 2022 ). El Fiky et al. ( 2021 ) developed machine learning-based approaches for identifying Android malware categories. Zhang et al. ( 2019 ) propose a combination of n-gram analysis and online classifiers for Android malware detection and family attribution. Shao et al. ( 2021 ) introduced a novel detection technique based on sampling strategies. The authors created two distinct sampling algorithms based on various malware families, to address the sample imbalance in the dataset. The original authors of CICMalDroid dataset, Samaneh Mahdavifar et al. ( 2022 ) employed a semi-supervised learning method combined with a pseudo-labelling technique. It is obvious that pseudo-labelling is sensitive to the initial predictions. Although this approach reduces label dependency, it may lead to incorrect prediction results if there are only limited data points available. Lee introduced pseudo-labelling technique (Lee 2013 ). In this technique, clustering is done to label unknown data points. If there are only a few labelled points and proper clustering cannot be performed, the resulting pseudo-labels may lead the classifier to the incorrect decision boundary. The authors claim that their proposed method shows an accuracy of 95.19% with only 100 labelled training samples. However, these results could be unstable with only 100 labelled samples since pseudo-labelling is highly influenced by the initial predictions. Moreover, the semi-supervised technique is highly based on a self-training approach. The main drawback of this approach is that wrong predictions with high confidence will propagate the prediction error into model learning.

Proposed methodology

This section outlines the proposed methodology to classify Android malware categories. The proposed method follows a dynamic analysis-based approach that utilizes system call frequencies as features. Figure  1 represents the model of proposed detection solution. The detailed descriptions of each phase are provided in the following Sects. "Data acquisition", "Data pre-processing" and "Feature selection").

figure 1

Proposed system model

Data acquisition

This section summarizes the dataset used for the proposed method. The dataset CICMalDroid ( 2020 ), Mahdavifar et al. ( 2022 ), Canadian Institute for Cybersecurity ( 2020 ) is used, which comprises Android samples broken down into five categories: Adware, Banking malware, SMS malware, Riskware, and Benign. For example, the adware category contains families like Judy, Ewind, Copycat, GhostClicker, etc. Each malware category has a variety of families. This dataset was gathered from different sources, including the VirusTotal service ( 2022 ), Contagio Mini Dump blog ( 2022 ), MalDozer (Karbab et al. 2018 ), etc. The proposed methodology employs the CSV (Comma Separated Value) file containing 139 extracted system call frequencies from 11,598 APK (Android Application Package) files of five malware categories. Table 1 shows the count of different APK sample categories used in the study.

Data pre-processing

The act of transforming unprocessed data into something that a machine learning model can use is known as data pre-processing. It is the first and most crucial stage in developing a machine-learning model. The data pre-processing is used to improve the model's accuracy and efficiency. The basic steps in data pre-processing include importing libraries and datasets, finding missing data, removing NULL values or unnecessary data, encoding categorical data, data scaling, augmentation, and feature vector generation. Since there are no missing or NULL values in the dataset, a new Huffman encoding-based feature vector generation technique is proposed and data scaling is applied as pre-processing tasks.

Feature vector generation

The Huffman encoding-based feature vector generation is used in this phase. According to the literature survey, the methodology presented in this paper is the first truly innovative way for detecting Android malware. The system calls help to monitor the dynamic behavior of apps; therefore, it will be easy to identify malicious patterns by determining how frequently a given application uses a system call. System call frequencies are therefore used as features in the study. The proposed method uses the frequencies of 139 different system calls, which were acquired in the data acquisition phase. Then Huffman encoding is used, which is an optimization technique that maps system call frequencies into an optimized size value in O(nlogn) times. The new feature vector is then created using these Huffman-encoded values, which boosts the performance of the detection framework even more due to its higher encoding speed and effectiveness. The Huffman's optimality or minimum-redundancy code property makes it a more efficient technique (Moffat 2019 ).

Huffman encoding

David A. Huffman invented the Huffman Encoding compression technique (Huffman_coding 2022 ). This technique is based on the frequency of occurrence of a data item. According to this encoding scheme, a unique code is obtained for each system call. Given ‘m’ number of application samples, A = {a 1 , a 2 , a 3 ,..., a m } and S = {s 1 , s 2 , s 3 ,...., s n } represents the ‘n’ number of features (system call frequencies) used by ‘m’ APK samples (Here, m = 11,598 and n = 139) as shown in Table 2

Then the feature set, F a,s represents all system call frequencies of all APK samples (Fig.  2 ).

figure 2

Feature set of proposed system

The corresponding Huffman trees for each row of system calls are built and mapped to unique codes by assigning 0 and 1 in the left and right child trees, respectively. At the end, a sequence of 0’s and 1’s will be generated for each leaf node. Table 3 represents the mapped Huffman codes of some of the given system calls.

As the final step, the optimum size required for each system call frequency value is obtained by multiplying the length of the corresponding Huffman codes with the value of system call frequency (Eq.  1 ). This will be used as the new feature vector for the final detection model. The pseudo-code of Huffman encoding is shown in Fig.  3 .

figure 3

Pseudo code of Huffman encoding

a. Minimum-redundancy code / Optimality property

Minimum-redundancy code or optimality means the average number of coding digits per feature is minimized. This property makes Huffman an efficient technique. Any application of Huffman's algorithm will always create a minimum-redundancy code (Huffman 1952 ). Therefore, it will help to generate an optimal feature vector for the final detection model. Thus, this can improve the detection accuracy of the model. A minimum-redundancy code exists in which the two least-value features are siblings and share a common parent in the corresponding binary code tree. These two features are joined into a combined node with weight given by their sum. Following that, a minimum-redundancy code is created for this reduced-by-one feature set. Then expanding that feature into its two components, yields a minimum-redundancy code for the original set of features (Moffat 2019 ).

Data scaling

One of the most critical phases in data pre-processing before building a machine learning model is feature scaling or data scaling. It is used to generalize data points so there will be less space between them. A machine learning model's strength can be changed through scaling, from poor to better. In this work, standard and Min–max scaling techniques are employed.

Feature selection

Feature selection is necessary for a model to predict the target variable. This process aims to minimize the number of input variables to select those features that are identified as most beneficial. The proposed model uses the Chi-square technique to select appropriate features. Thus, it reduces the feature space for the final machine learning model.

Classification

In this phase, the Android malware category classification is experimented using machine learning and deep learning techniques. The machine learning classifiers use the feature vector as input that was obtained from the previous step. Then classifiers like Random Forest, Decision Tree, Logistic Regression, Support Vector Machine, and AdaBoost are employed to detect Android malware categories. The experiments were also carried out with convolutional neural networks and multi-layer perceptron techniques.

Experiments and results

This section discusses the experiments and results. The proposed system is built with a novel method for creating feature vectors based on Huffman coding. A total of 139 features (system call frequencies) from 11,598 data samples provided by CICMaldroid 2020 (Mahdavifar 2020 ; Mahdavifar et al. 2022 ; Canadian Institute for Cybersecurity 2020 ) were used in the system. The proposed feature vector generation technique significantly improves the overall effectiveness of the detection model. Although there are several data scaling techniques, the primary issue for machine learning is selecting the appropriate scaling method. The studies (Ambarwari et al. 2020 ; Shahriyari 2019 ) support the impact of data scaling methods on various ML algorithms. As a result, the proposed solution uses standard scaling because it yields better performance with machine learning models. The experimental results are shown in Tables 4 and 5 . The corresponding result graphs are depicted in Figs. 4 and 5 .

figure 4

Results without proposed feature vector generation

figure 5

Results with Huffman encoding-based feature vector generation

Table 4 shows the results without the proposed feature vector generation. This demonstrates that, with a greater detection accuracy of 0.931, the random forest model performs better. The decision tree, support vector machine, k-nearest neighbor classifiers, multi-layer perceptron and CNN models provide more than 80% of detection accuracies.

The experiment results with the proposed Huffman Encoding-based approach are presented in Table 5 . It is evident that the proposed approach yields a greater accuracy of 98.70%. Moreover, it has increased the performance of other classifiers as well. This proves how effectively proposed feature vector generation process works. Huffman encoding supplies an optimised size value for each feature in O(nlogn) times, which increases the detection model's efficacy.

The proposed feature vector generation technique is compared with logarithmic transformation-based feature vector generation (Table 6 ). As per the results, the log transformation gives the greater accuracy is 93.06% with Random Forest model. It is known that log transformation will reduce the skewness of data by compressing the range of large numbers and extending the range of small numbers. However, it may lead to high memory consumption and increased time complexity. Also, this technique is computationally expensive and it may cause lowering of models’ accuracy. Whereas, in the Huffman encoding method, its minimum redundancy code property and higher encoding speed enables it to produce an optimal feature vector, which can improve the performance of the final detection model. The experiments also proved that Huffman-encoding feature vector generation gives better results than logarithmic based feature vector generation. Figure  6 shows the corresponding result graph of logarithmic transformation-based approach.

figure 6

Results with logarithmic transformation

From Fig.  7 , it is clear that the performance of the proposed method is higher than the method without using any feature vector generation and the baseline technique, i.e., logarithmic transformation technique.

figure 7

Result comparison of proposed method

As shown in Table 7 , the effectiveness of the proposed system is compared to that of a few existing methodologies. The results show that the proposed system outperforms the alternatives.

This paper presents an Android malware category detection system based on a novel Huffman encoding-based feature vector generation scheme. The proposed model includes phases like data acquisition, data pre-processing, feature selection, and classification. The system call frequencies of 11,598 Android application samples were used as features to design this solution because this dynamic feature helps to recognize the dynamic behavior patterns of malware category. The Huffman encoding technique is employed in the data pre-processing phase to provide the optimum size of system call frequencies used by the applications. Several Machine learning-based experiments are conducted to evaluate the effectiveness of the proposed system. Based on the findings of the experiments, the proposed method using the Random Forest model outperforms other models with a better accuracy of 98.70%. The results were also compared with the performance of logarithmic transformation-based feature vector generation, showing that the proposed approach exhibits better results. Additionally, the model's effectiveness was compared with a few earlier methods, and it was discovered that this work yields better outcomes. This solution relies on a dynamic feature called system calls; thus, in future research studies, the static features like permissions, API calls, intents, etc., should be integrated with the detection solution.

Availability of data and materials

The datasets analysed during the current study are available in the site Canadian Institute for Cybersecurity, CICMalDroid 2020 (Canadian Institute for Cybersecurity 2020 ), https://www.unb.ca/cic/datasets/maldroid-2020.html .

Abderrahmane A, Adnane G, Yacine C, Khireddine G, (2019). Android malware detection based on system calls analysis and CNN classification. In: 2019 IEEE wireless communications and networking conference workshop (WCNCW) (pp 1–6). IEEE

Almahmoud M, Alzubi D, Yaseen Q (2021) ReDroidDet: android malware detection based on recurrent neural network. Procedia Comput Sci 184:841–846. https://doi.org/10.1016/j.procs.2021.03.105

Article   Google Scholar  

Alswaina F, Elleithy K (2018) Android malware permission-based multi-class classification using extremely randomized trees. IEEE Access. https://doi.org/10.1109/ACCESS.2018.2883975

Alswaina F, Elleithy K (2020) Android malware family classification and analysis: current status and future directions. Electronics 9(6):942

Ambarwari A, Adrian QJ, Herdiyeni Y (2020) Analysis of the effect of data scaling on the performance of the machine learning algorithm for plant identification. J Resti Rekayasa Sist Dan Teknol Inf 4:117–122

Google Scholar  

Atzeni A, Diaz F, Marcelli A, Sánchez A, Squillero G, Tonda A (2018) Countering android malware: a scalable semi-supervised approach for family-signature generation. IEEE Access. https://doi.org/10.1109/ACCESS.2018.2874502

Bai Y, Xing Z, Ma D, Li X, Feng Z (2021) Comparative analysis of feature representations and machine learning methods in android family classification. Comput Netw 184:107639

Business of Apps: Android Statistics (2022). Android Statistics (2022) - Business of Apps Accessed on 20 July 2022

Canadian Institute for Cybersecurity, CICMalDroid 2020, https://www.unb.ca/cic/datasets/maldroid-2020.html , Accessed on 30 Mar 2022

Contagio Mobile http://contagiominidump.blogspot.com/ , Accessed on 30 Mar 2022

CuckooDroid (2020). Cuckoodroid book. Retrieved 2020, from https://cuckoo-droid.readthedocs.io/en/latest/

D’Angelo G, Palmieri F, Robustelli A, Castiglione A (2021) Effective classification of android malware families through dynamic features and neural networks. Connect Sci 33(3):786–801. https://doi.org/10.1080/09540091.2021.1889977

Dhalaria M, Gandotra E (2021) A hybrid approach for android malware detection and family classification. Int J Interact Multimed Artif Intel. https://doi.org/10.9781/ijimai.2020.09.001

Ding C, Luktarhan N, Lu B, Zhang W (2021) A hybrid analysis based approach to android malware family classification. Entropy 23:1009. https://doi.org/10.3390/e23081009

Elayan ON, Mustafa AM (2021) Android malware detection using deep learning. Procedia Comput Sci 184:847–852. https://doi.org/10.1016/j.procs.2021.03.106

Fiky AHE, Shenawy AE, Madkour MA (2021) Android malware category and family detection and identification using machine learning. arXiv preprint https://arxiv.org/abs/2107.01927

Huffman DA (1952) A method for the construction of minimum-redundancy codes. Proc Inst Radio Eng 40(9):1098–1101

MATH   Google Scholar  

Huffman coding, https://en.wikipedia.org/wiki/Huffman_coding , Accessed on 30 Mar 2022

Imtiaz SI, Rehman SU, Javed AR, Jalil Z, Liu X, Alnumay WS (2021) DeepAMD: detection and identification of android malware using high-efficient deep artificial neural network. Future Gener Comput Syst 115:844–856. https://doi.org/10.1016/j.future.2020.10.008

International Conference on Smart Sustainable Intelligent Computing and Applications under ICITETM2020 Android Malware Detection based on Vulnerable Feature Aggregation Arindaam Roya,_, Divjeet Singh Jasa, Gitanjali Jaggia, Kapil Sharmaa

Karbab E, Debbabi M, Derhab A, Mouheb D (2018) MalDozer: automatic framework for android malware detection using deep learning. Digit Investig 24:S48–S59. https://doi.org/10.1016/j.diin.2018.01.007

Kim M, Kim D, Hwang C, Cho S, Han S, Park M (2021) Machine-learning-based android malware family classification using built-in and custom permissions. Appl Sci 11:10244. https://doi.org/10.3390/app112110244

Lee DH (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. Workshop on challenges in representation learning, ICML. 3(2)

Mahdavifar S, Alhadidi D, Ghorbani AA (2022) Effective and efficient hybrid android malware classification using pseudo-label stacked auto-encoder. J Netw Syst Manage 30(1):1–34

Mahindru A, Sangal AL (2021a) MLDroid—framework for Android malware detection using machine learning techniques. Neural Comput Appl 33:5183–5240. https://doi.org/10.1007/s00521-020-05309-4

Mahindru A, Sangal AL (2021b) SemiDroid: a behavioral malware detector based on unsupervised machine learning techniques using feature selection approaches. Int J Mach Learn Cyber 12:1369–1411. https://doi.org/10.1007/s13042-020-01238-9

Mahdavifar S, Kadir AFA, Fatemi R, Alhadidi D, Ghorbani AA (2020) Dynamic android malware category classification using semi-supervised deep learning, In: The 18th IEEE international conference on dependable, autonomic, and secure computing (DASC), 17–24

Martín A, Rodríguez-Fernández V, Camacho D (2018) CANDYMAN: classifying android malware families by modelling dynamic traces with Markov chains. Eng Appl Artif Intell 74:121–133. https://doi.org/10.1016/j.engappai.2018.06.006

Meijin L, Zhiyang F, Junfeng W, Luyu C, Qi Z, Tao Y, Yinwei W, Jiaxuan G (2022) A systematic overview of android malware detection. Appl Artif Intel 36(1):2007327. https://doi.org/10.1080/08839514.2021.2007327

Moffat A (2019) Huffman coding. ACM Comput Surv (CSUR) 52(4):1–35

Nicheporuk A, Savenko O, Nicheporuk A, Nicheporuk Y (2020) An android malware detection method based on CNN mixed-data model CEUR Workshop Proceedings Kharkiv, Ukraine. 2732:198–213

Oyama Y, Giang TTD, Chubachi Y, Shinagawa T, Kato K (2012) Detecting malware signatures in a thin hypervisor, In: Proceedings of the 27th Annual ACM symposium on applied computing, SAC 12, ACM, New York, NY, USA, pp 1807–1814,  https://doi.org/10.1145/2245276.2232070

Pei X, Long Y, Tian S (2020) AMalNet: a deep learning framework based on graph convolutional networks for malware detection. Comput Secur 93:101792. https://doi.org/10.1016/j.cose.2020.101792

Portokalidis G, Slowinska A, Bos Argos H (2006) An emulator for fingerprinting zero-day attacks for advertised honeypots with automatic signature generation, In: Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006, EuroSys ’06, ACM, New York, NY, USA, pp 15–27,  https://doi.org/10.1145/1217935.1217938

Shahriyari L (2019) Effect of normalization methods on the performance of supervised learning algorithms applied to HTSeq-FPKM- UQ data sets: 7SK RNA expression as a predictor of survival in patients with colon adenocarcinoma. Briefings Bioinform 20:985–994

Shao K, Xiong Q, Cai Z (2021) FB2Droid: a novel malware family-based bagging algorithm for android malware detection. Secur Commun Netw

Statista: Share of Android OS of global smartphone shipments from 1st quarter 2011 to 2nd quarter 2018* (2022) Android global phone market share 2018 | Statista Accessed on 21 July 2022

Taheri L, Kadir AFA, Lashkari AH (2019) Extensible android malware detection and family classification using network-flows and API-calls. In: 2019 International carnahan conference on security technology (ICCST) (pp 1–8). IEEE

Tchakounté F, Ngassi RCN, Kamla VC et al (2021) LimonDroid: a system coupling three signature-based schemes for profiling Android malware. Iran J Comput Sci 4:95–114. https://doi.org/10.1007/s42044-020-00068-w

Virus Total (2022) https://www.virustotal.com/gui/home/upload , Accessed on 30 Mar 2022

Wressnegger C, Freeman K, Yamaguchi F, Rieck K (2017) Automatically inferring malware signatures for anti-virus assisted attacks. In: Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, ASIA CCS ’17, ACM, New York, NY, USA, pp 587–598,  https://doi.org/10.1145/3052973.3053002

Zhang L, Thing VL, Cheng Y (2019) A scalable and extensible framework for android malware detection and family attribution. Comput Secur 80:120–133

Zhou H, Yang X, Pan H, Guo W (2020) An android malware detection approach based on SIMGRU. IEEE Access 8:148404–148410. https://doi.org/10.1109/ACCESS.2020.3007571

Zhu H, Li Y, Li R, Li J, You Z, Song H (2021) SEDMDroid: an enhanced stacking ensemble framework for android malware detection. IEEE Trans Netw Sci Eng 8(2):984–994. https://doi.org/10.1109/TNSE.2020.2996379

Download references

Acknowledgements

There is no any third person/ organisation to acknowledge

The authors declare that the research doesn’t used any funding sources for the work. There are no any funding sources to disclose.

Author information

Authors and affiliations.

Department of Computer Science, Central University of Kerala, Kasaragod, 671316, Kerala, India

Hashida Haidros Rahima Manzil & S. Manohar Naik

You can also search for this author in PubMed   Google Scholar

Contributions

HHRM: Research, Data analysis, Documentation, Reporting, Implementations, Problem formulation, Coding, Testing. Dr. MNS: Supervision, Management, Validation. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Hashida Haidros Rahima Manzil .

Ethics declarations

Competing interests.

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Manzil, H.H.R., Manohar Naik, S. Android malware category detection using a novel feature vector-based machine learning model. Cybersecurity 6 , 6 (2023). https://doi.org/10.1186/s42400-023-00139-y

Download citation

Received : 24 August 2022

Accepted : 11 January 2023

Published : 09 March 2023

DOI : https://doi.org/10.1186/s42400-023-00139-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Android malware
  • Dynamic analysis
  • Malware category
  • Huffman coding

android malware detection research paper

MalDMTP: A Multi-tier Pooling Method for Malware Detection based on Graph Classification

  • Published: 26 April 2024

Cite this article

android malware detection research paper

  • Liang Kou 1 ,
  • Cheng Qiu 1 ,
  • Meiyu Wang 2 ,
  • Hua Liu 3 ,
  • Yan Du 3 &
  • Jilin Zhang 1  

Explore all metrics

With the development and adoption of cloud platforms in various fields, malware attacks have become a serious threat to the Internet cloud ecosystem. However, the pooling process of existing graph classification techniques for malware variant detection uses only a serial and single strategy, resulting in localized malicious behaviors of malware that may be overlooked. In this paper, we propose MalDMTP, a malware detection framework based on multilevel graph classification learning, which implements the graph pooling process for malware classification in parallel and performs graph instance-based discrimination. In particular, MalDMTP first constructs an API call graph based on results obtained from dynamic execution of malware. Then it combines multiple graph neural network learning strategies through multi-level pooling to learn the global importance of nodes in the pooled graph and extract node representations from multiple perspectives for heterogeneous graphs. After that, MalDMTP is aggregated into graph representations by the graph-level pooling function GMT based on a multi-head attention mechanism, which goes through a classifier in order to obtain malware prediction labels. Experimental results show that the proposed MalDMTP can achieve 96.53% accuracy on the Alibaba cloud malware dataset, which improves 1.9%   7.6% over the previous single-graph pooling methods on the graph classification task of malware detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

android malware detection research paper

Similar content being viewed by others

android malware detection research paper

A comparison of graph neural networks for malware classification

android malware detection research paper

Multi-class Malware Detection via Deep Graph Convolutional Networks Using TF-IDF-Based Attributed Call Graphs

android malware detection research paper

HawkEye: Cross-Platform Malware Detection with Representation Learning on Graphs

Data availability.

No datasets were generated or analysed during the current study.

AV-ATLAS (2022) Malware. The AV-TEST Institute. https://www.av-test.org/en/statistics/malware . Accessed 1 June 2023

SONICWALL (2023) 2023 SonicWall Cyber Threat Report. https://www.sonicwall.com/resources/white-papers/2023-sonicwall-cyber-threat-report . Accessed 20 Dec 2023

Egele M, Scholte T, Kirda E, Krügel C (2012) A survey on automated dynamic malware-analysis techniques and tools. ACM Comput Surv 44(6):1–42. https://doi.org/10.1145/2089125.2089126

Raff E, Zak R, Cox R, Sylvester J, Yacci P, Ward R, Tracy A, McLean M, Nicholas CK (2018) An investigation of byte n-gram features for malware classification. Journal of Computer Virology and Hacking Techniques 14:1–20. https://doi.org/10.1007/s11416-016-0283-1

Article   Google Scholar  

Bernardi Mario C, Marta D, Damiano M, Fabio M, Francesco (2019) Dynamic malware detection and phylogeny analysis using process mining. Int J Inf Secur 18:257–284. https://doi.org/10.1007/s10207-018-0415-3

Huang W, Stokes JW (2016) MtNet: A Multi-Task Neural Network for Dynamic Malware Classification. In: Caballero J, Zurutuza U, Rodríguez R (eds.) Detection of Intrusions and Malware, and Vulnerability Assessment. San Sebastián, Spain, pp 399-418

Zhang H, Lu G, Zhan M, Zhang B (2022) Semi-Supervised Classification of Graph Convolutional Networks with Laplacian Rank Constraints. Neural Process Lett 54:2645–2656. https://doi.org/10.1007/s11063-020-10404-7

Liu Z, Zhou J (2020) Graph Attention Networks. In: Introduction to Graph Neural Networks. Synth Lect Artif Intell Mach Learn pp 39-41

Hu Z, Dong Y, Wang K, Chang K, Sun Y (2020) GPT-GNN: Generative Pre-Training of Graph Neural Networks. Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. Association for Computing Machinery, New York, NY, USA, pp 1857–1867. https://doi.org/10.1145/3394486.3403237

Wang YG, Li M, Ma Z, Montúfar G, Zhuang X, Fan Y (2019) Haar Graph Pooling. In Proceedings of the 37th international conference on machine learning (ICML’20), 923:9952–9962. https://doi.org/10.5555/3524938.3525861

Peng H, Li J, Song Y, Yang R, Ranjan R, Yu PS, He L (2021) Streaming Social Event Detection and Evolution Discovery in Heterogeneous Information Networks. ACM Transactions on Knowledge Discovery from Data (TKDD) 15:1–33. https://doi.org/10.1145/3447585

Peng H, Li J, Gong Q, Wang S, He L, Li B, Wang L, Yu PS (2019) Hierarchical Taxonomy-Aware and Attentional Graph Capsule RCNNs for Large-Scale Multi-Label Text Classification. IEEE Trans Knowl Data Eng 33:2505–2519. https://doi.org/10.1109/TKDE.2019.2959991

Bruna J, Zaremba W, Szlam A, LeCun Y (2013) Spectral Networks and Locally Connected Networks on Graphs. CoRR, abs/1312.6203

Kipf T, Welling M (2017) Semi-Supervised Classification with Graph Convolutional Networks. Int Conf Learn Representations pp 1–14

Hamilton WL, Ying Z, Leskovec J (2017) Inductive Representation Learning on Large Graphs. Neural Inform Process Syst pp 1025–1035. https://doi.org/10.5555/3294771.3294869

Xu K, Li C, Tian Y, Sonobe T, Kawarabayashi K, Jegelka S (2018) Representation Learning on Graphs with Jumping Knowledge Networks. Int Conf Mach Learn pp 5453–5462

Abu-El-Haija S, Kapoor A, Perozzi B, Lee J (2018) N-GCN: Multi-scale Graph Convolution for Semi-supervised Node Classification. Conf Uncertain Artif Intell pp 841–851

Cai L, Ji S (2020) A Multi-Scale Approach for Graph Link Prediction. AAAI Conference on Artificial Intelligence 34:3308–3315. https://doi.org/10.1609/aaai.v34i04.5731

Xiao Y, Li R, Lu X, Liu Y (2021) Link prediction based on feature representation and fusion. Inf Sci 548:1–17

Article   MathSciNet   Google Scholar  

You J, Ying R, Leskovec J (2019) Position-aware Graph Neural Networks. Int Conf Mach Learn pp 7134–7143

Nguyen TD, Phung D (2019) Unsupervised universal self-attention network for graph classification. arXiv:1909.11855

Defferrard M, Bresson X, Vandergheynst P (2016) Convolutional neural networks on graphs with fast localized spectral filtering. Neural Inform Process Syst 29

Vinyals O, Bengio S, Kudlur M (2015) Order Matters: Sequence to sequence for sets. arXiv preprint arXiv:1511.06391

Zhang M, Cui Z, Neumann M, Chen Y (2018) An end-to-end deep learning architecture for graph classification. In: Proceedings of the AAAI conference on artificial intelligence, vol 32(1)

Gao H, Ji S (2019) Graph u-nets. In international conference on machine learning, pp 2083–2092

Lee J, Lee I, Kang J (2019) Self-attention graph pooling. In: International conference on machine learning pp 3734–3743

Zhang Z, Bu J, Ester M, Zhang J, Li Z, Yao C, Huifen D, Yu Z, Wang C (2021) Hierarchical Multi-View Graph Pooling With Structure Learning. IEEE Trans Knowl Data Eng 35:545–559

Google Scholar  

Diehl F (2019) Edge contraction pooling for graph neural networks. arXiv preprint arXiv:1905.10990

Ying Z, You J, Morris C, Ren X, Hamilton W, Leskovec J (2018) Hierarchical graph representation learning with differentiable pooling. Adv Neural Inform Processing Syst 31

Yuan H, Ji S (2020) Structpool: Structured graph pooling via conditional random fields. In: Proceedings of the 8th international conference on learning representations

Bianchi FM, Grattarola D, Alippi C (2020) Spectral clustering with graph neural networks for graph pooling. In: International conference on machine learning pp 874–883

Ranjan E, Sanyal S, Talukdar P (2020) Asap: Adaptive structure aware pooling for learning hierarchical graph representations. In Proceedings of the AAAI conference on artificial intelligence 34(04):5470–5477

Baek J, Kang M, Hwang SJ (2021) Accurate learning of graph representations with graph multiset pooling

John TS, Thomas T, Emmanuel S (2020) Graph convolutional networks for android malware detection with system call graphs. In: 2020 Third ISEA conference on security and privacy pp 162–170

Cai M, Jiang Y, Gao C, Li H, Yuan W (2021) Learning features from enhanced function call graphs for Android malware detection. Neurocomputing 423:301–307

Gao H, Cheng S, Zhang W (2021) GDroid: Android malware detection and classification with graph convolutional network. Comput & Secur 106:102264

Deldar F, Abadi M, Ebrahimifard M (2022) Android Malware Detection Using Supervised Deep Graph Representation Learning. In: 2022 12th International conference on computer and knowledge engineering pp 348–354

Wu H, Luktarhan N, Tian G, Song Y (2023) An Android Malware Detection Approach to Enhance Node Feature Differences in a Function Call Graph Based on GCNs. Sensors 23(10):4729

Ying C, Cai T, Luo S, Zheng S, Ke G, He D, She Y, Liu TY (2021) Do transformers really perform badly for graph representation? Adv Neural Inf Process Syst 34:28877–28888

Xu K, Hu W, Leskovec J, Jegelka S (2019) How powerful are graph neural networks?. In: 7th International conference on learning representations

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inform Process Syst 30

Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv preprint arXiv:1607.06450

Lin Y, Zhao H, Ma X, Tu Y, Wang M (2020) Adversarial attacks in modulation recognition with convolutional neural networks. IEEE Trans Reliab 70(1):389–401

Tu Y, Lin Y, Hou C, Mao S (2020) Complex-valued networks for automatic modulation classification. IEEE Trans Veh Technol 69(9):10085–10089

Liu C, Li B, Zhao J, Zhen Z, Liu X, Zhang Q (2022) FewM-HGCL: Few-shot malware variants detection via heterogeneous graph contrastive learning. IEEE Trans Dependable Secure Comput

Liu C, Fu X, Wang Y, Guo L, Liu Y, Lin Y, Zhao H, Gui G (2023) Overcoming data limitations: a few-shot specific emitter identification method using self-supervised learning and adversarial augmentation. IEEE Trans Inf Forensics Secur 19:500–513

Yao Z, Fu X, Guo L, Wang Y, Lin Y, Shi S, Gui G (2023) Few-shot specific emitter identification using asymmetric masked auto-encoder. IEEE Commun Lett 27(10):2657–2661

Chen Z, Xiang J, Lu Y, Xuan Q, Wang Z, Chen G, Yang X (2023) RGP: Neural Network Pruning Through Regular Graph With Edges Swapping. IEEE Trans Neural Netw Learn Syst

Xuan Q, Zhou J, Qiu K, Chen Z, Xu D, Zheng S, Yang X (2022) AvgNet: Adaptive visibility graph neural network and its application in modulation classification. IEEE Trans Netw Sci Eng 9(3):1516–1526

Zheng Z, Shi X, He L, Jin H, Wei S, Dai H, Peng X (2020) Feluca: A two-stage graph coloring algorithm with color-centric paradigm on gpu. IEEE Trans Parallel Distrib Syst 32(1):160–173

Zheng Z, Zhao C, Xie P, DuM B (2023) Galliot: Path Merging Based Betweenness Centrality Algorithm on GPU. In Proceedings of the IEEE International Conference on Computer Communications (INFOCOM’ 23). New York, USA, pp 17–20

Huang Q, He H, Singh A, Lim SN, Benson AR (2020) Combining label propagation and simple models out-performs graph neural networks. arXiv preprint arXiv: 2010.13993

Xu Y, Wang J, Guang M, Yan C, Jiang C (2023) Multistructure Graph Classification Method With Attention-Based Pooling. IEEE Trans Comput Soc Syst 10:602–613

Download references

Acknowledgements

This paper has been supported by the Key Technology Research and Development Program of the Zhejiang Province under Grant 2022C01125, and the General Research Program of the Department of Education under Grant Y202044517.

Author information

Authors and affiliations.

College of Cyberspace, Hangzhou Dianzi University, Hang Zhou, China

Liang Kou, Cheng Qiu & Jilin Zhang

College of Communication Engineering, Hangzhou Dianzi University, Hang Zhou, China

DBAPPSecurity Co., Ltd, Hang Zhou, China

Hua Liu & Yan Du

You can also search for this author in PubMed   Google Scholar

Contributions

All authors contributed to the writing, editing and proofing of the manuscript. All authors read and approved the final manuscript. The specific contributions of each author are as follows: Liang Kou: Conceptualization, Writing for Review & Editing. Cheng Qiu: Conceptualization, Writing for Original Draft. Meiyu Wang: Formal analysis, Data curation. Hua Liu: Validation. Yan Du: Visualization. Jilin Zhang: Supervision, Project administration.

Corresponding author

Correspondence to Meiyu Wang .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Kou, L., Qiu, C., Wang, M. et al. MalDMTP: A Multi-tier Pooling Method for Malware Detection based on Graph Classification. Mobile Netw Appl (2024). https://doi.org/10.1007/s11036-024-02318-8

Download citation

Accepted : 30 March 2024

Published : 26 April 2024

DOI : https://doi.org/10.1007/s11036-024-02318-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Malware detection
  • Graph classification
  • Graph pooling
  • Attention mechanism

Advertisement

  • Find a journal
  • Publish with us
  • Track your research

Help | Advanced Search

Computer Science > Cryptography and Security

Title: machine learning for windows malware detection and classification: methods, challenges and ongoing research.

Abstract: In this chapter, readers will explore how machine learning has been applied to build malware detection systems designed for the Windows operating system. This chapter starts by introducing the main components of a Machine Learning pipeline, highlighting the challenges of collecting and maintaining up-to-date datasets. Following this introduction, various state-of-the-art malware detectors are presented, encompassing both feature-based and deep learning-based detectors. Subsequent sections introduce the primary challenges encountered by machine learning-based malware detectors, including concept drift and adversarial attacks. Lastly, this chapter concludes by providing a brief overview of the ongoing research on adversarial defenses.

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. (PDF) Android Malware Detection: A Survey

    android malware detection research paper

  2. (PDF) Deep Android Malware Detection

    android malware detection research paper

  3. General framework for Android malware detection.

    android malware detection research paper

  4. (PDF) Review of Android Malware Detection Based on Deep Learning

    android malware detection research paper

  5. (PDF) Android Malware Detection & Protection: A Survey

    android malware detection research paper

  6. GitHub

    android malware detection research paper

VIDEO

  1. Malware Detection API

  2. Android malware analysis part-2 (SMS spyware) using jd-gui

  3. Q1

  4. Using the Malware Detection with Bacula 15.0

  5. ESET Insights

  6. A Performance Sensitive Malware Detection System Using Deep Learning on Mobile Devices

COMMENTS

  1. A Systematic Overview of Android Malware Detection

    The correlative research on Android malware collected in this paper can provide valuable reference and broaden the research direction for future researchers. ... However, after a comprehensive research of Android malware detection, there are still some challenges in future research, for example, the vulnerability of Android detectors to ...

  2. Android Malware Detection Using Deep Learning

    Android malware detection approaches can be divided into static, dynamic, and hybrid analysis [9]. The Static analysis extracts the features from Android application without running them into a device or Android emulator. ... The difference between this paper and the rest of the research mentioned above lies in the fact that the data used are ...

  3. Android Malware Detection Using Machine Learning

    In this paper, we propose a new system using machine learning classifiers to detect Android malware, following a mechanism to classify each APK application as a malicious or a legitimate application. The system employs a feature set of 27 features from a newly released dataset (CICMalDroid2020) containing 18,998 instances of APKs to achieve the ...

  4. An in-depth review of machine learning based Android malware detection

    In Android malware detection, supervised learning is typically used to train classifier models that can determine whether an unknown application is benign or malware. In some cases, classification is also used to classify malware applications according to their malware families. ... Only one of the research papers reviewed, Yu et al. (2013 ...

  5. [2103.05292] Deep Learning for Android Malware Defenses: a Systematic

    In this paper, we conducted a systematic literature review to search and analyze how deep learning approaches have been applied in the context of malware defenses in the Android environment. As a result, a total of 132 studies covering the period 2014-2021 were identified. Our investigation reveals that, while the majority of these sources ...

  6. (PDF) Android Mobile Malware Detection Using Machine Learning: A

    eliminating the need for an explicit definition of the signatures when developing malware detectors. This paper provides a systematic review of ML-based Android malware detection techniques. It ...

  7. Android malware analysis and detection: A systematic review

    This paper presents a quick understanding and a holistic view of malware detection and analysis. The current investigation conducted a systematic literature review (SLR) to recognize the salient shifts in malware detection by examining a range of scholarly journals and conference papers.

  8. A Review of Android Malware Detection Approaches Based on Machine

    Android applications are developing rapidly across the mobile ecosystem, but Android malware is also emerging in an endless stream. Many researchers have studied the problem of Android malware detection and have put forward theories and methods from different perspectives. Existing research suggests that machine learning is an effective and promising way to detect Android malware ...

  9. Electronics

    This paper provides a systematic review of ML-based Android malware detection techniques. It critically evaluates 106 carefully selected articles and highlights their strengths and weaknesses as well as potential improvements. ... When searching for the papers we considered the research papers written in the English language. Because of this ...

  10. An Android Malware Detection Approach Based on Static ...

    Therefore, this paper will contribute to Android malware detection. Moreover, this paper defines and further illustrates the four classes of android malware within the addressed dataset and paves the way to achieve high detection rates by proposing a static analysis-based malware detection method with this recent dataset’s aid.

  11. Android malware analysis in a nutshell

    This paper offers a comprehensive analysis model for android malware. The model presents the essential factors affecting the analysis results of android malware that are vision-based. Current android malware analysis and solutions might consider one or some of these factors while building their malware predictive systems. However, this paper comprehensively highlights these factors and their ...

  12. Android Malware Detection

    AndrODet: An Adaptive Android Obfuscation Detector. omirzaei/androdet • Future Generation Computer Systems 2019. This is typically applied to protect intellectual property in benign apps, or to hinder the process of extracting actionable information in the case malware. 1. Paper. Code.

  13. [PDF] Android Mobile Malware Detection Using Machine Learning: A

    This paper provides a systematic review of ML-based Android malware detection techniques and critically evaluates 106 carefully selected articles and highlights their strengths and weaknesses as well as potential improvements. With the increasing use of mobile devices, malware attacks are rising, especially on Android phones, which account for 72.2% of the total market share. Hackers try to ...

  14. (PDF) A Systematic Overview of Android Malware Detection

    With the introduction of the. taxonomy of machine learning methods, the commonly used models in. Android malware detection are distinguished as traditional machine learning, currently advanced ...

  15. An Analysis of Machine Learning-Based Android Malware Detection

    This paper provided a detailed examination of machine-learning-based Android malware detection approaches. According to present research, machine learning and genetic algorithms are in identifying Android malware, this is a powerful and promising solution. In this quick study of Android apps, we go through the Android system architecture ...

  16. On building machine learning pipelines for Android malware detection: a

    As the smartphone market leader, Android has been a prominent target for malware attacks. The number of malicious applications (apps) identified for it has increased continually over the past decade, creating an immense challenge for all parties involved. For market holders and researchers, in particular, the large number of samples has made manual malware detection unfeasible, leading to an ...

  17. A Context-Aware Android Malware Detection Approach Using Machine Learning

    The Android platform has become the most popular smartphone operating system, which makes it a target for malicious mobile apps. This paper proposes a machine learning-based approach for Android malware detection based on application features. Unlike many prior research that focused exclusively on API Calls and permissions features to improve detection efficiency and accuracy, this paper ...

  18. (PDF) A Comprehensive Study of Malware Detection in Android Operating

    To begin, several of the currently available Android malware detection approaches are carefully examined and classified based on their detection methodologies. This study examines a wide range of ...

  19. AppPoet: Large Language Model based Android malware detection via multi

    View a PDF of the paper titled AppPoet: Large Language Model based Android malware detection via multi-view prompt engineering, by Wenxiang Zhao and 2 other authors View PDF Abstract: Due to the vast array of Android applications, their multifarious functions and intricate behavioral semantics, attackers can adopt various tactics to conceal ...

  20. Android Malware Detection Based on Informative Syscall Subsequences

    The Android operating system commands a dominant market share of over 70% in the smartphone industry. However, this widespread usage has resulted in a concerning increase in malware applications. While existing static malware detection mechanisms are vulnerable to code obfuscation attacks, manipulating the runtime system call (syscall) sequence remains a significant challenge for attackers ...

  21. Android malware category detection using a novel feature vector-based

    Malware attacks on the Android platform are rapidly increasing due to the high consumer adoption of Android smartphones. Advanced technologies have motivated cyber-criminals to actively create and disseminate a wide range of malware on Android smartphones. The researchers have conducted numerous studies on the detection of Android malware, but the majority of the works are based on the ...

  22. Android Malware Category and Family Detection and Identification using

    The purpose of this paper is to shed light on prominent Android malware categories as well as related families within each malware category. Furthermore, it familiarizes the ... work on android malware detection and classification. The authors of [8] built three distinct types of datasets based on Machine Learning using a ...

  23. (PDF) Android malware detection: state of the art

    This paper presents a permission-based Android malware detection system, APK Auditor that uses static analysis to characterize and classify Android applications as benign or malicious.

  24. Android Malware Detection: A Literature Review

    The rest of the paper is organised as follows: Sect. 2 introduces Android malware detection, Sect. 3 reviews malware detection approaches and details the most commonly used analysis methods. Section 4 discusses findings and research directions in malware detection, and finally, Sect. 5 concludes the paper.

  25. An effective attention and residual network for malware detection

    The experimental results show that AMERNet is an effective malware detection method. Besides, the detection capability of our proposed framework outperforms relevant frontier research works. In the future, we will focus on combining more static features in the AndroidManifest.xml file to achieve more efficient Android malware detection.

  26. MalDMTP: A Multi-tier Pooling Method for Malware Detection ...

    With the development and adoption of cloud platforms in various fields, malware attacks have become a serious threat to the Internet cloud ecosystem. However, the pooling process of existing graph classification techniques for malware variant detection uses only a serial and single strategy, resulting in localized malicious behaviors of malware that may be overlooked. In this paper, we propose ...

  27. Android malware analysis in a nutshell

    Abstract. This paper offers a comprehensive analysis model for android malware. The model presents the essential factors affecting the analysis results of android malware that are vision-based. Current android malware analysis and solutions might consider one or some of these factors while building their malware predictive systems.

  28. [2404.18541] Machine Learning for Windows Malware Detection and

    View a PDF of the paper titled Machine Learning for Windows Malware Detection and Classification: Methods, Challenges and Ongoing Research, by Daniel Gibert View PDF Abstract: In this chapter, readers will explore how machine learning has been applied to build malware detection systems designed for the Windows operating system.