Android-manifest extraction and labeling method for malware compilation and dataset creation

ABSTRACT


INTRODUCTION
The growth of the smartphone market over the last two decades has made Android one of the most pervasive operating systems for smartphone devices, as it accounts for more than 80% of the global market.With its popularity, the Android operating system comes at a cost, as it is becoming one of primary targets of attack by cyber crime.Cyber attacks are so prevalent on most internet-connected systems, and various attack models are used.Online applications such as web applications that are not properly protected are prone to such attacks as SQL injection attacks, distribution denial of service, defacing or many other potential dangers.The last 15 years smartphone users have had the advantage of ever faster mobile connections and now nearly all smartphones are always connected to the internet.This is true in the case of ever dominating Android smartphones.It means that Android smartphones in the networks are compromised even more seriously and suffer from even wider cyber attacks or hijacked by fake Android application package kit (APK) files.Even worse is the fact that the majority of Android smartphone users tend to be less aware or not literate with obvious catastrophic danger once their devices are infected.Combined with social engineering attacks, users' mobile bank accounts, emails, phone books and social media apps fall quickly into the hand of cyber predators ready for exploiting the victims.Recent national news of looming attacks being handled by the national cyber security ISSN: 2088-8708  Android-manifest extraction and labeling method for malware compilation … (Djarot Hindarto) 6569 forces confirm this critical attacks [1]- [3], indicating the eternal need for well-planned efforts to perform penetration testing [4] and malware compilation.Cryptography can be applied to further protect data [5] in case of cyber attacks, and scam or fraudulent links for phasing can possibly be anticipated by federated learning [6].August 2010 was the first time that the existence of malware had been detected in the android operating system.It did not take long thereafter, that the number of malicious android applications reached more than thousand APKs in the following years.In the third quarter of 2018, according to Google data, the total number of Android malware touched 3.2 million and jumped by 40% year on year [7], [8].Data from several sources indicate that there are more than 1 billion Android devices at risk due to malware.In addition, two out of five active mobile phones have security risks [9], which we believe that most Android users are not aware of their potential attacks, as the attack is stealth.This risk is exaggerated when users do not update the operating system and APKs to the latest version with the most current security update.
Users today often download applications from anywhere, which often causes problems.When installing an application and allowing whatever the application asks for, without making a conscious selection and knowing the purpose of the application, this is what causes security problems on the smartphone, such as infiltration of certain scripts.That particular script when triggered can perform an action that violates security, for example the initiation of stealthy data transfer of private data to leak out from the smartphones to the attackers.Finally, the control of the smartphone will fall into the hands of attackers.
Application development technology is currently very fast, due to the use of a framework that makes it easy to create Android-based applications.There are several web-based applications that provide solutions to quickly create Android APK files, such as MIT App Inventor [10], [11], Flutter, Appery.io, and many more.There are also those who take advantage of ready-made applications and then carry out the Reverse Engineering process and add several functions to create new applications.The manufacturing process can be done in a short time.The speed of making Android applications is used by irresponsible parties for negative purposes.By adding certain scripts such as allowing to activate storage so that the party entering the script can explore the user's smartphone storage.So that applications that have been infiltrated by malicious scripts or programs are referred to as Android malicious software.Some survey data regarding Android malware reveal that the ease or speed of making Android applications coupled with reverse engineering APKs are very interesting for security research.Research in malware based on the latest data for February 2021, released by the AV-TEST Institute, says there are over 350,000 malwares every day.This means that it has increased sharply from the last 5 years.Attacks and threats on the internet become a major topic that is often discussed in campus forums or other forums.One of the attacks and threats that are quite trending today is malware on Android-based smartphones.Research on malware is carried out using a classification algorithm to detect whether the file is normal or malware.Many anti-virus and anti-malware sometimes cannot detect new malware variants for several reasons.There are two analyzes namely static analysis and dynamic analysis.The analysis uses a classification of Android malware.With the presence of artificial intelligence technology, malware research can take advantage of this technology, one of which is malware classification using machine learning (ML).It is an ongoing effort that the application of ML technology for malware classification will be accurate.
Malware in smartphones today can be very annoying and disturbing for users [12].The disturbance causes it to not run normally on the Smartphone.This problem is often solved by installing anti-malware or antivirus for smartphones.But another problem arises, namely the use of anti-malware from untrusted sources, where the anti-malware has been infiltrated with malware.In addition, anti-malware does not run normally, because several new variants keep appearing, and many anti-malwares do not detect new variants.So antimalware remains a problem.Anti-malware producing companies also continue to innovate to detect malware.Even so, it still cannot detect the latest malware variants that keep popping up.
Many works have been produced in previous research, where the writings discuss various works discussing malware research with static, dynamic and hybrid methods or those that combine static and dynamic.One of the results is forensic analysis of mobile devices using scoring (FAMOUS) of application permissions [13], which proposes a predictive approach to forensics in detecting suspicious Android APKs.The next study is to detect Android APK malware and benign by weighting the prediction-based feature set using ML.Various experiments were carried out on the features of the Android APK properties with an accuracy rate of 99%.FAMOUS extracts Android APK files using the permissions feature only in classifying and analyzing each Android feature using the AndroidManifest.xmlfile.
Next work is longitudinal performance analysis (LPA) of ML based Android malware detectors [14].The aim of the study was to examine the performance degradation over time for various classifiers with ML, which were trained with static features extracted from a collection of applications and date-labeled malware.It is a static analysis with quantitative methods, namely by collecting malware dataset, Application features extraction and noting ML classification for performance evaluation.The investigation is repeated by training with time periods and samples from the latest datasets.The review of this work is that the method chosen is the static method with datasets for 2013, 2014 and 2015-2016.Also, the ML algorithms used are support vector machine (SVM), J48 decision tree (DT), naïve Bayes (NB), simple logistic (SL), and random forest (RF).
Decompiled APK based malicious code classification [15].The purpose of this study is to adapt the decompile source code APK technique based on natural language processing for the classification of source code malware.Using static analysis with quantitative methods it proceeds as follows using Rocky framework: Decompiling APK files into source code, preprocessing of source code, generalizing N-tokens, feature representations, and classification.Algorithm baselines are permissions, API calls and neural network (NN) based.Android malware dataset (AMD) from Argus Lab contains 24,553 sample APKs, grouped into 135 types, 71 malware families from sampling year in 2010 -2017.Evaluation of metrics are confusion matrix.true positive ratio (TPR).F1, accuracy, receiver operating characteristic (ROC) and area under the ROC curve (AUC) [16], [17].Classification is done with NB, RF, logistic regression, with 10 validation tests.Test results from decompiled APK based malicious code classification research reached 97% accuracy.
Next is APK auditor-a permission-based Android malware detection system [18].In APK auditor permissions feature, the system performs a malware assessment with three main components: Android client, signature database, central server communicating with both.Presents Android's permission-based malware detection system [19]- [22] using static analysis in classifying benign and malware Android applications.In conducting the experiment using data as many as 8,762 APKs, consisting of 1,853 benign APKs and 6,909 APK malware.The results of the accuracy of the model produce 88%.
As can be summarized from the reviewed studies above, there are no standardized ways yet to address the key problem of malware identifications.The research questions are focused on the current state of the art in data engineering for dataset creation and in the labeling approach to support ML algorithms for Android malware study.The potential ability of the ML methods to detect, analyze and predict new variants of malware that are currently widespread looks promising.Therefore, both dataset creation and detection algorithms are two key enablers for solving the problem.Both are interesting and logical to be appointed as research questions (RQ) as follows: i) How does feature extraction of APK files produce the best malware dataset?(RQ 1) and ii) How does feature selection from the dataset help optimize the resulting detection model?(RQ2) The scope of this current research is as follows.The method used in analyzing malware is a static method.Data collection (public data) for the dataset will be used as material for making models that will be tested.Data preparation, analyzed datasets, which attributes are used, which attributes are most influential in the malware are considered very important in modeling and on the performance of ML algorithms.Dataset creation is a data engineering process that largely relies on the feature extraction method and feature selection method required, before data consuming ML algorithms can start.

THE PROPOSED METHOD (EXTRACTING APK ANDROID FILES)
Figure 1, is a big picture of the system that will extract the Android APK file into a malware dataset, then the analysis process will be carried out on the features contained in the APK file.The resulting dataset is the result of reverse engineering using the Jadx module [23], [24], which is a tool from Reverse Engineering.Reverse Engineering is the process of converting APK files into source code form.This source code will be carried out.Further analysis, whether the file is malware or benign.

. Download APK File
The proposed research framework is as follows: dataset collection in the form of an application package kit (APK) containing malware APK and benign APK.Source dataset from University of New Brunswick [25], Virusshare [26], VirusTotal [27].After collecting the dataset in the form of APK malware files and APK benign files, the data extraction process is carried out.APK files are converted and decompiled to get feature permission and feature intent.Feature permission and feature intent will be processed into a dataset.This process is called feature extraction.Feature selection is done to reduce the features or dimensions of the malware dataset.Feature permission and feature intent are modeled by dividing training data and test data.
The APK files in the training come from the Google Play store and the Canadian Institute (UNB), consisting of five classes.The files are Benign APK, Ransomware APK, Riskware APK, Banking APK, and short message service (SMS) APK.The downloaded data is stored in their respective folders according to their class.To make sure the files are malware or not malware, check them.Checking through virustotal.comWebsite.The website is able to detect the types of malwares.Because the virustotal.com[27] website is supported by security companies such as Avast, Norton and others.In carrying out this experiment, download files of around 14,170 APKs measuring 60 GB from various sources above.Android APK files are placed in a folder according to the type of Android APK, such as Benign will be placed in the benign folder, while banking will be placed in banking.Figure 2 shows the overall process in a pseudocode.The collection of Android APK files will later be extracted based on the type or family of malware.The results of the extraction using a reverse engineer, which will be processed is AndroidManifest.xml.The feature set that is processed is the permission feature and the intent feature, both of which are the basis of the Android malware classification.It is hoped that the extraction process will produce the best malware dataset.After downloading the malware APK and Benign APK, the next step is to check before carrying out the extraction process.The purpose of checking the APK file is whether the malware file is malware, not benign APK.Likewise, with the benign Android APK, whether the file is a benign APK not a malware APK.In addition, the file can be categorized so that it is not wrong in determining the class in the classification.Later it will produce a dataset that is not mixed between the malware APK and the benign APK.

Checking APK file
Virustotal.com [27] is a reference for checking files that contain malware.Not only APK files, but any files can be checked on the website.Actually, in our society, the term harmful file is called a virus.The virus is part of malicious software or malware.There are many families of malware [28], [29], there are viruses, trojans, ransomware, banking, SMS, and others.There are those that infect the computer and there are also those that infect the smartphone.Making malware has different goals, depending on the malware maker.There is some malware that does not damage the operating system, but has an effect on data theft [30], spying on smartphone users [31].Some smartphone users do not realize that they are being spied on by malware, by sending smartphone user data to the malware maker's servers [32].This is very detrimental for smartphone users.Therefore, before downloading the Android APK file, the APK file should not be installed directly, but must be checked on the APK file to virustotal.com or other APK file checking websites.

Extraction APK file
This stage is reading the APK Files in the folder then doing a reverse engineer to read the AndroidManifest.xmlFile.Feature permission, reverse engineer process; Extract the APK file and save it to the unpacked-permissions folder.The next step that is read is the AndroidManifest.xmlfile, the XML file parser is carried out to read the value from "uses-permission" then saved into the UpdatePermList.txtfile.feature intent, reverse engineering process; Extract the APK file and save it to the unpacked-intent folder.The process is continued by reading the AndroidManifest.xmlfile, followed by the XML file parser process to read the feature intent value from the contents of the AndroidManifest.xmlfile.The contents of the file, "application/activity/intent-filter/action", "application/receiver/intent-filter/action", "application/activity/intentfilter/category", if any, will be assigned a value of 1 and 0 if not found.The process is continued by saving the results of the process into the UpdateIntentList.txtfile.APK file, is an executable format in the Android operating system.To find out the APK file structure, you can use the Jadx module or Jadx-graphical user interface (GUI).
Fitur permission [33]; The APK file [34] is reverse engineered to produce the AndroidManifest.xmlfile.Then the XML parser is done by checking the 'uses-permission' feature, if there is a uses-permission, then it is given a value of 1 and if it is not there is given a value of 0. Fitur Intent; The APK file is reverse engineered to produce the AndroidManifest.xmlfile.Followed by parser by checking "activity/intent-filter/action," "receiver/intent-filter/action," "activity/intent-filter/category".If there is a checking condition such as a checking condition, it will be given a value of 1 and if it is not there, it will be given a value of 0. The intent feature is the most basic feature in Android, which is used to process data from other components [35].Reverse engineering for Android APK files [36], using the Jadx module.This Jadx (APK-tools) module is a module that can extract Android APK files by creating folders and extracting them into source code, resources and assets.The resources folder contains the AndroidManifest.xmlfile. Figure 3 shows the pseudocode for performing the APK feature extraction.

RESULTS AND DISCUSSION
In conducting an experiment for Android APK extraction, using a Macintosh 2020 Notebook, with 8 GB RAM, 256 GB hard drive.Python programming language, NumPy and pandas library, xml.NumPy and pandas are the main library packages for computational mathematics and data science.The time it takes to extract Android APK files is 7 days non-stop.Process 14,170 Android APK, 1,179 feature, malware and Benign.This process generates two datasets namely datapermission.cvsand dataintent.csv.The two dataset files are merged, resulting in datamalware.csv.

Extract APK
Table 1 is the result of the extraction of the Android APK file.Where the process has been described above, explaining about reverse engineering using the JADX module [37] and parsing each Android APK file.Table 1, the score of permission features that often appear are SEND_SMS (Developer), ACCESS_COARSE_LOCATION (Developer), SYSTEM_ALERT_WINDOW, READ_PHONE_STATE, RECEIVE_SMS, RECEIVE_BOOT_COMPLETED, GET_TASKS, READ_SMS, ACCESS_WIFI_STATE, WRITE_EXTERNAL_STORAGE.The following is a description of the features.

Dataset
The result of this research is a dataset that can contribute to the detection of malware and non-malware.So that other researchers can directly use the dataset from the extraction process on various original malware on the internet.The extraction process was carried out according to the algorithm described in Figure 3. Table 1 shown the sample result of a malware dataset.Explanation of the final result of the extraction process from the android file (.apk) using the reverse engineering process, extract the android manifest file, selecting the permission keywords and intent keywords in the android manifest file.The result of this process becomes a dataset file for Android malware classification or detection.Table 1 explains the features in the resulting dataset.The explanation for the left column is NAME, NAME is the name of the Android APK file.The android_permission_ACCESS_All column is a request to access all functions on a smartphone device.If an Android APK application executes certain instructions and requests access to all functions on a smartphone device, then an irregularity appears.A column or feature with a value of 1 indicates the function of the permissions feature requesting access rights to run on smartphone device functions.Column or feature if it has a value of 0 indicates the function of the permissions feature does not ask for access rights to run on smartphone device functions.

Discussion
APK files are the package files used to share and install apps on Android devices.To extract the source code from an APK file, you can use a tool called Jadx.Jadx is a reverse engineering tool that can decompile an APK file into Java source code.It can also convert DEX bytecode to Java source code and provides an option to view the code in a graphical format.To use Jadx, you can download the tool from the official website and then open the APK file you want to extract using the Jadx GUI.Once the file is open, you can navigate through the package hierarchy and view the source code for each package, class, and method.Additionally, you can also export the source code as a zip or a jar file.Reverse engineering the Jadx module to create a dataset involves using the Jadx tool to decompile multiple APK files and then organizing the resulting source code into a structured dataset.This dataset can then be used for various purposes such as code analysis, malware detection, and more.One way to create a dataset using Jadx is to first gather a set of APK files that you want to decompile.These files can be obtained from various sources such as the Google Play Store or from other sources like GitHub.Once you have a set of APK files, you can use Jadx to decompile each file and extract the source code.Next, organize the extracted source code into a structured dataset.This can be done by creating a new directory for each APK file and placing the decompiled source code into the corresponding directory.Alternatively, you can also organize the source code into a spreadsheet (format file csv), with each row representing an APK file and its corresponding source code.
The methodological or framework approach in this research is reverse engineering and extraction of the AndroidManifest file.Using the JADX module reverse engineering method, to perform reverse engineering.The process is that the android file (example file-android.apk) is reversed to become the source code files collected in the folder.A collection of source-code files contains the AndroidManifest file.This file is extracted and selects the permission keywords and intent keywords.If the permission keyword is enabled, it will be written into the dataset with a value of 1, if the permission keyword is disabled, then the dataset permission feature is 0. The algorithm has been explained in the APK file extraction section, complete with pseudocode Feature_Extraction.
The design the system used is as follows: So far in detecting malware APK, the difficulty encountered is getting the malware dataset if it detects malware APK with a static method.The first difficulty is getting the dataset in the form of an android virus (Apk file).To get the original virus file, download it at University of New Brunswick [25], Virusshare [26], VirusTotal [27].After getting the Android virus, the second step is to do reverse engineering using the JADX module tools.The third step is to read the android manifest file in the reverse engineering results (android manifest file) and select the permission and intent keywords.The fourth step saves the results of the third step (selection of permission keywords and intent keywords in the android manifest file) into the dataset.csvfile.By carrying out four stages, a dataset is obtained.The design explanation is in Figure 1 steps in extracting APK Android files.
Evaluate the performance of the dataset if the simulation is carried out using the grid search cross validation, 5 folds cross and the multi-layer perceptron (MLP) Classifier.For the code from GridSearchCV as given in Figure 4 whose results are as shown in the next Figure 5. Figure 5 shows the simulation results of the malware.csvdataset using artificial neural network (ANN) MLP.The dataset resulting from the reverse engineering process and the extraction of the AndroidManifest file produces a better performance accuracy model, reaching 100%.The number of datasets processed is 14,170 malwares.Simulations using ML algorithms are also carried out using DT, SVM and KNN algorithms.So that the simulation using the ML algorithm will be compared.The simulation results of ML algorithms are compared with NNs such as ANNs.
GridSearchCV (mlpc, mlpc_params, cv=5, n_jobs=-1, verbose=2) mlpc_params={"alpha": [0.1, 0.01, 0.0001], "hidden_layer_sizes": [(10,10,10), (100,100,100), (100,100)], "solver": ["lbfgs","adam","sgd"], "activation": ["relu","logistic"]} mlpc_cv_model.fit(X_train, y_train) mlpc_tuned=mlpc_cv_model.best_estimator_ mlpc_tuned.fit(X_train, y_train) For the value of K-fold F1_weighted as follows: from sklearn., shows a comparison of simulations using several algorithms such as DT, SVM, KNN+PCA, ANN+GridSearch+MLP.The result is that the DT algorithm produces a very good Precision, Recall, F1-Score of 100%, when using a relatively small dataset of 600 malwares.If the 7,000 dataset and 14,170 malware datasets experience a decrease in precision, Recall and F1-score.The same thing happened to SVM and KNN+PCA.DT, SVM and KNN are included in ML.In this simulation, the ML algorithm will decrease if the number of datasets increases.The difference occurs in the ANN, if the dataset is larger in number, the Precision, Recall and F1-score will increase.The average yield reaches 100%.The contribution of this study provides an alternative dataset that will be used for further research.The next work is how to use the dataset from the extracted APK file into ML and deep learning (DL) algorithms.The reason for using an ANN with GridSearchCV and MLP is because the dataset is large.For the use of ML methods such as SVM, DT does not have the maximum precision, recall and F1-score.The dataset is simulated using an ANN with GridSearchCV and MLP, resulting in 100% performance.

CONCLUSION
The authors purposefully focus on the process of building quality malware datasets as it is seen as the most demanding approach and implementation, and not on machine learning itself, because implementing machine learning requires another effort only doable after the reliable dataset is fully built.The overall steps in creating the malware dataset have been extensively described systematically, starting with the collection,  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 13, No. 6, December 2023: 6568-6577 6576 reverse engineering, followed by extraction of the Android Manifest from the APK file set, and ending with the labeling method for all the extracted APK files.The core contribution of this paper is on how to generate datasets systematically from any APK file.The conclusion of this study is very insightful and useful for researchers working in the various fields of ML.The constructed dataset can be directly used for various purposes, especially for supervised classification and malware identification.

Figure 1 .
Figure 1.Steps in extracting APK Android files

Figure 2 .
Figure 2. Pseudocode for creating folders and storing APK malware

Figure 3 .
Figure 3. Pseudocode for performing the APK feature extraction 6573

Table 1 .
The result of the dataset creation process