Core Concepts
High detection accuracies can be achieved using static analysis alone, with API calls and opcodes being the most productive features.
Abstract
The paper investigates the importance of feature and model choices in training ML models for Android malware detection. It reevaluates past works using a large dataset and identifies the most effective features and models. The study shows that high detection accuracies can be achieved using static analysis alone, with API calls and opcodes being the most productive features. Random forests are found to be generally the most effective model. Ensembling models separately leads to performances comparable to the best models but using less brittle features.
INTRODUCTION
Android is a common target for malware due to its popularity.
Machine learning models can effectively discriminate malware from benign applications.
Previous studies often report high accuracies using small, outdated datasets.
METHODOLOGY
Dataset collection involved a balanced, up-to-date dataset of Android applications.
Static and dynamic analysis tools were used to extract features.
Evaluation metrics included confusion matrix, accuracy, precision, F1-score, TPR, and TNR.
STATIC ANALYSIS
Permissions and API calls are essential for building Android malware detection models.
Reimplementation of past studies shows the effectiveness of API calls over permissions.
Feature selection algorithms play a crucial role in reducing the number of permissions for better performance.
REPRESENTATIONS OF API CALLS
Different ways of representing API calls, such as API usage, frequency, and sequences, were explored.
API frequency data set showed promising results with deep neural network models.
Model-based feature selection did not lead to improved classification performance.
Stats
High detection accuracies can be achieved using features extracted through static analysis alone.
API calls and opcodes are the most productive static features.
Random forests are generally the most effective model.
Quotes
"High detection accuracies can be achieved using features extracted through static analysis alone."
"API calls and opcodes are the most productive static features."
"Random forests are generally the most effective model."