Data Science Project – MS Malware Classification

1.- Introduction

Short for “malicious software,” malware refers to software programs designed to damage or do other unwanted actions on a computer system. (

Malware has quickly become one of the major security threats of today, facing the Internet and electronic devices in general.

Traditional methods, i.e. static analysis, based on software signatures through hash functions applied to the binary files and stored in a knowledge database, do not suffice to detect new malware; only malwares already introduced and registered can be detected.

Over the years, malware has been evolving, becoming more intricate by mutating into complex forms and using new ways to target computers and other electronic devices.

Since the modern malware coding uses sophisticated techniques such as polymorphism and obfuscation allowing the malware programs change their footprint or signature as they propagate (on each infection and copy), the dynamic analysis approach becomes essential in the malware detection.


The Kaggle Challenge

In February 2015, Microsoft announced the following challenge to the Kaggle Data Science community:

Microsoft has provided a total of 500 GB data of known malware files representing a mix of 9 families in 2 datasets: train and test; 10868 malwares in train and 10783 in test set.

Each sample is a binary file with the extension “.bytes” and their disassembled file with the extension “.asm”, in the assembly language (text).

The challenge is to train our model in a way to effectively classify the test dataset file into one of 9 categories (malware families) and minimize the logloss function:

image00where N is the number of files in the test set, M is the number of labels, log is the natural logarithm, yij is 1 if observation i is in class j and 0 otherwise, and pij is the predicted probability that observation i belongs to class j.

Logloss vs accuracy

In other words, this function is the mean of summatory of the natural logarithm probabilities (changing sign) assigned to TRUE label.

This function, by containing a natural logarithm, penalizes errors nonlinearly/disproportionately.

For example, the minimum value of the 10e-15 probability (the probabilities of each sample are normalized between 10e-15 and (1-10e-15)), penalizes with -ln(10e-15) = 32.5387764. If we add 0.001 epsilon, -ln(0.001) = 6.997.

When we have the predicted probability matrix of the model we can aggregate the epsilon value to minimize logloss. With the same accuracy we can optimize the function (picture)


In this example we can observe:

Logloss without adding epsilon = .10531896908357884

The epsilon that minimizes the function is 0.000121, and the obtained value is .082759119781732157

The strategy we have followed is, given the probability matrix, to seek the epsilon that minimizes the logloss. We added the resulting epsilon to the probability matrix before uploading it on Kaggle.

2.- Approach

In order to understand the problem domain, we conducted a general research into the malware behaviour as well as into the manner in which the malwares spread:


The 9 categories include a wide range of malware types from Virus, Trojan horse, Backdoor, Botnet, Spyware, Adware to anything else (with obfuscating techniques).

However, the results of this research have shown that these 9 categories are not mutually exclusive, meaning that a particular malware may reveal the characteristics of multiple categories at the same time.

The heuristic approach is discarded, let’s delve into the data.



To further familiarise with the subject and explore our knowledge, we have studied several papers about different approaches to the problem.

With this awareness we conclude that first of all we should define our knowledge base in the feature extraction process for each sample.

In our pipeline model, we take the following steps: collect, clean, tokenizing, and EDA (Exploratory Data Analysis).

At first, we inspect our train dataset’s structure to see how many samples there are for each category, and hence how balanced our train dataset is.


Naive Feature Extraction – 2 features : FileSize Parameter (.asm vs. .bytes sample size, without outliers)

image05 image02

With these 2 features (asmFileSize and bytesFileSize) per sample, we could already train our system. According to the nature of the data in our possession: multiclass, unbalanced train dataset, almost 10.000 samples in the train dataset etc. we ought to select the adequate machine learning algorithm:

First results with Naive Feature Extraction

  • AdaBoostFIRST_Adaboost
  • Random Forest


  • Gradient Boosting

FIRSTGradient_Boosting_FS_Train_100_estimators_001_learning_rateGradient boosting: Good results with only 2 features

Features extractions: Let’s have another look into the data and their structure (.asm samples) in order to discover new features:


b.   Cleaning

c.   Process the features

  1.    File size of assembler files: 1 feature
  2.    File size of binary files: 1 feature
  3.    Section code size of assembler files: XX features
  4.    1-grams of ADN families of instructions: 11 features
  5.    Bigrams of ADN families of instructions: 121 features
  6.    Bag of words of binary files: 121 featurees

d.   EDA to spot trends and outliers

e.   Machine Learning Classification

3   The Model: Cross-validation (80% – 20%) of  AdaBoost, Random Forest, Gradient Boosting for hybrid feature datasetsCaptura_PPT 

Best Kaggle Challenge Results

  • AdaBoostimage07

max of test accuracy: 0.852805

max of train accuracy: 0.866079

  • Random ForestBEST_RF

max of train accuracy: 1.0

max of test accuracy: 0.9874272

  • Gradient Boosting


max of train accuracy: 1.0

max of test accuracy: 0.9862005

4  Kaggle Challenge Results


Our team “DataHeroes” reached the 149th position out of total 377 teams.


Winning score 0,002833228
AVG score from total 0,655699940
AVG score excluding the scores corresponding to the Equal probability benchmark 2.197224577 (the result of submitting 1/9 for every prediction) and all worse than that scores 0,238102515
DataHeroes score 0,047806294


Total entries 7669
Teams   377
AVG entry from total     20
AVG entry per team from the higher ranked teams (1-148 positions)     37
DataHeroes entries     24


DataHeroes score is to a great extent closer to the winning score (+0.044973066) than to the average (-0.607893646 than total AVG and -0.190296221 than ‘clean’ AVG).

If the number of entries was taken into the account — DataHeroes submitted 65% of the higher-ranked-teams average number of entries (24 vs. 37 respectively) — DataHeroes would likely gain a higher position in the final team classification.


c.   Winner

Approach based on intensive Feature Engineering (4-gram byte, Single byte frequency, Instruction count, Function name, Derived Assembly feature and ‘3 golden features’: Opcode count: 2-gram, 3-gram, 4-gram, Header count, Asm pixel intensity), Gradient Boosting (Xgboost), Ensembling, Semi-supervised learning and Calibration.

Winner score reached 0,002833228 with 268 entries.

DataHeroes score reached 0,047806294 with 24 entries.

5   Conclusions and lessons learned

a. The benefits of Kaggle to learn about data science challenge:

  1.    The question to answer is clear
  2.    The raw data set is fixed
  3.    Focus is in pipeline of visualization – clean – feature extraction – model – evaluate model
  4.    Forum discussions were helpful, we learned from the others’ discoveries and experiences

b. We have obtain our first pipeline machine learning framework

6   Font Code: