Big Data - Microsoft Malware Prediction

The goal of this project is to predict a Windows machine’s probability of getting infected by various families of malware, based on different properties of that machine. We will train various Machine Learning models using PySpark’s MLIB and predict the best outcome on our test dataset.
We used PySpark capabilities to train our classification models like logistic regression, randomForest and predict the chances of a machine getting infected by the malware with the given configuration

Data: Kaggle ( Microsoft Malware Prediction) 800K rows , took subset of the dataset with around 40K rows and did train/test split with 70:30 ration

Team: Pawanjeet Kaur, Shubhangi Srivastava

Results: Without any cross validation we were able to achieve best accuracy of 66.7 percent.

Future scope: Introduce threads to run training on entire dataset provided by Microsoft and use cross validation.

Please feel free to drop a message if unable to access the link to project code at colab.

Big Data Project Report

Git Link

Malware industry has always been a crucial part of the digital market as it has the potential to invade a traditional secure system. Once a system has been infected with malware, the criminals can use that to their benefits in numerous ways. Malware detection is crucial because it gets more complicated by the introduction of new machines in the environment, machines that come online and offline, machines that receive patches or the ones that receive new operating systems, etc. As Microsoft is having more than one billion enterprise and consumer customers, Microsoft takes this problem very seriously and has always been deeply invested in improving security of their machines. As part of it, one of the initiative to find the best solution to this problem, Microsoft challenged the world wide data science community to develop techniques, based on various features of the machine, that can identify and predict if a machine is soon to be hit by the malware accurately.

Challenges:
We have decided to work on this particular data because the dataset was huge and fits perfectly for big data requirements. Also, handling such big dataset on machine with less power was a challenging problem. As it is the real time data collected by Microsoft it consists huge amount of missing values which makes it challenging to explore and find impact of these variables on target variable. The dataset consists of mixed type variables due to which categorical feature encoding was performed. Then to build a draft ML model for this dataset was challenging yet exciting.

Project Steps:
1. Data Exploration and Pre-Processing
2. Manual Feature Cleaning
3. Feature Selection using Random Forest
4. Model Development - Logistic , Decision Trees, Random Forest

Detail Result:
Bench Accuracy 70%
Logistic Regression 66.7%
Decision Tree 51.8%
Random Forest 65.8%

Complete step by step details are given in the attached report. Please feel free to contact me to know more about the project.