Data Science Projects

LIST OF ACADEMIC AND PERSONAL PROJECTS IN THE FIELD OF DAATA

Alaska Airline Analysis with Tableau Visualization

As part of Airline Analysis, we did below steps:

Data Curation and Data Cleaning:
The data processing and wrangling steps are listed below:
Collection and concatenation of the monthly BTS datasets for the year 2018
Conditioned the BTS datasets for use in the project.
Explored Alaska airlines data for meaningful patterns and trends

Analysis:
Generated performance statistics (arrival/departure delay, on-time, cancellation) and created intuitive dashboards to provide better visualization of airline traffic performance.
Analyzed the data across multiple dimensions (time, region, airlines) facilitating comparability across those dimensions.

Please visit my Tableau Public profile to get insight into the analysis.

Big Data - Microsoft Malware Prediction

The goal of this project is to predict a Windows machine’s probability of getting infected by various families of malware, based on different properties of that machine. We will train various Machine Learning models using PySpark’s MLIB and predict the best outcome on our test dataset.
We used PySpark capabilities to train our classification models like logistic regression, randomForest and predict the chances of a machine getting infected by the malware with the given configuration

Data: Kaggle ( Microsoft Malware Prediction) 800K rows , took subset of the dataset with around 40K rows and did train/test split with 70:30 ration

Team: Pawanjeet Kaur, Shubhangi Srivastava

Results: Without any cross validation we were able to achieve best accuracy of 66.7 percent.

Future scope: Introduce threads to run training on entire dataset provided by Microsoft and use cross validation.

Please feel free to drop a message if unable to access the link to project code at colab.

DATA MINING - VMWare Data Analysis

VMWare Case Study: • Worked on the VMWare dataset to forecast the behavior of a customer by analyzing their response to different actions
• Explored the dataset and performed data cleaning, feature selection (~600 predictors), class balancing using SMOTE, upscale, etc.
• Developed supervised learning models and came up with the best performing model having Sensitivity: 80%

Deep Learning - Text Summarization

Goal of the project is to summarize the document as accurately as possible. This NLP task does not have a specific metric to describe accuracy of the model, hence metrics like Bleu Score and Rogue were used as an indicator.

Team: Pawanjeet Kaur, Shivam Duseja
Dataset: WikiHowAll.csv
Models: LSTM, LSTM with attention , GRU, GRU with attention, Stacked LSTM
Average time for one model to run on Colab Pro with corss validation : 4-5 hours

Best Bleu Score: 0.365

https://towardsdatascience.com/text-summarization-using-deep-neural-networks-e7ee7521d804#a87e-1aa55d239c9e

Machine Learning - MNIST digit prediction

MNIST Neural Network Digit Recognizer: Built logistic classifier on MNIST digit image dateset. Neural network was built to classify the image.

Libraries: PyTorch , Keras
DataSet: 60,000 small square 28×28 pixel grayscale images of handwritten single digits between 0 and 9
Results:
Logistic: Misclassified samples: 947, Accuracy: 0.84
Neural Network: model with 2 hidden layers of size 100 and 25 (Baseline: 96.53% (0.37%))
Network Architecture - Layers Size [784 -> 512 ->256 -> 128 -> 64 -> 10] 0.9742857142857143 (97%)

Read More to know full details of project and code can be found on git link.

if unable to access github notebook , here is link to my colab (github has this bug going for quite long):
https://drive.google.com/file/d/1Q13AdZDRN3XJl4_T3ssiuiJbYkEtc6l5/view?usp=sharing

Social Media Analytics - Air Transport Network Analysis

Team : Pawanjeet Kaur , Shivam Duseja , Taniya Rajani

Tools used: R, Gephi

Dataset: OpenFlights

This project was done under the guidance of Dr. Ali Tafti. The purpose of this project was to analyse the trends in the air network connecting different source and destinations based on the routes data from open flight.Based on the analysis we reported different hubs in the network,Highest connecting airports, which points are highly connected, how airports are clustered based on different parameteres.

Please read the full description by clicking Read More.

Text Mining - Twitter Sentiment Analysis

Sentiment analysis is the automated process of analyzing text data and classifying it into sentiments positive, negative. Using sentiment analysis tools to analyze opinions in Twitter data can help companies understand how people are talking about their brand.

Dataset Details:
Training data : 90K rows and 6 columns
Test data: 10K rows

Models: Implementing Logistic regression from scratch (no in-built libraries for gradient/logistic) were used.

Results: with 10-fold cross validation Test accuracy: 70% precision: 70% and recall: 0.89