Skip to content

abdulazizalmass/Stroke_Prediction

Repository files navigation

Stroke_Prediction

Follow @StrokePred on Twitter

you can reach the application demo here.

Stroke Prediction

Jefn Alshammari & Abdulaziz Almass

Abstract

This project aimed to predict stroke for people by analyzing a dataset found in Kaggle using different machine learning models(MLs) to help the medical staff to recognize those people with stroke. The used dataset was trained and get 96% accuracy as the highest value of the different used models.

Design

This project is one of the T5 Data Science BootCamp requirements. Data provided by Kaggle has been used in this project. The attribute "Stroke" is the label or target to be predicted in this project. This target is binary having either 1 or 0 as a value. The value of "1" means predicted with stroke and "0" means predicted without a stroke. This classifcation prediction is deployed using various machine learning models and a comparison of these models is done to measure of performance for each model to find the one that fits with the selected dataset. All of the following models have been used and tested: Logistic Regression, K-Nearest Neighbors (KNN), Decision Tree, Support-Vector Machines (SVM),Random Forest and XGBoost.

Data

The dataset is available in .csv format. It consists of 5110 observations/data points with 12 attributes or features. From exploratory data analysis, the age feature has an important role in stroke prediction which most models deployed confirmed afterwards. Other features were not definately if they are important due to the imbalanced dataset that has been treated with Synthetic Minority Oversampling Technique (SMOTE). Another important feature of this project is the label of the stroke whether the person is predicted with a stroke or not.

Models

Logistic Regression, K-Nearest Neighbors (KNN), Decision Tree, Support-Vector Machines (SVM),Random Forest and XGBoost are trained to predict stroke. The Random Forest has the highiest accuracy.

Models Evaluation and Selection

The following metrics summarize the results of all ML models used in this project :

Macro Avg Accuracy Precision Recall F1 Score Stroke
Logistic Regression Imbalanced 0.50 0.92 0.93
0.00
1.00
0.00
0.96
0.00
0
1
Logistic Regression Not Scaled 0.91 0.90 0.89
0.92
0.93
0.89
0.91
0.90
0
1
Logistic Regression Scaled 0.92 0.91 0.88
0.97
0.97
0.87
0.92
0.92
0
1
Logistic Regression Tuned & Scaled 0.92 0.91 0.89
0.94
0.94
0.89
0.92
0.91
0
1
KNN Scaled 0.94 0.94 0.96
0.93
0.93
0.96
0.94
0.94
0
1
Decision Tree Scaled 0.92 0.92 0.94
0.91
0.90
0.95
0.92
0.93
0
1
SVM Not Scaled 0.92 0.92 0.89
0.96
0.97
0.88
0.92
0.92
0
1
SVM Scaled 0.92 0.92 0.88
0.97
0.97
0.87
0.92
0.92
0
1
Random Forest Scaled 0.97 0.96 0.96
0.97
0.97
0.96
0.97
0.97
0
1
XGBoost Scaled 0.93 0.93 0.91
0.95
0.95
0.91
0.93
0.93
0
1

Tools

  • Pandas library for data frames
  • Numpy for mathematical operations
  • Matplotlib and Seaborn for plots
  • SKlearn for modeling
  • One-Hot-Encoding for categorical labeling
  • Imblearn
  • Plotly
  • Seaborn
  • XGBoost

Communication

The presentation show is provided here, besides details are provided at the readme of the project. for any enquiries, you can contact us via Email or Twitter via Follow @StrokePred on Twitter

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published