This project uses machine learning algorithms to predict the likelihood of a patient having diabetes based on certain diagnostic criteria. The dataset used in this project contains information about various diagnostic measurements for patients, such as glucose, blood pressure, insulin level, etc.
The aim of this project is to build and compare the performance of different machine learning models - K-Nearest Neighbors, Logistic Regression, Decision Tree, Random Forest, Support Vector Machine - to accurately predict the presence of diabetes in patients.
The dataset used in this project is the Pima Indians Diabetes Dataset, which can be found here. The dataset consists of 768 samples with 8 features, including the target variable, indicating the presence of diabetes. The dataset can be found on Kaggle
. It includes following health criteria:
- Pregnancies: Number of times pregnant
- Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- BloodPressure: Diastolic blood pressure (mm Hg)
- SkinThickness: Triceps skin fold thickness (mm)
- Insulin: 2-Hour serum insulin (mu U/ml)
- BMI: Body mass index (weight in kg/(height in m)^2)
- DiabetesPedigreeFunction: Diabetes pedigree function
- Age: Age (years)
- Outcome: Class variable (0 or 1)
- Number of Instances: 768
- Number of Attributes: 8 plus class
- Missing Attribute Values: Yes
- Class Distribution: (class value 1 is interpreted as "tested positive for diabetes")
python 3.8.3
pandas
numpy
sklearn
matplotlib
seaborn
pickle
Clone the repository and install the required dependencies using the following commands:
git clone https://github.com/Priyanshu88/Diabetes-Prediction.git
cd Diabetes-Prediction
The Jupyter notebook Diabetes Prediction.ipynb contains the code for loading and preprocessing the dataset, as well as implementing and evaluating the KNN, Logistic Regression, Random Forest, Support Vector Machine and Decision Tree models. To run the notebook, simply open it in Jupyter and run each cell in order.
Model | Accuracy |
---|---|
K-Nearest Neighbour | 79.22% |
Logistic Regression | 81.82% |
Random Forest | 79.22% |
Support Vector Machine | 83.12% |
Decision Tree | 81.82% |
Hypertuning - GridSearchCV on Logistic Regression | 83.12% |
In this project, we compared the accuracy of five different machine learning models as well as hypertuning parameters for predicting diabetes based on various health criteria. We found that Support Vector Machine was the most accurate model, with an accuracy of 83.12%. The above results also tells that Logistic Regression and Decision Tree are also performing good and hypertuing on Logistic Regression increases its accuracy around 2%. This project could be further improved by testing additional models and/or including additional health criteria in the dataset.
Checkout the deployment repository here
.
This project is licensed under the MIT License - see the LICENSE file for details.
Your Name - @twitter_handle - 2040020@sliet.ac.in
Project Link: https://github.com/Priyanshu88/Diabetes-Prediction.git