Enhancing Heart Failure Predictions: A Data-Driven Approach with Machine Learning
Heart Failure Prediction using Machine Learning
In this project, I investigated various programming and analytical methods for analyzing and evaluating the effectiveness of machine learning models in accurately predicting patient cases of heart failure. The healthcare sector relies on precise predictions of heart failure cases to make well-informed decisions about patient survival. I used scientific computing, data analysis, data visualization, and machine learning techniques to complete this task.
The project demonstrated my proficiency in using various machine learning techniques, data analysis, and visualization tools to tackle a real-world problem in the healthcare domain. It showcases my ability to identify and address challenges, such as class imbalance, and use feature selection methods to improve model performance.
To view coding, please follow this link.
Note: Please, click on the below images to view them with better clarity.
I. IMPLEMENTATION
1.1. Exploratory Data Analysis:
- Perform exploratory data analysis on the dataset to identify any missing or incorrect data.
- Examine and investigate data to uncover patterns, trends, and insights.
1.2. Classification using Multiple Models:
Use six machine learning algorithms: Naive Bayes, Logistic Regression, Support Vector Machines (SVM), Random Forest, K-Nearest Neighbors (K-NN), and Multi-layer Perceptron Neural Network (MLP) to fit the dataset. The performance of these models are evaluated using metrics like accuracy, precision, recall, and F1-score.
1.3. Addressing Class Imbalance:
Identify and address class imbalance issues in the dataset using resampling techniques like Random Under-Sampling (RUS) and Synthetic Minority Over-sampling Technique (SMOTE). The balanced dataset is then used to build and evaluate the classification models again.
1.4. Feature Selection:
I used the Mann-Whitney U test, Chi-Square test, and the feature importance graph from the Random Forest Classifier to identify the most significant features for making predictions.
1.5. Model Performance Evaluation:
Compare the performance of the models across the different stages of the project (original data, balanced data, selected features) and provided justifications for the observed differences.
II. FINDINGS
2.1. Exploratory Data Analysis:
2.1.1. Age Distribution:
- The age histogram shows a bimodal distribution, with a peak around 60-70 years old and a smaller peak around 20-30 years old. Patients with a death event tend to be concentrated in the older age range.
2.1.2. Smoking:
- The smoking feature shows a clear distinction between patients who experienced a DEATH_EVENT and those who did not. Patients who smoked appear to have a higher risk of a death event.
2.1.3. Sex:
- The sex feature indicates that there is a slightly higher proportion of death event cases among male patients compared to female patients.
2.1.4. Platelet Count:
- The platelet count distribution shows a bimodal pattern, with one peak around 200,000 and another around 300,000. Patients with a lower platelet count appear to have a higher risk of a death event.
2.1.5. Creatinine Phosphokinase:
- The creatinine phosphokinase feature exhibits a wide range of values, with some clustering of higher values associated with death event cases.
2.1.6. Diabetes and High Blood Pressure:
- These features show a clear separation between death event and non-death event cases, indicating that patients with diabetes and high blood pressure are more likely to experience a death event.
2.1.7. Other Features:
- The visualizations for anemia, ejection fraction, serum creatinine, and serum sodium also provide insights into the relationships between these features and the target variable.
2.2. Classification using Multiple Models:
Model | Technique | Accuracy | Precision | Recall | F1-Score | |
---|---|---|---|---|---|---|
0 | Naive Bayes | Original | 0.70 | 0.82 | 0.36 | 0.50 |
1 | Logistic Regression | Original | 0.80 | 0.93 | 0.56 | 0.70 |
2 | Support Vector Machine | Original | 0.75 | 0.86 | 0.48 | 0.62 |
3 | Random Forest | Original | 0.77 | 0.92 | 0.48 | 0.63 |
4 | K-Nearest Neighbors | Original | 0.68 | 0.88 | 0.28 | 0.42 |
5 | Multi-Layer Perceptron | Original | 0.73 | 0.76 | 0.52 | 0.62 |
The results showed that Logistic Regression performed the best among the models evaluated, with high scores in accuracy, precision, recall, and F1-Score. The other models showed varying levels of performance, with Naive Bayes and K-Nearest Neighbors having the lowest scores among the metrics evaluated.
It is noticible that the models Random Forest and Multi-Layer Perceptron are all based on techniques that involve randomness during training or prediction. Therefore, their performance can change slightly depending on the random seed used or the specific instances in the dataset that are sampled.
2.3. Addressing Class Imbalance:
Model | Technique | Accuracy | Precision | Recall | F1-Score | |
---|---|---|---|---|---|---|
4 | K-Nearest Neighbors | Original | 0.68 | 0.88 | 0.28 | 0.42 |
4 | K-Nearest Neighbors | RUS | 0.79 | 0.84 | 0.76 | 0.80 |
4 | K-Nearest Neighbors | SMOTE | 0.79 | 0.79 | 0.80 | 0.80 |
1 | Logistic Regression | Original | 0.80 | 0.93 | 0.56 | 0.70 |
1 | Logistic Regression | RUS | 0.87 | 0.83 | 0.95 | 0.89 |
1 | Logistic Regression | SMOTE | 0.87 | 0.89 | 0.83 | 0.86 |
5 | Multi-Layer Perceptron | Original | 0.73 | 0.76 | 0.52 | 0.62 |
5 | Multi-Layer Perceptron | RUS | 0.90 | 0.87 | 0.95 | 0.91 |
5 | Multi-Layer Perceptron | SMOTE | 0.84 | 0.83 | 0.85 | 0.84 |
0 | Naive Bayes | Original | 0.70 | 0.82 | 0.36 | 0.50 |
0 | Naive Bayes | RUS | 0.69 | 0.74 | 0.67 | 0.70 |
0 | Naive Bayes | SMOTE | 0.78 | 0.85 | 0.68 | 0.76 |
3 | Random Forest | Original | 0.77 | 0.92 | 0.48 | 0.63 |
3 | Random Forest | RUS | 0.90 | 0.87 | 0.95 | 0.91 |
3 | Random Forest | SMOTE | 0.89 | 0.85 | 0.95 | 0.90 |
2 | Support Vector Machine | Original | 0.75 | 0.86 | 0.48 | 0.62 |
2 | Support Vector Machine | RUS | 0.85 | 0.83 | 0.90 | 0.86 |
2 | Support Vector Machine | SMOTE | 0.84 | 0.83 | 0.85 | 0.84 |
The application of resampling techniques notably improved the performance of most models. The Random Forest and Multi-Layer Perceptron models with the RUS technique outperformed the others, achieving an accuracy, precision, recall, and F1-Score of 0.90, 0.87, 0.95, and 0.91 respectively.
However, all above findings probaly be changed after different running times because RUS is a technique that reduces the majority class by randomly eliminating some of its instances and SMOTE is a technique that generates new synthetic instances of the minority class.
From the findings, it appears that applying these resampling techniques can greatly improve the performance of almost models.
2.4. Feature Selection:
Model | Technique | Accuracy | Precision | Recall | F1-Score | |
---|---|---|---|---|---|---|
4 | K-Nearest Neighbors | Chi - Mann | 0.73 | 0.74 | 0.56 | 0.64 |
4 | K-Nearest Neighbors | Original | 0.68 | 0.88 | 0.28 | 0.42 |
4 | K-Nearest Neighbors | Random Forest | 0.70 | 0.82 | 0.36 | 0.50 |
1 | Logistic Regression | Chi - Mann | 0.78 | 0.93 | 0.52 | 0.67 |
1 | Logistic Regression | Original | 0.80 | 0.93 | 0.56 | 0.70 |
1 | Logistic Regression | Random Forest | 0.80 | 0.93 | 0.56 | 0.70 |
5 | Multi-Layer Perceptron | Chi - Mann | 0.77 | 0.82 | 0.56 | 0.67 |
5 | Multi-Layer Perceptron | Original | 0.73 | 0.76 | 0.52 | 0.62 |
5 | Multi-Layer Perceptron | Random Forest | 0.75 | 0.86 | 0.48 | 0.62 |
0 | Naive Bayes | Chi - Mann | 0.72 | 0.90 | 0.36 | 0.51 |
0 | Naive Bayes | Original | 0.70 | 0.82 | 0.36 | 0.50 |
0 | Naive Bayes | Random Forest | 0.70 | 0.82 | 0.36 | 0.50 |
3 | Random Forest | Chi - Mann | 0.70 | 0.71 | 0.48 | 0.57 |
3 | Random Forest | Original | 0.77 | 0.92 | 0.48 | 0.63 |
3 | Random Forest | Random Forest | 0.70 | 0.71 | 0.48 | 0.57 |
2 | Support Vector Machine | Chi - Mann | 0.75 | 0.78 | 0.56 | 0.65 |
2 | Support Vector Machine | Original | 0.75 | 0.86 | 0.48 | 0.62 |
2 | Support Vector Machine | Random Forest | 0.77 | 0.82 | 0.56 | 0.67 |
2.5. Model Performance Evaluation:
In summary, after evaluating the performance of 6 different machine learning models, it was found that the Logistic Regression model demonstrated the highest performance, closely followed by the Support Vector Machine. The performance of the Random Forest and Multi-Layer Perceptron models was found to be slightly more variable, and may be sensitive to changes in the algorithm and specific parameters used. In order to optimize the performance of the models, it is recommended to use techniques such as balancing the dataset and selecting relevant features. Based on the results of this project, it is suggested to use the Logistic Regression model for future predictions and also consider applying resampling techniques to further improve the model's performance.
Comments
Post a Comment