Enhancing Heart Failure Predictions: A Data-Driven Approach with Machine Learning

Heart Failure Prediction using Machine Learning


In this project, I investigated various programming and analytical methods for analyzing and evaluating the effectiveness of machine learning models in accurately predicting patient cases of heart failure. The healthcare sector relies on precise predictions of heart failure cases to make well-informed decisions about patient survival. I used scientific computing, data analysis, data visualization, and machine learning techniques to complete this task. 


The project demonstrated my proficiency in using various machine learning techniques, data analysis, and visualization tools to tackle a real-world problem in the healthcare domain. It showcases my ability to identify and address challenges, such as class imbalance, and use feature selection methods to improve model performance.


To view coding, please follow this link.


Note: Please, click on the below images to view them with better clarity.


I. IMPLEMENTATION

1.1. Exploratory Data Analysis: 

- Perform exploratory data analysis on the dataset to identify any missing or incorrect data.

- Examine and investigate data to uncover patterns, trends, and insights.


1.2. Classification using Multiple Models:

Use six machine learning algorithms: Naive Bayes, Logistic Regression, Support Vector Machines (SVM), Random Forest, K-Nearest Neighbors (K-NN), and Multi-layer Perceptron Neural Network (MLP) to fit the dataset. The performance of these models are evaluated using metrics like accuracy, precision, recall, and F1-score.


1.3. Addressing Class Imbalance: 

Identify and address class imbalance issues in the dataset using resampling techniques like Random Under-Sampling (RUS) and Synthetic Minority Over-sampling Technique (SMOTE). The balanced dataset is then used to build and evaluate the classification models again. 

 


1.4. Feature Selection:

I used the Mann-Whitney U test, Chi-Square test, and the feature importance graph from the Random Forest Classifier to identify the most significant features for making predictions. 



1.5. Model Performance Evaluation: 

Compare the performance of the models across the different stages of the project (original data, balanced data, selected features) and provided justifications for the observed differences. 



II. FINDINGS


2.1. Exploratory Data Analysis: 

                



2.1.1. Age Distribution: 

- The age histogram shows a bimodal distribution, with a peak around 60-70 years old and a smaller peak around 20-30 years old. Patients with a death event tend to be concentrated in the older age range. 

2.1.2. Smoking: 

- The smoking feature shows a clear distinction between patients who experienced a DEATH_EVENT and those who did not. Patients who smoked appear to have a higher risk of a death event. 

2.1.3. Sex: 

- The sex feature indicates that there is a slightly higher proportion of death event cases among male patients compared to female patients. 

2.1.4. Platelet Count: 

- The platelet count distribution shows a bimodal pattern, with one peak around 200,000 and another around 300,000. Patients with a lower platelet count appear to have a higher risk of a death event. 

2.1.5. Creatinine Phosphokinase: 

- The creatinine phosphokinase feature exhibits a wide range of values, with some clustering of higher values associated with death event cases. 

2.1.6. Diabetes and High Blood Pressure: 

- These features show a clear separation between death event and non-death event cases, indicating that patients with diabetes and high blood pressure are more likely to experience a death event.

2.1.7. Other Features: 

- The visualizations for anemia, ejection fraction, serum creatinine, and serum sodium also provide insights into the relationships between these features and the target variable.


2.2. Classification using Multiple Models:

ModelTechniqueAccuracyPrecisionRecallF1-Score
0Naive BayesOriginal0.700.820.360.50
1Logistic RegressionOriginal0.800.930.560.70
2Support Vector MachineOriginal0.750.860.480.62
3Random ForestOriginal0.770.920.480.63
4K-Nearest NeighborsOriginal0.680.880.280.42
5Multi-Layer PerceptronOriginal0.730.760.520.62

The results showed that Logistic Regression performed the best among the models evaluated, with high scores in accuracy, precision, recall, and F1-Score. The other models showed varying levels of performance, with Naive Bayes and K-Nearest Neighbors having the lowest scores among the metrics evaluated.

It is noticible that the models Random Forest and Multi-Layer Perceptron are all based on techniques that involve randomness during training or prediction. Therefore, their performance can change slightly depending on the random seed used or the specific instances in the dataset that are sampled.



2.3. Addressing Class Imbalance: 

ModelTechniqueAccuracyPrecisionRecallF1-Score
4K-Nearest NeighborsOriginal0.680.880.280.42
4K-Nearest NeighborsRUS0.790.840.760.80
4K-Nearest NeighborsSMOTE0.790.790.800.80
1Logistic RegressionOriginal0.800.930.560.70
1Logistic RegressionRUS0.870.830.950.89
1Logistic RegressionSMOTE0.870.890.830.86
5Multi-Layer PerceptronOriginal0.730.760.520.62
5Multi-Layer PerceptronRUS0.900.870.950.91
5Multi-Layer PerceptronSMOTE0.840.830.850.84
0Naive BayesOriginal0.700.820.360.50
0Naive BayesRUS0.690.740.670.70
0Naive BayesSMOTE0.780.850.680.76
3Random ForestOriginal0.770.920.480.63
3Random ForestRUS0.900.870.950.91
3Random ForestSMOTE0.890.850.950.90
2Support Vector MachineOriginal0.750.860.480.62
2Support Vector MachineRUS0.850.830.900.86
2Support Vector MachineSMOTE0.840.830.850.84


The application of resampling techniques notably improved the performance of most models. The Random Forest and Multi-Layer Perceptron models with the RUS technique outperformed the others, achieving an accuracy, precision, recall, and F1-Score of 0.90, 0.87, 0.95, and 0.91 respectively. 

However, all above findings probaly be changed after different running times because RUS is a technique that reduces the majority class by randomly eliminating some of its instances and SMOTE is a technique that generates new synthetic instances of the minority class. 

From the findings, it appears that applying these resampling techniques can greatly improve the performance of almost models.



2.4. Feature Selection:

ModelTechniqueAccuracyPrecisionRecallF1-Score
4K-Nearest NeighborsChi - Mann0.730.740.560.64
4K-Nearest NeighborsOriginal0.680.880.280.42
4K-Nearest NeighborsRandom Forest0.700.820.360.50
1Logistic RegressionChi - Mann0.780.930.520.67
1Logistic RegressionOriginal0.800.930.560.70
1Logistic RegressionRandom Forest0.800.930.560.70
5Multi-Layer PerceptronChi - Mann0.770.820.560.67
5Multi-Layer PerceptronOriginal0.730.760.520.62
5Multi-Layer PerceptronRandom Forest0.750.860.480.62
0Naive BayesChi - Mann0.720.900.360.51
0Naive BayesOriginal0.700.820.360.50
0Naive BayesRandom Forest0.700.820.360.50
3Random ForestChi - Mann0.700.710.480.57
3Random ForestOriginal0.770.920.480.63
3Random ForestRandom Forest0.700.710.480.57
2Support Vector MachineChi - Mann0.750.780.560.65
2Support Vector MachineOriginal0.750.860.480.62
2Support Vector MachineRandom Forest0.770.820.560.67


Based on results, it appears that using the Mann-Whitney test or Chi-Square test technique for select important features, improved slightly in the accuracy of Naive Bayes, Random Forest, K-nearest Neighbors and Multi-Layer Perceptron models. In addition, the Random Forest Classifier plot technique shows a moderate improvement in the accuracy of the models: Support Vector Machine, K-nearest Neighbors and Multi-Layer Perceptron. The results above indicate that the feature selection techniques did not have a considerate  impact on the models’ performance in this project. It is possible that the effectiveness of these techniques may depend on the size and characteristics of the dataset used


2.5. Model Performance Evaluation: 

In summary, after evaluating the performance of 6 different machine learning models, it was found that the Logistic Regression model demonstrated the highest performance, closely followed by the Support Vector Machine. The performance of the Random Forest and Multi-Layer Perceptron models was found to be slightly more variable, and may be sensitive to changes in the algorithm and specific parameters used. In order to optimize the performance of the models, it is recommended to use techniques such as balancing the dataset and selecting relevant features. Based on the results of this project, it is suggested to use the Logistic Regression model for future predictions and also consider applying resampling techniques to further improve the model's performance.




Comments

Popular posts from this blog

Unlock Powerful Survey Insights with Automated Analysis

Optimizing the London Restaurant Industry with NLP and Big Data Insights from Google Maps Reviews