Enhancing Heart Failure Predictions: A Data-Driven Approach with Machine Learning

Heart Failure Prediction using Machine Learning

In this project, I investigated various programming and analytical methods for analyzing and evaluating the effectiveness of machine learning models in accurately predicting patient cases of heart failure. The healthcare sector relies on precise predictions of heart failure cases to make well-informed decisions about patient survival. I used scientific computing, data analysis, data visualization, and machine learning techniques to complete this task.

The project demonstrated my proficiency in using various machine learning techniques, data analysis, and visualization tools to tackle a real-world problem in the healthcare domain. It showcases my ability to identify and address challenges, such as class imbalance, and use feature selection methods to improve model performance.

To view coding, please follow this link.

Note: Please, click on the below images to view them with better clarity.

I. IMPLEMENTATION

1.1. Exploratory Data Analysis:

- Perform exploratory data analysis on the dataset to identify any missing or incorrect data.

- Examine and investigate data to uncover patterns, trends, and insights.

1.2. Classification using Multiple Models:

Use six machine learning algorithms: Naive Bayes, Logistic Regression, Support Vector Machines (SVM), Random Forest, K-Nearest Neighbors (K-NN), and Multi-layer Perceptron Neural Network (MLP) to fit the dataset. The performance of these models are evaluated using metrics like accuracy, precision, recall, and F1-score.

1.3. Addressing Class Imbalance:

Identify and address class imbalance issues in the dataset using resampling techniques like Random Under-Sampling (RUS) and Synthetic Minority Over-sampling Technique (SMOTE). The balanced dataset is then used to build and evaluate the classification models again.

1.4. Feature Selection:

I used the Mann-Whitney U test, Chi-Square test, and the feature importance graph from the Random Forest Classifier to identify the most significant features for making predictions.

1.5. Model Performance Evaluation:

Compare the performance of the models across the different stages of the project (original data, balanced data, selected features) and provided justifications for the observed differences.

II. FINDINGS

2.1. Exploratory Data Analysis:

2.1.1. Age Distribution:

- The age histogram shows a bimodal distribution, with a peak around 60-70 years old and a smaller peak around 20-30 years old. Patients with a death event tend to be concentrated in the older age range.

2.1.2. Smoking:

- The smoking feature shows a clear distinction between patients who experienced a DEATH_EVENT and those who did not. Patients who smoked appear to have a higher risk of a death event.

2.1.3. Sex:

- The sex feature indicates that there is a slightly higher proportion of death event cases among male patients compared to female patients.

2.1.4. Platelet Count:

- The platelet count distribution shows a bimodal pattern, with one peak around 200,000 and another around 300,000. Patients with a lower platelet count appear to have a higher risk of a death event.

2.1.5. Creatinine Phosphokinase:

- The creatinine phosphokinase feature exhibits a wide range of values, with some clustering of higher values associated with death event cases.

2.1.6. Diabetes and High Blood Pressure:

- These features show a clear separation between death event and non-death event cases, indicating that patients with diabetes and high blood pressure are more likely to experience a death event.

2.1.7. Other Features:

- The visualizations for anemia, ejection fraction, serum creatinine, and serum sodium also provide insights into the relationships between these features and the target variable.

2.2. Classification using Multiple Models:

	Model	Technique	Accuracy	Precision	Recall	F1-Score
0	Naive Bayes	Original	0.70	0.82	0.36	0.50
1	Logistic Regression	Original	0.80	0.93	0.56	0.70
2	Support Vector Machine	Original	0.75	0.86	0.48	0.62
3	Random Forest	Original	0.77	0.92	0.48	0.63
4	K-Nearest Neighbors	Original	0.68	0.88	0.28	0.42
5	Multi-Layer Perceptron	Original	0.73	0.76	0.52	0.62

The results showed that Logistic Regression performed the best among the models evaluated, with high scores in accuracy, precision, recall, and F1-Score. The other models showed varying levels of performance, with Naive Bayes and K-Nearest Neighbors having the lowest scores among the metrics evaluated.

It is noticible that the models Random Forest and Multi-Layer Perceptron are all based on techniques that involve randomness during training or prediction. Therefore, their performance can change slightly depending on the random seed used or the specific instances in the dataset that are sampled.

2.3. Addressing Class Imbalance:

	Model	Technique	Accuracy	Precision	Recall	F1-Score
4	K-Nearest Neighbors	Original	0.68	0.88	0.28	0.42
4	K-Nearest Neighbors	RUS	0.79	0.84	0.76	0.80
4	K-Nearest Neighbors	SMOTE	0.79	0.79	0.80	0.80
1	Logistic Regression	Original	0.80	0.93	0.56	0.70
1	Logistic Regression	RUS	0.87	0.83	0.95	0.89
1	Logistic Regression	SMOTE	0.87	0.89	0.83	0.86
5	Multi-Layer Perceptron	Original	0.73	0.76	0.52	0.62
5	Multi-Layer Perceptron	RUS	0.90	0.87	0.95	0.91
5	Multi-Layer Perceptron	SMOTE	0.84	0.83	0.85	0.84
0	Naive Bayes	Original	0.70	0.82	0.36	0.50
0	Naive Bayes	RUS	0.69	0.74	0.67	0.70
0	Naive Bayes	SMOTE	0.78	0.85	0.68	0.76
3	Random Forest	Original	0.77	0.92	0.48	0.63
3	Random Forest	RUS	0.90	0.87	0.95	0.91
3	Random Forest	SMOTE	0.89	0.85	0.95	0.90
2	Support Vector Machine	Original	0.75	0.86	0.48	0.62
2	Support Vector Machine	RUS	0.85	0.83	0.90	0.86
2	Support Vector Machine	SMOTE	0.84	0.83	0.85	0.84

The application of resampling techniques notably improved the performance of most models. The Random Forest and Multi-Layer Perceptron models with the RUS technique outperformed the others, achieving an accuracy, precision, recall, and F1-Score of 0.90, 0.87, 0.95, and 0.91 respectively.

However, all above findings probaly be changed after different running times because RUS is a technique that reduces the majority class by randomly eliminating some of its instances and SMOTE is a technique that generates new synthetic instances of the minority class.

From the findings, it appears that applying these resampling techniques can greatly improve the performance of almost models.

2.4. Feature Selection:

	Model	Technique	Accuracy	Precision	Recall	F1-Score
4	K-Nearest Neighbors	Chi - Mann	0.73	0.74	0.56	0.64
4	K-Nearest Neighbors	Original	0.68	0.88	0.28	0.42
4	K-Nearest Neighbors	Random Forest	0.70	0.82	0.36	0.50
1	Logistic Regression	Chi - Mann	0.78	0.93	0.52	0.67
1	Logistic Regression	Original	0.80	0.93	0.56	0.70
1	Logistic Regression	Random Forest	0.80	0.93	0.56	0.70
5	Multi-Layer Perceptron	Chi - Mann	0.77	0.82	0.56	0.67
5	Multi-Layer Perceptron	Original	0.73	0.76	0.52	0.62
5	Multi-Layer Perceptron	Random Forest	0.75	0.86	0.48	0.62
0	Naive Bayes	Chi - Mann	0.72	0.90	0.36	0.51
0	Naive Bayes	Original	0.70	0.82	0.36	0.50
0	Naive Bayes	Random Forest	0.70	0.82	0.36	0.50
3	Random Forest	Chi - Mann	0.70	0.71	0.48	0.57
3	Random Forest	Original	0.77	0.92	0.48	0.63
3	Random Forest	Random Forest	0.70	0.71	0.48	0.57
2	Support Vector Machine	Chi - Mann	0.75	0.78	0.56	0.65
2	Support Vector Machine	Original	0.75	0.86	0.48	0.62
2	Support Vector Machine	Random Forest	0.77	0.82	0.56	0.67

Based on results, it appears that using the Mann-Whitney test or Chi-Square test technique for select important features, improved slightly in the accuracy of Naive Bayes, Random Forest, K-nearest Neighbors and Multi-Layer Perceptron models. In addition, the Random Forest Classifier plot technique shows a moderate improvement in the accuracy of the models: Support Vector Machine, K-nearest Neighbors and Multi-Layer Perceptron. The results above indicate that the feature selection techniques did not have a considerate impact on the models’ performance in this project. It is possible that the effectiveness of these techniques may depend on the size and characteristics of the dataset used

2.5. Model Performance Evaluation:

In summary, after evaluating the performance of 6 different machine learning models, it was found that the Logistic Regression model demonstrated the highest performance, closely followed by the Support Vector Machine. The performance of the Random Forest and Multi-Layer Perceptron models was found to be slightly more variable, and may be sensitive to changes in the algorithm and specific parameters used. In order to optimize the performance of the models, it is recommended to use techniques such as balancing the dataset and selecting relevant features. Based on the results of this project, it is suggested to use the Logistic Regression model for future predictions and also consider applying resampling techniques to further improve the model's performance.

Search This Blog

Daughter of the Sea

Enhancing Heart Failure Predictions: A Data-Driven Approach with Machine Learning

Comments

Post a Comment

Popular posts from this blog

Unlock Powerful Survey Insights with Automated Analysis

New project is coming

Optimizing the London Restaurant Industry with NLP and Big Data Insights from Google Maps Reviews

Labels