e-ISSN 2231-8526
ISSN 0128-7680

Home / Regular Issue / JST Vol. 32 (3) Apr. 2024 / JST-4709-2023


An Improved Ensemble Machine Learning Approach for Diabetes Diagnosis

Mohanad Mohammed Rashid, Omar Mahmood Yaseen, Rana Riyadh Saeed and Maher Talal Alasaady

Pertanika Journal of Science & Technology, Volume 32, Issue 3, April 2024


Keywords: Diabetes diagnosis, ensemble learning, machine learning, PIDD, soft voting

Published on: 24 April 2024

Diabetes is recognized as one of the most detrimental diseases worldwide, characterized by elevated levels of blood glucose stemming from either insulin deficiency or decreased insulin efficacy. Early diagnosis of diabetes enables patients to initiate treatment promptly, thereby minimizing or eliminating the risk of severe complications. Although years of research in computational diagnosis have demonstrated that machine learning offers a robust methodology for predicting diabetes, existing models leave considerable room for improvement in terms of accuracy. This paper proposes an improved ensemble machine learning approach using multiple classifiers for diabetes diagnosis based on the Pima Indians Diabetes Dataset (PIDD). The proposed ensemble voting classifier amalgamates five machine learning algorithms: Decision Tree (DT), Logistic Regression (LR), K-Nearest Neighbor (KNN), Random Forests (RF), and XGBoost. We obtained the individual model accuracies and used the ensemble method to improve accuracy. The proposed approach uses a pre-processing stage of standardization and imputation and applies the Local Outlier Factor (LOF) to remove data anomalies. The model was evaluated using sensitivity, specificity, and accuracy criteria. With a reported accuracy of 81%, the proposed approach shows promise compared to prior classification techniques.

  • Agrawal, K., Bhargav, G., & Spandana, E. (2021). Diabetes diagnosis prediction using ensemble approach. In V. Nath & J. K. Mandal (Eds.), Proceedings of the Fourth International Conference on Microelectronics, Computing and Communication Systems: Lecture Notes in Electrical Engineering, vol 673 (pp. 799-813). Springer.

  • Agresti, A. (2015). Foundations of linear and generalized linear models. John Wiley & Sons

  • Akyol, K., & Şen, B. (2018). Diabetes mellitus data classification by cascading of feature selection methods and ensemble learning algorithms. International Journal of Modern Education & Computer Science, 10(6), 10-16.

  • Alasaady, M. T., Aris, T. N. M., Sharef, N. M., & Hamdan, H. (2022). A proposed approach for diabetes diagnosis using neuro-fuzzy technique. Bulletin of Electrical Engineering and Informatics, 11(6), 3590–3597.

  • Alasaady, M. T., Saeed, M. G., & Faraj, K. H. (2019, February 13-14). Evaluation and comparison framework for data modeling languages. [Paper presentation]. 2nd International Conference on Electrical, Communication, Computer, Power and Control Engineering (ICECCPCE), Mosul, Iraq.

  • Atif, M., Anwer, F., & Talib, F. (2022). An ensemble learning approach for effective prediction of diabetes mellitus using hard voting classifier. Indian Journal of Science and Technology, 15(39), 1978–1986.

  • Barik, S., Mohanty, S., Mohanty, S., & Singh, D. (2021). Analysis of prediction accuracy of diabetes using classifier and hybrid machine learning techniques. In D. Mishra, R. Buyya, P. Mohapatra & S. Patnaik (Eds.), Intelligent and Cloud Computing (pp. 399–409). Springer.

  • Berner, R., & Judge, K. (2019). The Data Standardization Challenge (Working Paper No. 438/2019). CIGI Press.

  • Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.

  • Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.

  • Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (2017). Classification and Regression Trees. Routledge.

  • Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000, May 15-18). LOF: Identifying density-based local outliers. [Paper presentation] SIGMOD ‘00: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Texas, USA.

  • Buuren, S. V. (2012). Flexible imputation of missing data. CRC Press.

  • Caruana, R., Niculescu-Mizil, A., Crew, G., & Ksikes, A. (2004, July 4-8). Ensemble selection from libraries of models. [Paper presentation]. ICML ‘04: Proceedings of the Twenty-first International Conference on Machine Learning, New York, USA.

  • Centers for Disease Control and Prevention (2011). National diabetes fact sheet: National estimates and general information on diabetes and prediabetes in the United States. Atlanta, GA: US Department of Health and Human Services, Centers for Disease Control and Prevention, 201(1), 2568–2569.

  • Chen, R., Ovbiagele, B., & Feng, W. (2016). Diabetes and stroke: Epidemiology, pathophysiology, pharmaceuticals and outcomes. American Journal of the Medical Sciences, 351(4), 380–386.

  • Chen, T., & Guestrin, C. (2016, August 13-17). XGBoost: A scalable tree boosting system. [Paper presentation]. KDD ‘16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, California, USA.

  • Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27.

  • El Houby, E. M. F., Yassin, N. I. R., & Omran, S. (2017). A hybrid approach from ant colony optimization and K-nearest neighbor for classifying datasets using selected features. Informatica, 41, 495–506.

  • Fernández-Delgado, M., Cernadas, E., Barro, S., & Amorim, D. (2014). Do we need hundreds of classifiers to solve real-world classification problems? The Journal of Machine Learning Research, 15(1), 3133–3181.

  • Ganesh, P. V. S., & Sripriya, P. (2020). A comparative review of prediction methods for pima indians diabetes dataset. In S. Smys, J. M. R. S. Tavares, V. E. Balas & A. M. Iliyasu (Eds.), Computational Vision and Bio-Inspired Computing (pp. 735–750). Springer.

  • Gelman, A., & Hill, J. (2006). Data Analysis using Regression and Multilevel/Hierarchical Models. Cambridge University Press.

  • Han, J., Pei, J., & Tong, H. (2022). Data mining: Concepts and techniques. Morgan Kaufmann.

  • Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression (3rd ed.). Wiley.

  • Kavakiotis, I., Tsave, O., Salifoglou, A., Maglaveras, N., Vlahavas, I., & Chouvarda, I. (2017). Machine learning and data mining methods in diabetes research. Computational and Structural Biotechnology Journal, 15, 104–116.

  • Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T. Y. (2017). LightGBM: A highly efficient gradient boosting decision tree. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan & R. Garnett (Eds.), Advances in Neural Information Processing Systems (pp. 30). Curran Associates, Inc.

  • Khairan, H. E., Zubaidi, S. L., Muhsen, Y. R., & Al-Ansari, N. (2023). Parameter optimisation-based hybrid reference evapotranspiration prediction models: A systematic review of current implementations and future research directions. Atmosphere, 14(1), Article 77.

  • Kumari, S., Kumar, D., & Mittal, M. (2021). An ensemble approach for classification and prediction of diabetes mellitus using soft voting classifier. International Journal of Cognitive Computing in Engineering, 2, 40–46.

  • Kunwar, R., & Timalsina, A. K. (2021). An ensemble approach for the diagnosis of diabetes mellitus using multiple classifiers. Proceedings of 9th IOE Graduate Conference, 9, 202-207.

  • Li, L. (2014, November 10-12). Diagnosis of diabetes using a weight-adjusted voting approach. [Paper presentation]. IEEE International Conference on Bioinformatics and Bioengineering, Florida, USA.

  • Mahabub, A. (2019). A robust voting approach for diabetes prediction using traditional machine learning techniques. SN Applied Sciences, 1(12), Article 1667.

  • Mansour, Y., & Schain, M. (2001). Learning with maximum-entropy distributions. Machine Learning, 45(2), 123–145.

  • Mirzajani, S. S., & Salimi, S. (2018). Prediction and diagnosis of diabetes by using data mining techniques. Avicenna Journal of Medical Biochemistry, 6(1), 3–7.

  • Noor, N. A. B. S., Elamvazuthi, I., & Yahya, N. (2021, July 13-15). Classification of diabetes mellitus using ensemble algorithms. [Paper presentation]. 8th International Conference on Intelligent and Advanced Systems (ICIAS), Kuching, Sarawak.

  • Prema, N. S., Varshith, V., & Yogeswar, J. (2019). Prediction of diabetes using ensemble techniques. International Journal of Recent Technology and Engineering, 7(6), 203-205.

  • Qin, L. (2022, September 23-25). A prediction model of diabetes based on ensemble learning. [Paper presentation] AIPR ‘22: Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition, Xiamen China.

  • Saeed, R. R., Yaseen, O. M., Rashid, M. M., & Ahmed, M. R. (2022, June 9-11). Applications of machine learning in battling against novel COVID-19. [Paper presentation]. International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), Ankara, Turkey.

  • Shanker, M., Hu, M. Y., & Hung, M. S. (1996). Effect of data standardization on neural network training. Omega, 24(4), 385–397.

  • Singh, N., & Singh, P. (2020). Stacking-based multi-objective evolutionary ensemble framework for prediction of diabetes mellitus. Biocybernetics and Biomedical Engineering, 40(1), 1–22.

  • Soni, M., & Varma, S. (2020). Diabetes prediction using machine learning techniques. International Journal of Engineering Research & Technology, 9(9), 921-925.

  • Swapna, G., Soman, K. P., & Vinayakumar, R. (2018). Automated detection of diabetes using CNN and CNN-LSTM network and heart rate signals. Procedia Computer Science, 132, 1253–1262.

  • WHO. (2014). World diabetes statistics. World Health Organization.