e-ISSN 2231-8542
ISSN 1511-3701
Hezlin Aryani Abd Rahman, Yap Bee Wah and Ong Seng Huat
Pertanika Journal of Tropical Agricultural Science, Volume 29, Issue 1, January 2021
DOI: https://doi.org/10.47836/pjst.29.1.10
Keywords: Categorical covariate, imbalanced data, logistic regression, parameter estimates, predictive analytics, simulation
Published on: 22 January 2021
Logistic regression is often used for the classification of a binary categorical dependent variable using various types of covariates (continuous or categorical). Imbalanced data will lead to biased parameter estimates and classification performance of the logistic regression model. Imbalanced data occurs when the number of cases in one category of the binary dependent variable is very much smaller than the other category. This simulation study investigates the effect of imbalanced data measured by imbalanced ratio on the parameter estimate of the binary logistic regression with a categorical covariate. Datasets were simulated with controlled different percentages of imbalance ratio (IR), from 1% to 50%, and for various sample sizes. The simulated datasets were then modeled using binary logistic regression. The bias in the estimates was measured using MSE (Mean Square Error). The simulation results provided evidence that the effect of imbalance ratio on the parameter estimate of the covariate decreased as sample size increased. The bias of the estimates depended on sample size whereby for sample size 100, 500, 1000 ─ 2000 and 2500 ─ 3500, the estimates were biased for IR below 30%, 10%, 5% and 2% respectively. Results also showed that parameter estimates were all biased at IR 1% for all sample size. An application using a real dataset supported the simulation results.
Ahmad, S., Midi, H., & Ramli, N. M. (2011). Diagnostics for residual outliers using deviance component in binary logistic regression. World Applied Sciences Journal, 14(8), 1125-1130.
Anand, A., Pugalenthi, G., Fogel, G. B., & Suganthan, P. N. (2010). An approach for classification of highly imbalanced data using weighting and undersampling. Amino Acids, 39(5), 1385 1391. doi: https://doi.org/10.1007/s00726-010-0595-2
Antal, B., & Hajdu, A. (2014). An ensemble-based system for automatic screening of diabetic retinopathy. Knowledge-Based Systems, 60, 20 27. doi: https://doi.org/10.1016/j.knosys.2013.12.023
Blagus, R., & Lusa, L. (2010). Class prediction for high-dimensional class-imbalanced data. BMC Bioinformatics, 11(1), 1-17. doi: https://doi.org/10.1186/1471-2105-11-523
Burez, J., & Van den Poel, D. (2009). Handling class imbalance in customer churn prediction. Expert Systems with Applications, 36(3), 4626 4636. doi: https://doi.org/10.1016/j.eswa.2008.05.027
Chawla, N. V. (2003, August 21). C4. 5 and imbalanced data sets: Investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In Proceedings of the International Conference on Machine Learning, Workshop Learning from Imbalanced Data Set II (Vol. 3, p. 66). Washington, DC.
Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Editorial: Special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter, 6(1), 1-6. doi: https://doi.org/10.1145/1007730.1007733
Cohen, G., Hilario, M., Sax, H., Hugonnet, S., & Geissbuhler, A. (2006). Learning from imbalanced data in surveillance of nosocomial infection. Artificial Intelligence in Medicine, 37(1), 7 18. doi: https://doi.org/10.1016/j.artmed.2005.03.002
Dong, Y., Guo, H., Zhi, W., & Fan, M. (2014, October 13-15). Class imbalance oriented logistic regression. In 2014 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (pp. 187 192). Shanghai, China. doi: https://doi.org/10.1109/CyberC.2014.42
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2011). A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4), 463-484. doi: 10.1109/TSMCC.2011.2161285
Goel, G., Maguire, L., Li, Y., & McLoone, S. (2013). Evaluation of sampling methods for learning from imbalanced data. Intelligent Computing Theories, 7995, 392 401. doi: https://doi.org/10.1007/978-3-642-39479-9_47
Hamid, H. A. (2016). Effects of different type of covariates and sample size on parameter estimation for multinomial logistic regression model. Jurnal Teknologi, 78(12 3), 155 161. doi: https://doi.org/10.11113/jt.v78.10036
Hamid, H. A., Yap, B. W., Xie, X. J., & Rahman, H. A. A. (2015). Assessing the effects of different types of covariates for binary logistic regression. In AIP Conference Proceedings 1643 (Vol. 425, pp. 425 430). New York, USA: American Institute of Physics. doi: https://doi.org/10.1063/1.4907476
Hamid, H. A., Yap, B. W., Xie, X. J., & Ong, S. H. (2018). Investigating the power of goodness-of-fit tests for multinomial logistic regression. Communications in Statistics: Simulation and Computation, 47(4), 1039 1055. doi: https://doi.org/10.1080/03610918.2017.1303727
He, H., & Garcia, E. E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263 1284. doi: https://doi.org/10.1109/TKDE.2008.239
Hosmer, D. W., & Lemeshow, S. (2004). Applied logistic regression, second edition. New York, NY: John Wiley & Sons, Inc. doi: https://doi.org/10.1002/0471722146
Lemnaru, C., Potolea, R., Lenmaru, C., & Potolea, R. (2012). Imbalanced classification problems: Systematic study, issues and best practices. Enterprise Information Systems: Lecture Notes in Business Information Processing, 102, 35 50. doi: https://doi.org/10.1007/978-3-642-29958-2
Longadge, R., Dongre, S. S., & Malik, L. (2013). Class imbalance problem in data mining: Review. International Journal of Computer Science and Network, 2(1), 83 87. doi: https://doi.org/10.1109/SIU.2013.6531574
Mena, L., & Gonzalez, J. A. (2006, May 11-13). Machine learning for imbalanced datasets: Application in medical diagnostic. In Proceedings of the Nineteenth International Florida Artificial Intelligence Research Society Conference (FLAIRS 2006) (pp. 574 579). Florida, USA.
Oztekin, A., Delen, D., & Kong, Z. J. (2009). Predicting the graft survival for heart-lung transplantation patients: An integrated data mining methodology. International Journal of Medical Informatics, 78(12), e84-e96. doi: https://doi.org/10.1016/j.ijmedinf.2009.04.007
Pourahmad, S., Ayatollahi, S. M. T., & Taheri, S. M. (2011). Fuzzy logistic regression: A new possibilistic model and its application in clinical vague status. Iranian Journal of Fuzzy Systems, 8(1), 1 17.
Prati, R. C., Batista, G. E. A. P. A., & Silva, D. F. (2014). Class imbalance revisited: A new experimental setup to assess the performance of treatment methods. Knowledge and Information Systems, 45(1), 247 270. doi: https://doi.org/10.1007/s10115-014-0794-3
Rahman, H. A. A., & Yap, B. W. (2016). Imbalance effects on classification using binary logistic regression. In International Conference on Soft Computing in Data Science (pp. 136 147). Singapore: Springer. doi: https://doi.org/https://doi.org/10.1007/978-981-10-2777-2_12
Rahman, H. A. A., Yap, B. W., Khairudin, Z., & Abdullah, N. N. (2012, September 10-12). Comparison of predictive models to predict survival of cardiac surgery patients. In 2012 International Conference on Statistics in Science, Business and Engineering (ICSSBE) (pp. 1 5). doi: https://doi.org/10.1109/ICSSBE.2012.6396534
Ramyachitra, D., & Manikandan, P. (2014). Imbalanced dataset classification and solutions: A review. International Journal of Computing and Business Research, 5(4), 1-29.
Rothstein, M. A. (2015). Ethical issues in big data health research: Currents in contemporary bioethics. The Journal of Law, Medicine and Ethics, 43(2), 425 429. doi: https://doi.org/10.1111/jlme.12258
Roumani, Y. F., May, J. H., Strum, D. P., & Vargas, L. G. (2013). Classifying highly imbalanced ICU data. Health Care Management Science, 16(2), 119 128. doi: https://doi.org/10.1007/s10729-012-9216-9
Sarmanova, A., & Albayrak, S. (2013, April 24-26). Alleviating class imbalance problem in data mining. In 2013 21st Signal Processing and Communications Applications Conference (SIU) (pp. 1-4). Haspolat, Turkey. doi: 10.1109/SIU.2013.6531574
Shariff, S. S. R., Rodzi, N. A. M., Rahman, K. A., Zahari, S. M., & Deni, S. M. (2016). Predicting the “graduate on time (GOT)” of PhD students using binary logistics regression model. In AIP Conference Proceedings (Vol. 1782, No. 1, p. 050015). New York, USA: AIP Publishing LLC. doi: https://doi.org/10.1063/1.4966105
Srinivasan, U., & Arunasalam, B. (2013). Leveraging big data analytics to reduce healthcare costs. IT Professional, 15(6), 21 28. doi: https://doi.org/10.1109/MITP.2013.55
Uyar, A., Bener, A., Ciracy, H. N., & Bahceci, M. (2010). Handling the imbalance problem of IVF implantation prediction. IAENG International Journal of Computer Science, 37(2), 164-170.
Van Hulse, J., Khoshgoftaar, T. M., & Napolitano, A. (2007). Experimental perspectives on learning from imbalanced data. In Proceedings of the 24th international conference on Machine learning (pp. 935-942). New York, USA: Association for Computing Machinery. doi: https://doi.org/10.1145/1273496.1273614
Wallace, B. C., & Dahabreh, I. J. (2012, December 10-13). Class probability estimates are unreliable for imbalanced data (and how to fix them). In 2012 IEEE 12th International Conference on Data Mining (pp. 695-704). Brussels, Belgium. doi: 10.1109/ICDM.2012.115
Weiss, G. M., & Provost, F. (2003). Learning when training data are costly: The effect of class distribution on tree induction. Journal of Artificial Intelligence Research, 19, 315 354. doi: https://doi.org/10.1613/jair.1199
Yap, B. W., Rani, K. A., Rahman, H. A. A., Fong, S., Khairudin, Z., & Abdullah, N. N. (2014). An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets. In Proceedings of the first international conference on advanced data and information engineering (DaEng-2013) (pp. 13-22). Singapore: Springer. doi: https://doi.org/10.1007/978-981-4585-18-7
ISSN 1511-3701
e-ISSN 2231-8542