Home / Regular Issue / JST Vol. 29 (2) Apr. 2021 / JST-2230-2020

 

Comparison of Single and MICE Imputation Methods for Missing Values: A Simulation Study

Nurul Azifah Mohd Pauzi, Yap Bee Wah, Sayang Mohd Deni, Siti Khatijah Nor Abdul Rahim and Suhartono

Pertanika Journal of Science & Technology, Volume 29, Issue 2, April 2021

DOI: https://doi.org/10.47836/pjst.29.2.15

Keywords: MICE, missing data, multiple imputation, simulation, single imputation

Published on: 30 April 2021

High quality data is essential in every field of research for valid research findings. The presence of missing data in a dataset is common and occurs for a variety of reasons such as incomplete responses, equipment malfunction and data entry error. Single and multiple data imputation methods have been developed for data imputation of missing values. This study investigated the performance of single imputation using mean and multiple imputation method using Multivariate Imputation by Chained Equations (MICE) via a simulation study. The MCAR which means missing completely at random were generated randomly for ten levels of missing rates (proportion of missing data): 5% to 50% for different sample sizes. Mean Square Error (MSE) was used to evaluate the performance of the imputation methods. Data imputation method depends on data types. Mean imputation is commonly used to impute missing values for continuous variable while MICE method can handle both continuous and categorical variables. The simulation results indicate that group mean imputation (GMI) performed better compared to overall mean imputation (OMI) and MICE with lowest value of MSE for all sample sizes and missing rates. The MSE of OMI, GMI, and MICE increases when missing rate increases. The MICE method has the lowest performance (i.e. highest MSE) when percentage of missing rates is more than 15%. Overall, GMI is more superior compared to OMI and MICE for all missing rates and sample size for MCAR mechanism. An application to a real dataset confirmed the findings of the simulation results. The findings of this study can provide knowledge to researchers and practitioners on which imputation method is more suitable when the data involves missing data.

  • Abidin, N. Z., Ismail, A. R., & Emran, N. A. (2018). Performance analysis of machine learning algorithms for missing value imputation. International Journal of Advanced Computer Science and Applications, 9(6), 442-447.

  • Aljuaid, T., & Sasi, S. (2016). Proper imputation techniques for missing values in data sets. In International Conference on Data Science and Engineering (ICDSE) (pp. 1-5). IEEE Conference Publication. https://doi.org/10.1109/ICDSE.2016.7823957

  • Ayilara, O. F., Zhang, L., Sajobi, T. T., Sawatzky, R., Bohm, E., & Lix, L. M. (2019). Impact of missing data on bias and precision when estimating change in patient-reported outcomes from a clinical registry. Health and Quality of Life Outcomes, 17(1), 106. https://doi.org/10.1186/s12955-019-1181-2

  • Barnett, A. G., McElwee, P., Nathan, A., Burton, N. W., & Turrell, G. (2017). Identifying patterns of item missing survey data using latent groups: An observational study. BMJ Open, 7(10), 1-9. https://doi.org/10.1136/bmjopen-2017-017284

  • Bhati, S., & Gupta, M. K. (2016). Missing data imputation for medical database: Review. International Journal of Advanced Research in Computer Science and Software Engineering, 6(4), 754-758.

  • Buuren, S. V., & Groothuis-Oudshoorn, K. (2010). Mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 1-68.

  • Chaudhry, A., Li, W., Basri, A., & Patenaude, F. (2019). A method for improving imputation and prediction accuracy of highly seasonal univariate data with large periods of missingness. Wireless Communications and Mobile Computing, 2019, 1-13. https://doi.org/10.1155/2019/4039758

  • Cheema, J. R. (2014). Some general guidelines for choosing missing data handling methods in educational research. Journal of Modern Applied Statistical Methods, 13(2), 53-75. https://doi.org/10.22237/jmasm/1414814520

  • Chhabra, G., Vashisht, V., & Ranjan, J. (2017). A comparison of multiple imputation methods for data with missing values. Indian Journal of Science and Technology, 10(19), 1-7. https://doi.org/10.17485/ijst/2017/v10i19/110646

  • Dettori, J. R., Norvell, D. C., & Chapman, J. R. (2018). The sin of missing data: Is all forgiven by way of imputation? Global Spine Journal, 8(8), 892-894. https://doi.org/10.1177/2192568218811922

  • Dong, Y., & Peng, C. Y. J. (2013). Principled missing data methods for researchers. SpringerPlus, 2(1), 1-17. https://doi.org/10.1186/2193-1801-2-222

  • Fichman, M., & Cummings, J. N. (2003). Multiple imputation for missing data: Making the most of what you know. Organizational Research Methods, 6(3), 282-308. https://doi.org/10.1177/1094428103255532

  • Gad, A. M., & Abdelkhalek, R. H. M. (2017). Imputation methods for longitudinal data: A comparative study. International Journal of Statistical Distributions and Applications, 3(4), 72. https://doi.org/10.11648/j.ijsd.20170304.13

  • Gopal, K. M., Durgaprasad, N., Deepa, K. S., Sravan, R. G., & Revanth, R. D. (2019). Comparative analysis of different imputation techniques for handling missing dataset. International Journal of Innovative Technology and Exploring Engineering (IJITEE), 8(7), 347-351.

  • Goretzko, D., Heumann, C., & Bühner, M. (2019). Investigating parallel analysis in the context of missing data: A simulation study comparing six missing data methods. Educational and Psychological Measurement, 80(4), 756-774. https://doi.org/10.1177/0013164419893413

  • Grund, S., Lüdtke, O., & Robitzsch, A. (2018). Multiple imputation of missing data at level 2: A comparison of fully conditional and joint modeling in multilevel designs. Journal of Educational and Behavioral Statistics, 43(3), 316-353. https://doi.org/10.3102/1076998617738087

  • Hughes, R. A., Heron, J., Sterne, J. A., & Tilling, K. (2019). Accounting for missing data in statistical analyses: Multiple imputation is not always the answer. International Journal of Epidemiology, 48(4), 1294-1304. https://doi.org/10.1093/ije/dyz032

  • Jadhav, A., Pramod, D., & Ramanathan, K. (2019). Comparison of performance of data imputation methods for numeric dataset. Applied Artificial Intelligence, 33(10), 913-933. https://doi.org/10.1080/08839514.2019.1637138

  • Kaiser, J. (2014). Dealing with missing values in data. Journal of Systems Integration, 5(1), 42- 51. http://dx.doi.org/10.20470/jsi.v5i1.178

  • Kamatchi P, L., & Baranidharan, C. (2019). Missing data imputation methods for autism prediction. International Journal of Recent Technology and Engineering, 8(5), 940-944.

  • Le, T. D., Beuran, R., & Tan, Y. (2018). Comparison of the most influential missing data imputation algorithms for healthcare. In 2018 10th International Conference on Knowledge and Systems Engineering (KSE) (pp. 247-251). IEEE Conference Publication. http://dx.doi.org/10.1109/KSE.2018.8573344

  • Li, Y., Ji, L., Oravecz, Z., Brick, T. R., Hunter, M. D., & Chow, S. M. (2019). dynr. mi: An R program for multiple imputation in dynamic modeling. World Academy of Science, Engineering and Technology, 13(5), 302-311. https://doi.org/10.5281/zenodo.3298841

  • Little, R. J. (1988). A test of missing completely at random for multivariate data with missing values. Journal of The American Statistical Association, 83(404), 1198-1202.

  • Little, R. J., & Rubin, D. B. (1987). Statistical analysis with missing data. John Wiley & Sons.

  • Lo, A. W., Siah, K. W., & Wong, C. H. (2019). Machine learning with statistical imputation for predicting drug approvals. Harvard Data Science Review, 1(1), 1-25. https://doi.org/10.1162/99608f92.5c5f0525

  • Ma, Z., & Chen, G. (2018). Bayesian methods for dealing with missing data problems. Journal of The Korean Statistical Society, 47(3), 297-313. https://doi.org/10.1016/j.jkss.2018.03.002

  • Madley-Dowd, P., Hughes, R., Tilling, K., & Heron, J. (2019). The proportion of missing data should not be used to guide decisions on multiple imputation. Journal of Clinical Epidemiology, 110, 63-73. https://doi.org/10.1016/j.jclinepi.2019.02.016

  • Malarvizhi, M. R., & Thanamani, A. S. (2012). K-Nearest Neighbor in missing data imputation. International Journal of Engineering Research and Development, 5(1), 5-7.

  • Masconi, K. L., Matsha, T. E., Echouffo-Tcheugui, J. B., Erasmus, R. T., & Kengne, A. P. (2015). Reporting and handling of missing data in predictive research for prevalent undiagnosed type 2 diabetes mellitus: A systematic review. The EPMA Journal, 6(1), 1-11. https://doi.org/10.1186/s13167-015-0028-0

  • Newman, D. A. (2003). Longitudinal modeling with randomly and systematically missing data: A simulation of ad hoc, maximum likelihood, and multiple imputation techniques. Organizational Research Methods, 6(3), 328-362. https://doi.org/10.1177/1094428103254673

  • Newman, D. A. (2014). Missing data: Five practical guidelines. Organizational Research Methods, 17(4), 372-411. https://doi.org/10.1177/1094428114548590

  • Nwakuya, M. T., & Nwabueze, J. C. (2018). Comparison of shrinkage–based estimators in the presence of missing data: A multiple imputation analysis. International Journal of Statistics and Applications, 8(6), 305-308. https://doi.org/10.5923/j.statistics.20180806.03

  • Ochieng’Odhiambo, F. (2020). Comparative study of various methods of handling missing data. Mathematical Modelling and Applications, 5(2), 87.

  • Pampaka, M., Hutcheson, G., & Williams, J. (2016). Handling missing data: Analysis of a challenging data set using multiple imputation. International Journal of Research & Method in Education, 39(1), 19-37. https://doi.org/10.1080/1743727X.2014.979146

  • Papageorgiou, G., Grant, S. W., Takkenberg, J. J., & Mokhles, M. M. (2018). Statistical primer: How to deal with missing data in scientific research? Interactive Cardiovascular and Thoracic Surgery, 27(2), 153-158. https://doi.org/10.1093/icvts/ivy102

  • Pedersen, A. B., Mikkelsen, E. M., Cronin-Fenton, D., Kristensen, N. R., Pham, T. M., Pedersen, L., & Petersen, I. (2017). Missing data and multiple imputation in clinical epidemiological research. Clinical Epidemiology, 9, 157-166. https://doi.org/10.2147/CLEP.S129785

  • Ratolojanahary, R., Ngouna, R. H., Medjaher, K., Junca-Bourié, J., Dauriac, F., & Sebilo, M. (2019). Model selection to improve multiple imputation for handling high rate missingness in a water quality dataset. Expert Systems with Applications, 131, 299-307. https://doi.org/10.1016/j.eswa.2019.04.049

  • Salgado C. M., Azevedo C., Proença H., & Vieira S. M. (2016) Missing data. In Secondary analysis of electronic health records (pp. 143-162). Springer.

  • Scheffer, J. (2002). Dealing with missing data. Research Letters in the Information and Mathematical Sciences, 3, 153-160.

  • Schmitt, P., Mandel, J., & Guedj, M. (2015). A comparison of six methods for missing data imputation. Journal of Biometrics & Biostatistics, 6(1), 1-6. https://doi.org/10.472/2155-6180.1000224

  • Shi, D., Lee, T., Fairchild, A. J., &Maydeu-Olivares, A. (2019). Fitting ordinal factor analysis models with missing data: A comparison between pairwise deletion and multiple imputation. Educational and Psychological Measurement, 80(1), 41-66. https://doi.org/10.1177/0013164419845039

  • Sim, J., Lee, J. S., & Kwon, O. (2015). Missing values and optimal selection of an imputation method and classification algorithm to improve the accuracy of ubiquitous computing applications. Mathematical Problems in Engineering, 2015, 1-14. https://doi.org/10.1155/2015/538613

  • Song, Q., & Shepperd, M. (2007). Missing data imputation techniques. International Journal of Business Intelligence and Data Mining, 2(3), 261-291. https://doi.org/10.1504/IJBIDM.2007.015485

  • Stavseth, M. R., Clausen, T., &Røislien, J. (2019). How handling missing data may impact conclusions: A comparison of six different imputation methods for categorical questionnaire data. SAGE Open Medicine, 7, 1-12. https://doi.org/10.1177/2050312118822912

  • Stekhoven, D. J., & Bühlmann, P. (2012). MissForest - Non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118. https://doi.org/10.1093/bioinformatics/btr597

  • Sullivan, T. R., White, I. R., Salter, A. B., Ryan, P., & Lee, K. J. (2018). Should multiple imputation be the method of choice for handling missing data in randomized trials? Statistical Methods in Medical Research, 27(9), 2610-2626. https://doi.org/10.1177/0962280216683570

  • Tabachnick, B. G., Fidell, L. S., & Ullman, J. B. (2007). Using multivariate statistics (Vol. 5). Pearson.

  • Turner, E. L., Yao, L., Li, F., & Prague, M. (2019). Properties and pitfalls of weighting as an alternative to multilevel multiple imputation in cluster randomized trials with missing binary outcomes under covariate-dependent missingness. Statistical Methods in Medical Research, 29(5), 1338-1353. https://doi.org/10.1177/0962280219859915

  • Van Buuren, S. (2007). Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 16(3), 219-242. https://doi.org/10.1177/0962280206074463

  • Van Buuren, S., Boshuizen, H. C., & Knook, D. L. (1999). Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in medicine, 18(6), 681-694. https://doi.org/10.1002/(SICI)1097-0258(19990330)18:6<681::AID-SIM71>3.0.CO;2-R

  • van Ginkel, J. R., Linting, M., Rippe, R. C., & van der Voort, A. (2019). Rebutting existing misconceptions about multiple imputation as a method for handling missing data. Journal of Personality Assessment, 102(3), 297-308. https://doi.org/10.1080/00223891.2018.1530680

  • Wah, Y. B., Ibrahim, N., Hamid, H. A., Abdul-Rahman, S., & Fong, S. (2018). Feature selection methods: Case of filter and wrapper approaches for maximising classification accuracy. Pertanika Journal of Science & Technology, 26, 329-340.

  • Wilks, S. S. (1932). Certain generalizations in the analysis of variance. Biometrika, 24(3/4), 471-494. https://doi.org/10.2307/2331979

  • Yadav, M. L., & Roychoudhury, B. (2018). Handling missing values: A study of popular imputation packages in R. Knowledge-Based Systems, 160, 104-118. https://doi.org/10.1016/j.knosys.2018.06.012

  • Zhang, Z. (2016). Missing data imputation: focusing on single imputation. Annals of Translational Medicine, 4(1), 1-9. https://doi.org/10.3978/j.issn.2305-5839.2015.12.38

ISSN 0128-7680

e-ISSN 2231-8526

Article ID

JST-2230-2020

Download Full Article PDF

Share this article

Recent Articles