Pertanika Journal

Go to Pertanika

Go to JTAS Home

Go to Pertanika Facebook

Home / Regular Issue / JST Vol. 29 (3) Jul. 2021 / JST-2438-2021

A Comprehensive Review of Automated Essay Scoring (AES) Research and Development

Chun Then Lim, Chih How Bong, Wee Sian Wong and Nung Kion Lee

Pertanika Journal of Science & Technology, Volume 29, Issue 3, July 2021

DOI: https://doi.org/10.47836/pjst.29.3.27

Keywords: Attributes, automatic essay scoring, evaluation metrics, framework, human raters, recommendation

Published on: 31 July 2021

Abstract

Automated Essay Scoring (AES) is a service or software that can predictively grade essay based on a pre-trained computational model. It has gained a lot of research interest in educational institutions as it expedites the process and reduces the effort of human raters in grading the essays as close to humans’ decisions. Despite the strong appeal, its implementation varies widely according to researchers’ preferences. This critical review examines various AES development milestones specifically on different methodologies and attributes used in deriving essay scores. To generalize existing AES systems according to their constructs, we attempted to fit all of them into three frameworks which are content similarity, machine learning and hybrid. In addition, we presented and compared various common evaluation metrics in measuring the efficiency of AES and proposed Quadratic Weighted Kappa (QWK) as standard evaluation metric since it corrects the agreement purely by chance when estimate the degree of agreement between two raters. In conclusion, the paper proposes hybrid framework standard as the potential upcoming AES framework as it capable to aggregate both style and content to predict essay grades Thus, the main objective of this study is to discuss various critical issues pertaining to the current development of AES which yielded our recommendations on the future AES development.

References

Alghamdi, M., Alkanhal, M., Al-Badrashiny, M., Al-Qabbany, A., Areshey, A., & Alharbi, A. (2014). A hybrid automatic scoring system for Arabic essays. AI Communications, 27(2), 103-111. https://doi.org/10.3233/aic-130586
Alikaniotis, D., Yannakoudakis, H., & Rei, M. (2016). Automatic text scoring using neural networks. In K. Erk & N. A. Smith (Eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 715-725). Association for Computational Linguistics. https://doi.org/10.18653/v1/p16-1068
Al-Jouie, M., & Azmi, A. (2017). Automated evaluation of school children essays in Arabic. Procedia Computer Science, 117, 19-22. https://doi.org/10.1016/j.procs.2017.10.089
Amalia, A., Gunawan, D., Fithri, Y., & Aulia, I. (2019). Automated Bahasa Indonesia essay evaluation with latent semantic analysis. Journal of Physics: Conference Series, 1235, Article 012100. https://doi.org/10.1088/1742-6596/1235/1/012100
Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® V.2. Journal of Technology, Learning, and Assessment, 4(3), 1-29.
Awaida, S. A., Shargabi, B. A., & Rousan, T. A. (2019). Automated Arabic essays grading system based on F-score and Arabic wordnet. Jordanian Journal of Computers and Information Technology (JJCIT), 5(3), 170-180. https://doi.org/10.5455/jjcit.71-1559909066
Chen, H., & He, B. (2013). Automated essay scoring by maximizing human-machine agreement. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1741-1752). Association for Computational Linguistics.
Chen, M., & Li, X. (2018). Relevance-based automated essay scoring via hierarchical recurrent model. In 2018 International Conference on Asian Language Processing (IALP) (pp. 378-383). IEEE Conference Publication. https://doi.org/10.1109/ialp.2018.8629256
Chen, Z., & Zhou, Y. (2019). Research on automatic essay scoring of composition based on CNN and OR. In 2019 2nd International Conference on Artificial Intelligence and Big Data (ICAIBD) (pp. 13-18). IEEE Conference Publication. https://doi.org/10.1109/icaibd.2019.8837007
Cheon, M., Seo, H. W., Kim, J. H., Noh, E. H., Sung, K. H., & Lim, E. (2015). An automated scoring tool for Korean supply-type items based on semi-supervised learning. In Proceedings of the 2nd Workshop on Natural Language Processing Techniques for Educational Applications (pp. 59-63). Association for Computational Linguistics and Asian Federation of Natural Language Processing. https://doi.org/10.18653/v1/w15-4409
Contreras, J. O., Hilles, S., & Abubakar, Z. B. (2018). Automated essay scoring with ontology based on text mining and nltk tools. In 2018 International Conference on Smart Computing and Electronic Enterprise (ICSCEE) (pp. 1-6). IEEE Conference Publication. https://doi.org/10.1109/icscee.2018.8538399
Darus, S., Stapa, S. H., & Hussin, S. (2003). Experimenting a computer-based essay marking system at Universiti Kebangsaan Malaysia. Jurnal Teknologi, 39(E), 1-18. https://doi.org/10.11113/jt.v39.472
Davis, B. (2014). Essay grading computer mistakes gibberish for genius. Retrieved August 28, 2020, from http://www.realclear.com/tech/2014/04/29/essay_grading_computer_mistakes_gibberish_for_genius_6784.html
Dikli, S. (2006). An overview of automated scoring of essays. Journal of Technology, Learning, and Assessment (JTLA), 5(1), 1-35.
Dong, F., & Zhang, Y. (2016). Automatic features for essay scoring–an empirical study. In Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 1072-1077). Association for Computational Linguistics. https://doi.org/10.18653/v1/D16-1115
Elliot, S. M. (2003). IntelliMetric: From here to validity. In M. D., Shermis & J. C. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 71-86). Routledge
Fazal, A., Dillon, T., & Chang, E. (2011). Noise reduction in essay datasets for automated essay grading. In OTM Confederated International Conferences On the Move to Meaningful Internet Systems (pp. 484-493). Springer. https://doi.org/10.1007/978-3-642-25126-9_60
Foltz, P. W., Laham, D., & Landauer, T. K. (1999). The intelligent essay assessor: Applications to educational technology. Interactive Multimedia Electronic Journal of Computer-Enhanced Learning, 1(2), 939-944.
Ghosh, S., & Fatima, S. S. (2008). Design of an automated essay grading (AEG) system in Indian context. In TENCON 2008-2008 IEEE Region 10 Conference (pp. 1-6). IEEE Conference Publication. https://doi.org/10.1109/tencon.2008.4766677
Greene, P. (2018). Automated essay scoring remains an empty dream. Retrieved September 8, 2020, from Forbes: https://www.forbes.com/sites/petergreene/2018/07/02/automated-essay-scoring-remains-an-empty-dream/#4474e4f74b91
Hussein, M. A., Hassan, H., & Nassef, M. (2019). Automated language essay scoring systems: A literature review. PeerJ Computer Science, 5, Article e208 https://doi.org/10.7287/peerj.preprints.27715v1
Imaki, J., & Ishihara, S. (2013). Experimenting with a Japanese automated essay scoring system in the L2 Japanese environment. Papers in Language Testing and Assessment, 2(2), 28-47.
Ishioka, T., & Kameda, M. (2006). Automated Japanese essay scoring system based on articles written by experts. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (pp. 233-240). Association for Computational Linguistics. https://doi.org/10.3115/1220175.1220205
Islam, M. M., & Hoque, A. L. (2013). Automated Bangla essay scoring system: ABESS. In 2013 International Conference on Informatics, Electronics and Vision (ICIEV) (pp. 1-5). IEEE Conference Publication. https://doi.org/10.1109/iciev.2013.6572694
Jin, C., & He, B. (2015). Utilizing latent semantic word representations for automated essay scoring. In 2015 IEEE 12th Intl Conf on Ubiquitous Intelligence and Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications and Its Associated Workshops (UIC-ATC-ScalCom) (pp. 1101-1108). IEEE Conference Publication. https://doi.org/10.1109/UIC-ATC-ScalCom-CBDCom-IoP.2015.202
Jin, C., He, B., Hui, K., & Sun, L. (2018). TDNN: a two-stage deep neural network for prompt-independent automated essay scoring. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1088-1097). Association for Computational Linguistics. https://doi.org/10.18653/v1/p18-1100
Kakkonen, T., Myller, N., Timonen, J., & Sutinen, E. (2005). Automatic essay grading with probabilistic latent semantic analysis. In Proceedings of the second workshop on Building Educational Applications Using NLP (pp. 29-36). Association for Computational Linguistics.
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to latent semantic analysis. Discourse Processes, 25, 259-284.
Landauer, T. K., Laham, D., & Foltz, P. W. (2000). The intelligent essay assessor. IEEE Intelligent Systems, 15, 27-31.
Liang, G., On, B. W., Jeong, D., Kim, H. C., & Choi, G. S. (2018). Automated essay scoring: A siamese bidirectional LSTM neural network architecture. Symmetry, 10(12), Article 682. https://doi.org/10.3390/sym10120682
Loraksa, C., & Peachavanish, R. (2007). Automatic Thai-language essay scoring using neural network and latent semantic analysis. In First Asia International Conference on Modelling & Simulation (AMS’07) (pp. 400-402). IEEE Conference Publication. https://doi.org/10.1109/ams.2007.19
Malaysian Examinations Council. (2014). Malaysian university English test (MUET): Regulations, test specifications, test format and sample questions. Retrieved March 15, 2021, from https://www.mpm.edu.my/images/dokumen/calon-peperiksaan/muet/regulation/Regulations_Test_Specifications_Test_Format_and_Sample_Questions.pdf
Measurement Incorporated. (2020). Automated essay scoring. Retrieved June 4, 2020, from https://www.measurementinc.com/products-services/automated-essay-scoring
Nguyen, H., & Dery, L. (2016). Neural networks for automated essay grading. CS224d Stanford Reports, 1-11.
Omar, N., & Mezher, R. (2016). A hybrid method of syntactic feature and latent semantic analysis for automatic Arabic essay scoring. Journal of Applied Sciences, 16(5), 209-215. https://doi.org/10.3923/jas.2016.209.215
Ong, D. A., Razon, A. R., Guevara, R. C., & Prospero C. Naval, J. (2011, November 24-25). Empirical comparison of concept indexing and latent semantic indexing on the content analysis of Filipino essays. In Proceedings of the 8th National Natural Language Processing Research Symposium (pp. 40-45). De La Salle University, Manila.
Page, E. B. (1966). The imminence of... grading essays by computer. Phi Delta Kappan, 47(5), 238-243.
Page, E. (2003). Project essay grade: PEG. In M. Shermis, & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 43-54). Lawrence Erlbaum Associates Publishers.
Pai, K. C., Lu, Y., & Kuo, B. C. (2017). Developing Chinese automated essay scoring model to assess college students’ essay quality. In Proceedings of the 10th International Conference on Educational Data Mining (pp. 430-432).
Pearson. (2010). Intelligent essay assessor (IEA)™ fact sheet. Pearson Education. Retrieved June 4, 2020, from https://images.pearsonassessments.com/images/assets/kt/download/IEA-FactSheet-20100401.pdf
Peng, X., Ke, D., Chen, Z., & Xu, B. (2010). Automated Chinese essay scoring using vector space models. In 2010 4th International Universal Communication Symposium (pp. 149-153). IEEE Conference Publication. https://doi.org/10.1109/IUCS.2010.5666229
Phandi, P., Chai, K. M. A., & Ng, H. T. (2015). Flexible domain adaptation for automated essay scoring using correlated linear regression. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 431-439). Association for Computational Linguistics. https://doi.org/10.18653/v1/D15-1049
Ramalingam, V. V., Pandian, A., Chetry, P., & Nigam, H. (2018). Automated essay grading using machine learning algorithm. In Journal of Physics: Conference Series (Vol. 1000, No. 1, p. 012030). IOP Publishing. https://doi.org/10.1088/1742-6596/1000/1/012030
Ramineni, C., & Williamson, D. (2018). Understanding mean score differences between the e‐rater® automated scoring engine and humans for demographically based groups in the GRE® general test. ETS Research Report Series, 2018(1), 1-31. https://doi.org/10.1002/ets2.12192
Ratna, A. A. P., Budiardjo, B., & Hartanto, D. (2007). SIMPLE: System automatic essay assessment for Indonesian language subject examination. Makara Journal of Technology, 11(1), 5-11.
Ratna, A. A. P., Purnamasari, P. D., & Adhi, B. A. (2015). SIMPLE-O, the Essay grading system for Indonesian Language using LSA method with multi-level keywords. In The Asian Conference on Society, Education & Technology 2015 (pp. 155-164). The International Academic Forum.
Ratna, A. A. P., Arbani, A. A., Ibrahim, I., Ekadiyanto, F. A., Bangun, K. J., & Purnamasari, P. D. (2018). Automatic essay grading system based on latent semantic analysis with learning vector quantization and word similarity enhancement. In Proceedings of the 2018 International Conference on Artificial Intelligence and Virtual Reality (pp. 120-126). Association for Computing Machinery. https://doi.org/10.1145/3293663.3293684
Ratna, A. A. P., Kaltsum, A., Santiar, L., Khairunissa, H., Ibrahim, I., & Purnamasari, P. D. (2019a). Term frequency-inverse document frequency answer categorization with support vector machine on automatic short essay grading system with latent semantic analysis for japanese language. In 2019 International Conference on Electrical Engineering and Computer Science (ICECOS) (pp. 293-298). IEEE Conference Publication. https://doi.org/10.1109/ICECOS47637.2019.8984530
Ratna, A. A. P., Khairunissa, H., Kaltsum, A., Ibrahim, I., & Purnamasari, P. D. (2019b). Automatic essay grading for Bahasa Indonesia with support vector machine and latent semantic analysis. In 2019 International Conference on Electrical Engineering and Computer Science (ICECOS) (pp. 363-367). IEEE Conference Publication. https://doi.org/10.1109/ICECOS47637.2019.8984528
Ratna, A. A. P., Santiar, L., Ibrahim, I., Purnamasari, P. D., Luhurkinanti, D. L., & Larasati, A. (2019c). Latent semantic analysis and winnowing algorithm based automatic Japanese short essay answer grading system comparative performance. In 2019 IEEE 10th International Conference on Awareness Science and Technology (iCAST) (pp. 1-7). IEEE Conference Publication. https://doi.org/10.1109/ICAwST.2019.8923226
Rudner, L., & Gagne, P. (2001). An overview of three approaches to scoring written essays by computer. Practical Assessment, Research & Evaluation, 7, Article 26.
Sendra, M., Sutrisno, R., Harianata, J., Suhartono, D., & Asmani, A. B. (2016). Enhanced latent semantic analysis by considering mistyped words in automated essay scoring. In 2016 International Conference on Informatics and Computing (ICIC) (pp. 304-308). IEEE Conference Publication. https://doi.org/10.1109/IAC.2016.7905734
Shehab, A., Faroun, M., & Rashad, M. (2018). An automatic Arabic essay grading system based on text similarity Algorithms. International Journal of Advanced Computer Science and Applications, 9(3), 263-268. https://doi.org/10.14569/IJACSA.2018.090337
Shermis, M. D., & Burstein, J. (2003). Automated essay scoring: A cross-disciplinary perspective. Lawrence Erlbaum Associates Publishers.
Shermis, M. D., Burstein, J., Higgins, D., & Zechner, K. (2010). Automated essay scoring: Writing assessment and instruction. International Encyclopedia of Education, 4(1), 20-26. https://doi.org/10.1016/B978-0-08-044894-7.00233-5
Sim, J., & Wright, C. C. (2005). The Kappa statistic in reliability studies: Use, interpretation, and sample size requirements. Physical Therapy, 85(3), 257-268.
Taghipour, K., & Ng, H. T. (2016). A neural approach to automated essay scoring. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 1882-1891). Association for Computational Linguistics. https://doi.org/10.18653/v1/d16-1193
Tanha, J., Abdi, Y., Samadi, N., Razzaghi, N., & Asadpour, M. (2020). Boosting methods for multi-class imbalanced data classification: an experimental review. Journal of Big Data, 7(1), 1-47. https://doi.org/10.1186/s40537-020-00349-y
Vantage Learning. (2005). How IntelliMetricTM works. Retrieved June 4, 2020, from http://www.vantagelearning.com/docs/intellimetric/IM_How_IntelliMetric_Works.pdf
Vantage Learning. (2020). IntelliMetric®: Frequently asked questions. Retrieved June 4, 2020, from http://www.vantagelearning.com/products/intellimetric/faqs/#LongUsed
Wong, W. S., & Bong, C. H. (2019). A study for the development of automated essay scoring (AES) in Malaysian English test environment. International Journal of Innovative Computing, 9(1), 69-78. https://doi.org/10.11113/ijic.v9n1.220
Xu, Y., Ke, D., & Su, K. (2017). Contextualized latent semantic indexing: A new approach to automated Chinese essay scoring. Journal of Intelligent Systems, 26(2), 263-285. https://doi.org/10.1515/jisys-2015-0048
Zupanc, K., & Bosnic, Z. (2015). Advances in the field of automated essay evaluation. Informatica, 4(39), 383-396.