Home / Regular Issue / JST Vol. 29 (3) Jul. 2021 / JST-2438-2021


A Comprehensive Review of Automated Essay Scoring (AES) Research and Development

Chun Then Lim, Chih How Bong, Wee Sian Wong and Nung Kion Lee

Pertanika Journal of Science & Technology, Volume 29, Issue 3, July 2021

DOI: https://doi.org/10.47836/pjst.29.3.27

Keywords: Attributes, automatic essay scoring, evaluation metrics, framework, human raters, recommendation

Published on: 31 July 2021

Automated Essay Scoring (AES) is a service or software that can predictively grade essay based on a pre-trained computational model. It has gained a lot of research interest in educational institutions as it expedites the process and reduces the effort of human raters in grading the essays as close to humans’ decisions. Despite the strong appeal, its implementation varies widely according to researchers’ preferences. This critical review examines various AES development milestones specifically on different methodologies and attributes used in deriving essay scores. To generalize existing AES systems according to their constructs, we attempted to fit all of them into three frameworks which are content similarity, machine learning and hybrid. In addition, we presented and compared various common evaluation metrics in measuring the efficiency of AES and proposed Quadratic Weighted Kappa (QWK) as standard evaluation metric since it corrects the agreement purely by chance when estimate the degree of agreement between two raters. In conclusion, the paper proposes hybrid framework standard as the potential upcoming AES framework as it capable to aggregate both style and content to predict essay grades Thus, the main objective of this study is to discuss various critical issues pertaining to the current development of AES which yielded our recommendations on the future AES development.

  • Alghamdi, M., Alkanhal, M., Al-Badrashiny, M., Al-Qabbany, A., Areshey, A., & Alharbi, A. (2014). A hybrid automatic scoring system for Arabic essays. AI Communications, 27(2), 103-111. https://doi.org/10.3233/aic-130586

  • Alikaniotis, D., Yannakoudakis, H., & Rei, M. (2016). Automatic text scoring using neural networks. In K. Erk & N. A. Smith (Eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 715-725). Association for Computational Linguistics. https://doi.org/10.18653/v1/p16-1068

  • Al-Jouie, M., & Azmi, A. (2017). Automated evaluation of school children essays in Arabic. Procedia Computer Science, 117, 19-22. https://doi.org/10.1016/j.procs.2017.10.089

  • Amalia, A., Gunawan, D., Fithri, Y., & Aulia, I. (2019). Automated Bahasa Indonesia essay evaluation with latent semantic analysis. Journal of Physics: Conference Series, 1235, Article 012100. https://doi.org/10.1088/1742-6596/1235/1/012100

  • Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® V.2. Journal of Technology, Learning, and Assessment, 4(3), 1-29.

  • Awaida, S. A., Shargabi, B. A., & Rousan, T. A. (2019). Automated Arabic essays grading system based on F-score and Arabic wordnet. Jordanian Journal of Computers and Information Technology (JJCIT), 5(3), 170-180. https://doi.org/10.5455/jjcit.71-1559909066

  • Chen, H., & He, B. (2013). Automated essay scoring by maximizing human-machine agreement. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1741-1752). Association for Computational Linguistics.

  • Chen, M., & Li, X. (2018). Relevance-based automated essay scoring via hierarchical recurrent model. In 2018 International Conference on Asian Language Processing (IALP) (pp. 378-383). IEEE Conference Publication. https://doi.org/10.1109/ialp.2018.8629256

  • Chen, Z., & Zhou, Y. (2019). Research on automatic essay scoring of composition based on CNN and OR. In 2019 2nd International Conference on Artificial Intelligence and Big Data (ICAIBD) (pp. 13-18). IEEE Conference Publication. https://doi.org/10.1109/icaibd.2019.8837007

  • Cheon, M., Seo, H. W., Kim, J. H., Noh, E. H., Sung, K. H., & Lim, E. (2015). An automated scoring tool for Korean supply-type items based on semi-supervised learning. In Proceedings of the 2nd Workshop on Natural Language Processing Techniques for Educational Applications (pp. 59-63). Association for Computational Linguistics and Asian Federation of Natural Language Processing. https://doi.org/10.18653/v1/w15-4409

  • Contreras, J. O., Hilles, S., & Abubakar, Z. B. (2018). Automated essay scoring with ontology based on text mining and nltk tools. In 2018 International Conference on Smart Computing and Electronic Enterprise (ICSCEE) (pp. 1-6). IEEE Conference Publication. https://doi.org/10.1109/icscee.2018.8538399

  • Darus, S., Stapa, S. H., & Hussin, S. (2003). Experimenting a computer-based essay marking system at Universiti Kebangsaan Malaysia. Jurnal Teknologi, 39(E), 1-18. https://doi.org/10.11113/jt.v39.472

  • Davis, B. (2014). Essay grading computer mistakes gibberish for genius. Retrieved August 28, 2020, from http://www.realclear.com/tech/2014/04/29/essay_grading_computer_mistakes_gibberish_for_genius_6784.html

  • Dikli, S. (2006). An overview of automated scoring of essays. Journal of Technology, Learning, and Assessment (JTLA), 5(1), 1-35.

  • Dong, F., & Zhang, Y. (2016). Automatic features for essay scoring–an empirical study. In Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 1072-1077). Association for Computational Linguistics. https://doi.org/10.18653/v1/D16-1115

  • Elliot, S. M. (2003). IntelliMetric: From here to validity. In M. D., Shermis & J. C. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 71-86). Routledge

  • Fazal, A., Dillon, T., & Chang, E. (2011). Noise reduction in essay datasets for automated essay grading. In OTM Confederated International Conferences On the Move to Meaningful Internet Systems (pp. 484-493). Springer. https://doi.org/10.1007/978-3-642-25126-9_60

  • Foltz, P. W., Laham, D., & Landauer, T. K. (1999). The intelligent essay assessor: Applications to educational technology. Interactive Multimedia Electronic Journal of Computer-Enhanced Learning, 1(2), 939-944.

  • Ghosh, S., & Fatima, S. S. (2008). Design of an automated essay grading (AEG) system in Indian context. In TENCON 2008-2008 IEEE Region 10 Conference (pp. 1-6). IEEE Conference Publication. https://doi.org/10.1109/tencon.2008.4766677

  • Greene, P. (2018). Automated essay scoring remains an empty dream. Retrieved September 8, 2020, from Forbes: https://www.forbes.com/sites/petergreene/2018/07/02/automated-essay-scoring-remains-an-empty-dream/#4474e4f74b91

  • Hussein, M. A., Hassan, H., & Nassef, M. (2019). Automated language essay scoring systems: A literature review. PeerJ Computer Science, 5, Article e208 https://doi.org/10.7287/peerj.preprints.27715v1

  • Imaki, J., & Ishihara, S. (2013). Experimenting with a Japanese automated essay scoring system in the L2 Japanese environment. Papers in Language Testing and Assessment, 2(2), 28-47.

  • Ishioka, T., & Kameda, M. (2006). Automated Japanese essay scoring system based on articles written by experts. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (pp. 233-240). Association for Computational Linguistics. https://doi.org/10.3115/1220175.1220205

  • Islam, M. M., & Hoque, A. L. (2013). Automated Bangla essay scoring system: ABESS. In 2013 International Conference on Informatics, Electronics and Vision (ICIEV) (pp. 1-5). IEEE Conference Publication. https://doi.org/10.1109/iciev.2013.6572694

  • Jin, C., & He, B. (2015). Utilizing latent semantic word representations for automated essay scoring. In 2015 IEEE 12th Intl Conf on Ubiquitous Intelligence and Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications and Its Associated Workshops (UIC-ATC-ScalCom) (pp. 1101-1108). IEEE Conference Publication. https://doi.org/10.1109/UIC-ATC-ScalCom-CBDCom-IoP.2015.202

  • Jin, C., He, B., Hui, K., & Sun, L. (2018). TDNN: a two-stage deep neural network for prompt-independent automated essay scoring. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1088-1097). Association for Computational Linguistics. https://doi.org/10.18653/v1/p18-1100

  • Kakkonen, T., Myller, N., Timonen, J., & Sutinen, E. (2005). Automatic essay grading with probabilistic latent semantic analysis. In Proceedings of the second workshop on Building Educational Applications Using NLP (pp. 29-36). Association for Computational Linguistics.

  • Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to latent semantic analysis. Discourse Processes, 25, 259-284.

  • Landauer, T. K., Laham, D., & Foltz, P. W. (2000). The intelligent essay assessor. IEEE Intelligent Systems, 15, 27-31.

  • Liang, G., On, B. W., Jeong, D., Kim, H. C., & Choi, G. S. (2018). Automated essay scoring: A siamese bidirectional LSTM neural network architecture. Symmetry, 10(12), Article 682. https://doi.org/10.3390/sym10120682

  • Loraksa, C., & Peachavanish, R. (2007). Automatic Thai-language essay scoring using neural network and latent semantic analysis. In First Asia International Conference on Modelling & Simulation (AMS’07) (pp. 400-402). IEEE Conference Publication. https://doi.org/10.1109/ams.2007.19

  • Malaysian Examinations Council. (2014). Malaysian university English test (MUET): Regulations, test specifications, test format and sample questions. Retrieved March 15, 2021, from https://www.mpm.edu.my/images/dokumen/calon-peperiksaan/muet/regulation/Regulations_Test_Specifications_Test_Format_and_Sample_Questions.pdf

  • Measurement Incorporated. (2020). Automated essay scoring. Retrieved June 4, 2020, from https://www.measurementinc.com/products-services/automated-essay-scoring

  • Nguyen, H., & Dery, L. (2016). Neural networks for automated essay grading. CS224d Stanford Reports, 1-11.

  • Omar, N., & Mezher, R. (2016). A hybrid method of syntactic feature and latent semantic analysis for automatic Arabic essay scoring. Journal of Applied Sciences, 16(5), 209-215. https://doi.org/10.3923/jas.2016.209.215

  • Ong, D. A., Razon, A. R., Guevara, R. C., & Prospero C. Naval, J. (2011, November 24-25). Empirical comparison of concept indexing and latent semantic indexing on the content analysis of Filipino essays. In Proceedings of the 8th National Natural Language Processing Research Symposium (pp. 40-45). De La Salle University, Manila.

  • Page, E. B. (1966). The imminence of... grading essays by computer. Phi Delta Kappan, 47(5), 238-243.

  • Page, E. (2003). Project essay grade: PEG. In M. Shermis, & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 43-54). Lawrence Erlbaum Associates Publishers.

  • Pai, K. C., Lu, Y., & Kuo, B. C. (2017). Developing Chinese automated essay scoring model to assess college students’ essay quality. In Proceedings of the 10th International Conference on Educational Data Mining (pp. 430-432).

  • Pearson. (2010). Intelligent essay assessor (IEA)™ fact sheet. Pearson Education. Retrieved June 4, 2020, from https://images.pearsonassessments.com/images/assets/kt/download/IEA-FactSheet-20100401.pdf

  • Peng, X., Ke, D., Chen, Z., & Xu, B. (2010). Automated Chinese essay scoring using vector space models. In 2010 4th International Universal Communication Symposium (pp. 149-153). IEEE Conference Publication. https://doi.org/10.1109/IUCS.2010.5666229

  • Phandi, P., Chai, K. M. A., & Ng, H. T. (2015). Flexible domain adaptation for automated essay scoring using correlated linear regression. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 431-439). Association for Computational Linguistics. https://doi.org/10.18653/v1/D15-1049

  • Ramalingam, V. V., Pandian, A., Chetry, P., & Nigam, H. (2018). Automated essay grading using machine learning algorithm. In Journal of Physics: Conference Series (Vol. 1000, No. 1, p. 012030). IOP Publishing. https://doi.org/10.1088/1742-6596/1000/1/012030

  • Ramineni, C., & Williamson, D. (2018). Understanding mean score differences between the e‐rater® automated scoring engine and humans for demographically based groups in the GRE® general test. ETS Research Report Series, 2018(1), 1-31. https://doi.org/10.1002/ets2.12192

  • Ratna, A. A. P., Budiardjo, B., & Hartanto, D. (2007). SIMPLE: System automatic essay assessment for Indonesian language subject examination. Makara Journal of Technology, 11(1), 5-11.

  • Ratna, A. A. P., Purnamasari, P. D., & Adhi, B. A. (2015). SIMPLE-O, the Essay grading system for Indonesian Language using LSA method with multi-level keywords. In The Asian Conference on Society, Education & Technology 2015 (pp. 155-164). The International Academic Forum.

  • Ratna, A. A. P., Arbani, A. A., Ibrahim, I., Ekadiyanto, F. A., Bangun, K. J., & Purnamasari, P. D. (2018). Automatic essay grading system based on latent semantic analysis with learning vector quantization and word similarity enhancement. In Proceedings of the 2018 International Conference on Artificial Intelligence and Virtual Reality (pp. 120-126). Association for Computing Machinery. https://doi.org/10.1145/3293663.3293684

  • Ratna, A. A. P., Kaltsum, A., Santiar, L., Khairunissa, H., Ibrahim, I., & Purnamasari, P. D. (2019a). Term frequency-inverse document frequency answer categorization with support vector machine on automatic short essay grading system with latent semantic analysis for japanese language. In 2019 International Conference on Electrical Engineering and Computer Science (ICECOS) (pp. 293-298). IEEE Conference Publication. https://doi.org/10.1109/ICECOS47637.2019.8984530

  • Ratna, A. A. P., Khairunissa, H., Kaltsum, A., Ibrahim, I., & Purnamasari, P. D. (2019b). Automatic essay grading for Bahasa Indonesia with support vector machine and latent semantic analysis. In 2019 International Conference on Electrical Engineering and Computer Science (ICECOS) (pp. 363-367). IEEE Conference Publication. https://doi.org/10.1109/ICECOS47637.2019.8984528

  • Ratna, A. A. P., Santiar, L., Ibrahim, I., Purnamasari, P. D., Luhurkinanti, D. L., & Larasati, A. (2019c). Latent semantic analysis and winnowing algorithm based automatic Japanese short essay answer grading system comparative performance. In 2019 IEEE 10th International Conference on Awareness Science and Technology (iCAST) (pp. 1-7). IEEE Conference Publication. https://doi.org/10.1109/ICAwST.2019.8923226

  • Rudner, L., & Gagne, P. (2001). An overview of three approaches to scoring written essays by computer. Practical Assessment, Research & Evaluation, 7, Article 26.

  • Sendra, M., Sutrisno, R., Harianata, J., Suhartono, D., & Asmani, A. B. (2016). Enhanced latent semantic analysis by considering mistyped words in automated essay scoring. In 2016 International Conference on Informatics and Computing (ICIC) (pp. 304-308). IEEE Conference Publication. https://doi.org/10.1109/IAC.2016.7905734

  • Shehab, A., Faroun, M., & Rashad, M. (2018). An automatic Arabic essay grading system based on text similarity Algorithms. International Journal of Advanced Computer Science and Applications, 9(3), 263-268. https://doi.org/10.14569/IJACSA.2018.090337

  • Shermis, M. D., & Burstein, J. (2003). Automated essay scoring: A cross-disciplinary perspective. Lawrence Erlbaum Associates Publishers.

  • Shermis, M. D., Burstein, J., Higgins, D., & Zechner, K. (2010). Automated essay scoring: Writing assessment and instruction. International Encyclopedia of Education, 4(1), 20-26. https://doi.org/10.1016/B978-0-08-044894-7.00233-5

  • Sim, J., & Wright, C. C. (2005). The Kappa statistic in reliability studies: Use, interpretation, and sample size requirements. Physical Therapy, 85(3), 257-268.

  • Taghipour, K., & Ng, H. T. (2016). A neural approach to automated essay scoring. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 1882-1891). Association for Computational Linguistics. https://doi.org/10.18653/v1/d16-1193

  • Tanha, J., Abdi, Y., Samadi, N., Razzaghi, N., & Asadpour, M. (2020). Boosting methods for multi-class imbalanced data classification: an experimental review. Journal of Big Data, 7(1), 1-47. https://doi.org/10.1186/s40537-020-00349-y

  • Vantage Learning. (2005). How IntelliMetricTM works. Retrieved June 4, 2020, from http://www.vantagelearning.com/docs/intellimetric/IM_How_IntelliMetric_Works.pdf

  • Vantage Learning. (2020). IntelliMetric®: Frequently asked questions. Retrieved June 4, 2020, from http://www.vantagelearning.com/products/intellimetric/faqs/#LongUsed

  • Wong, W. S., & Bong, C. H. (2019). A study for the development of automated essay scoring (AES) in Malaysian English test environment. International Journal of Innovative Computing, 9(1), 69-78. https://doi.org/10.11113/ijic.v9n1.220

  • Xu, Y., Ke, D., & Su, K. (2017). Contextualized latent semantic indexing: A new approach to automated Chinese essay scoring. Journal of Intelligent Systems, 26(2), 263-285. https://doi.org/10.1515/jisys-2015-0048

  • Zupanc, K., & Bosnic, Z. (2015). Advances in the field of automated essay evaluation. Informatica, 4(39), 383-396.

ISSN 0128-7680

e-ISSN 2231-8526

Article ID


Download Full Article PDF

Share this article

Recent Articles