Home / Regular Issue / JTAS Vol. 30 (4) Oct. 2022 / JST-3272-2021

 

A Comparative Study of RNA-Seq Aligners Reveals Novoalign’s Default Setting as an Optimal Setting for the Alignment of HeLa RNA-Seq Reads

Kristine Sandra Pey Adum and Hasni Arsad

Pertanika Journal of Tropical Agricultural Science, Volume 30, Issue 4, October 2022

DOI: https://doi.org/10.47836/pjst.30.4.24

Keywords: Alignment, HISAT2, novoalign, RNA-seq, subread, TopHat

Published on: 28 September 2022

The introduction of RNA-sequencing (RNA-Seq) technology into biological research has encouraged bioinformatics developers to build various analysis pipelines. The chosen bioinformatics pipeline mostly depends on the research goals and organisms of interest because a single pipeline may not be optimal for all cases. As the first step in most pipelines, alignment has become a crucial step that will affect the downstream analysis. Each alignment tool has its default and parameter settings to maximise the output. However, this poses great challenges for the researchers as they need to determine the alignment tool most compatible with the correct settings to analyse their samples accurately and efficiently. Therefore, in this study, the duplication of real data of the HeLa RNA-seq was used to evaluate the effects of data qualities on four commonly used RNA-Seq tools: HISAT2, Novoalign, TopHat and Subread. Furthermore, these data were also used to evaluate the optimal settings of each aligner for our sample. These tools’ performances, precision, recall, F-measure, false discovery rate, error tolerance, parameter stability, runtime and memory requirements were measured. Our results showed significant differences between the settings of each alignment tool tested. Subread and TopHat exhibited the best performance when using optimised parameters setting. In contrast, the most reliable performance was observed for HISAT2 and Novoalign when the default setting was used. Although HISAT2 was the fastest alignment tool, the highest accuracy was achieved using Novoalign with the default setting.

  • Andrews, S. (2010). FastQC: A quality control tool for high throughput sequence data. Babraham Bioinformatics. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

  • Baruzzo, G., Hayer, K. E., Kim, E. J., Di Camillo, B., Fitzgerald, G. A., & Grant, G. R. (2017). Simulation-based comprehensive benchmarking of RNA-seq aligners. Nature Methods, 14(2), 135-139. https://doi.org/10.1038/nmeth.4106

  • Bottomley, R. H., Trainer, A. L., & Griffin, M. J. (1969). Enzymatic and chromosomal characterization of HeLa variants. The Journal of Cell Biology, 41(3), 806-815. https://doi.org/10.1083/jcb.41.3.806

  • Chen, S., Zhou, Y., Chen, Y., & Gu, J. (2018). Fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17), i884-i890. https://doi.org/10.1093/bioinformatics/bty560

  • Chen, X., Robinson, D. G., & Storey, J. D. (2021). The functional false discovery rate with applications to genomics. Biostatistics, 22(1), 68-81. https://doi.org/10.1093/biostatistics/kxz010

  • Donato, L., Scimone, C., Rinaldi, C., D’Angelo, R., & Sidoti, A. (2021). New evaluation methods of read mapping by 17 aligners on simulated and empirical NGS data: An updated comparison of DNA- and RNA-seq data from Illumina and Ion Torrent technologies. Neural Computing and Applications, 33(22), 15669-15692. https://doi.org/10.1007/s00521-021-06188-z

  • Fasterius, E., & Al-Khalili Szigyarto, C. (2018). Analysis of public RNA-sequencing data reveals biological consequences of genetic heterogeneity in cell line populations. Scientific Reports, 8(1), 1-11. https://doi.org/10.1038/s41598-018-29506-3

  • Ferragina, P., & Manzini, G. (2000). Opportunistic data structures with applications. In Proceedings 41st Annual Symposium on Foundations of Computer Science (pp. 390-398). IEEE Publishing. https://doi.org/10.1109/sfcs.2000.892127

  • Fonseca, N. A., Rung, J., Brazma, A., & Marioni, J. C. (2012). Tools for mapping high-throughput sequencing data. Bioinformatics, 28(24), 3169-3177. https://doi.org/10.1093/bioinformatics/bts605

  • Gaur, P., & Chaturvedi, A. (2017). A survey of bioinformatics-based tools in RNA-sequencing (RNA-seq) data analysis. In Translational Bioinformatics and its Application (pp. 223-248). Springer. https://doi.org/10.1007/978-94-024-1045-7_10

  • Goodwin, S., McPherson, J. D., & McCombie, W. R. (2016). Coming of age: Ten years of next-generation sequencing technologies. Nature Reviews Genetics, 17(6), 333-351. https://doi.org/10.1038/nrg.2016.49

  • Grytten, I., Rand, K. D., Nederbragt, A. J., & Sandve, G. K. (2020). Assessing graph-based read mappers against a novel baseline approach highlights strengths and weaknesses of the current generation of methods. BMC Genomics, 21, Article 282. https://doi.org/10.1186/s12864-020-6685-y

  • Hu, W. E., Zhang, X., Guo, Q. F., Yang, J. W., Yang, Y., Wei, S. C., & Su, X. D. (2019). HeLa-CCL2 cell heterogeneity studied by single-cell DNA and RNA sequencing. PLoS One, 14(12), Article e0225466. https://doi.org/10.1371/journal.pone.0225466

  • Jain, C., Rhie, A., Zhang, H., Chu, C., Walenz, B. P., Koren, S., & Phillippy, A. M. (2020). Weighted minimizer sampling improves long read mapping. Bioinformatics, 36, I111-I118. https://doi.org/10.1093/BIOINFORMATICS/BTAA435

  • Keel, B. N., & Snelling, W. M. (2018). Comparison of Burrows-Wheeler transform-based mapping algorithms used in high-throughput whole-genome sequencing: Application to illumina data for livestock genomes 1. Frontiers in Genetics, 9, 1-6. https://doi.org/10.3389/fgene.2018.00035

  • Kim, D., Langmead, B., & Salzberg, S. L. (2015). HISAT: A fast spliced aligner with low memory requirements. Nature Methods, 12(4), 357-360. https://doi.org/10.1038/nmeth.3317

  • Koboldt, D. C. (2020). Best practices for variant calling in clinical sequencing. Genome Medicine, 12(1), 1-13. https://doi.org/10.1186/s13073-020-00791-w

  • Križanović, K., Echchiki, A., Roux, J., & Šikić, M. (2018). Evaluation of tools for long read RNA-seq splice-aware alignment. Bioinformatics, 34(5), 748-754. https://doi.org/10.1093/bioinformatics/btx668

  • Landman, S. R., Hwang, T. H., Silverstein, K. A. T., Li, Y., Dehm, S. M., Steinbach, M., & Kumar, V. (2014). SHEAR: Sample heterogeneity estimation and assembly by reference. BMC Genomics, 15(1), 1-12. https://doi.org/10.1186/1471-2164-15-84

  • Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., & Durbin, R. (2009). The sequence alignment/map format and SAMtools. Bioinformatics, 25(16), 2078-2079. https://doi.org/10.1093/bioinformatics/btp352

  • Liao, Y., Smyth, G. K., & Shi, W. (2013). The subread aligner: Fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Research, 41(10), e108-e108. https://doi.org/10.1093/nar/gkt214

  • Liu, Y., Mi, Y., Mueller, T., Kreibich, S., Williams, E. G., Van Drogen, A., Borel, C., Frank, M., Germain, P. L., Bludau, I., Mehnert, M., Seifert, M., Emmenlauer, M., Sorg, I., Bezrukov, F., Bena, F. S., Zhou, H., Dehio, C., Testa, G., & Aebersold, R. (2019). Multi-omic measurements of heterogeneity in HeLa cells across laboratories. Nature Biotechnology, 37(3), 314-322. https://doi.org/10.1038/s41587-019-0037-y

  • Nodehi, H. M., Tabatabaiefar, M. A., & Sehhati, M. (2021). Selection of optimal bioinformatic tools and proper reference for reducing the alignment error in targeted sequencing data. Journal of Medical Signals and Sensors, 11(1), 37-44. https://doi.org/10.4103/jmss.JMSS-7-20

  • Qin, D. (2019). Next-generation sequencing and its clinical application. Cancer Biology and Medicine, 16(1), 4-10. https://doi.org/10.20892/j.issn.2095-3941.2018.0055

  • Raplee, I. D., Evsikov, A. V., & De Evsikova, C. M. (2019). Aligning the aligners: Comparison of rna sequencing data alignment and gene expression quantification tools for clinical breast cancer research. Journal of Personalized Medicine, 9(2), Article 18. https://doi.org/10.3390/jpm9020018

  • Rutledge, S. (2014). What HeLa cells are you using? The Winnower, 9, 1-9. https://doi.org/10.15200/winn.143896.65158

  • Sahlin, K., & Mäkinen, V. (2021). Accurate spliced alignment of long RNA sequencing reads. Bioinformatics, 37(24), 4643-4651. https://doi.org/10.1093/bioinformatics/btab540

  • Sahraeian, S. M. E., Mohiyuddin, M., Sebra, R., Tilgner, H., Afshar, P. T., Au, K. F., Bani Asadi, N., Gerstein, M. B., Wong, W. H., Snyder, M. P., Schadt, E., & Lam, H. Y. K. (2017). Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis. Nature Communications, 8(1), 1-14. https://doi.org/10.1038/s41467-017-00050-4

  • Schaarschmidt, S., Fischer, A., Zuther, E., & Hincha, D. K. (2020). Evaluation of seven different RNA-seq alignment tools based on experimental data from the model plant Arabidopsis thaliana. International Journal of Molecular Sciences, 21(5), Article 1720. https://doi.org/10.3390/ijms21051720

  • Schilbert, H. M., Rempel, A., & Pucker, B. (2020). Comparison of read mapping and variant calling tools for the analysis of plant NGS data. Plants, 9(4), Article 439. https://doi.org/10.3390/plants9040439

  • Shang, J., Zhu, F., Vongsangnak, W., Tang, Y., Zhang, W., & Shen, B. (2014). Evaluation and comparison of multiple aligners for next-generation sequencing data analysis. BioMed Research International, 2014, Article 309650. https://doi.org/10.1155/2014/309650

  • Sun, Z., Bhagwate, A., Prodduturi, N., Yang, P., & Kocher, J. P. A. (2017). Indel detection from RNA-seq data: Tool evaluation and strategies for accurate detection of actionable mutations. Briefings in Bioinformatics, 18(6), 973-983. https://doi.org/10.1093/bib/bbw069

  • Thankaswamy-Kosalai, S., Sen, P., & Nookaew, I. (2017). Evaluation and assessment of read-mapping by multiple next-generation sequencing aligners based on genome-wide characteristics. Genomics, 109(3-4), 186-191. https://doi.org/10.1016/j.ygeno.2017.03.001

  • Trapnell, C., Pachter, L., & Salzberg, S. L. (2009). TopHat: Discovering splice junctions with RNA-seq. Bioinformatics, 25(9), 1105-1111. https://doi.org/10.1093/bioinformatics/btp120

  • Wu, D. C., Yao, J., Ho, K. S., Lambowitz, A. M., & Wilke, C. O. (2018). Limitation of alignment-free tools in total RNA-seq quantification. BMC Genomics, 19(1), 1-14. https://doi.org/10.1101/246967

  • Yoo, Y. S., Han, H. G., & Jeon, Y. J. (2017). Unfolded protein response of the endoplasmic reticulum in tumor progression and immunogenicity. Oxidative Medicine and Cellular Longevity, 2017, Article 2969271. https://doi.org/10.1155/2017/2969271

  • Zhang, C., Zhang, B., Lin, L. L., & Zhao, S. (2017). Evaluation and comparison of computational tools for RNA-seq isoform quantification. BMC Genomics, 18(1), 1-11. https://doi.org/10.1186/s12864-017-4002-1

  • Zhou, Q., Su, X., Jing, G., Chen, S., & Ning, K. (2018). RNA-QC-chain: Comprehensive and fast quality control for RNA-Seq data. BMC Genomics, 19(1), 1-10. https://doi.org/10.1186/s12864-018-4503-6

ISSN 1511-3701

e-ISSN 2231-8542

Article ID

JST-3272-2021

Download Full Article PDF

Share this article

Recent Articles