A comparison of machine learning algorithms for the prediction of Hepatitis C NS3 protease cleavage sites

Open access


Hepatitis is a global disease that is on the rise and is currently the cause of more deaths than the human immunodeficiency virus each year. As a result, there is an increasing need for antivirals. Previously, effective antivirals have been found in the form of substrate-mimetic antiviral protease inhibitors. The application of machine learning has been used to predict cleavage patterns of viral proteases to provide information for future drug design. This study has successfully applied and compared several machine learning algorithms to hepatitis C viral NS3 serine protease cleavage data. Results have found that differences in sequence-extraction methods can outweigh differences in algorithm choice. Models produced from pseudo-coded datasets all performed with high accuracy and outperformed models created with orthogonal-coded datasets. However, no single pseudo-model performed significantly better than any other. Evaluation of performance measures also show that the correct choice of model scoring system is essential for unbiased model assessment.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • 1. Tong L. Viral Proteases. Chem Rev. 2002;102(12):4609–26.

  • 2. WHO. Global hepatitis report. 2017.

  • 3. Zopf S Kremer AE Neurath MF Siebler J. Advances in hepatitis C therapy: What is the current state - what come’s next? World J Hepatol. 2016 Jan;8(3):139–47.

  • 4. Lin C. HCV NS3-4A Serine Protease. In: Hepatitis C Viruses: Genomes and Molecular Biology. 1st ed. Norfolk: Horizon Bioscience; 2006. p. 163–206.

  • 5. Chambers TJ Weir RC Grakoui A McCourt DW Bazan JF Fletterick RJ et al. Evidence that the N-terminal domain of nonstructural protein NS3 from yellow fever virus is a serine protease responsible for site-specific cleavages in the viral polyprotein. Proc Natl Acad Sci U S A. 1990 Nov;87(22):8898–902.

  • 6. Colarusso S Gerlach B Koch U Muraglia E Conte I Stansfield I et al. Evolution synthesis and SAR of tripeptide α-ketoacid Inhibitors of the hepatitis C virus NS3/NS4A serine protease. Bioorg Med Chem Lett. 2002;12(4):705–8.

  • 7. Sheng XC Pyun H-J Chaudhary K Wang J Doerffler E Fleury M et al. Discovery of novel phosphonate derivatives as hepatitis C virus NS3 protease inhibitors. Bioorg Med Chem Lett. 2009;19(13):3453–7.

  • 8. Venkatraman S Wu W Prongay A Girijavallabhan V George Njoroge F. Potent inhibitors of HCV-NS3 protease derived from boronic acids. Bioorg Med Chem Lett. 2009;19(1):180–3.

  • 9. Lamarre D Anderson PC Bailey M Beaulieu P Bolger G Bonneau P et al. An NS3 protease inhibitor with antiviral effects in humans infected with hepatitis C virus. Nature. 2003 Oct 26;426:186.

  • 10. Kwo PY Lawitz EJ McCone J Schiff ER Vierling JM Pound D et al. Efficacy of boceprevir an NS3 protease inhibitor in combination with peginterferon alfa-2b and ribavirin in treatment-naive patients with genotype 1 hepatitis C infection (SPRINT-1): an open-label randomised multicentre phase 2 trial. Lancet. 2010;376(9742):705–16.

  • 11. Sing WT Lee CL Yeo SL Lim SP Sim MM. Arylalkylidene rhodanine with bulky and hydrophobic functional group as selective HCV NS3 protease inhibitor. Bioorg Med Chem Lett. 2001;11(2):91–4.

  • 12. Venkatraman S Bogen SL Arasappan A Bennett F Chen K Jao E et al. Discovery of (1R5S)-N-[3-Amino-1-(cyclobutylmethyl)-23-dioxopropyl]-3-[2(S)-[[[(11-dimethylethyl)amino]carbonyl]amino]-33-dimethyl-1-oxobutyl]-66-dimethyl-3-azabicyclo[3.1.0] hexan-2(S)-carboxamide (SCH 503034) a Selective Potent Orally Bioavailable Hepatitis C Virus NS3 Protease Inhibitor: A Potential Therapeutic Agent for the Treatment of Hepatitis C Infection. J Med Chem. 2006;49(20):6074–86.

  • 13. Li X Zhang Y-K Liu Y Ding CZ Li Q Zhou Y et al. Synthesis and evaluation of novel α-amino cyclic boronates as inhibitors of HCV NS3 protease. Bioorg Med Chem Lett. 2010;20(12):3550–6.

  • 14. Prongay AJ Guo Z Yao N Pichardo J Fischmann T Strickland C et al. Discovery of the HCV NS3/4A Protease Inhibitor (1R5S)-N-[3-Amino-1-(cyclobutylmethyl)-23-dioxopropyl]-3-[2(S)-[[[(11-dimethylethyl)amino]carbonyl]amino]-33-dimethyl-1-oxobutyl]-66-dimethyl-3-azabicyclo[3.1.0]hexan-2(S)-carboxamide (Sch 503034) II. Key Steps in Structure-Based Optimization. J Med Chem. 2007 May 1;50(10):2310–8.

  • 15. Chen KX Njoroge FG Prongay A Pichardo J Madison V Girijavallabhan V. Synthesis and biological activity of macrocyclic inhibitors of hepatitis C virus (HCV) NS3 protease. Bioorg Med Chem Lett. 2005;15(20):4475–8.

  • 16. Venkatraman S Njoroge FG Wu W Girijavallabhan V Prongay AJ Butkiewicz N et al. Novel inhibitors of hepatitis C NS3–NS4A serine protease derived from 2-aza-bicyclo[2.2.1]heptane-3-carboxylic acid. Bioorg Med Chem Lett. 2006;16(6):1628–32.

  • 17. Bai X McMullan G Scheres SHW. How cryo-EM is revolutionizing structural biology. Trends Biochem Sci. 2015;40(1):49–57.

  • 18. Bishop CM. Pattern Recognition and Machine Learning (Information Science and Statistics). Berlin Heidelberg: Springer-Verlag; 2006.

  • 19. Lu X Wang L Jiang Z. The Application of Deep Learning in the Prediction of HIV-1 Protease Cleavage Site. In: 2018 5th International Conference on Systems and Informatics (ICSAI). 2018. p. 1299–304.

  • 20. Singh O Su EC-Y. Prediction of HIV-1 protease cleavage site using a combination of sequence structural and physicochemical features. BMC Bioinformatics. 2016 Dec;17(17):478.

  • 21. Narayanan A Wu X Yang ZR. Mining viral protease data to extract cleavage knowledge. Bioinformatics. 2002;18:5–13.

  • 22. Rögnvaldsson T You L. Why neural networks should not be used for HIV-1 protease cleavage site prediction. Bioinformatics. 2004;20(11):1702–9.

  • 23. Lv Z Chu Y Wang Y. HIV protease inhibitors: a review of molecular selectivity and toxicity. HIV AIDS (Auckl). 2015;7:95–104.

  • 24. Schechter I Berger A. On the size of active sites in proteases. I. Papain. Biochem Biophys Res Commun. 1967;27:157–62.

  • 25. Ripley B. Pattern Recognition and Neural Networks. 1stedn ed. Cambridge: Cambridge University Press; 1996.

  • 26. Breiman L. Random Forests. Mach Learn. 2001;45:5–32.

  • 27. J. Dobson A. An Introduction to Generalized Linear Models. 2nd ed. London: Chapman and Hall; 2002.

  • 28. Mika S Ratsch G Weston J Scholkopft B Mullert K. Fisher Discriminant Analysis with Kernels. In: Neural networks for signal processing IX: Proceedings of the 1999 IEEE signal processing society. 1999. p. 41–8.

  • 29. Cortes C Vapnik V. Support-Vector Networks. Mach Learn. 1995;20:273–97.

  • 30. Kotsiantis SB Zaharakis I Pintelas P. Supervised machine learning: A review of classification techniques. Emerg Artif Intell Appl Comput Eng. 2007;160:3–24.

  • 31. Kutkina O Feuerriegel S. Deep Learning in R. University of Freiburg; 2016.

  • 32. Goel E Abhilasha E. Random Forest : A Review. Int J Adv Res Comput Sci Softw Eng. 2017;7(1):251–7.

  • 33. Dey D Ghosh S Mallick B. Generalized Linear Models. 1st ed. Boca Raton: CRC Press; 2000.

  • 34. Ben-Hur A Ong C. Sonnenburg S Schölkopf B Rätsch G. Support Vector Machines and Kernels for Computational Biology. PLoS Comput Biol. 2008;4(10).

  • 35. Panchal F Panchal M. Optimizing Number of Hidden Nodes for Artificial Neural Network using Competitive Learning Approach. Int J Comput Sci Mob Comput. 2015;4(5):358–64.

  • 36. McLachlan Geoffrey J. Do K-A Ambroise C. Analyzing microarray gene expression data / Geoffrey J. McLachlan Kim-Anh Do Christopher Ambroise. Wiley-Interscience Hoboken N.J; 2004. 213–214 p.

  • 37. Metz CE. Basic principles of ROC analysis. Semin Nucl Med. 1978;8(4):283–98.

  • 38. Raghavan V Bollmann P S. Jung G. A critical investigation of recall and precision as measures of retrieval system performance. ACM Trans Inf Syst. 1989;7(3):205–29.

  • 39. Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta-Protein Struct. 1975;405(2):442–51.

  • 40. Chicco D. Ten quick tips for machine learning in computational biology. BioData Min. 2017;10:1–17.

  • 41. Boughorbel S Jarray F El-Anbari M. Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS One. 2017;12(6):1–17.

  • 42. Royston JP. Algorithm AS 181: The W Test for Normality. J R Stat Soc Ser C (Applied Stat. 1982;31(2):176–80.

  • 43. Joanes DN Gill CA. Comparing Measures of Sample Skewness and Kurtosis. J R Stat Soc Ser D (The Stat. 1998;47(1):183–9.

  • 44. Kim TK. T test as a parametric statistic. Korean J Anesthesiol. 2015/11/25. 2015 Dec;68(6):540–6.

  • 45. Kim H-Y. Analysis of variance (ANOVA) comparing means of more than two groups. Restor Dent Endod. 2014/01/20. 2014 Feb;39(1):74–7.

  • 46. Spearman C. The proof and measurement of association between two things. Am J Psychol. 1904;15(1):72–101.

  • 47. Chakrabarti K Keogh E Mehrotra S Pazzani M. Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases. ACM Trans Database Syst. 2002;27(2):188–228.

  • 48. Li B Cai Y Feng K Zhao G. Prediction of Protein Cleavage Site with Feature Selection by Random Forest. PLoS One. 2012;7(9):1–9.

  • 49. Davis J Goadrich M. The Relationship Between Precision-Recall and ROC Curves. In: Proceedings of the 23rd International Conference on Machine Learning. New York NY USA: ACM; 2006. p. 233–40.

  • 50. Crooks GE Hon G Chandonia J-M Brenner SE. WebLogo: a sequence logo generator. Genome Res. 2004 Jun;14(6):1188–90.

Journal information
All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 226 226 226
PDF Downloads 72 72 72