Hepatitis C virus (HCV) is a member of the Flaviviridae family, alongside yellow fever virus and West Nile virus (1). Hepatitis is a global disease that caused 1.34 million deaths in 2015, higher than the number of deaths caused by HIV. It is estimated that each year 1.75 million people newly acquire HCV infection (2).
Chronic infection of HCV is the main reason for liver transplantation worldwide. Infection can lead to severe liver disease and primary liver cancer (2). Since 2014, several direct acting antivirals (DAAs) have been approved that target specific HCV proteins or RNA elements (3). Prior to this, treatment consisted of general use antivirals, such as ribavirin and pegylated interferon-α. These treatments were often lengthy and caused many adverse side effects (4). Detailed information about HCV replication components enabled the development of DAAs.
The HCV RNA genome encodes a long polyprotein precursor which is processed proteolytically. The release of non-structural (NS) proteins is vital for the virus’ maturation. Cleavage of NS proteins is catalysed by the viral encoded NS3 serine protease (NS3P) (1). Because of the protease’s importance in the life-cycle of the virus it has become an attractive antiviral target. Inhibition of the protease is effective and can lead to the production of non-infectious viral particles (5) . Therefore, the design of NS3P inhibitors has received much attention and several of these DAAs have now been discovered.
Potent inhibitors have been found with peptide-bond mimetic (6, 7, 8) or substrate mimetic properties (9, 10, 11, 12). As a result, a number of inhibitor-NS3P complex crystal structures have been obtained (13, 14, 15, 16). These provide insightful structural information increasing the understanding of the protease’s cleavage mechanisms, aiding the development of therapeutics. This information is of importance when analysing peptide-bond mimetics as they may interact differently with the protease than standard substrate mimetics. Although crystal structures provide high quality spatial information, they are generally low-throughput and high cost (17).
The design of effective substrate mimetic NS3P inhibitors can be aided by the prediction of HCV cleavage as cleavable substrates can form the template for inhibitor molecules. Prediction and characterisation of viral protease cleavage sites have been determined by several in silico studies on different viruses. The main tools used for these studies incorporated machine learning algorithms to analyse viral datasets. Supervised learning is a class of machine learning algorithms which builds predictive models based on datasets with known classifications. These models can then be used to classify new unknown datasets (18).
There have been many successful studies in which machine learning algorithms have been used to identify substrate specificity of the human immunodeficiency virus (HIV-1) protease. A wide range of supervised algorithms exist and several of them have been applied to predict the substrate specificity of proteases. The most recent studies tackling the HIV-1 protease cleavage problem commonly use four types of classifiers: artificial neural networks (ANNs), support vector machines (SVMs), decision trees and linear models. Within these studies ANNs have outperformed many other models, with most studies able to obtain an accuracy of ~92% (19, 20, 21). Although the predictive accuracies of ANNs are high, they have come under criticism for their longer run times and limited interpretability when compared to other models (22). A number of studies have compared a handful of classifiers against one another to see which performs best using HIV-1 data (19,20,22). In return this information has helped development of HIV-1 protease inhibitors. Currently, almost half of all anti-HIV compounds are protease inhibitors (23).
As mentioned, there is a large number of machine learning algorithms available in the bioinformatic toolbox to predict cleavage sites of viral proteases. Choosing the correct method is essential for accurate predictions. Previously, this information has been useful for the design of antivirals. This study aims to apply and compare several machine learning algorithms to an HCV NS3P dataset and to see whether differences in sequence-data transformation and model selection improves prediction accuracies based on three performance metrics.
The dataset obtained by Narayanan et al. (21) was removed of all peptides containing non-standard amino acids and the resulting modified dataset was used for this study. The dataset contained a collection of decapeptides and their cleavage ability, either cleaved or non-cleaved, denoted by 1 or 0 respectively. Out of the 891 peptides collected, 145 are classified as cleaved and 746 as non-cleaved. The amino acids of each peptide were arranged following standard Schechter and Berger nomenclature: P6-P5-P4-P3-P2-P1-P’1-P’2-P’3-P’4, where cleavage occurs between the scissile bond at P1-P’1 (24).
Sequence-based feature extraction
Two sequence-based feature extraction methods were implemented to convert each peptide into a numerical feature vector which accurately stores the composition of amino acids. The selected methods were orthogonal (ortho) coding and pseudo coding. Ortho coding created a vector which represents each amino acid by a 20-bit long binary sequence. Pseudo coding created a vector by calculating the frequency of each amino acid at each position. Ortho and pseudo coding were applied to the modified dataset to produce two new datasets. Both the ortho and pseudo coded datasets were used in the study for all machine learning algorithms.
Machine learning algorithms
Several machine learning algorithms were applied to predict HCV protease specificity, including three ANNs (25), random forest (RF) (26), a generalised linear model (GLM) (27), linear discriminant analysis (LDA) (28) and an SVM (29). ANNs are non-parametric models that can detect non-linear interactions between independent and dependent variables. ANN pass variables through a set of interconnected nodes, arranged in hidden layers, with specific weights to determine their output variable (classification) (30). Three ANN model packages were used in this study, “darch”, “h2o” and “elmNN”. The first two packages can produce models with multiple hidden layers whereas the later uses a fixed single-hidden layer. There are now a number of open-source multilayer ANN models to choose from. The two used in this study were chosen due to their ease of use and high performance seen in other studies (31). RF was used as it is not influenced by linearity, it assesses the outcome of a set of decision trees to classify data (32). The RF model used was created from the “RandomForest” package. GLM, from the “stats” package, is a logistic regression model that transforms data into independent linear variables (33). LDA attempts to project raw data from a high-dimensional space to a univariate space, it is modelled from principles of Fischer’s discriminant analysis. An LDA model was produced from the “MASS” package. The last model, SVM, creates a kernel function to map data into a high-dimensional space and finds the optimal hyperplane to classify data (34). The package “e1071” was used to create an SVM model.
As shown, each model used classifies data variables using different mathematical properties. This range of algorithms has been used extensively in biological research and provides rationale for the side-by-side comparison of all seven machine learning models.
Machine learning packages were installed from the CRAN repository and ran in RStudio. Default parameters were used for all models except darch, h2o, elmNN and RF. ANN models required an optimised number of nodes and layers. Epochs were kept constant (100). Optimised number of nodes and layers were determined using general rule-of-thumb measures, in which the number of hidden nodes is no greater than double of the input nodes (35) and due to lack of computer processing power the number of hidden layers was restricted to two. Both darch and h2o algorithms performed at their best using two layers. The number of nodes found in each layer is summarised in Table 1. Optimised node parameters for ortho-elmNN were 19 and 17 for pseudo-elmNN.
ANN optimised nodes
|Model||Optimal Hidden Layer Nodes|
|Layer 1||Layer 2|
The number of decision trees for RF to use was optimised from 1-500. The optimal number of trees for ortho-RF and pseudo-RF were found to be 107 and 99 respectively.
Prior to modelling, data was split by 5-fold cross-validation to produce training and testing datasets. The percentage of non-cleaved peptides in each fold was standardised at 16%, representative of the whole dataset. Cross-validation of this style overcomes the bias of training the model predominantly on either negative or positive data (36). Correct assessment of model performance is critical for determining an algorithms predictive power. Therefore, this study proposes the use of three different evaluators: receiving-operator characteristic (ROC) curves (37), precision-recall (PR) curves (38) and Matthews-correlation coefficient (MCC) (39). Evaluative measures focus on confusion matrix results that produce true-positive (TP), false-positive (FP), true-negative (TN) and false-negative FN) values. ROC curves use FP rates as their x-axis and TP rates as their y-axis, whereas PR curves use recall (x-axis) and precision (y-axis). Values for these curves and MCC were calculated as below:
The area under curve (AUC) value was used as a descriptive value for ROC and PR curves. MCC values were used for the fine-tuning of parameters in ANN and RF models. Many studies using protease data often assess the quality of their models based on ROC-AUC values and accuracy. ROC is a useful tool for determining the robustness of a model by varying the discrimination threshold for prediction values. This provides more information than accuracy alone. However, most protease datasets are imbalanced. It is common to find a larger number of negative, non-cleaved, peptides than positive, cleaved, peptides. The downfalls of ROC come from this as ROC curves neglect the negative variables, enhancing positive predictions. As a result, ROC-AUC values can produce overly hopeful values.
PR curves tackle this imbalance by maximising the correctly classified positive values and does not directly consider the negative values, which are not of importance to this study or to previous studies. For this reason, PR curves are more informative as the datasets have few positive instances but many negative instances. PR curves work in similar fashion to ROC in that they vary their discriminant threshold.
Due to imbalanced datasets it is possible to build a model mainly on negative instances. As a result, these models can predict TN’s at a greater rate than TP’s, in turn this can obtain high accuracy scores. MCC values consider the ratio of the confusion matrix size, which is not taken into consideration by accuracy alone. As a result, the MCC score is only high when the classifier is able to correctly predict both positive and negative elements at a high level (40). Due to its unbiased nature it is a common metric used by a US FDA initiative for predictive model consensus (41). For these reasons the MCC values were used for optimisation and further significance testing. The Shapiro-Wilk test, Kurtosis test, median and mean were used to determine normal distribution of data obtained across five-folds of cross-validation before using parametric t-tests and ANOVA (42, 43, 44, 45). To determine whether the judgement of model performance differs between evaluators, Spearman’s rank was applied to the order of performance denoted by ROC-AUC, PR-AUC and MCC values (46).
Analysis of performance metrics obtained by the experiments show that the application of pseudo- or ortho-coded datasets to a classifier greatly affects a model’s performance. Fig. 1 shows the performance of pseudo- and ortho-models. The pseudo-coded dataset produced models with higher accuracies than their ortho-coded counterparts. Also, the performance of ortho-coded models varied greater than pseudo-coded models, this can be seen clearly in Fig. 1. In contrast, a higher variance in model performance was observed across five-folds of cross-validation in pseudo-models compared to their orthogonal counterpart, as seen in Fig. 2.
Several machine learning algorithms were applied to the two datasets; AUC and MCC scores were used to quantify model performance, which is summarised in Fig. 2. Experimental results show that the ortho-coded RF model (average results: ROC-AUC 0.924, PR-AUC 0.819 and MCC 0.842) outperformed all other ortho-coded models. This was validated by MCC ANOVA analysis (p-value = 7.70x10-13) and MCC t-test analysis against the second best performing ortho-model, SVM (average results: ROC-AUC 0.868, PR-AUC 0.662 and MCC 0.640), (p-value = 0.003). Ortho-RF also obtained higher prediction capabilities than its pseudo-coded counterpart, pseudo-RF (average results: ROC-AUC 0.914, PR-AUC 0.828 and MCC 0.892). This was the only ortho-model to obtain higher scores than its pseudo counterpart.
The lowest scoring ortho-models were both LDA (average results: ROC-AUC 0.638, PR-AUC 0.235 and MCC 0.294) and GLM (average results: ROC-AUC 0.635, PR-AUC 0.233 and MCC 0.300) models.
All pseudo-coded models predicted peptide classification with a high degree of accuracy. The highest performance was found in SVM (average results: ROC-AUC 0.972, PR-AUC 0.900 and MCC 0.860) and elmNN (average results: ROC-AUC 0.960, PR-AUC 0.883 and MCC 0.852) models. In contrast to the ortho-coded models, RF performed the worst using a pseudo-coded dataset (average results: ROC-AUC 0.914, PR-AUC 0.809 and MCC 0.828).
Due to the high performance of all models there was no significant difference across the predictions by pseudo-models, validated by MCC ANOVA testing (p-value = 0.981).
Three evaluation measures were applied to all models: ROC-AUC, PR-AUC and MCC. Ranking of the models using these metrics were assessed to see which evaluative measures are consistent with each other. Consistency between metrics shows that regardless of which measure is being used it will rank model performance similarly to other measures. ROC-AUC ranked model performance analogously to PR-AUC, validated by Spearman’s rank correlation (ortho-model Rho = 0.964, pseudo-model Rho = 1). Although the scores are not directly comparable, as they measure different predictive qualities, in general, ROC-AUC values were higher than PR-AUC and MCC values, as seen in Fig. 2. This was exemplified by the ortho-darch model which obtained a ROC-AUC value of 0.820 but scored a dramatically lower PR-AUC (0.454) and MCC (0.494) value.
The aims of this study were to determine which machine learning algorithms can successfully predict HCV NS3P substrate cleavage sites, using two sequenced-based feature extractions methods, with high accuracies. Alongside this, the study investigated model evaluation to determine if the choice of prediction metric affects the accuracy of model performance representation.
Results from this study has shown that sequence-data transformation is a limiting-factor for high-level model performance. Experimental data shows that pseudo-coding data enable machine learning models to accurately classify data at a higher accuracy than if it was orthogonally encoded. The large difference in model performance between the two extraction techniques is due to the dependency of the training and testing data, and the dimensionality reduction found in the pattern-based pseudo-coding technique. When splitting pseudo-code data into training and testing sets, the amino acids are still encoded as an observation frequency in the whole dataset, this makes the split datasets dependent on each other. Reducing dimensions within a dataset is extremely useful for machine learning algorithms as it enables variables of similarity to be replaced by a singular instance, in turn this can lead to improved model performance, as long as no important features are lost (47). Therefore, pseudo-coded models greatly outperformed their orthogonal counterparts. However, these large differences between coding techniques have not been seen in other comparative studies on viral datasets (20). These results show that the application of feature extraction methods is imperative for enhanced predictive power.
Pseudo-coded models also showed a greater variance in performance than ortho-models, as seen in Fig. 2. Although pseudo-coding reduces dimensionality, which in turn should help to reduce model variance, there was still disparity between model performances across testing sets in cross-validation. As mentioned, the dependency between training and testing sets in pseudo-coding enhances pseudo-model performance. However, this dependency may also be the reason for the higher variance in pseudo-models. Individual testing folds have a higher or lower rate of dependency on their constitutive training folds. As a result, a decreased relation between the training and testing data will reduce the model’s performance. When each testing fold is used in cross-validation, some folds may have a lower dependency. This causes the prediction to be less accurate, creating variance.
With a large repertoire of machine learning algorithms available for biologists it is important to use the optimal one for the classification task. This importance was shown in this study when using ortho-datasets, as the classifiers varied greatly showing the significance of correct model choice. Of these, the RF algorithm outperformed all models under ortho-coding. In contrast, previous studies have shown that decision trees and RF perform with lower accuracies than other algorithms (19,48). Results from this study show that RF should not be disregarded as a potential candidate for other similar pattern recognition tasks.
Overall, pseudo-models showed similar predictive capabilities with moderately high variation across five-folds of cross-validation making it difficult to compare the pseudo-models. However, Fig. 2 shows that some models have parallel prediction power to others. The two ANN models darch and elmNN had uniform performances across both ortho- and pseudo-models. Whereas, the ANN, h2o showed greater performance whilst using ortho-code and was non-distinguishable when using pseudo-code, due to the high variance between folds. This shows that the choice of specific ANN algorithm can also affect the results of a machine learning task.
GLM and LDA also displayed uniform performance in Fig. 2. These showed, the greatest difference when applying pseudo-or ortho-coding techniques. Ortho-GLM and ortho-LDA were the worst performing orthogonally encoded models whereas their pseudo counterparts performed to the same capabilities of other models. This provides evidence in favour of linear models for machine learning tasks, but only if the data has been pre-processed to a high standard.
The importance of model selection has been greatly questioned by pseudo-model performance. No pseudo-model significantly outperformed their orthogonal/pseudo counterparts, and all obtained high scores across all three-performance metrics used. As a result, the efficiencies of models come into question and reinvigorates the ideas put forward by Rögnvaldsson and You that if all algorithms work at a high accuracy rates the simplest algorithm with faster run times should be used (22). With these ideas in mind the use of ANN models is unnecessary due to their slower run times, need for parameter optimisation and overall comparatively insignificant model performance.
When measuring model performance, a variety of metrics can be considered. This study proposed the use of three measures to give full details on a model’s prediction capabilities. Fig. 2 shows the application of three performance metrics to evaluate each of the models. These three metrics showed little disparity between ranking the models. This proved the relationship between ROC and PR even though the curves and AUC values can be different (49). Although ROC-AUC, PR-AUC and MCC are not directly comparable measurements, it was observed that ROC-AUC scores are traditionally higher than the other metrics (Fig. 2). This means that evaluation of model performance based on ROC-AUC scores alone can be misleading to the wider audience of researchers without sound knowledge on the workings of ROC curves. The low performance of ortho-models was expressed more obviously in PR-AUC and MCC scores, these were over half the ROC-AUC scores in some cases. The overly optimistic ROC-AUC values seen in Fig. 2 disregard the important principles of imbalanced datasets. Model evaluation from ROC-AUC alone can be misleading and unimportant when dealing with imbalanced datasets as it maximises the model’s capabilities of predicting TN’s. As mentioned previously, studies that work on viral peptide cleavage need to focus more on the identification of cleavable peptides, TP’s. It is this information which is of biological importance for the development of new peptide inhibitors. Using scoring systems such as PR-AUC and MCC help correctly assess whether a model is favouring the prediction of non-cleaved peptides when compared to ROC-AUC scores. This investigation shows that using a variety of scoring systems such as PR, MCC and ROC alongside each other can help correctly assess a model’s predictive biases.
To understand the biological significance of this study it is important to analyse the substrate predicted by the models. As a result, a set of reference peptides have been chosen for comparison against data obtained from in vitro experiments on HCV NS3P substrate specificity. True-positive predictions taken from the top performing ortho- and pseudo-models (ortho-RF and pseudo-SVM) were taken, with a discriminant threshold of 0.5, to produce a set of nine reference peptides. The results from both models can be seen in Table 2. The amino acid composition of the reference peptides have been visualised as a WebLogo in Fig. 3 (50). Experimental data has shown that the most important amino acids in substrate peptides are found to be at positions P1 and P’1, either side of the scissile bond. Enzymatic assays and consensus substrate sequence alignments have previously revealed that the following amino acids are found at each position: Asp or Glu at P6, Cys or Thr at P1 and Ser or Ala at P’1. Out of these key three positions, substrate mutations at P1 resulted in a significant decrease of substrate cleavage (4). The reference peptides produced from this investigation support experimental evidence. Fig. 3 shows that the Asp at P6, Cys at P1 and Ser or Ala at P’1 are commonly found in substrate peptides, corroborating with in vitro results. Furthermore, computational models also suggest that Glu or Asp at P5 and Glu or Val at P3 could also be an important factor for substrate cleavage, see Fig. 3. Other studies have shown that acidic residues at P5 and P6 enables the substrate to form electrostatic interactions with the NS3 protease, with the potential to enhance binding (4). These in silico results support this. With this knowledge, it is possible that peptides with similar physicochemical properties to the average substrate model in Fig. 3. could form the basis of inhibitor molecules.
Set of reference peptides
In silico studies that can predict the substrate cleavage sites of HCV NS3P can speed the anti-hepatitis drug development pipeline and reduce experimental costs. The need for new anti-hepatitis drugs is increasingly important as the rate of hepatitis has been rising (2). This study has successfully shown that several machine learning algorithms can be applied to determine substrate cleavage of HCV NS3P. It has been shown that the method of feature-extraction greatly outweighs the choice of algorithm. This has shown that more emphasis should be placed on pre-modelling techniques than the models themselves. Furthermore, it has also been shown that the use of ROC-AUC scoring as a main indicator of model performance can hide model biases towards the correct prediction of non-cleaved peptides. This information can help aid future studies on viral proteases by providing information on the importance of data transformations, model selection and model assessment. In future research work, physicochemical and structural features should be combined with sequence information as these combinatorial feature approaches have been seen to enhance model accuracy in the prediction of HIV protease cleavage sites (20). Furthermore, the application of improved data coding techniques should also be applied as the results from this study show that feature-selection and extraction are the limiting factors over model selection. It is also hoped that multiple scoring measures will also be applied to provide transparency of model’s predictive capabilities.
I would like to thank Dr Ron Yang for being my mentor throughout the project and the University of Exeter.
The research data supporting this publication are openly available from GitHub at: https://github.com/harrychown/hcvp/
Tong L. Viral Proteases. Chem Rev. 2002;102(12):4609–26.
WHO. Global hepatitis report. 2017.
Zopf S, Kremer AE, Neurath MF, Siebler J. Advances in hepatitis C therapy: What is the current state - what come’s next? World J Hepatol. 2016 Jan;8(3):139–47.
Lin C. HCV NS3-4A Serine Protease. In: Hepatitis C Viruses: Genomes and Molecular Biology. 1st ed. Norfolk: Horizon Bioscience; 2006. p. 163–206.
Chambers TJ, Weir RC, Grakoui A, McCourt DW, Bazan JF, Fletterick RJ, et al. Evidence that the N-terminal domain of nonstructural protein NS3 from yellow fever virus is a serine protease responsible for site-specific cleavages in the viral polyprotein. Proc Natl Acad Sci US A. 1990 Nov;87(22):8898–902.
Colarusso S, Gerlach B, Koch U, Muraglia E, Conte I, Stansfield I, et al. Evolution, synthesis and SAR of tripeptide α-ketoacid Inhibitors of the hepatitis C virus NS3/NS4A serine protease. Bioorg Med Chem Lett. 2002;12(4):705–8.
Sheng XC, Pyun H-J, Chaudhary K, Wang J, Doerffler E, Fleury M, et al. Discovery of novel phosphonate derivatives as hepatitis C virus NS3 protease inhibitors. Bioorg Med Chem Lett. 2009;19(13):3453–7.
Venkatraman S, Wu W, Prongay A, Girijavallabhan V, George Njoroge F. Potent inhibitors of HCV-NS3 protease derived from boronic acids. Bioorg Med Chem Lett. 2009;19(1):180–3.
Lamarre D, Anderson PC, Bailey M, Beaulieu P, Bolger G, Bonneau P, et al. An NS3 protease inhibitor with antiviral effects in humans infected with hepatitis C virus. Nature. 2003 Oct 26;426:186.
Kwo PY, Lawitz EJ, McCone J, Schiff ER, Vierling JM, Pound D, et al. Efficacy of boceprevir, an NS3 protease inhibitor, in combination with peginterferon alfa-2b and ribavirin in treatment-naive patients with genotype 1 hepatitis C infection (SPRINT-1): an open-label, randomised, multicentre phase 2 trial. Lancet. 2010;376(9742):705–16.
Sing WT, Lee CL, Yeo SL, Lim SP, Sim MM. Arylalkylidene rhodanine with bulky and hydrophobic functional group as selective HCV NS3 protease inhibitor. Bioorg Med Chem Lett. 2001;11(2):91–4.
Venkatraman S, Bogen SL, Arasappan A, Bennett F, Chen K, Jao E, et al. Discovery of (1R,5S)-N-[3-Amino-1-(cyclobutylmethyl)-2,3-dioxopropyl]- 3-[2(S)-[[[(1,1-dimethylethyl)amino]carbonyl]amino]-3,3-dimethyl-1-oxobutyl]- 6,6-dimethyl-3-azabicyclo[3.1.0] hexan-2(S)-carboxamide (SCH 503034), a Selective, Potent, Orally Bioavailable Hepatitis C Virus NS3 Protease Inhibitor: A Potential Therapeutic Agent for the Treatment of Hepatitis C Infection. J Med Chem. 2006;49(20):6074–86.
Li X, Zhang Y-K, Liu Y, Ding CZ, Li Q, Zhou Y, et al. Synthesis and evaluation of novel α-amino cyclic boronates as inhibitors of HCV NS3 protease. Bioorg Med Chem Lett. 2010;20(12):3550–6.
Prongay AJ, Guo Z, Yao N, Pichardo J, Fischmann T, Strickland C, et al. Discovery of the HCV NS3/4A Protease Inhibitor (1R,5S)-N-[3-Amino-1-(cyclobutylmethyl)-2,3-dioxopropyl]-3- [2(S)-[[[(1,1-dimethylethyl)amino]carbonyl]amino]-3,3-dimethyl-1-oxobutyl]- 6,6-dimethyl-3-azabicyclo[3.1.0]hexan-2(S)-carboxamide (Sch 503034) II. Key Steps in Structure-Based Optimization. J Med Chem. 2007 May 1;50(10):2310–8.
Chen KX, Njoroge FG, Prongay A, Pichardo J, Madison V, Girijavallabhan V. Synthesis and biological activity of macrocyclic inhibitors of hepatitis C virus (HCV) NS3 protease. Bioorg Med Chem Lett. 2005;15(20):4475–8.
Venkatraman S, Njoroge FG, Wu W, Girijavallabhan V, Prongay AJ, Butkiewicz N, et al. Novel inhibitors of hepatitis C NS3–NS4A serine protease derived from 2-aza-bicyclo[2.2.1]heptane-3-carboxylic acid. Bioorg Med Chem Lett. 2006;16(6):1628–32.
Bai X, McMullan G, Scheres SHW. How cryo-EM is revolutionizing structural biology. Trends Biochem Sci. 2015;40(1):49–57.
Bishop CM. Pattern Recognition and Machine Learning (Information Science and Statistics). Berlin, Heidelberg: Springer-Verlag; 2006.
Lu X, Wang L, Jiang Z. The Application of Deep Learning in the Prediction of HIV-1 Protease Cleavage Site. In: 2018 5th International Conference on Systems and Informatics (ICSAI). 2018. p. 1299–304.
Singh O, Su EC-Y. Prediction of HIV-1 protease cleavage site using a combination of sequence, structural, and physicochemical features. BMC Bioinformatics. 2016 Dec;17(17):478.
Narayanan A, Wu X, Yang ZR. Mining viral protease data to extract cleavage knowledge. Bioinformatics. 2002;18:5–13.
Rögnvaldsson T, You L. Why neural networks should not be used for HIV-1 protease cleavage site prediction. Bioinformatics. 2004;20(11):1702–9.
Lv Z, Chu Y, Wang Y. HIV protease inhibitors: a review of molecular selectivity and toxicity. HIV AIDS (Auckl). 2015;7:95–104.
Schechter I, Berger A. On the size of active sites in proteases. I. Papain. Biochem Biophys Res Commun. 1967;27:157–62.
Ripley B. Pattern Recognition and Neural Networks. 1stedn ed. Cambridge: Cambridge University Press; 1996.
Breiman L. Random Forests. Mach Learn. 2001;45:5–32.
J. Dobson A. An Introduction to Generalized Linear Models. 2nd ed. London: Chapman and Hall; 2002.
Mika S, Ratsch G, Weston J, Scholkopft B, Mullert K. Fisher Discriminant Analysis with Kernels. In: Neural networks for signal processing IX: Proceedings of the 1999 IEEE signal processing society. 1999. p. 41–8.
Cortes C, Vapnik V. Support-Vector Networks. Mach Learn. 1995;20:273–97.
Kotsiantis SB, Zaharakis I, Pintelas P. Supervised machine learning: A review of classification techniques. Emerg Artif Intell Appl Comput Eng. 2007;160:3–24.
Kutkina O, Feuerriegel S. Deep Learning in R. University of Freiburg; 2016.
Goel E, Abhilasha E. Random Forest : A Review. Int J Adv Res Comput Sci Softw Eng. 2017;7(1):251–7.
Dey D, Ghosh S, Mallick B. Generalized Linear Models. 1st ed. Boca Raton: CRC Press; 2000.
Ben-Hur A, Ong C., Sonnenburg S, Schölkopf B, Rätsch G. Support Vector Machines and Kernels for Computational Biology. PLoS Comput Biol. 2008;4(10).
Panchal F, Panchal M. Optimizing Number of Hidden Nodes for Artificial Neural Network using Competitive Learning Approach. Int J Comput Sci Mob Comput. 2015;4(5):358–64.
McLachlan Geoffrey J., Do K-A, Ambroise C. Analyzing microarray gene expression data / Geoffrey J. McLachlan, Kim-Anh Do, Christopher Ambroise. Wiley-Interscience Hoboken, N.J; 2004. 213–214 p.
Metz CE. Basic principles of ROC analysis. Semin Nucl Med. 1978;8(4):283–98.
Raghavan V, Bollmann P, S. Jung G. A critical investigation of recall and precision as measures of retrieval system performance. ACM Trans Inf Syst. 1989;7(3):205–29.
Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta - Protein Struct. 1975;405(2):442–51.
Chicco D. Ten quick tips for machine learning in computational biology. BioData Min. 2017;10:1–17.
Boughorbel S, Jarray F, El-Anbari M. Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS One. 2017;12(6):1–17.
Royston JP. Algorithm AS 181: The W Test for Normality. J R Stat Soc Ser C (Applied Stat. 1982;31(2):176–80.
Joanes DN, Gill CA. Comparing Measures of Sample Skewness and Kurtosis. JR Stat Soc Ser D (The Stat. 1998;47(1):183–9.
Kim TK. T test as a parametric statistic. Korean J Anesthesiol. 2015/11/25. 2015 Dec;68(6):540–6.
Kim H-Y. Analysis of variance (ANOVA) comparing means of more than two groups. Restor Dent Endod. 2014/01/20. 2014 Feb;39(1):74–7.
Spearman C. The proof and measurement of association between two things. Am J Psychol. 1904;15(1):72–101.
Chakrabarti K, Keogh E, Mehrotra S, Pazzani M. Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases. ACM Trans Database Syst. 2002;27(2):188–228.
Li B, Cai Y, Feng K, Zhao G. Prediction of Protein Cleavage Site with Feature Selection by Random Forest. PLoS One. 2012;7(9):1–9.
Davis J, Goadrich M. The Relationship Between Precision-Recall and ROC Curves. In: Proceedings of the 23rd International Conference on Machine Learning. New York, NY, USA: ACM; 2006. p. 233–40.
Crooks GE, Hon G, Chandonia J-M, Brenner SE. WebLogo: a sequence logo generator. Genome Res. 2004 Jun;14(6):1188–90.