How to Use Machine Learning in Bioinformatics Research Part 2

Original article was published on Artificial Intelligence on Medium

How to Use Machine Learning in Bioinformatics Research Part 2

Machine Learning in Bioinformatics

Bioinformatics is a combination of many fields. I was doing my undergraduate research on BioInformatics and I had a great passion for machine learning and deep learning. My passion forced me to search about Integrating machine learning and deep learning techniques to solve problems in BioInformatics. So I thought to share my knowledge to support other researchers to help them avoid difficulties I faced during the early stage of my research work.

In my first article How to Use Machine Learning in Bioinformatics Research Part 1, I discussed what is Bioinformatics and several biological domains where machine learning techniques are applied for knowledge extraction from data. I also discussed why machine learning is used and the optimization criterion to follow. I promised to discuss Supervised classification, Clustering, Probabilistic Graphical Models and Optimization in my next series of blogs. So in this article, I would like to discuss Supervised classification in Bioinformatics research.

Supervised Classification

Supervised Classification

In a classification problem, a set of elements are divided into classes. Given an element (or instance) of the set, a class is assigned according to two things.

  1. some of the element’s features
  2. a set of classification rules (not known)

In many real-life situations, this set of rules is not known, and the only information available is a set of labeled examples (i.e. a set of instances associated with a class). Supervised classification methods are algorithms that induce the classification rules from the data.

As an example, we will see a way to use splice site prediction as a supervised classification problem. The instances to be classified will be DNA sequences of a given size (consider an example, 22 base pairs, 10 upstream and 10 downstream the 2bp splice site). Here, the attributes of a given instance will be the nucleotide at each position in the sequence. In the example, we will assume that we are looking for donor sites, so the possible values for the class will be a true donor site or false donor site. As we are approaching the problem as supervised classification, we need a set of labeled examples, i.e. a set of sequences of true and false donor sites along with their label. At this point, we can use this training set to build up a classifier. Once classifier has been trained, we can use it to label new sequences, using the nucleotide present at each position as an input to the classifier and getting the label (true or false donor site) as an output.

In two group supervised classification, there is a feature vector X whose components are called predictor variables and a label or class variable C. Hence, the task is to induce classifiers from training data, which consists of a set of N independent observations drawn from the joint probability distribution p(x,c). The classification model will assign labels to new instances according to the value of its predictor variables.

Assession and Comparison of Classification Algorithms

Error rate and ROC curve

When a 0/1 loss is used, all errors are equally bad, and our error calculations are based on the confusion matrix.

Confusion Matrix

In this case, we can define the error rate as (|FN|+|FP|)/N, where N=|TP|+|FP|+|TN|+|FN| is the total number of instances in the validation set.

To fine-tune a classifier, another approach is to draw the receiver operating characteristics (ROCs) curve, which shows true positive rate versus false-positive rate, namely, 1- specificity=|FP|/(|FP|+|TN|) versus sensitivity=|TP|/(|TP|+|FN|), and has a form similar to figure shown below.

ROC Curve

In each classification algorithm, there is a parameter, for example, a threshold of decision, which we can play with to change the number of true positives versus false positives. When increasing the number of true positives increases the number of false alarms; decreasing the number of false alarms also decreases the number of hits. Depending on how good/costly these are for the particular application we have, we decide on a point on this curve. The area under the receiver operating characteristic curve is used as a performance measure for machine learning algorithms. This is called the AUC score. A model is said to be good if this value range between 0.5 and 1. If the AUC score is 1 it is said that is model is a perfect classification model.

Estimating the Classification Error

An important issue related to a designed classifier is how to estimate its error rate when using this model for classifying unseen (new) instances.

The following techniques can be used to estimate the error rate.

  1. resubstitution estimator (simplest & the fastest way to estimate error)
  2. k-fold cross-validation
  3. leave-one-out cross-validation
  4. bootstrap methodology
  5. bolstered error estimation

Comparing Classification Algorithms

Given two learning algorithms and a training set, an interesting question is to know whether the differences in the estimation of the expected error rates provided by both algorithms are statistically significant.

Feature Subset Selection

Feature Subset Selection

One question that is common to all supervised classification paradigms is whether all the n descriptive features are useful when learning the classification rule. To answer this we define feature subset selection as given a set of candidate features, select the best subset under some learning algorithm.

There are Four basic issues that determine the nature of the search process.

  1. Search space starting point => determines the direction of the search
  2. Search organization => determines the strategy of the search in a space of size Pow(2,n), where n is the number of features in the problem

The search strategies can be optimal or heuristic. Two classic optimal search algorithms that exhaustively evaluate all possible subsets are depth-first and breadth-first. Otherwise, the branch and bound search guarantee the detection of the optimal subset for monotonic evaluation functions without the systematic examination of all subsets. When monotonicity cannot be satisfied, depending on the number of features and the evaluation function used, an exhaustive search can be impractical. In this situation, the heuristic search is interesting because it can find near-optimal solutions, if not optimal. Among heuristic methods, there are deterministic and stochastic algorithms. On one hand, classic deterministic heuristic FSS algorithms are sequential forward and backward selection, floating selection methods or best-first search. They are deterministic in the sense that all runs always obtain the same solution and, due to their hill-climbing nature, they tend to get trapped on local peaks caused by interdependencies among features. Stochastic heuristic FSS algorithms use randomness to escape from local maxima, which implies that one should not expect the same solution from different turns of run. Genetic algorithms and estimation of distribution algorithms have been applied to the FSS problem.

3. Evaluation function => measures the effectiveness of a particular subset of features after the search algorithm has chosen it for examination

4. Search-halting criterion => an intuitive approach is the non-improvement of the evaluation function value of alternative subsets. Another classic criterion is to fix a number of possible solutions to be visited along with the search.

Classification Paradigms

Each supervised classification method has an associated decision surface that determines the type of problems the classifier is able to solve.

(1) Bayesian Classifiers => 1. Naive Bayes 2. Seminaive Bayes 3. Tree Augmented Naive Bayes 4. k dependence Bayesian classifier (KDB)

(2) Logistic Regression

(3) Discriminant Analysis => 1. Fisher discriminant analysis 2. Linear discriminant analysis

(4) Classification trees

(5) Nearest Neighbour => 1. K-nearest neighbor rule

(6) Neural Networks => 1. Feedforward neural network 2. perception 3. multilayer perception

(7) Support Vector Machines

(8) Combining Classifiers => 1. Majority Vote 2. Stacked generalization 3. Ada Boost Algorithm

Supervised Classification in BioInformatics

Supervised Classification in BioInformatics

These are a few applications of supervised classification in biological domains.

(1) Genomics

  1. gene finding problem
  2. searching for protein-coding regions in human DNA (classification trees)
  3. splice site prediction problem (a new Bayesian classifier)
  4. gene finding problem (feature subset selection)
  5. splice site prediction problem (optimization procedures are applied to the FSS)
  6. gene prediction (combining different sources of evidence)
  7. search for RNA genes ( use of classification paradigms)
  8. computational identification of functional RNA genes (SVM & Neural Networks)
  9. genome-wide identification of genes likely to be involved in genetic diseases (classification trees taking different conservation scores and gene length as predicting variables)
  10. discover intelligible knowledge rules from data generated by means of a DNA analysis technique (genotyping) called spacer oligonucleotide typing (C4.5 algorithm)
  11. reconstruction of amino-acid sequences by means of spectral features has been addressed using dynamic programming
  12. RNA secondary structure prediction (dynamic programming)
  13. identify RNA structural elements (Evolutionary algorithms)
  14. RNA tertiary structure determination (tabu search)

(2) Proteomics

  1. prediction of the secondary structure of proteins (nearest neighbor)
  2. prediction of the secondary structure of proteins (a consensus method based on a classification tree)
  3. predict the surface residues of a protein that participate in protein-protein interactions (two-stage method consisting of a support vector machine and a Bayesian classifier)
  4. predicting the protein subcellular location automatically from its sequence (fuzzy k-nearest neighbor algorithm)

(3) Microarrays

  1. pattern recognition in microarray data
  2. simultaneously select the optimal classifier and the optimal subset of genes for cancer diagnosis based on expression data (a Bayesian generalization of the support vector machine)
  3. gene selection ( k-nearest neighbor is used in conjunction with a genetic algorithm in a wrapper approach)
  4. classification of cancerous gene expression profiles (ensemble learning (bagged and boosted decision trees) performs better than single classification trees)
  5. cancer gene expression studies (compare nearest-neighbor classifiers, linear discriminant analysis, classification trees, bagging and boosting)
  6. classification problem with 14 tumor classes (comparison between three binary classifiers (k-nearest neighbors, weighted voting and support vector machines))
  7. Eleven datasets for cancer diagnosis (comparison between several major classification algorithms (support vector machines, k-nearest neighbor, neural networks, and different ensemble classifiers))
  8. To evaluate the performance of 21 classification methods in 7 datasets in microarray datasets.

(4) System Biology

  1. when modeling signal–response cascades, and the methodology is applied to the prediction of cell migration speed using phosphorylation levels of signaling proteins (classification trees)
  2. prediction of a gene regulatory response (boosting with classification trees as the base classifier)

(5) Text Mining

  1. a novel method for protein/gene identification in text, based on an ensemble of a support vector machine and two hidden Markov models
  2. prediction of the subcellular location of proteins (support vector machines)

(6) Other Applications

  1. use of mass spectrometry data
  2. detection of early-stage ovarian cancer patients using mass spectrometry data as biomarkers (linear discriminant analysis, quadratic discriminant analysis, k-nearest-neighbor classifier, bagging and boosting classification trees, support vector machines, and random forests)
  3. classification of two metabolic disorders in newborns, using data obtained from mass spectrometry technology (compare classification algorithms (discriminant analysis, logistic regression, classification trees, k-nearest-neighbor classifiers, neural networks, and support vector machines))

Final Thought

Final Thought

I hope to discuss Clustering, Probabilistic Graphical Models and Optimization in my next series of blogs. Hope you will stay in touch with me.

Special credit goes to my University lecturer Dr. C. T. WANNIGE

Thank you!!!… To be continued…