# Protein Identification With Support Vector Machines

Many combination of structures are available for proteins involved in the central processes of life, however it can be a challenge to find the information that we need, since protein data bank archives so many different structures, often with multiple combination of structures for a given molecule, or partial structures, or structures that have been modified from their native form. This paper presents SVM-based algorithm to identify protein structures and compare with a Naive Bayes-based algorithm on a set of protein database from Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (PDB). Performance was measured for identification scenarios. The classification performance for SVM with one-vs-rest approach is ~94% versus Naive Bayes is ~90%.

** Keywords: **support vector machine, sequences, pattern recognition, bag of words

**1. INTRODUCTION**

Protein plays countless roles throughout the biological world, from catalyzing chemical reactions to building the structures of all living things. Despite this wide range of functions all proteins are made out of the same twenty amino acides, but combined in the different ways.

Support Vector Machines (SVMs) are a class of supervised learning algorithms first introduced by Vapnik [4]. Given a set of labelled training vectors (positive and negative input examples), SVMs learn a linear decision boundary to discriminate between the two classes. The result is a linear classification rule that can be used to classify new test examples. SVMs have exhibited excellent generalization performance (accuracy on test sets) in practice and have strong theoretical motivation in statistical learning theory [4].

We demonstrate our SVM-based algorithm on protein sequence identification applications. The algorithm is presented with a sequence of an unknown protein. The algorithm reports its best estimate of the identity of an unknown protein from a database of known proteins.

The simplest way to extend SVMs for multiclass problems is using the so-called one-vs-rest approach (Vapnik, 1995). For K class problems, K linear SVMs will be trained independently, where the data from the other classes form the negative cases [5]. Other approach uses one-vs-one *(Knerr et al. 1990)* based multiclass approaches. All possible two-class classifiers are evaluated from the training set of n classes, each classifier being trained on only two out of n classes, giving a total of n_class * (n_class — 1) / 2 classifiers. Applying each classifier to the test data vectors give one vote to the winning class. The data is assigned the label of the class with most votes. Results of recent analysis of different multi-class strategies are provided by *Hsu and Lin (2002), *but we will leave those for future work.

To provide a benchmark for comparison, we experimented both one-vs-rest and one-vs-one SVM with different model parameters tuning, and in addition we compared our approaches with a Naive Bayes based algorithm. We present experiment results by using different set of testing data. Cross validation was also performed against training data with 5 different folds to ensure model fitness.

**2. BASELINE APPROACH**

Protein sequence is represented as a sequence of string constructed from a process of determining the amino acid sequence of all or part of a protein or peptide. We extracted each character from protein sequence, count number of occurrences and stored them into a vector. This vector will become the input into our SVM model and based on these inputs, the classifier will output one out of eight classes of known proteins.

Suppose our training set S consists of labelled input vectors (xi,yi), i = 1 . . . m, where xi∈Rn and yi∈ {±1}. We can specify a linear classification rule f by a pair (w, b), where w ∈ Rnand b ∈ R, via

f(x) = (w, x) + b

where a point x is classified as positive (negative) if f(x) > 0 (f(x) < 0). Geometrically, the decision boundary is the hyperplane

{x ∈ Rn : (w, x) + b = 0}

where w is a normal vector to the hyperplane and b is the bias. If we further require that |w| = 1, then the geometric margin of the classifier with respect to S is

In the case where the training data are linearly separable and a classifier f correctly classifies the training set, then m g S (f) is simply the distance from the decision hyperplane to the nearest training point(s).

**2.1. SVM Kernels**

A key feature of any SVM optimization problem is that it is equivalent to solving a dual quadratic programming problem. For example, in the linearly separable case, the maximal margin classifier is found by solving for the optimal “weights” ai , i = 1 . . . m, in the dual problem:

on the region: ai ≥ 0 for all i

The parameters (w, b) of the classifier are then determined by the optimal αi (and the training data). The dual optimization problems for various soft margin SVMs are similar. The dual problem not only makes SVMs amenable to various efficient optimization algorithms, but also, since the dual problem depends only on the inner products (xixj)allows for the introduction of kernel techniques.

To introduce a kernel, we now suppose that our training data are simply labelled examples (xi ,yi), where the xi belong to an input space X which could be a vector space or a space of discrete structures like sequences of characters from an alphabet or trees. Given any feature map Φ from X into a (possibly high-dimensional) vector space called the feature space Φ : X → Rn , we obtain a kernel K on n × n matrix defined by K(x , y) = (Φ(x), Φ(y)). By replacing (xixj) by K(xixj) in the dual problem, we can use SVMs in feature space.

Moreover, if we can directly compute the kernel values K(x , y) without explicitly calculating the feature vectors, we gain tremendous computational advantage for high-dimensional feature spaces.

**3. PROPOSED APPROACH**

Baseline approach with one-to-one approach has limitation of training data size. It is computational expensive and does not scale well after more than 10.000 records. More effective approach is needed to scale the training performance with lesser computation. Furthermore training an optimal SVM model requires number of decisions: data pre-processing / features engineering, kernel function to use and SVM parameters.

**3.1. SVM parameters tuning**

Our approach will be comparing both SVM one-vs-one and one-vs-rest approach which has been proven to be effective and scale better with larger training data size. Below are parameters we used in our SVM model.

**3.2. Features engineering**

Protein sequences data are basically composed of series of characters which represents different amino acids combined. In order for our model to learn from it, we need to convert these strings into a number representation.

Bag-of-words model is one of the most popular representation methods for object categorization. The key idea is to quantize each extracted key point into one of visual words. In our case, we basically extracts each character from the string and count the number of occurrence for each symbol or character.

**4. EXPERIMENTAL RESULTS**

**4.1. Cross validation (Training data)**

Here is the 5-folds cross validation result performed with training dataset to ensure model fitness before proceeding with testing dataset.

**4.2. Model accuracy (Test data)**

Below result table presents two SVM approaches and Naive Bayes.

In terms of model training time, Naive Bayes have highest performance compared with SVC and LinearSVC, and LinearSVC is relatively faster than SVC.

**5. CONCLUSIONS**

Typical machine learning systems operate on input data after they have been transformed into feature vectors d1,…,dn ∈ D living in an m dimensional space. There are many cases, however, where the input data cannot readily be described by explicit feature vectors: for example protein sequences.

For such datasets, the construction of a feature extraction module can be as complex and expensive as solving the entire problem. This feature extraction process not only requires extensive domain knowledge but also it is possible to lose important information during this process. These extracted features play a key role in the effectiveness of a system. Protein sequence raw data comprises of a sequence of symbols that cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length. Tokenizing, counting and normalizing are the common ways to extract these numerical features from text content. We called vectorization the general process of turning a collection of text into numerical feature vectors. This specific strategy (Tokenizing, counting and normalizing) is called bag of words or bag of n-grams representation. One protein sequence can be described by symbol occurances while completely ignoring the relative position information of the symbol in the sequence.

Our experiments have shown the potential of using the bag of words representation together with SVM and we could achieve very good classification results without the need for complex manual parameter tuning.

**REFERENCES**

[1] P. Jonathan Phillips, *Support Vector Machines Applied to Face Recognition*, National Institute of Standards and Technology Bldg 225/ Rm A216, Gaithersburg/ MD 20899.

[2] Mahesh Pal, *Multiclass Approaches for Support Vector Machine Based Land Cover Classification*, Department of Civil engineering, National Institute of Technology, Kurukshetra, 136119, Haryana (India).

[3] Christina Leslie, Eleazar Eskin, William Stafford Noble, *The Spectrum Kernel: A String Kernel For SVM Protein Classification*, Department of Computer Science, Columbia University, New York, NY 10027

[4] VN Vapnik. *Statistical Learning Theory*. Springer, 1998.

[5] Yichuan Tang tang@cs.toronto.edu, *Deep Learning using Linear Support Vector Machines. *Department of Computer Science, University of Toronto. Toronto, Ontario, Canada.

[6] Yin Zhang, Rong Jin, Zhi-Hua Zhou, *Understanding Bag-of-Words Model: A Statistical Framework. *National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China and Department of Computer Science & Engineering Michigan State University, East Lansing, MI 48824.

[7] Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, Chris Watkins, *Text Classification using String Kernels.*Department of Computer Science, Royal Holloway, University of London, Egham, Surrey TW20 0EX, UK

[8] Christina S. Leslie, Eleazar Eskin, Adiel Cohen, Jason Weston and William Stafford Noble, *Mismatch string kernels for discriminative protein classification. *Department of Computer Science, Columbia University, 1214 Amsterdam Avenue, Mail Code 0401, New York, NY 10027, USA, Max-Planck Institute for Biological Cybernetics, Spemannstrasse 38, 72076 Tübingen, Germany and Department of Genome Sciences, University of Washington, 1705 NE Pacific Street, Seattle, WA 98195, USA

[9] Betty Yee Man Cheng, Jaime G. Carbonel, and Judith Klein-Seetharaman, *Protein Classification Based on Text Document Classification Techniques. *Language Technologies Institute, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, Pennsylvania, Department of Pharmacology, University of Pittsburgh School of Medicine, Biomedical Science Tower E1058, Pittsburgh, Pennsylvania

[10] Needleman SB, Wunsch CD. *A general method applicable to the search for similarities in the amino acid sequence of two proteins*. J Mol Biol 1970;48(3):443–453.

[11] Deshpande M, Karypis G. *Evaluation of techniques for classifying biological sequences*. In 6th Pacific-Asia Conference on Knowledge Discover (PAKDD). 2002. p 417–431.

[12] Hsu, Chih-Wei and Lin, Chih-Jen. *A comparison of methods for multiclass support vector machines*. IEEE Transactions on Neural Networks, 13(2):415–425, 2002.

[13] E. J. Bredensteiner and K. P. Bennett, *“Multicategory classification by support vector machines”* Comput. Optimiz. Applicat., pp. 53–79, 1999.

[14] C. Cortes and V. Vapnik, *“Support-vector network”* , *Machine Learning*, vol. 20, pp. 273–297, 1995.

[15] E. Mayoraz and E. Alpaydin, *“Support vector machines for multi-class classification”* in IWANN, vol. 2, 1999, pp. 833–842.

[16] Lipo Wang, *Support Vector Machines: Theory and Applications. *Springer, Berlin 2005

[17] Thorsten Joachims, *Text Categorization with Support Vector Machines: Learning with Many Relevant Features. *Universitat Dortmund Informatik LS8, Baroper Str. 301 44221 Dortmund, Germany

[18] Yuh-Jye Lee, Yi-Ren Yeh, and Hsing-Kuo Pao, *An Introduction to Support Vector Machines, *Department of Computer Science and Information Engineering National Taiwan University of Science and Technology Taipei, Taiwan 10607

[19] Olivier Chapelle, Vladimir Vapnik, Olivier Bousquet, Sayan Mukherjee, *Choosing Multiple Parameters for Support Vector Machines*. LIP6, Paris, France, AT&T Research Labs, 200 Laurel Ave, Middletown, NJ 07748, USA, Ecole Polytechnique, France, MIT, Cambridge, MA 02139, USA

[20] Juwen Shen, Jian Zhang, Xiaomin Luo, Weiliang Zhu, Kunqian Yu, Kaixian Chen, Yixue Li, and Hualiang Jiang, *Predicting protein–protein interactions based only on sequences information. *Center for Drug Discovery and Design, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, and Graduate School of Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China; School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China; and Bioinformation Center, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 320 Yue Yang Road, Shanghai 200031, China