19th International CODATA Conference
Category: Knowledge Discovery

Example-based Classification of Protein Subcellular Locations Using Penta-gram Features

Ho-Eun Park, Mi-Nyeong Hwang, Hyeon S. Son and Jinsuk Kim (jinsuk@kisti.re.kr)
Center for Computational Biology & Bioinformatics, Korea Institute of Science and Technology Information, Republic of Korea


Motivation: The function of a protein is closely correlated with its subcellular location(s). Given a protein sequence, how to determine its subcellular location is a vitally important problem. We have developed a new prediction system for protein subcellular location(s), called ProSLP.

Methods: The ProSLP is based on n-gram feature extraction method and k-nearest neighbor(kNN) classification algorithm. It classifies a protein sequence to one or more subcellular compartments based on the locations of top k sequences which show the highest weights against the input sequence. The weight is a kind of similarity measure which is determined by comparing n-gram features between two sequences. Currently the ProSLP extracts penta-grams as features of protein sequences, computes scores of the potential localization site(s) using k-nearest neighbor(kNN) algorithm, and finally presents the locations and their associated scores.

Results: We constructed a large-scale data set of protein sequences with known subcellular locations from the SWISS-PROT database. This data set contains 51,885 entries with one or more known subcellular locations. The ProSLP showed very high prediction precision of about 93% for this data set, and compared with other method, it also showed comparable prediction improvement for a test collection used in a previous work.

Availability: The ProSLP is available through the World-Wide Web at http://proslp.kisti.re.kr.