19th International CODATA Conference
Category: Knowledge Discovery

N-gram Indexing for Protein Sequence Databases

Jinsuk Kim, Mi-Nyeong Hwang, Sul-Ah Ahn, Center for Computational Biology & Bioinformatics, Korea Institute of S&T Information, Republic of Korea
Dr. Hyeon S. Son (hss@kisti.re.kr), Bioinformatics Department, School of Public Health, Seoul National University, Republic of Korea


Motivation: Though the sequence databases of proteins and DNAs are increasing in size exponentially, still exhaustive sequence search systems are commonly used in conducting biological researches. However, due to the advancement of information technology, many information retrieval algorithms have been developed to search strings in large-scale text databases and are proved to be successful. We propose that these algorithms could also be applied to the biological data.

Results: Four n-gram indexing methods(tri-gram, tetra-gram, penta-gram, and hexa-gram) were applied to extract indices from protein sequences of the PIR-NREF database, and their retrieval effectiveness and speed were measured. Penta-gram method showed the best results that its retrieval effectiveness matches for BLASTP and its retrieval speed was about 38 times faster than BLASTP program.

Availability: Our protein sequence search service is accessible at http://proses.kisti.re.kr.