19th International CODATA Conference
Category: Knowledge Discovery
N-gram Indexing for Protein Sequence Databases
Jinsuk Kim, Mi-Nyeong
Hwang, Sul-Ah Ahn, Center for Computational Biology & Bioinformatics,
Korea Institute of S&T Information, Republic of Korea
Dr.
Hyeon S. Son (hss@kisti.re.kr), Bioinformatics Department, School of Public Health, Seoul National University,
Republic of Korea
Motivation: Though the sequence databases of proteins and DNAs
are increasing in size exponentially, still exhaustive sequence search systems
are commonly used in conducting biological researches. However, due to the advancement
of information technology, many information retrieval algorithms have been developed
to search strings in large-scale text databases and are proved to be successful.
We propose that these algorithms could also be applied to the biological data.
Results: Four n-gram indexing methods(tri-gram, tetra-gram, penta-gram,
and hexa-gram) were applied to extract indices from
protein sequences of the
Availability: Our protein sequence search service is accessible at http://proses.kisti.re.kr.