20th International CODATA Conference (Times New Roman, 11pt)

20th International CODATA Conference (Times New Roman, 11pt)
Session: Computational informatics: integrating data science with materials modeling

The Storage And Substructure Search Of 2-Dimensional

Chemical Structure

Xin Li^{1, 2}, Wan-Yi Sun^{1, 2} and Xian-Feng He¹

¹Multi-Phase Reaction Laboratory, Institute of Process Engineering

Chinese Academy of Sciences, P.O. Box 353, Beijing 100080, P. R. China

²Graduate University of Chinese Academy of Sciences

P.O. Box 4588, Beijing 100049, P. R. China

The management of large chemical structure databases has become one of the most important applications of computational tools in cheminformatics. For a chemical database with large amount of structural information, the most interesting problem may be the substructure search and effective storage of these chemical structures. In the present work, an improved VF2 algorithm is developed to search the 2-dimensional (2D) molecular structure from a database by means of a pre-screening strategy, and a new algorithm is adopted to compress the storage of molecular structures and to create the Primary Key string in the database.

VF2 algorithm, the improved VF algorithm appeared in recent years, is the most efficient 2D graphs matching algorithm for large graphs problems at present. VF2 algorithm is able to solve the graph/sub-graph isomorphism problems on Attributed Relational Graphs and mainly used for matching directed graphs with unlimited edges number in Graphics field. However, all of the organic chemical structures are undirected graphs, and most of which has maximal edge (bond) number no more than 4. The efficiency of VF2 algorithm for 2D molecular structure search is improved by following pre-screening strategy.

(1) Atom type matching: the number of atom type in target graph (TG) is always no less than that in quest graph (QG);

(2) Bond type match: the number of bond type in TG is always no less than that in QG;

(3) Heteroatom priority: heteroatom has the highest priority.

The 2D molecular structure is one kind of topology structure, which expresses the relationship between atoms and the bonds connected to these atoms. The most extensively used format is the pure text-based MOL format presented by MDL Inc. However, for most molecular structures, MOL format is too redundant for chemical database storage. A compression strategy is proposed in the present work for reducing the storage size of molecular structures. The key points of the compression strategy can be expressed as:

(1) The structure definition is rearranged for decrease storage sizes;

(2) The index table is built for eliminating redundant atom and bond type information;

(3) A method is presented to store sparse matrix.

A unique coding method based on the Connectivity Matrix Polynomial Multiply (CMPM) algorithm is developed in the present work for creating the Primary Key string in database, where Connectivity Matrix is derived from molecular topological structures.

The compression and search strategies are tested with a database with about 60,000 2D molecular structures. For molecular structure compression, the storage size of compressed structures can be reduced to an average value of 12% as MOL files. The 2D molecular structure search can present correct results within a reasonable searching time, except some of ambiguity compared with the ISIS-Base for aromatic rings.

Key words: 2D molecular structure, substructure search, VF2 algorithm