20th International CODATA Conference (Times
New Roman, 11pt)
Session: Computational informatics: integrating data science with materials
modeling
The Storage And
Substructure Search Of 2-Dimensional
Chemical Structure
Xin Li1, 2, Wan-Yi Sun1, 2 and Xian-Feng He1
1Multi-Phase Reaction
Laboratory, Institute of Process Engineering
The management of large chemical
structure databases has become one of the most important applications of
computational tools in cheminformatics. For a
chemical database with large amount of structural information, the most
interesting problem may be the substructure search and effective storage of these chemical structures. In the present work,
an improved VF2 algorithm is developed to search the 2-dimensional (2D)
molecular structure from a database by means of a pre-screening strategy, and a
new algorithm is adopted to compress the storage of molecular structures and to
create the Primary Key string in the database.
VF2 algorithm, the improved VF
algorithm appeared in recent years, is the most efficient 2D graphs matching
algorithm for large graphs problems at present. VF2 algorithm is able to solve
the graph/sub-graph isomorphism problems on Attributed Relational Graphs and
mainly used for matching directed graphs with unlimited edges number in
Graphics field. However, all of the organic chemical structures are undirected
graphs, and most of which has maximal edge (bond) number no more than 4. The
efficiency of VF2 algorithm for 2D molecular structure search is improved by
following pre-screening strategy.
(1) Atom type matching: the number of atom type in target
graph (TG) is always no less than that in quest graph (QG);
(2) Bond type match: the number of bond type in TG is
always no less than that in QG;
(3) Heteroatom priority: heteroatom has the highest
priority.
The 2D molecular structure is one
kind of topology structure, which expresses the relationship between atoms and
the bonds connected to these atoms. The most extensively used format is the
pure text-based MOL format presented by MDL Inc. However, for most molecular
structures, MOL format is too redundant for chemical database storage. A compression strategy is proposed in the present work
for reducing the storage size of molecular structures. The key points of the
compression strategy can be expressed as:
(1) The structure definition is rearranged for decrease
storage sizes;
(2) The index table is built for eliminating redundant
atom and bond type information;
(3) A method is presented to store sparse matrix.
A unique coding method based on
the Connectivity Matrix Polynomial Multiply (CMPM) algorithm is developed in
the present work for creating the Primary Key string in database, where
Connectivity Matrix is derived from molecular topological structures.
The compression and search
strategies are tested with a database with about 60,000 2D molecular
structures. For molecular structure compression, the storage size of compressed
structures can be reduced to an average value of 12% as MOL files. The 2D
molecular structure search can present correct results within a reasonable
searching time, except some of ambiguity compared with the ISIS-Base for
aromatic rings.
Key words: 2D
molecular structure, substructure search, VF2 algorithm