Subscriber access provided by Washington University | Libraries
Chemical Information
Molecular Structure Extraction From Documents Using Deep Learning Joshua Staker, Kyle Marshall, Robert Abel, and Carolyn Mae McQuaw J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.8b00669 • Publication Date (Web): 13 Feb 2019 Downloaded from http://pubs.acs.org on February 14, 2019
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Molecular Structure Extraction From Documents Using Deep Learning Joshua Staker†, Kyle Marshall†, Robert Abel‡, Carolyn M. McQuaw† †Schrödinger, Inc., 101 SW Main Street, Portland, Oregon 97204, United States ‡Schrödinger, Inc., 120 West 45th Street, New York, New York 10036, United States
Chemical
structure
extraction
from
documents
remains
a
hard
problem due to both false positive identification of structures during Current that
segmentation approaches
perform
encounter
and rely
reasonably
situations
satisfactory Complications
and
errors on
predicted rules
generally, recognition
systematic
impacting
the
handcrafted
well
where
in
but
of
subroutines
still
rates
improvement
performance
and
structures.
are
is current
routinely not
yet
challenging. approaches
include the diversity in visual styles used by various software to render structures, the frequent use of ad hoc annotations,
ACS Paragon Plus Environment
1
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
and
other
challenges
related
resolution
and
noise.
solutions
for
both
documents
and
for
segmented
images.
We
to
present
segmenting predicting
This
image
deep
Page 2 of 50
quality,
end-to-end molecular
chemical
deep
learning
structures
structures
learning-based
including
approach
from
from
the
does
not
require any handcrafted features, is learned directly from data, and is robust against variations in image quality and style. Using the deep learning approach described herein we show that it
is
possible
to
perform
well
on
both
segmentation
and
prediction of low-resolution images containing moderately sized molecules found in journal articles and patents.
1. Introduction For
drug
discovery
projects
to
be
successful,
it
is
often
critical that newly available data are quickly processed and assimilated
through
high
quality
curation.
Furthermore,
an
important initial step in developing a new therapeutic includes the
collection,
published drug
data.
discovery
analysis, This where
is
and
utilization
particularly
collections
true
of
for
of
previously
small-molecule
experimentally
tested
molecules are used in virtual screening programs, quantitative structure
activity/property
relationship
(QSAR/QSPR)
analyses,
or validation of physics-based modeling approaches. Due to the
ACS Paragon Plus Environment
2
Page 3 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
difficulty
and
expense
of
generating
large
quantities
of
experimental data, many drug discovery projects are forced to rely on a relatively small pool of in-house experimental data, which in turn may result in data volume as a limiting factor in improving in-house QSAR/QSPR models. One promising solution to the widespread lack of appropriate training
set
data
in
drug
discovery
is
the
amount
of
data
currently being published.1 Medline reports more than 2000 new life science papers published per day2, and this estimate does not include other literature indexes or patents that further add to the volume of newly published data. Given this rate at which experimental
data
is
increasingly
important
entering to
public
address
literature,
issues
related
it to
is data
extraction and curation, and to automate these processes to the greatest extent possible. One such area of data curation in life sciences that continues to be difficult and time consuming is the extraction of chemical structures from publicly available sources such as journal articles and patent filings. Most publications containing data related to small molecules in drug
discovery do
not
provide
the molecular
structures in
a
computer readable format (e.g., SMILES, connection table, etc.). Authors often use computer programs to draw structures and then incorporate images.
those
Publishing
structures documents
into with
publications only
images
of
via
raster
structures
ACS Paragon Plus Environment
3
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 4 of 50
necessitates the manual redrawing of the structures in chemical sketching software as a means of converting the structures into computer readable formats for use in downstream computation and analysis. Redrawing chemical structures can be time consuming and
often
requires
ambiguities,
domain
interpret
knowledge
variations
to
in
adequately
style,
and
resolve
decide
how
have
been
annotations should be included or ignored. Solutions described
for
automatic
previously.3-9
structure
These
recognition
methods
utilize
sophisticated
rules that perform well in many situations, but can experience degradation
in
output
quality
under
commonly
encountered
conditions, especially when input resolution is low or image quality is
poor.
One of
extraction
rates
is
the
that
challenges to
rule-based
improving
systems
are
current
necessarily
highly interdependent and complex, making further improvements difficult. Furthermore, rule-based approaches can be demanding to build and maintain because they require significant domain expertise and necessitate contributors to anticipate and codify rules for all potential scenarios the system might encounter. Developing
hand-coded
rules
is
particularly
difficult
in
chemical structure extraction where a wide variety of styles and annotations
are
used,
and
input
quality
is
not
always
consistent. The goal of this paper is twofold: 1) demonstrate it is possible to develop an extraction method to go from input
ACS Paragon Plus Environment
4
Page 5 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
document to SMILES without requiring the implementation of handcoded
rules
or
features;
and
2)
further
demonstrate
it
is
possible to improve prediction accuracy on low quality images using such a system. Deep learning and other data-driven technologies are becoming increasingly widespread in life sciences, particularly in drug discovery and development.10-13 In this paper we leverage recent advances in image processing, sequence generation, and computing over latent representations of chemical structures to predict SMILES for molecular structure images. The method reported here takes
an
image
or
PDF
and
performs
segmentation
using
a
convolutional neural network. SMILES are then generated using a convolutional neural
neural
network
(meaning,
the
network
in
(encoder-decoder) architecture
combination in
computes
with
a
an
end-to-end
SMILES
directly
recurrent fashion from
raw
images using a fully differentiable neural network). We report results based on preliminary findings using our deep learningbased method and provide suggestions for potential improvements in future iterations. In this work, we focus on performing both the
segmentation
resolutions
that,
and to
structure our
prediction
knowledge,
have
steps not
at
before
low been
attempted successfully in the literature. Using a downsampled version of a published dataset, we show that our deep learning
ACS Paragon Plus Environment
5
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
method
performs
well
under
low
quality
Page 6 of 50
conditions,
and
may
operate on raw image data without hand-coded rules or features. 2. Related Work Automatic chemical structure extraction is not a new idea. Park et al.3, McDaniel et al.5, Sadawi et al.6, Valko and Johnson7, and Filippov and Nicklaus8 each utilize various combinations of image processing techniques, optical character recognition, and handcoded rules to identify lines and characters in a page, then assemble
these
components
into
molecular
connection
tables.
Similarly, Frasconi et al. utilize low level image processing techniques to identify molecular components but rely on Markov logic to assemble the components into complete structures.9 Park et
al.
demonstrated
the
benefits
of
ensembling
several
recognition systems together in a single framework for improved recognition rates.4 Currently available methods rely on low-level image etc.)
processing
techniques
in combination with
(edge
detectors,
vectorization,
subcomponent recognition (character
and bond detection) and high-level rules that arrange recognized components into their corresponding structures. There
are
continuing
challenges,
however,
that
limit
the
usefulness of currently available methods. As discussed by Valko and Johnson, there are many situations in the literature where designing
specific
challenging.7
Some
rules of
to
these
handle include
inputs wavy
becomes
bonds,
lines
quite that
ACS Paragon Plus Environment
6
Page 7 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
overlap (e.g., bridges), and ambiguous atom labels. Apart from complex, ambiguous, or uncommon representations, there are other challenges
that
resolution
or
currently noisy
impact
images.
performance,
Currently
including
available
require relatively high-resolution input, e.g.,
low
solutions
150+ dpi5,8,14,
and tolerate only small amounts of noise. Furthermore, rulebased systems can be difficult to improve due to the complexity and interconnectedness of
the various recognition components.
Changing a heuristic in one area of the algorithm can impact and require improve
adjustment
in
components
another
to
fit
area, new
making
data
it
while
difficult
to
simultaneously
maintaining or improving generalizability of the overall system. Apart from accuracy of structure prediction, filtering of false positives during the structure extraction process also remains problematic. Current solutions rely on users to manually filter out
results,
choosing
which
predicted
structures
should
be
ignored if tables, figures, etc. are predicted to be molecules. In order to both improve extraction and prediction of structures in a wide variety of source materials, particularly with noisy or
low
quality
alternatives
to
input
images,
rule-based
it
is
systems.
The
important deep
to
find
learning-based
method outlined herein provides i) improved accuracy for poor quality
images
and
ii)
a
built-in
mechanism
for
improvement
through the addition of new training sets.
ACS Paragon Plus Environment
7
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 8 of 50
3. Deep Learning Method The
deep
structure Figure
learning prediction
1.
chemical
The
model
architectures
described
segmentation
structure
images
in
this
model from
for
segmentation
work
are
identifies
input
depicted and
documents,
and in
extracts and
the
structure prediction model generates a computer-readable SMILES string for each extracted chemical structure image.
Figure
1.
In
architecture
subpanel and
in
A
we
depict
subpanel
B
the
we
segmentation
depict
the
model
structure
prediction model architecture. For brevity, similar layers are condensed
with
a
multiplier
indicating
how
many
layers
are
chained together, e.g., “2X”. All convolution (conv) layers are followed by a parametric ReLU activation function. “Pred Char”
ACS Paragon Plus Environment
8
Page 9 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
represents
the
character
predicted
at
a
particular
step
and
“Prev Char” stands for the predicted character at the previous step. Computation flow through the diagrams is left to right (and top to bottom in B).
3.1. Segmentation When presented with a document containing chemical structures, the initial step in the extraction pipeline is first identifying what are structures by segmenting the document, then separating out
the
molecules
from
the
rest
of
the
input.
Successful
segmentation is important to i) provide cleanest possible input for accurate sequence prediction, and ii) exclude objects that are not structures but contain similar features, e.g., charts, graphs, logos, and annotations. Segmentation herein utilizes a deep convolutional neural network to predict which pixels in the input image are likely to be part of a chemical structure. In designing the segmentation model we followed the “U-Net” design strategy outlined in Ronneberger et al.15 which is especially well
suited
network
(the
for
full-resolution
neural
network
detection
output)
and
at the top enables
of the
fine-grained
segmentation of structures in the experiments reported here. The U-Net supports full-resolution segmentation by convolving (with pooling)
the
input
to
obtain
a
latent
representation,
then
ACS Paragon Plus Environment
9
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
upsampling the
latent representation
skip-connections
until
the
output
Page 10 of 50
using deconvolution with
is
at
a
resolution
that
matches that of the input. The outputs generated at the top of the
network
are
then
scaled
using
a
softmax
activation
(to
normalize the outputs between 0-1.0) and provide a predicted probability for each pixel, identifying the likelihood pixels belong to a structure. The
predicted
resolution
as
pixels the
form
masks
generated
original
input
images
sufficiently fine-grained
extraction.
at
and
The masks
the
same
allow
for
obtained from
the segmentation model are binarized to remove low confidence pixels, and contiguous areas of pixels that are not large enough to contain a structure are removed. To remove areas too small to be structures, we count the number of pixels in a contiguous area and deem the area a non-structure if the number of pixels is
below
entities pixels)
a (a
in
threshold single, the
(8050
pixels
contiguous
refined
masks
at
group are
of
300
dpi).
positively
assumed
to
Individual predicted
contain
single
structures and are used to crop structures from the original inputs,
resulting
in
a
collection
of
individual
structure
images. During testing we observed qualitatively better masks when generating several masks at different resolutions and averaging the
masks
together
into
a
final
mask
used
to
crop
out
ACS Paragon Plus Environment
10
Page 11 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
structures. In the experiments reported herein, averaged masks were obtained by scaling inputs to each resolution within the range 30 to 60 dpi in increments of 3 dpi, then generating masks for each image using the segmentation model. The resulting masks were
scaled
together.
to
the
Finally,
same the
resolution
averaged
(60
masks
dpi)
were
and
scaled
averaged to
the
original input resolution (usually 300 dpi) and then used to crop
out
individual
structures.
Figure
2
shows
an
example
journal article page along with its predicted mask. The inputs used in all experiments were first preprocessed to be grayscale and downsampled to approximately 60 dpi31 resolution. We found 60 dpi input resolution to be sufficient for segmentation while providing resolutions.
significant We
also
speed tested
the
improvement
versus
removal
long,
of
higher straight
horizontal and vertical lines in input images using the Hough transform.16 Line removal improved mask quality in many cases, especially in tables where structures were very close to grid lines, and was included in the final model.
ACS Paragon Plus Environment
11
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 12 of 50
Figure 2: An example showing the input PDF (left) and output of the segmentation model (right) when processing a journal article page from Salunke et al.17 All of the text and other extraneous items are completely removed with the exception of a few faint lines that are amenable to automated post processing. 3.2. Structure Prediction Images of individual structures obtained using the segmentation model
are
SMILES
automatically
sequences
transcribed
representing
the
into
contained
the
corresponding
structures
using
another deep neural network. The purpose of this network is to take
an
image
of
a
fashion,
predict
structure
contained
single
the
structure
corresponding
in
the
image.
and, SMILES
The
in
an
end-to-end
string
network
of
comprises
the an
ACS Paragon Plus Environment
12
Page 13 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
encoder-decoder
strategy
where
structure
images
are
first
encoded into a fixed-length latent space (state vector) using a convolutional neural network and then decoded into a sequence of characters using a recurrent neural network. The convolutional network consists of alternating layers of 5x5 convolutions, 2x2 max-pooling, and a parameterized ReLU activation function18, with the overall network conceptually similar to the design outlined in
Krizhevsky
layer.
To
et
al.19
help mitigate
but
without
issues
the
final
classification
in which small
but
important
features are lost during encoding, our network architecture does not utilize any pooling method in the first few layers of the network. The state vector obtained from the convolutional encoder is passed into a decoder to generate a SMILES sequence. The decoder consists of an input projection layer, three layers of GridLSTM cells20, an
attention
mechanism
(implemented similarly to the
soft attention method described in Xu et al.21 and Bahdanau et al.22, and the global attention method in Luong et al.23), and an output projection layer. The state vector from the encoder is used
to
initialize
the
GridLSTM
cell
states
and
the
SMILES
sequence is then generated a character at a time, similar to the decoding
method
described
sentences were generated
in
a word
Sutskever at a
time
et
al.24
while
(wherein
translating
English to French). Decoding is started by projecting a special
ACS Paragon Plus Environment
13
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 14 of 50
start token into the GridLSTM (initialized by the encoder and conditioned
on an
initial
context
vector
as computed
by the
attention mechanism), processing this input in the cell, and predicting
the
first
Subsequent
characters
character are
of
the
produced
output
sequence.
similarly,
with
each
prediction conditioned on the previous cell state, the current attention,
and
the
previous
output
projected
back
into
the
network. The logits vector for each character produced by the network is of length N, where N is the number of available characters (65 characters in this case). A softmax activation is applied to the logits to compute a probability distribution over characters, and the highest scoring character is selected for a particular step in the sequence. Sequences are generated until a special end-of-sequence token is predicted, at which point the completed SMILES string is returned. During testing we predict images at several different (low) resolutions and return the sequence
of
multiplying
the
highest
together
confidence,
the
softmax
which
output
is of
determined each
by
predicted
character in the sequence. Additional architecture details for both
the
segmentation
and
structure
prediction
models
are
included in the supplemental information. The addition of an attention mechanism in the decoder helps solve the
several challenges. decoder
to
access
Most importantly, attention enables
information
produced
earlier
in
the
ACS Paragon Plus Environment
14
Page 15 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
encoder and minimizes the loss of important details that may otherwise be overly compressed when encoding the state vector. Additionally,
attention
enables
the
decoder
to
reference
information closer to the raw input during the prediction of each character which is important considering the significance of individual pixels in low resolution structure images. See Figure 3 for an example of the computed attention and how the output corresponds to various characters recognized during the decoding
process.
Apart
from
using
attention
for
improved
performance, the attention output is useful for repositioning a predicted structure into an orientation that better matches the original input image. This is done by converting the SMILES into a connection table using the open source Indigo toolkit25, and repositioning each atom in 2D space according to the coordinates of
each
character's
computed
attention.
The
repositioned
structure then more closely matches the original positioning and orientation in the input image, enabling users to more easily identify and correct mistakes when comparing the output with the original source.
ACS Paragon Plus Environment
15
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 16 of 50
Figure 3. Heatmaps depicting the computed attention for several sequence generation steps are shown. Below the heatmaps, the full predicted SMILES sequence for the molecule is given with the
characters
highlighted
corresponding
characters
read
to
the
left
images
to
right
highlighted. in
the
The
SMILES
correspond to the images shown in the order left to right, top to bottom. The complete encoder-decoder framework is fully differentiable and
is
trained
backpropagation,
end-to-end
enabling
SMILES
using to
a be
suitable fully
form
generated
of
using
only raw images as input. During decoding, SMILES are generated a character at a time, from left to right. Additionally, no
ACS Paragon Plus Environment
16
Page 17 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
external
dictionary
is
used
for
chemical
abbreviations
(superatoms), rather, these are learned as part of the model, thus
images
generated
a
may
contain
character
superatoms and
at
a
time.
The
the
SMILES
model
are still
operates
on
raw
images and directly generates chemically valid SMILES with no explicit subcomponent recognition required. 4. Datasets 4.1. Segmentation Dataset To our knowledge, no dataset addressing molecular structure segmentation has been published. To provide sufficient data for training
a
neural
network
while
minimizing
manual
effort
required to curate such a dataset, we developed a pipeline for automatically dataset,
in
generating summary,
segmentation
the
following
data.
steps
To
were
generate
the
performed:
i)
removed structures from journal and patent pages, ii) overlaid structures onto the pages, iii) produced a ground truth mask identifying the overlaid structures, and iv) randomly cropped images
from
the
pages
containing
structures
and
the
corresponding mask. In detail, OSRA8 was utilized to identify bounding boxes of
candidate
molecules within the pages
of
a
large number of documents, both published journal articles and patents. The regions expected to contain molecules were whitedout, thus leaving pages without molecules. OSRA is not always correct in finding structures and occasionally non-structures
ACS Paragon Plus Environment
17
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 18 of 50
(e.g., charts) were removed suggesting that cleaner input may further improve model performance. Next, molecule images from the Indigo OS X and USPTO datasets (described in more detail later) were randomly overlaid onto the pages while ensuring that no structures overlapped with any non-white pixels. Structure images were occasionally perturbed using affine transformations, changes
in
background
shade,
and/or
lines
added
around
the
structure (to simulate table grid lines). We also generated the true mask for each overlaid page; these masks were zero-valued except where pixels were part of a molecule (pixels assigned a value of 1). During training, samples of 128x128 pixels were randomly cropped from the overlaid pages and masks, and arranged into mini-batches for feeding into the network; example imagemask pairs are shown in Figure 4.
Figure
4.
Two
examples
from
the
programmatically
generated
segmentation dataset. Inputs to the network are shown on the left with the corresponding masks used in training shown on the right. In the mask, white indicates pixels which are part of a chemical structure.
ACS Paragon Plus Environment
18
Page 19 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
4.2. Molecular Image Dataset An important goal of our work is to improve recognition of low resolution or poor quality images. Utilizing training data that is too high quality or too clean could negatively impact the generalizability
of
the
final
model.
Arguably,
the
network
should be capable of becoming invariant to quality when trained explicitly on both high and low quality examples, at the expense of more computation and likely more data. However, we opted to handle image quality consistently by considering all images as low quality images regardless of original input resolution. This was done by scaling all inputs down considerably so that all images are low resolution when fed to the model. To illustrate the impact of scaling images of molecular structures, consider two structures in Figure 5 that are chemically identical but presented with different levels of quality. The top-left image is of fairly high quality apart from some perforation throughout the image. In the top-right image the perforation is much more pronounced with some bonds no longer continuous (small breaks due to the excessive noise). When these images are downsampled significantly using bilinear interpolation, the
images appear
similar and it is hard to differentiate which of the two began as a lower quality image, apart from one being darker than the other.
ACS Paragon Plus Environment
19
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 20 of 50
Figure 5. Noisy images of the same structure from Bonanomi et al.27 The top two images are the originals with their downsampled versions
shown
below.
Image
quality
appears
similar
when
downsampled considerably. Chemical
structures
vary
significantly
in
size;
some
being
small fragments with just a few atoms, or very large structures including
natural
products
or
peptide
sequences.
This
necessitated using an image size that was not too large to be too
computationally
fully
fit
intensive
reasonably
sized
to
train,
structures,
but i.e.,
large
enough
drug-like
to
small
molecules, into the image. Although structures themselves can be large the individual atoms and their relative connectivity may be contained within a small number of pixels regardless of the size
of
the
overall
structure.
Thus,
scaling
an
image
too
aggressively will result in important information being lost. To train a neural network that can work with both low and high quality images, we utilized an image size of 256x256 and scaled
ACS Paragon Plus Environment
20
Page 21 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
images
to
fit
within
this
size
constraint
(bond
lengths
in
approximately the 3-12 pixels range). Training a neural network over
higher
resolution
images
is
an
interesting
research
direction and may improve results, but was here left for future work. In
designing
datasets
to
train
and
test
our
method
we
considered both the chemistry space and the image styles that were of interest to maximize general application of the final model at the end of training. To ensure sufficient variety in molecular images, both in terms of the breadth of chemical space and in the various styles portrayed, we curated three separate datasets,
each
sampled
uniformly
during
training.
The
first
dataset consisted of a 57 million molecule subset of molecules available in the PubChem28 database, with the subset selected using several filters, described later. The molecules obtained from Pubchem were initially in InChI format and were converted into SMILES using Indigo, with images then rendered from the SMILES
using
Indigo
and
its
various
style
options
available
(bond thicknesses, character sizes, etc.). Reasonable parameter ranges were chosen for each style option and style parameters were randomly selected from the ranges for each image during rendering
(see
Table
S1
in
supplemental
information
for
the
option ranges used). This dataset is referred to throughout the text as the Indigo dataset.
ACS Paragon Plus Environment
21
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 22 of 50
The second dataset (referred to in our text as the Indigo OS X dataset)
was
consisted observed
of that
prepared Pubchem Indigo
similarly compounds
rendering
to
the
first
rendered
output
dataset
using
can
vary
Indigo.
and We
significantly
between operating systems, so to ensure that we captured these stylistic differences, we randomly sampled with replacement 10 million
of
the
prepared
SMILES
from
the
first
dataset
and
rendered additional images for these SMILES using Indigo but in OS X (versus Linux in the first dataset). Although sampling with replacement may allow for some SMILES to be sampled more than once, this is not an issue because each instance of the SMILES was rendered using a new set of randomly sampled style options, as
described
earlier.
The
inclusion
of
these
images
helped
supplement training with additional image styling. The
third
dataset
(USPTO
dataset)
consisted
of
1.7
million
image/molecule pairs curated from data made publicly available by the United States Patent and Trademark Office (USPTO).26 Many of these images contained extraneous text and labels and were preprocessed to remove non-structure elements before training. The original USPTO data provided molecules in MOL format and these were converted into SMILES using Indigo. For some files in the USPTO dataset, we observed that Indigo did not correctly retain stereochemistry when converting the MOL files into SMILES
ACS Paragon Plus Environment
22
Page 23 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
which resulted in some SMILES having incorrect stereochemistry; results may improve with cleaner data. Although our goal when curating data was to capture sufficient breadth of chemical space in our datasets for the learned model to be useful, it is important to note that chemical space is extremely
vast.
In
order
to
make
learning
a
model
in
our
experiments feasible and to ensure data were in a consistent format across datasets we chose to focus on drug-like molecules and imposed the following restrictions when filtering compounds and when preparing all three training datasets:
Structures with a SMILES length of 21-100 characters (a range that covers most drug-like small molecules) were included with all others removed.
Attachment placeholders of the format R1, R2, R3, etc., were included but other forms were not included
Enumeration fragments were not predicted or attached to the parent structure.
Salts were removed and each image was assumed to contain only one structure.
All molecules were converted to SMILES using Indigo.
All SMILES were kekulized and canonicalized using Indigo.
ACS Paragon Plus Environment
23
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 24 of 50
Attachment placeholders of the form R1, R2, R3, etc. are not natively
supported
in
SMILES
strings
but
often
show
up
in
images. While SMILES can include an asterisk in place of atoms to
indicate
that
the
location
is
a
wildcard,
we
desired
to
support more granular prediction by retaining numbers that are displayed in images. To do this we utilized Indigo’s ability to render
custom
text
for
wildcard
atoms
and
randomly
replaced
atoms with “R”, where i is incremented each time an R-group is added; terminal atoms were replaced with 5% probability and non-terminal
atoms
were
replaced
only
1%
of
the
time.
The
asterisks in the resulting SMILES strings were replaced with the R-groups in brackets. For example, the SMILES string for ethane, “CC”, might become [R1]CC[R2]. Although predicted SMILES may not be truly valid SMILES that contain this added R-group notation, added groups are easily detectable and reversible. The R-groups can be replaced with asterisks to recover the predicted molecule in valid SMILES format while the attachment annotation at each R-group location is retained for reference. Since the USPTO data already provided rendered images, we only randomly included Rgroups in the Indigo and Indigo OS X datasets wherein SMILES containing
the
added
R-groups
could
then
be
rendered
using
Indigo. The prepared datasets were split into training and validation portions,
utilized
for
training
the
model
and
periodically
ACS Paragon Plus Environment
24
Page 25 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
testing the performance during training, respectively. For the Indigo dataset (containing the 57 million molecules), 90% of the dataset
was
utilized
for
training
and
10%
was
reserved
for
validation. Similarly, the USPTO dataset was split into training and validation portions; 75% of the set was used to train with 25% reserved for validation. We reserved a larger portion of the USPTO dataset for validation because the total dataset size is significantly smaller than the Indigo dataset and we desired to ensure
a
sufficiently
large
set
for
adequate
validation.
To
simplify our training and validation metrics, we opted to only include the Indigo OS X dataset during training and did not reserve a portion of this dataset for validation and do not report any metrics for this dataset.
Apart from the training and validation sets just described, two additional
test
sets
were
utilized
to
evaluate
method
performance. The first was the dataset published within Valko and Johnson7 (hereafter called “Valko dataset”) which consists of 454
images
of
molecules
cropped
from
literature.
The
Valko
dataset is interesting because it contains complicated molecules with challenging features such as bridges, stereochemistry, and a
variety
proprietary published
of
superatoms.
collection articles
The
of
and
5
760
second
dataset
image-SMILES
patents.
The
consisted pairs
molecules
of
a
from
47
in
the
ACS Paragon Plus Environment
25
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
proprietary contained
dataset small
were
drug-like
amounts
of
and
Page 26 of 50
some
extraneous
of
the
images
artifacts,
e.g.,
surrounding text, compound labels, lines from enclosing table, etc. and was used to evaluate overall method effectiveness in examples extracted for use in real drug discovery projects. With the focus of this work being on low quality/resolution images, rather than predicting high resolution images, we tested our method on downsampled versions of the Valko and proprietary datasets. During the segmentation phase each Valko dataset image was downsampled to 5-10% of its original size, and during the sequence prediction phase, images were downsampled to 10-22.5%, with similar scaling performed on the proprietary dataset. These scale
ranges
were
chosen
so that the
resolution used during
prediction approximately matched the (low) resolutions of the images used during training. A list of 65 characters containing all the unique characters in the Indigo dataset was assembled. These characters served as the list of available characters that can be selected at each SMILES decoding step. Four of these characters were special tokens not used in valid SMILES notation and were used to indicate the beginning or end of a sequence, replace unknown characters, or used to pad sequences that were shorter than a maximum length. During training and testing 100 characters were generated for
ACS Paragon Plus Environment
26
Page 27 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
each
input
and
any
characters
generated
after
the
end-of-
sequence token were ignored. 5. Training The segmentation model had 380,000 parameters and was trained on
batches
of
64
images
(128x128
pixels
in
size).
In
our
experiments, training converged after 650,000 steps and took 4 days to complete on a single GPU. The sequencing model had 46.3 million parameters
and was
trained on
batches
of 128 images
(256x256 pixels in size). During training, images were randomly affine
transformed,
Augmenting
the
transformations
brightness
dataset ensured
scaled,
while
that
the
and/or
training model
would
binarized.
using not
random
become
too
reliant on styles either generated by Indigo or seen in the patent images. In our experiments, training converged after 1 million training steps (26 days to train on 8 GPUs). Both models were constructed using TensorFlow29 and were trained using the Adam optimizer30 on NVIDIA Pascal GPUs. 6. Results and Discussion During training, metrics were tracked for performance on both the
Indigo
and
USPTO
validation
datasets
where
only
SMILES
strings that were exactly equivalent to the ground truth SMILES were
considered
correct.
We
observed
no
apparent
overfitting
over the Indigo dataset during training but did experience some overfitting over USPTO data (Figure 6). Due to the large size of
ACS Paragon Plus Environment
27
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 28 of 50
the Indigo dataset (52 million examples used during training) and the many rendering styles available in Indigo it is not surprising
that
the
model
did
not
experience
any
apparent
overfitting on Indigo data. Conversely, the USPTO set is much smaller (1.27 million examples used during training) with each example
sampled
much
more
frequently
(approximately
40
times
more often), increasing the risk of overfitting.
Figure 6. Training and validation curves for both the Indigo and USPTO datasets are here depicted. We tested the trained models on the Valko and proprietary test sets
using
the
full
segmentation
and
sequence
generation
pipeline described above with accuracies reported in Table 1. No SMILES were provided with the Valko dataset and only images were made available. Unlike the validation sets, the test sets are quite
small
which
made
it
possible
to
conduct
evaluation
visually rather than comparing SMILES strings. Each prediction
ACS Paragon Plus Environment
28
Page 29 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
was compared visually with the input images and for a result to have
contributed
to
accuracy
it
must
have
been
chemically
identical to the ground truth, including stereochemistry, and any error resulted in a structure being deemed incorrect. Also, because the segmentation step was applied, structure images that were incomplete after segmentation or if any false positives were also extracted, the structure was deemed incorrect. For the proprietary
dataset,
a
combination
of
automated
methods
and
visual inspection was utilized to ensure correctness, with the same requirement of exact chemical equivalence applied. Despite the
method
requiring
low
resolution
inputs,
accuracy
was
generally high across the datasets. Additionally, accuracy for the validation sets and the proprietary set were all similar (77-83%) indicating that the training sets used in developing the method reasonably approximate data useful in actual drug discovery projects as represented by the proprietary test set. Table 1. Overall performance against validation and test sets. Dataset
Accuracy
Indigo Validation
82%
USPTO Validation
77%
Valko Dataset
41%
Proprietary Dataset
83%
ACS Paragon Plus Environment
29
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 30 of 50
On the Valko dataset the method achieved an accuracy of 41% over the full, downsampled dataset, which is significantly lower than the accuracies observed in the other datasets. The decrease in performance was likely due to the higher rate of certain challenging features seen less frequently in the other datasets, including
superatoms.
Superatoms
were
the
single
largest
contributor to prediction errors in the Valko dataset with 21% of
samples
containing
one
or
more
incorrectly
predicted
superatoms. In our training sets, superatoms were only included in the USPTO dataset and were not generated as part of the Indigo
dataset
resulting
in
a
low
rate
of
inclusion
during
training (6.6% of total images seen contained some superatom, with most superatoms included at a rate of