Molecular Structure Extraction from Documents ... - ACS Publications

Feb 13, 2019 - generated using a convolutional neural network in combination with a recurrent .... similar features, e.g., charts, graphs, logos, and ...
0 downloads 0 Views 1MB Size
Subscriber access provided by Washington University | Libraries

Chemical Information

Molecular Structure Extraction From Documents Using Deep Learning Joshua Staker, Kyle Marshall, Robert Abel, and Carolyn Mae McQuaw J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.8b00669 • Publication Date (Web): 13 Feb 2019 Downloaded from http://pubs.acs.org on February 14, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Molecular Structure Extraction From Documents Using Deep Learning Joshua Staker†, Kyle Marshall†, Robert Abel‡, Carolyn M. McQuaw† †Schrödinger, Inc., 101 SW Main Street, Portland, Oregon 97204, United States ‡Schrödinger, Inc., 120 West 45th Street, New York, New York 10036, United States

Chemical

structure

extraction

from

documents

remains

a

hard

problem due to both false positive identification of structures during Current that

segmentation approaches

perform

encounter

and rely

reasonably

situations

satisfactory Complications

and

errors on

predicted rules

generally, recognition

systematic

impacting

the

handcrafted

well

where

in

but

of

subroutines

still

rates

improvement

performance

and

structures.

are

is current

routinely not

yet

challenging. approaches

include the diversity in visual styles used by various software to render structures, the frequent use of ad hoc annotations,

ACS Paragon Plus Environment

1

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

and

other

challenges

related

resolution

and

noise.

solutions

for

both

documents

and

for

segmented

images.

We

to

present

segmenting predicting

This

image

deep

Page 2 of 50

quality,

end-to-end molecular

chemical

deep

learning

structures

structures

learning-based

including

approach

from

from

the

does

not

require any handcrafted features, is learned directly from data, and is robust against variations in image quality and style. Using the deep learning approach described herein we show that it

is

possible

to

perform

well

on

both

segmentation

and

prediction of low-resolution images containing moderately sized molecules found in journal articles and patents.

1. Introduction For

drug

discovery

projects

to

be

successful,

it

is

often

critical that newly available data are quickly processed and assimilated

through

high

quality

curation.

Furthermore,

an

important initial step in developing a new therapeutic includes the

collection,

published drug

data.

discovery

analysis, This where

is

and

utilization

particularly

collections

true

of

for

of

previously

small-molecule

experimentally

tested

molecules are used in virtual screening programs, quantitative structure

activity/property

relationship

(QSAR/QSPR)

analyses,

or validation of physics-based modeling approaches. Due to the

ACS Paragon Plus Environment

2

Page 3 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

difficulty

and

expense

of

generating

large

quantities

of

experimental data, many drug discovery projects are forced to rely on a relatively small pool of in-house experimental data, which in turn may result in data volume as a limiting factor in improving in-house QSAR/QSPR models. One promising solution to the widespread lack of appropriate training

set

data

in

drug

discovery

is

the

amount

of

data

currently being published.1 Medline reports more than 2000 new life science papers published per day2, and this estimate does not include other literature indexes or patents that further add to the volume of newly published data. Given this rate at which experimental

data

is

increasingly

important

entering to

public

address

literature,

issues

related

it to

is data

extraction and curation, and to automate these processes to the greatest extent possible. One such area of data curation in life sciences that continues to be difficult and time consuming is the extraction of chemical structures from publicly available sources such as journal articles and patent filings. Most publications containing data related to small molecules in drug

discovery do

not

provide

the molecular

structures in

a

computer readable format (e.g., SMILES, connection table, etc.). Authors often use computer programs to draw structures and then incorporate images.

those

Publishing

structures documents

into with

publications only

images

of

via

raster

structures

ACS Paragon Plus Environment

3

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 50

necessitates the manual redrawing of the structures in chemical sketching software as a means of converting the structures into computer readable formats for use in downstream computation and analysis. Redrawing chemical structures can be time consuming and

often

requires

ambiguities,

domain

interpret

knowledge

variations

to

in

adequately

style,

and

resolve

decide

how

have

been

annotations should be included or ignored. Solutions described

for

automatic

previously.3-9

structure

These

recognition

methods

utilize

sophisticated

rules that perform well in many situations, but can experience degradation

in

output

quality

under

commonly

encountered

conditions, especially when input resolution is low or image quality is

poor.

One of

extraction

rates

is

the

that

challenges to

rule-based

improving

systems

are

current

necessarily

highly interdependent and complex, making further improvements difficult. Furthermore, rule-based approaches can be demanding to build and maintain because they require significant domain expertise and necessitate contributors to anticipate and codify rules for all potential scenarios the system might encounter. Developing

hand-coded

rules

is

particularly

difficult

in

chemical structure extraction where a wide variety of styles and annotations

are

used,

and

input

quality

is

not

always

consistent. The goal of this paper is twofold: 1) demonstrate it is possible to develop an extraction method to go from input

ACS Paragon Plus Environment

4

Page 5 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

document to SMILES without requiring the implementation of handcoded

rules

or

features;

and

2)

further

demonstrate

it

is

possible to improve prediction accuracy on low quality images using such a system. Deep learning and other data-driven technologies are becoming increasingly widespread in life sciences, particularly in drug discovery and development.10-13 In this paper we leverage recent advances in image processing, sequence generation, and computing over latent representations of chemical structures to predict SMILES for molecular structure images. The method reported here takes

an

image

or

PDF

and

performs

segmentation

using

a

convolutional neural network. SMILES are then generated using a convolutional neural

neural

network

(meaning,

the

network

in

(encoder-decoder) architecture

combination in

computes

with

a

an

end-to-end

SMILES

directly

recurrent fashion from

raw

images using a fully differentiable neural network). We report results based on preliminary findings using our deep learningbased method and provide suggestions for potential improvements in future iterations. In this work, we focus on performing both the

segmentation

resolutions

that,

and to

structure our

prediction

knowledge,

have

steps not

at

before

low been

attempted successfully in the literature. Using a downsampled version of a published dataset, we show that our deep learning

ACS Paragon Plus Environment

5

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

method

performs

well

under

low

quality

Page 6 of 50

conditions,

and

may

operate on raw image data without hand-coded rules or features. 2. Related Work Automatic chemical structure extraction is not a new idea. Park et al.3, McDaniel et al.5, Sadawi et al.6, Valko and Johnson7, and Filippov and Nicklaus8 each utilize various combinations of image processing techniques, optical character recognition, and handcoded rules to identify lines and characters in a page, then assemble

these

components

into

molecular

connection

tables.

Similarly, Frasconi et al. utilize low level image processing techniques to identify molecular components but rely on Markov logic to assemble the components into complete structures.9 Park et

al.

demonstrated

the

benefits

of

ensembling

several

recognition systems together in a single framework for improved recognition rates.4 Currently available methods rely on low-level image etc.)

processing

techniques

in combination with

(edge

detectors,

vectorization,

subcomponent recognition (character

and bond detection) and high-level rules that arrange recognized components into their corresponding structures. There

are

continuing

challenges,

however,

that

limit

the

usefulness of currently available methods. As discussed by Valko and Johnson, there are many situations in the literature where designing

specific

challenging.7

Some

rules of

to

these

handle include

inputs wavy

becomes

bonds,

lines

quite that

ACS Paragon Plus Environment

6

Page 7 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

overlap (e.g., bridges), and ambiguous atom labels. Apart from complex, ambiguous, or uncommon representations, there are other challenges

that

resolution

or

currently noisy

impact

images.

performance,

Currently

including

available

require relatively high-resolution input, e.g.,

low

solutions

150+ dpi5,8,14,

and tolerate only small amounts of noise. Furthermore, rulebased systems can be difficult to improve due to the complexity and interconnectedness of

the various recognition components.

Changing a heuristic in one area of the algorithm can impact and require improve

adjustment

in

components

another

to

fit

area, new

making

data

it

while

difficult

to

simultaneously

maintaining or improving generalizability of the overall system. Apart from accuracy of structure prediction, filtering of false positives during the structure extraction process also remains problematic. Current solutions rely on users to manually filter out

results,

choosing

which

predicted

structures

should

be

ignored if tables, figures, etc. are predicted to be molecules. In order to both improve extraction and prediction of structures in a wide variety of source materials, particularly with noisy or

low

quality

alternatives

to

input

images,

rule-based

it

is

systems.

The

important deep

to

find

learning-based

method outlined herein provides i) improved accuracy for poor quality

images

and

ii)

a

built-in

mechanism

for

improvement

through the addition of new training sets.

ACS Paragon Plus Environment

7

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 50

3. Deep Learning Method The

deep

structure Figure

learning prediction

1.

chemical

The

model

architectures

described

segmentation

structure

images

in

this

model from

for

segmentation

work

are

identifies

input

depicted and

documents,

and in

extracts and

the

structure prediction model generates a computer-readable SMILES string for each extracted chemical structure image.

Figure

1.

In

architecture

subpanel and

in

A

we

depict

subpanel

B

the

we

segmentation

depict

the

model

structure

prediction model architecture. For brevity, similar layers are condensed

with

a

multiplier

indicating

how

many

layers

are

chained together, e.g., “2X”. All convolution (conv) layers are followed by a parametric ReLU activation function. “Pred Char”

ACS Paragon Plus Environment

8

Page 9 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

represents

the

character

predicted

at

a

particular

step

and

“Prev Char” stands for the predicted character at the previous step. Computation flow through the diagrams is left to right (and top to bottom in B).

3.1. Segmentation When presented with a document containing chemical structures, the initial step in the extraction pipeline is first identifying what are structures by segmenting the document, then separating out

the

molecules

from

the

rest

of

the

input.

Successful

segmentation is important to i) provide cleanest possible input for accurate sequence prediction, and ii) exclude objects that are not structures but contain similar features, e.g., charts, graphs, logos, and annotations. Segmentation herein utilizes a deep convolutional neural network to predict which pixels in the input image are likely to be part of a chemical structure. In designing the segmentation model we followed the “U-Net” design strategy outlined in Ronneberger et al.15 which is especially well

suited

network

(the

for

full-resolution

neural

network

detection

output)

and

at the top enables

of the

fine-grained

segmentation of structures in the experiments reported here. The U-Net supports full-resolution segmentation by convolving (with pooling)

the

input

to

obtain

a

latent

representation,

then

ACS Paragon Plus Environment

9

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

upsampling the

latent representation

skip-connections

until

the

output

Page 10 of 50

using deconvolution with

is

at

a

resolution

that

matches that of the input. The outputs generated at the top of the

network

are

then

scaled

using

a

softmax

activation

(to

normalize the outputs between 0-1.0) and provide a predicted probability for each pixel, identifying the likelihood pixels belong to a structure. The

predicted

resolution

as

pixels the

form

masks

generated

original

input

images

sufficiently fine-grained

extraction.

at

and

The masks

the

same

allow

for

obtained from

the segmentation model are binarized to remove low confidence pixels, and contiguous areas of pixels that are not large enough to contain a structure are removed. To remove areas too small to be structures, we count the number of pixels in a contiguous area and deem the area a non-structure if the number of pixels is

below

entities pixels)

a (a

in

threshold single, the

(8050

pixels

contiguous

refined

masks

at

group are

of

300

dpi).

positively

assumed

to

Individual predicted

contain

single

structures and are used to crop structures from the original inputs,

resulting

in

a

collection

of

individual

structure

images. During testing we observed qualitatively better masks when generating several masks at different resolutions and averaging the

masks

together

into

a

final

mask

used

to

crop

out

ACS Paragon Plus Environment

10

Page 11 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

structures. In the experiments reported herein, averaged masks were obtained by scaling inputs to each resolution within the range 30 to 60 dpi in increments of 3 dpi, then generating masks for each image using the segmentation model. The resulting masks were

scaled

together.

to

the

Finally,

same the

resolution

averaged

(60

masks

dpi)

were

and

scaled

averaged to

the

original input resolution (usually 300 dpi) and then used to crop

out

individual

structures.

Figure

2

shows

an

example

journal article page along with its predicted mask. The inputs used in all experiments were first preprocessed to be grayscale and downsampled to approximately 60 dpi31 resolution. We found 60 dpi input resolution to be sufficient for segmentation while providing resolutions.

significant We

also

speed tested

the

improvement

versus

removal

long,

of

higher straight

horizontal and vertical lines in input images using the Hough transform.16 Line removal improved mask quality in many cases, especially in tables where structures were very close to grid lines, and was included in the final model.

ACS Paragon Plus Environment

11

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 50

Figure 2: An example showing the input PDF (left) and output of the segmentation model (right) when processing a journal article page from Salunke et al.17 All of the text and other extraneous items are completely removed with the exception of a few faint lines that are amenable to automated post processing. 3.2. Structure Prediction Images of individual structures obtained using the segmentation model

are

SMILES

automatically

sequences

transcribed

representing

the

into

contained

the

corresponding

structures

using

another deep neural network. The purpose of this network is to take

an

image

of

a

fashion,

predict

structure

contained

single

the

structure

corresponding

in

the

image.

and, SMILES

The

in

an

end-to-end

string

network

of

comprises

the an

ACS Paragon Plus Environment

12

Page 13 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

encoder-decoder

strategy

where

structure

images

are

first

encoded into a fixed-length latent space (state vector) using a convolutional neural network and then decoded into a sequence of characters using a recurrent neural network. The convolutional network consists of alternating layers of 5x5 convolutions, 2x2 max-pooling, and a parameterized ReLU activation function18, with the overall network conceptually similar to the design outlined in

Krizhevsky

layer.

To

et

al.19

help mitigate

but

without

issues

the

final

classification

in which small

but

important

features are lost during encoding, our network architecture does not utilize any pooling method in the first few layers of the network. The state vector obtained from the convolutional encoder is passed into a decoder to generate a SMILES sequence. The decoder consists of an input projection layer, three layers of GridLSTM cells20, an

attention

mechanism

(implemented similarly to the

soft attention method described in Xu et al.21 and Bahdanau et al.22, and the global attention method in Luong et al.23), and an output projection layer. The state vector from the encoder is used

to

initialize

the

GridLSTM

cell

states

and

the

SMILES

sequence is then generated a character at a time, similar to the decoding

method

described

sentences were generated

in

a word

Sutskever at a

time

et

al.24

while

(wherein

translating

English to French). Decoding is started by projecting a special

ACS Paragon Plus Environment

13

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 50

start token into the GridLSTM (initialized by the encoder and conditioned

on an

initial

context

vector

as computed

by the

attention mechanism), processing this input in the cell, and predicting

the

first

Subsequent

characters

character are

of

the

produced

output

sequence.

similarly,

with

each

prediction conditioned on the previous cell state, the current attention,

and

the

previous

output

projected

back

into

the

network. The logits vector for each character produced by the network is of length N, where N is the number of available characters (65 characters in this case). A softmax activation is applied to the logits to compute a probability distribution over characters, and the highest scoring character is selected for a particular step in the sequence. Sequences are generated until a special end-of-sequence token is predicted, at which point the completed SMILES string is returned. During testing we predict images at several different (low) resolutions and return the sequence

of

multiplying

the

highest

together

confidence,

the

softmax

which

output

is of

determined each

by

predicted

character in the sequence. Additional architecture details for both

the

segmentation

and

structure

prediction

models

are

included in the supplemental information. The addition of an attention mechanism in the decoder helps solve the

several challenges. decoder

to

access

Most importantly, attention enables

information

produced

earlier

in

the

ACS Paragon Plus Environment

14

Page 15 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

encoder and minimizes the loss of important details that may otherwise be overly compressed when encoding the state vector. Additionally,

attention

enables

the

decoder

to

reference

information closer to the raw input during the prediction of each character which is important considering the significance of individual pixels in low resolution structure images. See Figure 3 for an example of the computed attention and how the output corresponds to various characters recognized during the decoding

process.

Apart

from

using

attention

for

improved

performance, the attention output is useful for repositioning a predicted structure into an orientation that better matches the original input image. This is done by converting the SMILES into a connection table using the open source Indigo toolkit25, and repositioning each atom in 2D space according to the coordinates of

each

character's

computed

attention.

The

repositioned

structure then more closely matches the original positioning and orientation in the input image, enabling users to more easily identify and correct mistakes when comparing the output with the original source.

ACS Paragon Plus Environment

15

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 50

Figure 3. Heatmaps depicting the computed attention for several sequence generation steps are shown. Below the heatmaps, the full predicted SMILES sequence for the molecule is given with the

characters

highlighted

corresponding

characters

read

to

the

left

images

to

right

highlighted. in

the

The

SMILES

correspond to the images shown in the order left to right, top to bottom. The complete encoder-decoder framework is fully differentiable and

is

trained

backpropagation,

end-to-end

enabling

SMILES

using to

a be

suitable fully

form

generated

of

using

only raw images as input. During decoding, SMILES are generated a character at a time, from left to right. Additionally, no

ACS Paragon Plus Environment

16

Page 17 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

external

dictionary

is

used

for

chemical

abbreviations

(superatoms), rather, these are learned as part of the model, thus

images

generated

a

may

contain

character

superatoms and

at

a

time.

The

the

SMILES

model

are still

operates

on

raw

images and directly generates chemically valid SMILES with no explicit subcomponent recognition required. 4. Datasets 4.1. Segmentation Dataset To our knowledge, no dataset addressing molecular structure segmentation has been published. To provide sufficient data for training

a

neural

network

while

minimizing

manual

effort

required to curate such a dataset, we developed a pipeline for automatically dataset,

in

generating summary,

segmentation

the

following

data.

steps

To

were

generate

the

performed:

i)

removed structures from journal and patent pages, ii) overlaid structures onto the pages, iii) produced a ground truth mask identifying the overlaid structures, and iv) randomly cropped images

from

the

pages

containing

structures

and

the

corresponding mask. In detail, OSRA8 was utilized to identify bounding boxes of

candidate

molecules within the pages

of

a

large number of documents, both published journal articles and patents. The regions expected to contain molecules were whitedout, thus leaving pages without molecules. OSRA is not always correct in finding structures and occasionally non-structures

ACS Paragon Plus Environment

17

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 50

(e.g., charts) were removed suggesting that cleaner input may further improve model performance. Next, molecule images from the Indigo OS X and USPTO datasets (described in more detail later) were randomly overlaid onto the pages while ensuring that no structures overlapped with any non-white pixels. Structure images were occasionally perturbed using affine transformations, changes

in

background

shade,

and/or

lines

added

around

the

structure (to simulate table grid lines). We also generated the true mask for each overlaid page; these masks were zero-valued except where pixels were part of a molecule (pixels assigned a value of 1). During training, samples of 128x128 pixels were randomly cropped from the overlaid pages and masks, and arranged into mini-batches for feeding into the network; example imagemask pairs are shown in Figure 4.

Figure

4.

Two

examples

from

the

programmatically

generated

segmentation dataset. Inputs to the network are shown on the left with the corresponding masks used in training shown on the right. In the mask, white indicates pixels which are part of a chemical structure.

ACS Paragon Plus Environment

18

Page 19 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

4.2. Molecular Image Dataset An important goal of our work is to improve recognition of low resolution or poor quality images. Utilizing training data that is too high quality or too clean could negatively impact the generalizability

of

the

final

model.

Arguably,

the

network

should be capable of becoming invariant to quality when trained explicitly on both high and low quality examples, at the expense of more computation and likely more data. However, we opted to handle image quality consistently by considering all images as low quality images regardless of original input resolution. This was done by scaling all inputs down considerably so that all images are low resolution when fed to the model. To illustrate the impact of scaling images of molecular structures, consider two structures in Figure 5 that are chemically identical but presented with different levels of quality. The top-left image is of fairly high quality apart from some perforation throughout the image. In the top-right image the perforation is much more pronounced with some bonds no longer continuous (small breaks due to the excessive noise). When these images are downsampled significantly using bilinear interpolation, the

images appear

similar and it is hard to differentiate which of the two began as a lower quality image, apart from one being darker than the other.

ACS Paragon Plus Environment

19

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 50

Figure 5. Noisy images of the same structure from Bonanomi et al.27 The top two images are the originals with their downsampled versions

shown

below.

Image

quality

appears

similar

when

downsampled considerably. Chemical

structures

vary

significantly

in

size;

some

being

small fragments with just a few atoms, or very large structures including

natural

products

or

peptide

sequences.

This

necessitated using an image size that was not too large to be too

computationally

fully

fit

intensive

reasonably

sized

to

train,

structures,

but i.e.,

large

enough

drug-like

to

small

molecules, into the image. Although structures themselves can be large the individual atoms and their relative connectivity may be contained within a small number of pixels regardless of the size

of

the

overall

structure.

Thus,

scaling

an

image

too

aggressively will result in important information being lost. To train a neural network that can work with both low and high quality images, we utilized an image size of 256x256 and scaled

ACS Paragon Plus Environment

20

Page 21 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

images

to

fit

within

this

size

constraint

(bond

lengths

in

approximately the 3-12 pixels range). Training a neural network over

higher

resolution

images

is

an

interesting

research

direction and may improve results, but was here left for future work. In

designing

datasets

to

train

and

test

our

method

we

considered both the chemistry space and the image styles that were of interest to maximize general application of the final model at the end of training. To ensure sufficient variety in molecular images, both in terms of the breadth of chemical space and in the various styles portrayed, we curated three separate datasets,

each

sampled

uniformly

during

training.

The

first

dataset consisted of a 57 million molecule subset of molecules available in the PubChem28 database, with the subset selected using several filters, described later. The molecules obtained from Pubchem were initially in InChI format and were converted into SMILES using Indigo, with images then rendered from the SMILES

using

Indigo

and

its

various

style

options

available

(bond thicknesses, character sizes, etc.). Reasonable parameter ranges were chosen for each style option and style parameters were randomly selected from the ranges for each image during rendering

(see

Table

S1

in

supplemental

information

for

the

option ranges used). This dataset is referred to throughout the text as the Indigo dataset.

ACS Paragon Plus Environment

21

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 50

The second dataset (referred to in our text as the Indigo OS X dataset)

was

consisted observed

of that

prepared Pubchem Indigo

similarly compounds

rendering

to

the

first

rendered

output

dataset

using

can

vary

Indigo.

and We

significantly

between operating systems, so to ensure that we captured these stylistic differences, we randomly sampled with replacement 10 million

of

the

prepared

SMILES

from

the

first

dataset

and

rendered additional images for these SMILES using Indigo but in OS X (versus Linux in the first dataset). Although sampling with replacement may allow for some SMILES to be sampled more than once, this is not an issue because each instance of the SMILES was rendered using a new set of randomly sampled style options, as

described

earlier.

The

inclusion

of

these

images

helped

supplement training with additional image styling. The

third

dataset

(USPTO

dataset)

consisted

of

1.7

million

image/molecule pairs curated from data made publicly available by the United States Patent and Trademark Office (USPTO).26 Many of these images contained extraneous text and labels and were preprocessed to remove non-structure elements before training. The original USPTO data provided molecules in MOL format and these were converted into SMILES using Indigo. For some files in the USPTO dataset, we observed that Indigo did not correctly retain stereochemistry when converting the MOL files into SMILES

ACS Paragon Plus Environment

22

Page 23 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

which resulted in some SMILES having incorrect stereochemistry; results may improve with cleaner data. Although our goal when curating data was to capture sufficient breadth of chemical space in our datasets for the learned model to be useful, it is important to note that chemical space is extremely

vast.

In

order

to

make

learning

a

model

in

our

experiments feasible and to ensure data were in a consistent format across datasets we chose to focus on drug-like molecules and imposed the following restrictions when filtering compounds and when preparing all three training datasets: 

Structures with a SMILES length of 21-100 characters (a range that covers most drug-like small molecules) were included with all others removed.



Attachment placeholders of the format R1, R2, R3, etc., were included but other forms were not included



Enumeration fragments were not predicted or attached to the parent structure.



Salts were removed and each image was assumed to contain only one structure.



All molecules were converted to SMILES using Indigo.



All SMILES were kekulized and canonicalized using Indigo.

ACS Paragon Plus Environment

23

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 50

Attachment placeholders of the form R1, R2, R3, etc. are not natively

supported

in

SMILES

strings

but

often

show

up

in

images. While SMILES can include an asterisk in place of atoms to

indicate

that

the

location

is

a

wildcard,

we

desired

to

support more granular prediction by retaining numbers that are displayed in images. To do this we utilized Indigo’s ability to render

custom

text

for

wildcard

atoms

and

randomly

replaced

atoms with “R”, where i is incremented each time an R-group is added; terminal atoms were replaced with 5% probability and non-terminal

atoms

were

replaced

only

1%

of

the

time.

The

asterisks in the resulting SMILES strings were replaced with the R-groups in brackets. For example, the SMILES string for ethane, “CC”, might become [R1]CC[R2]. Although predicted SMILES may not be truly valid SMILES that contain this added R-group notation, added groups are easily detectable and reversible. The R-groups can be replaced with asterisks to recover the predicted molecule in valid SMILES format while the attachment annotation at each R-group location is retained for reference. Since the USPTO data already provided rendered images, we only randomly included Rgroups in the Indigo and Indigo OS X datasets wherein SMILES containing

the

added

R-groups

could

then

be

rendered

using

Indigo. The prepared datasets were split into training and validation portions,

utilized

for

training

the

model

and

periodically

ACS Paragon Plus Environment

24

Page 25 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

testing the performance during training, respectively. For the Indigo dataset (containing the 57 million molecules), 90% of the dataset

was

utilized

for

training

and

10%

was

reserved

for

validation. Similarly, the USPTO dataset was split into training and validation portions; 75% of the set was used to train with 25% reserved for validation. We reserved a larger portion of the USPTO dataset for validation because the total dataset size is significantly smaller than the Indigo dataset and we desired to ensure

a

sufficiently

large

set

for

adequate

validation.

To

simplify our training and validation metrics, we opted to only include the Indigo OS X dataset during training and did not reserve a portion of this dataset for validation and do not report any metrics for this dataset.

Apart from the training and validation sets just described, two additional

test

sets

were

utilized

to

evaluate

method

performance. The first was the dataset published within Valko and Johnson7 (hereafter called “Valko dataset”) which consists of 454

images

of

molecules

cropped

from

literature.

The

Valko

dataset is interesting because it contains complicated molecules with challenging features such as bridges, stereochemistry, and a

variety

proprietary published

of

superatoms.

collection articles

The

of

and

5

760

second

dataset

image-SMILES

patents.

The

consisted pairs

molecules

of

a

from

47

in

the

ACS Paragon Plus Environment

25

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

proprietary contained

dataset small

were

drug-like

amounts

of

and

Page 26 of 50

some

extraneous

of

the

images

artifacts,

e.g.,

surrounding text, compound labels, lines from enclosing table, etc. and was used to evaluate overall method effectiveness in examples extracted for use in real drug discovery projects. With the focus of this work being on low quality/resolution images, rather than predicting high resolution images, we tested our method on downsampled versions of the Valko and proprietary datasets. During the segmentation phase each Valko dataset image was downsampled to 5-10% of its original size, and during the sequence prediction phase, images were downsampled to 10-22.5%, with similar scaling performed on the proprietary dataset. These scale

ranges

were

chosen

so that the

resolution used during

prediction approximately matched the (low) resolutions of the images used during training. A list of 65 characters containing all the unique characters in the Indigo dataset was assembled. These characters served as the list of available characters that can be selected at each SMILES decoding step. Four of these characters were special tokens not used in valid SMILES notation and were used to indicate the beginning or end of a sequence, replace unknown characters, or used to pad sequences that were shorter than a maximum length. During training and testing 100 characters were generated for

ACS Paragon Plus Environment

26

Page 27 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

each

input

and

any

characters

generated

after

the

end-of-

sequence token were ignored. 5. Training The segmentation model had 380,000 parameters and was trained on

batches

of

64

images

(128x128

pixels

in

size).

In

our

experiments, training converged after 650,000 steps and took 4 days to complete on a single GPU. The sequencing model had 46.3 million parameters

and was

trained on

batches

of 128 images

(256x256 pixels in size). During training, images were randomly affine

transformed,

Augmenting

the

transformations

brightness

dataset ensured

scaled,

while

that

the

and/or

training model

would

binarized.

using not

random

become

too

reliant on styles either generated by Indigo or seen in the patent images. In our experiments, training converged after 1 million training steps (26 days to train on 8 GPUs). Both models were constructed using TensorFlow29 and were trained using the Adam optimizer30 on NVIDIA Pascal GPUs. 6. Results and Discussion During training, metrics were tracked for performance on both the

Indigo

and

USPTO

validation

datasets

where

only

SMILES

strings that were exactly equivalent to the ground truth SMILES were

considered

correct.

We

observed

no

apparent

overfitting

over the Indigo dataset during training but did experience some overfitting over USPTO data (Figure 6). Due to the large size of

ACS Paragon Plus Environment

27

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 50

the Indigo dataset (52 million examples used during training) and the many rendering styles available in Indigo it is not surprising

that

the

model

did

not

experience

any

apparent

overfitting on Indigo data. Conversely, the USPTO set is much smaller (1.27 million examples used during training) with each example

sampled

much

more

frequently

(approximately

40

times

more often), increasing the risk of overfitting.

Figure 6. Training and validation curves for both the Indigo and USPTO datasets are here depicted. We tested the trained models on the Valko and proprietary test sets

using

the

full

segmentation

and

sequence

generation

pipeline described above with accuracies reported in Table 1. No SMILES were provided with the Valko dataset and only images were made available. Unlike the validation sets, the test sets are quite

small

which

made

it

possible

to

conduct

evaluation

visually rather than comparing SMILES strings. Each prediction

ACS Paragon Plus Environment

28

Page 29 of 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

was compared visually with the input images and for a result to have

contributed

to

accuracy

it

must

have

been

chemically

identical to the ground truth, including stereochemistry, and any error resulted in a structure being deemed incorrect. Also, because the segmentation step was applied, structure images that were incomplete after segmentation or if any false positives were also extracted, the structure was deemed incorrect. For the proprietary

dataset,

a

combination

of

automated

methods

and

visual inspection was utilized to ensure correctness, with the same requirement of exact chemical equivalence applied. Despite the

method

requiring

low

resolution

inputs,

accuracy

was

generally high across the datasets. Additionally, accuracy for the validation sets and the proprietary set were all similar (77-83%) indicating that the training sets used in developing the method reasonably approximate data useful in actual drug discovery projects as represented by the proprietary test set. Table 1. Overall performance against validation and test sets. Dataset

Accuracy

Indigo Validation

82%

USPTO Validation

77%

Valko Dataset

41%

Proprietary Dataset

83%

ACS Paragon Plus Environment

29

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 50

On the Valko dataset the method achieved an accuracy of 41% over the full, downsampled dataset, which is significantly lower than the accuracies observed in the other datasets. The decrease in performance was likely due to the higher rate of certain challenging features seen less frequently in the other datasets, including

superatoms.

Superatoms

were

the

single

largest

contributor to prediction errors in the Valko dataset with 21% of

samples

containing

one

or

more

incorrectly

predicted

superatoms. In our training sets, superatoms were only included in the USPTO dataset and were not generated as part of the Indigo

dataset

resulting

in

a

low

rate

of

inclusion

during

training (6.6% of total images seen contained some superatom, with most superatoms included at a rate of