HELM Software for Biopolymers - Journal of Chemical Information and

May 4, 2017 - The HELM project, part of the Pistoia Alliance nonprofit organization, has been tasked to develop and promote HELM as a global exchange ...
0 downloads 10 Views 1MB Size
Application Note pubs.acs.org/jcim

HELM Software for Biopolymers Jeff Milton,*,† Tianhong Zhang,‡ Claire Bellamy,∥ Eric Swayze,† Christopher Hart,† Markus Weisser,§ Sabrina Hecht,§ and Sergio Rotstein‡ †

Ionis Pharmaceuticals, Inc, 2855 Gazelle Court, Carlsbad, California 92010, United States Pfizer Inc., One Burtt Road, Andover, Massachusetts 01810, United States § Quattro Research, Fraunhoferstraße 18a 82152 Planegg−Martinsried, Germany ∥ Pistoia Alliance, 401 Edgewater Place, Wakefield, Massachusetts 01880-6201, United States ‡

ABSTRACT: Hierarchical Editing Language for Macromolecules (HELM version 2.0) is a molecular line notation similar to SMILEs but specifically for communicating and managing biopolymer structures. The HELM project, part of the Pistoia Alliance nonprofit organization, has been tasked to develop and promote HELM as a global exchange format and recently released version 2.0 of the specification. Here we will describe the specifics of the HELM v2.0 notation along with the large ecosystem of software to support HELM-based structure management. We will highlight a recent open-source software and database for HELM monomers and a new, simpler approach to deploying a large complicated molecular management system.



INTRODUCTION The computational management of molecular therapeutics has historically centered on small molecules and their molecular structure as described by atomic coordinates. Today the world is much different. While small molecules represent the vast majority of FDA approved substances, more and more therapeutic substances are entering the clinic with large, complex, and often undefined chemical makeup.1,2 This growing trend has greatly complicated the computational management of chemical information. Today database queries for complex biopolymers rely almost exclusively on storage of two primary data structures: (1) atomic coordinates and (2) (nucleotide/peptide) sequence. Scifinder,22 one of the most popular search tools for chemical substances and reactions, permits users to supply both. For most compounds this is sufficient, but for biopolymers this can be a low-resolution view of the intrinsic intellectual property. Many biopolymers are modifications of natural compounds and therefore maintain a “sequence” equal to its natural parent but may not share the exact atomic composition. For example, 5-methylcytosine has the same sequence as the DNA base cytosine but does not share the same molecular structure. Searching sequences with motifs of this methylated form in public databases like Scifinder are currently not possible and proprietary databases require © 2017 American Chemical Society

significant programming rigor to develop and surface a userfriendly search tool. Some important structures recently published have highlighted the shortcomings of current digital representations. Kadcyla (ado-trastuzumab emtansine),3 a breast cancer drug developed by Genentech and approved in 20134 is a monoclonal antibody (Trastuzumab) and a small molecule conjugate (Emtansine) connected by a nonreducible thioether linkage (MCC). At first glance this seems like a well-defined molecule since each of the three mentioned compounds are well characterized at the atom level. The connection of conjugate to the antibody via the MCC linker is in fact not well understood and required complicated molecular assays along with statistics to accurately describe structure composition.5 Thus, this uncertainty in chemical conjugation site translates to a structure ambiguity where specific covalent bonds are replaced by a ratio of probable interactions. These structural ambiguities are lost with conventional digital representations. Received: August 1, 2016 Published: May 4, 2017 1233

DOI: 10.1021/acs.jcim.6b00442 J. Chem. Inf. Model. 2017, 57, 1233−1239

Application Note

Journal of Chemical Information and Modeling



MATERIALS AND METHODS HELM6 is a rapidly establishing molecular notation that was designed to capture higher order macromolecular information in a compact and human-readable manner while maintaining a highly structured, computationally accessible format. Molecular structures are stored in libraries as monomers allowing polymerlevel features to be defined in a simple and modular manner. In its simplest form, HELM resembles a FASTA format,7 but unlike FASTA where each letter is assumed to be the natural monomer (e.g., “A” is adenine, “C” is cytosine, etc.) in a natural polymer sequence (RNA, DNA, or peptide), HELM permits users to define monomers with entirely new chemical structures, while maintaining the link to biological sequence information via a “natural analog” attribute. A polymer in HELM can represent an equivalent sequence without having the “natural” chemical structure. This is especially important for antisense and RNAi technology where oligonucleotides are often highly chemically modified while maintaining a constant biological sequence. HELM version 2.0 contains new features that greatly enhance the computational utility and human readability of large complex macromolecules. One of the most significant is the ambiguity notation where users can declare unknown, or ambiguous composition. This is extended further to include notations for polymer mixtures and free-form annotations. For example, a series of conjugated lysine residues on an antibody structure may be annotated using the following HELM notation:

monomer database. This same structure with a methylated cytosine will then have the same sequence as the unmethylated: RNA1{d(A)p.d([m5C])p.d(T)p.d(G)p}$$$$

This will have a sequence: ACTG not A[m5C]TG. This is because the methylated cytosine monomer [m5C] is associated with a natural analog attribute of “C” in the monomer database. The equivalent sequence of a polymer is very important when dealing with synthetic peptides and nucleotides, as chemical modifications at bases are often designed to preserve Watson− Crick base pairing; the bioinformatics is preserved while chemistry is modified. Connections. The connection between two monomers is determined by the monomer type and “R group” designation of each connecting monomer. In a typical HELM string the connection between monomers is specified by the “.” character. In the case of insulin, the peptide sequence is “M.A.L.W.M.R..etc”, and the connections between the atoms in this case are standard peptide bonds where R groups on respective monomers determine the “C” and “N” termini. As the polymer extends to the right in Figure 1, the C-terminus “R2” group is linked to the “R1” of next amino acid while the N-

PEPTIDE1{A.N.D.C.Q.K′5′“linked MDP”. Q.N.A.}

A peptide region containing five lysine amino acids “K” is annotated with “linked MDP” to indicate that that each lysine in the repeat is linked to a MDP molecule. HELM Structure. The HELM specification has four levels: (1) At the lowest level, molecular structure is specified. (2) Next, molecules are encapsulated in monomer units where additional metadata like sequence letter is defined. (3) Simple polymers are covalently linked monomers of the same polymer type, and (4) complex polymers define interactions of simple polymer chains. The hierarchy was designed to capture both the chemical structure and biological sequence. The syntax is broken into four main sections separated by the “$” character:

Figure 1.

terminus is capped with a hydrogen. This “capping” is a product of the stored monomer. A monomer that is part of a polymer and does not have a connection to another monomer is capped with a “default capping group”. This is typically an OH or H. Intermonomer connections are denoted by the “.” character while polymer-to-polymer connections are defined explicitly in the “Connections” and the “ListOfPolymerGroups” sections of HELM strings. This highly structured notation for polymer-topolymer interactions allows for better computational accessibility of complex structures. This is best highlighted in Figure 2 where siRNA oligonucleotides are bound together by hydrogen bonds to form a double stranded RNA structure.

SimplePolymers$Connections$PolymerGroups$FreeText$

SimplePolymers list all chains. Connections describe interpolymer interactions like hydrogen bonding or covalent bonds between chains. PolymerGroups allow for defining mixtures and ratios of chains. For example, in the string “G1(PEPTIDE1+PEPTIDE2+PEPTIDE3+PEPTIDE4)|G2(CHEM1+G1:3.4)”, the G1 group defines the relationship of four chains in an antibody and the G2 defines the relationship of a small molecule (CHEM1) and the antibody (G1). The relative abundance of the small molecule to the antibody is captured in this section. Biological Sequence. HELM bridges chemistry and biology by maintaining an explicit link between any chemical entity and its biological sequence symbol. Each monomer contains a “natural analog” attribute that can be used to construct the sequence of a polymer. For example, the simple HELM string for one strand of DNA of sequence “ACTG” is RNA1{d(A)p.d(C)p.d(T)p.d(G)p}$$$$. While it seems quite simple to parse the sequence character by finding all “base” monomers on the nucleotide units, this is not the correct method. The correct approach is to get the natural analog of each monomer from the

Figure 2.

The HELM string for this structure is RNA1{r(A)p.r(G)p.r(C)p.r(U)p. . .} |RNA2{r(U)p.r(G)p.r(G)p.r(G)p. . .} RNA1,RNA2,20:pair‐8:pair|RNA1,RNA2,17:pair‐11:pair|R NA1,RNA2,8:pair‐20:pair|RNA1,RNA2,14:pair‐14:pair|RN A1,RNA2,11:pair‐17:pair$$$ 1234

DOI: 10.1021/acs.jcim.6b00442 J. Chem. Inf. Model. 2017, 57, 1233−1239

Application Note

Journal of Chemical Information and Modeling

HELM2 and Structure Ambiguity. The first version of the HELM specification considered polymers with full atomistic representation. While this addresses most use-cases, many structures, like some antibody−drug conjugates, are not wellcharacterized at atomic resolution. These compounds may have ambiguous or unknown structural components. HELM2 extends HELM1 by adding syntax to capture unknown or ambiguous elements. For example, the HELM string for Cyclosporin is

The hydrogen bonding is described as pairing between two monomers at specific monomer position indices. Complex queries for siRNA’s then become relatively simple. For example, a search for siRNA’s that contain nucleotide bulges can be determine by finding nucleosides where pairing does not occur but does occur on both sides. While conventional atomistic representations like SMILEs contain the necessary information for motif searching, it is not easy to access. For example, cyclization is often used to increase the bioavailability of oligonucleotide and peptide therapeutics. A database will offer search capabilities for peptide sequence and may even support conjugate-based queries, but searching for cyclization will require additional database annotations to existing structures. This adds complexity to computational environments. HELM syntax supports macromolecular annotations such as cyclization and greatly lowers the computational burden for building software tools. The following cyclic peptide contains a HELM string with a connection descriptor that defines the specific head-to-tail connection: Monomer 1 is connected to monomer 13. A sequence search for a cyclic peptide in a database becomes trivial. Figure 3 highlights this connection descriptor.

PEPTIDE1{[Abu].[Sar].[meL].V.[meL].A.[dA].[meL].[me L].[meV].[Bmt]}$PEPTIDE1,PEPTIDE1,11:R2−1:R1$ $$

Binding this structure to a nanoparticle becomes PEPTIDE1{[Abu].[Sar].[meL].V.[meL].A.[dA].[meL].[me L].[meV].[Bmt]}|BLOB1{nanopartice}“poloxamer 407”$PEPTIDE1,PEPTIDE1,11:R2‐1:R1 |PEPTIDE1,BLOB1,X:?‐?:?$$$

The addition of the BLOB1 chain of unknown structure is followed by an annotation. The linkage between the peptide and the nanoparticle is also nonspecific as illustrated by the syntax: “PEPTIDE1,BLOB1,?:?-?:?”. Often the feature ambiguity can be characterized by a ratio between two elements. In the following structure the peptide has a known ratio of monomers in the variable domain of the heavy chain of an antibody: PEPTIDE4{A.C.G.(A:70+G:30).F.Y... etc.

Figure 3.

HELM2 encompasses a broad spectrum of complexity in characterizing biopolymers and where possible builds highly structured representations. In this way, HELM enables downstream computational tools to parse and process compounds without the need for complicated database schema or algorithms. Competing Technologies. The simplified molecular-input line-entry system (SMILES)10,11 is a molecular line notation designed to capture structure in a compact and human readable way. HELM and SMILES are similar in syntax but have very different purposes. While both notations aim to capture molecular structure, SMILEs captures structure at the atomic level and HELM addresses human readability at a polymeric level. HELM is a “container” syntax where monomer ids in the HELM notation are aliases to atomic-level definitions. HELM derives complete atomic-level polymeric structure by accessing and assembling atom-level monomeric structure, which is often represented as SMILES. For example, Figure 5 shows the PDB structure 1ana12 contains a modified residue on the 5′ end of a tetramer. The smiles string for this structure is

Another example is Biphalin8,9 which contains two peptide chains linked together tail-to-tail by chlorinated phenylalanine residues. Using HELM, the molecular features are both human readable and computationally accessible, and since each monomer symbol is a pointer to a unique molecular structure, the entire molecule can be rendered. Figure 4 is a graphical

Figure 4.

rendering of the HELM string that clearly shows the tail-to-tail structure and the symmetry of the modified residues. PEPTIDE1{Y.[dA].G.F}|PEPTIDE2{Y.[dA].G.F}|CHEM1 {[R1]NN[R2]}$PEPTIDE1,CHEM1,4:R2−1:R1|PEPTI DE2,CHEM1,4:R2:1R1$$$V2.0

Using SMILEs the molecular structure is accessible, but higher-level features, such as the chain symmetry, monomer stereochemistry, hydrazine linker, and sequence, are obfuscated. Clc1ccc(cc1)C[C@H](C(O)NNC(O)[C@H](NC (O)CNC(O)[C@H](NC(O)[C@@H](N)Cc2c cc(O)cc2)C)Cc3ccc(Cl)cc3)NC(O)CNC(O)[C@ H](NC(O)[C@H](N)Cc4ccc(O)cc4)C

Figure 5. 1235

DOI: 10.1021/acs.jcim.6b00442 J. Chem. Inf. Model. 2017, 57, 1233−1239

Application Note

Journal of Chemical Information and Modeling Nc1nc(O)n(cc1I)[C@H]1C[C@H](O)[C@@H] (COP(O)(O)O)O1

While this describes the structure in detail, the relevant base modification is not apparent. For SAR-type experimental design this representation is not useful. HELM on the other hand clearly defines the unit as a DNA nucleotide with a modified base. The HELM string is d([C38])p

The “d” represents the deoxyribose monomer, the “[C38]” identifier is a modified base, and the “p” is a phosphodiester backbone. Each of these three monomers are in fact syntax “pointers” to detailed structural information contained in a monomer database. In the context of the entire tetramer the structure is

Figure 6.

Another PDB (1ana) structure with a modification at the 5′ base of the sequence C−C−G−G has the following line notation: DNA(5′‐D(*(C38)P*CP*GP*G)‐3′)Chains: A, B

RNA1{d([C38])p.d(C)p.d(G)p.d(G)p}$$$$

This does illustrate the 5′ C38 modification and the “C38 “ identifier does reference a specific molecular entity, but there are several important descriptive elements missing. The equivalent sequence of the tetramer is not clear as there is no explicit reference to a natural analog for the C38 monomer. In this representation (PDB: 4L0A), the monomers are the nucleosides and their phosphate linkages. This means that multiple nucleosides must be defined for any one sugar modification as is the case in the following LNA modification:16

This clearly highlights the 5′ modified base in the context of the entire molecular structure. Furthermore, it is easy to see that the DNA backbone consists of unmodified deoxyribose sugars and phosphodiester linkers. One of the biggest difference between SMILEs and HELM is that SMILEs relies on the Periodic Table of Elements as its reference database whereas HELM requires a user-defined monomer database. There is no globally defined lookup table for helm identifiers. A public database has been established at monomer.org but is not yet broadly adopted as the reference library. The Internal Chemical Identifier (InChI)13 is another specification that has considerable overlap with HELM. It was developed by IUPAC and NIST in 2005 and was intended to capture multiple facets of chemical substances including bonds, tautomers, isotopes, electronic charge, and stereochemistry. InchI strings can be more descriptive than SMILEs and have great utility; however, they are not widely used in software tools. Like SMILEs InchI strings can be canonicalized and may therefore serve as a key for structure uniqueness. It was the original intent of the InChI authors to provide a globally accessible chemical identifier for searching large database indexes like google.com.14 The InchI strings contains atomic-level details and, like SMILEs is not ideal for large polymeric representations. For example, the InchI string for 2″-DEOXYGUANOSINE-5′MONOPHOSPHATE is

DNA/RNA(5′‐R(*(TLN)P*(LCG)P*(LCG)P*(LCG)P *(TLN))‐3′)

The string represents a locked nucleic acid 5′-TGGGT-3′ where the nucleosides TLN and LCG represent the modifications. The bases are part of the modified monomer and therefore new monomers must be defined to satisfy any additional nucleosides required (i.e., ALN, GLN, etc.). The macromolecular Crystallographic Information File (mmCIF) and PDB file17−19 are the most comprehensive file formats for capturing and storing biological macromolecular information. These were designed to capture all aspects of 3D molecular structure of biological macromolecules. It therefore serves a very different purpose. This can be viewed as the format containing the “raw data” behind HELM. HELM2 Technology. While there are an infinite number of implementation strategies, here we will focus on a simple usecase that leverages modern cloud-based tools and open-source software to implement a complete end-to-end solution for building, storing and managing large databases of HELM-based structures. With this strategy, an organization will be able to register and manage millions of unique polymer/oligomer compounds with little or no startup cost. Docker software20 has significantly changed the landscape of information technology by simplifying the deployment and longterm management of software in both local and cloud-based environments. This is evident by the ongoing adoption and use of Docker by all major cloud vendors including Google, Azure, and Amazon. Docker is an ideal way to deploy HELM-based technology since “Docker Containers” encapsulate and simplify the unavoidable complexity of molecular storage and management systems. A software tool written in C++ for a specific Linux OS and version can coexist with other tools written for other platforms simply by building respective Docker Containers. It is this ability to package heterogeneous technology that greatly

InChI =1S/C10H14N5O6PS/c11‐10‐13‐8‐7(9(17)14‐10) 12‐3‐15(8)6‐1‐4(16)5(21‐6)2‐20‐22(18,19)23/h3‐6,16 H,1‐2H2,(H2,18,19,23)(H3,11,13,14,17)/t4‐,5+,6+/m0 /s1

This structure is useful as a small molecule descriptor, but it suffers the same limitations as SMILEs for managing large polymers. The PDB and EBI have chosen to characterize the macromolecular structures in their own proprietary line notation. For example, in Figure 6 the monomer is one unit of a larger DNA structure (PDB: 5J3I).15 The heterocycle identifier for this monomer, “GS”, is highlighted in the context of the full DNA sequence as (SSG). Chain A: DNA(5′‐D(*CP*GP*(SSG)P*CP*CP*GP*CP*CP*GP*A)‐3′) 1236

DOI: 10.1021/acs.jcim.6b00442 J. Chem. Inf. Model. 2017, 57, 1233−1239

Application Note

Journal of Chemical Information and Modeling simplifies the broad use of open-source tools like RDKit21 and the HELM2Notation toolkit. The Docker-HELM implementation contains two containers: one for Monomers and one for Polymers. Each Docker is an instance of Ubuntu Linux operating system with all the necessary open-source software for managing and storing molecular data structures. The MonomerLib Docker contains a (1) Mongo database/JSON document store for storing structures, (2) a RESTful web service for registering, searching, and making monomer structures unique, and (3) a web server with molecular viewer for user-editing and storing structures. The PolymerLib Docker has the same features but different implementation since HELM strings are stored in the database instead of actual atomic coordinates. Figure 7 shows the three

database is the original format while the JSON approach is newer and encouraged for future development. Recently the Pistoia Alliance and Ionis Pharmaceuticals released a public monomer database at monomer.org. This JSON formatted database of monomers was released to serve as a starting point for new HELM users. The current iteration contains approximately 500 monomer structures that were curated by medicinal chemists. Figure 9 is a complete monomer definition in JSON format for 2′-O-methoxyethyl ribose (2′-MOE).

Figure 9. Figure 7.

The HELM specification describes monomers as having three main “polymerType” attributes: (1) RNA for all nucleic acids, (2) PEPTIDE, and (3) CHEM for any arbitrary chemical moiety. For example, an antibody−drug conjugate will have several PEPTIDE chains along with a CHEM (small molecule) drug conjugate. This attribute allows HELM structures to have heterogeneous composition. For example, a monomer with a symbol “A” in a polymer of type RNA is not the same as a monomer “A” in a PEPTIDE polymer. In the following example the nucleic acid dimer “TA” is connected to a peptide chain (ALGK) at the 5′ end. The two monomers labeled “A” are distinct from one another because of the “polymerType” scope.

software components running on Linux: (1) a database for HELM structures, (2) a RESTful API for programmatically registering, making unique, and searching, and (3) the HELM editor. Docker technology provides the ideal method to deploy and maintain a complex environment like a HELM-based database. The separation between monomer database and polymer database can provide a nice logical barrier between users who manage and register monomers and those who work with polymers. Monomer Management. Monomers are stored with atomic-level resolution. The Open-Source Cheminformatics toolkit in the MonomerLib Docker is RDKit (http://www.rdkit. org/). This toolkit generates canonical SMILEs that are then used to determine unique structures. RDKit is also used for other utilities like structure searching. In Figure 8 the monomer labeled “r” is a pointer to a monomer database entry that contains molecular structure, default capping

RNA1{d(T)p.d(A)p}|PEPTIDE1{A.L.G.K}$RNA1,PEPTI DE1,1:R1‐1:R1$$$

The “monomerType” field is used to describe the type of monomer within a polymer chain. This is mostly used for nucleic acids where a monomer can be a “backbone” or “branch”. In the previous example the bases enclosed by parentheses in the RNA1 chain are branch monomers while the DNA sugar and phosphodiester linker are backbone monomers. This is especially important when parsing a biological sequence from a nucleic acid chain. For bases that have monomer symbols that match the sequence letters (i.e., A, C, T, G, U), the sequence of the oligomer is relatively simple to extract. However, when computationally parsing the sequence from a modified oligomer one must iterate over the RNA chain, find all branch monomers in a sequence and build a new string with the corresponding natural analog of the branch monomer. Biopolymers. Once the monomer database is operational, HELM-based biopolymers can be managed using a series of tools recently released by the Pistoia Alliance, including the HELM2NotationToolkit, the HELM Web Services and the HELM editor. The HELM2 Software Framework was released in April, 2016, to the open source community without restriction. The goal of the first release is to provide a platform for industrywide adoption and integration of the HELM standard. The software comes with

Figure 8.

groups and the natural analog identifier. As long as the HELM string is used within the context of a monomer database then the full polymer can be constructed with atom-level resolution. The structure and stability of the database is essential to the successful deployment of a HELM-based polymer library within an organization. Currently there are two formats for the monomer database: (1) an XML specification and (2) a JSON structure. The XML 1237

DOI: 10.1021/acs.jcim.6b00442 J. Chem. Inf. Model. 2017, 57, 1233−1239

Application Note

Journal of Chemical Information and Modeling three main Java software packages: HELM2NotationToolkit, HELMNotationParser, and The ChemistryToolkit. HELM2NotationToolkit contains the core data model and all necessary functionality for reading, writing and managing HELM structures including a chemical engine for calculating simple properties like molecular formula, extinction coefficient and molecular weight. The HELM2NotationParser provides lowlevel parsing methods and includes a method for converting HELM1 to HELM2. One of the most important operations HELM users require is the ability to iterate over all polymers and monomers in a particular HELM notation string. This is illustrated in the following example:

Figure 10.

version was released by the Pistoia Alliance in early 2017 and includes a Javascript molecular viewer for structure download and display. It is the primary tool when building oligonucleotide and small peptide structures. The Antibody Editor (HAbE) is an Open-Source Java client developed and maintained by the Roche Innovation Center. HAbE was designed to manage the specific complexity in large antibody structures and permits users to decompose, edit and register complex antibody and antibody-conjugate structures. Further information and documentation on all HELM-based tools can be found at http://openhelm.org. The HELM2 source code is hosted on GitHub: http://github.com/PistoiaHELM/. Third-Party Software and Databases. There are many third-party vendors and communities implementing HELM tools in various capacities. ChemAxon offers HELM support for their Biologics registration service and has been heavily involved in HELM development. Biovia/Desault has integrated HELM into their ScienceCloud.com services, and recently PerkinElmer announced plans to build a HELM editor into their widely used ChemDraw application. In December 2016 Ionis Pharmaceuticals released a curated set of monomers via the HELM monomer database: http:// Ionis.monomer.org. This database contains over 500 monomer structures for building novel oligonucleotides and peptides as well as several known small molecule conjugates. This database serves as a starting point for developing an internal HELM2based database.

Using this code one can parse the “id” and “type” as “PEPTIDE1” and “PEPTIDE” respectively from the following HELM string: PEPTIDE1{M.A.L.W.M.R.L.L.P.L.L.A.L.L.A.L.W.G.P.D.P.A. A.A.F.V.N.Q.H.L.C.G.S.H.L.V.E.A.L.Y.L.V.C.G.E.R.G.F.F.Y. T.P.K.T.R.R.E.A.E.D.L.Q.G.S.L.Q.P.L.A.L.E.G.S.L.Q.K.R.G. I.V.E.Q.C.C.T.S.I.C.S.L.Y.Q.L.E.N.Y.C.N}$$$$V2.0

The Connections section defines the covalent bonds between any two simple polymers. The method “HELM2Notation.getListOfConnections()” returns a list of ConnectionNotation objects with references to the source and target HELM entities, i.e., polymers or group objects. For example, to parse an oligonucleotide bound to a small molecule conjugate one needs to use the connection portion of the HELM string. This might look like the following: “RNA1,CHEM1,21:R2‐1:R1”



The code to parse this looks like

CONCLUSION The release of the second version of HELM (HELM2) greatly expands the utility of the molecular notation for storing, managing and visualizing complex biological macromolecules. Along with this release, the Pistoia Alliance published several Open-Source software tools to help organizations adopt and integrate the HELM standard. Finally, Ionis Pharmaceuticals released a curated monomer database at monomer.org to further support the HELM community by serving “starter” monomer library. These collaborative efforts provide a complete end-toend solution for institutional HELM2 adoption for biopolymer management.

Using the HELM2NotationToolkit one can build a searchable index of conjugated oligonucleotides with very little code and complexity. In addition to notation management and HELM string parsing, the HELM2 software library provides a Chemistry toolkit. This software api provides a general framework for calculating chemical properties of biopolymers including molecular weight, formula, and extinction coefficient. The design of the ChemistryToolkit is polymorphic in the sense that different chemistry implementations can be configured at runtime instead of compile time. For example, Figure 10 shows how two different vendors might represent the same monomer. The toolkit may be configured to point to either implementation without a compilation step, thus greatly simplifying the integration of third party solutions. HELM Client Tools. There are two major development efforts for end-user construction and visualization of HELM structures: The HELM Editor and the HELM Antibody Editor. The HELM Editor is a general biopolymer editor that connects to a proprietary monomer database and allows users to construct, visualize and download HELM-based biopolymers. The first



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Jeff Milton: 0000-0001-8160-1390 Notes

The authors declare no competing financial interest.



REFERENCES

(1) Goble, C.; Wolstencroft, K. Dictionary of Bioinformatics and Computational Biology; Wiley, 2004.

1238

DOI: 10.1021/acs.jcim.6b00442 J. Chem. Inf. Model. 2017, 57, 1233−1239

Application Note

Journal of Chemical Information and Modeling (2) Mullard, A. FDA drug approvals. Nat. Rev. Drug Discovery 2015, 14, 77−81. (3) Niculescu-Duvaz, I. Trastuzumab emtansine, an antibody-drug conjugate for the treatment of HER2+ metastatic breast cancer. PubMed−NCBI. Curr. Opin. Mol. Ther. 2010, 12 (3), 350−60. (4) New data from Phase III EMILIA study showed Roche’s trastuzumab emtansine (T-DM1) significantly improved survival of people with HER2-positive metastatic breast cancer. http://www.roche. com/media/store/releases/med-cor-2012-08-27.htm. (accessed 17th November 2016). (5) Marcoux, J.; Champion, T.; Colas, O.; Wagner-Rousset, E.; Corvaïa, N.; Van Dorsselaer, A.; Beck, A.; Cianférani, S. Native mass spectrometry and ion mobility characterization of trastuzumab emtansine, a lysine-linked antibody drug conjugate. Protein Sci. 2015, 24, 1210−1223. (6) Zhang, T.; Li, H.; Xi, H.; Stanton, R. V.; Rotstein, S. H. HELM: A Hierarchical Notation Language for Complex Biomolecule Structure Representation. J. Chem. Inf. Model. 2012, 52, 2796−2806. (7) Durand, P.; Canard, L.; Mornon, J. P. Visual BLAST and Visual FASTA: graphic workbenches for interactive analysis of full BLAST and FASTA outputs under Microsoft Windows 95/NT. Bioinformatics 1997, 13, 407−413. (8) Feliciani, F. Structure-activity relationships of biphalin analogs and their biological evaluation on opioid receptors. Mini Rev. Med. Chem. 2013, 13 (1), 11−13. (9) Wu, Q. Y. Characterization and identification of antiestrogenic products of phenylalanine chlorination. Water Res. 2016, 44 (12), 3625. (10) Daylight Theory: SMILES. http://www.daylight.com/dayhtml/ doc/theory/theory.smiles.html (accessed 14th October 2016). (11) Daylight > SMARTS Tutorial. http://www.daylight.com/ dayhtml_tutorials/languages/smarts/index.html (accessed 14th October 2016). (12) RCSB PDB−Search Results. http://www.rcsb.org/pdb/results/ results.do?tabtoshow=Current&qrid=A5694E9A (accessed 14th January 2017). (13) Heller, S. R.; McNaught, A.; Pletnev, I.; Stein, S.; Tchekhovskoi, D. InChI, the IUPAC International Chemical Identifier. J. Cheminf. 2015, 7, 23. (14) Southan, C. InChI in the wild: an assessment of InChIKey searching in Google. - PubMed - NCBI. J. Cheminf. 2013, 5 (1), 10. (15) Lan, W.; Hu, Z.; Shen, J.; Wang, C.; Jiang, F.; Liu, H.; Long, D.; Liu, M.; Cao, C. Structural investigation into physiological DNA phosphorothioate modification. Sci. Rep. 2016, 6, 25737. (16) Russo Krauss, I.; et al. A regular thymine tetrad and a peculiar supramolecular assembly in the first crystal structure of an all-LNA Gquadruplex. Acta Crystallogr., Sect. D: Biol. Crystallogr. 2014, 70, 362− 370. (17) Bourne, P. E.; Bernstein, H. J.; Bernstein, F. C. Translating PDB entries into mmCIF. Acta Crystallogr., Sect. A: Found. Crystallogr. 1996, 52, C575−C575. (18) Martz, E. In Dictionary of Bioinformatics and Computational Biology; Wiley, 2004. (19) Stan Tsai, C. An Introduction to Computational Biochemistry; John Wiley & Sons, 2003. (20) Goasguen, S. Docker Cookbook; O’Reilly Media, Inc., 2015. (21) (a) Tosco, P.; Stiefl, N.; Landrum, G. The integration of Open3DTOOLS into the RDKit and KNIME. J. Cheminf. 2014, 6, P8. (b) RDKit: Cheminformatics and Machine Learning Software; 2013; http://www.rdkit.org. (22) Scifinder. www.cas.org/SCIFINDER.

1239

DOI: 10.1021/acs.jcim.6b00442 J. Chem. Inf. Model. 2017, 57, 1233−1239