C++ Chemoinformatics Toolkits to the

Sep 7, 2017 - Emscripten is a special open source compiler that compiles C and C++ code into JavaScript. ... The compiled JavaScript files have sizes ...
0 downloads 3 Views 1MB Size
Application Note Cite This: J. Chem. Inf. Model. 2017, 57, 2407-2412

pubs.acs.org/jcim

Quick Way to Port Existing C/C++ Chemoinformatics Toolkits to the Web Using Emscripten Chen Jiang*,† and Xi Jin‡ †

Department of Organic Chemistry and ‡Department of Foreign Languages, China Pharmaceutical University, Nanjing 210009, Jiangsu, China S Supporting Information *

ABSTRACT: Emscripten is a special open source compiler that compiles C and C++ code into JavaScript. By utilizing this compiler, some typical C/C++ chemoinformatics toolkits and libraries are quickly ported to to web. The compiled JavaScript files have sizes similar to native programs, and from a series of constructed benchmarks, the performance of the compiled JavaScript codes is also close to that of the native codes and is better than the handwritten JavaScript codes. Therefore, we believe that Emscripten is a feasible and practical tool for reusing existing C/C++ codes on the web, and many other chemoinformatics or molecular calculation software tools can also be easily ported by Emscripten.

1. INTRODUCTION Over the past decades, the web has become one of the most important platforms for chemical computing. JavaScript, originally designed as an interpreted programming language, is the only widely accepted dialect in the current web browsers and it forms the basis for web applications together with other web standards such as HTML (hyper text markup language)1 and CSS (cascade style sheet).2 Today, many JavaScript based molecular viewers/editors3−5 and a few chemoinformatics toolkits, such as ChemDoodle Web Components6 and our Kekule.js,7 have been developed, and their value has been proven on the web. However, these newly developed tools are still far from mature. Meanwhile, there have been a huge number of powerful programs and libraries targeting native platforms over the course of the history of computational chemistry, implementing vast aspects of chemoinformatics and molecular modeling. These programs are often developed with traditional languages such as C and C++ and are compiled into native machine code for performance. The native codes are unable to be run in web browsers, but dropping all of those great legacies and reimplementing them in JavaScript is a Herculean duplication of effort. Fortunately, for existing C/C+ + libraries, Emscripten,8 first introduced by Mozilla,9 is an ideal tool for quickly porting them to the web. Emscripten is actually an open-source LLVM10 based compiler. Instead of compiling to native machine code like other commonly used compilers, Emscripten generates the LLVM bitcode to asm.js, a highly optimized subset of JavaScript. Theoretically, Emscripten supports any language that can be converted into LLVM bitcode but currently its tool chain is mainly focused on C and C++. Usually, existing portable C/C++ codes need only a small amount of modifications before being ported by Emscripten. The compiler also provides various options for connecting normal JavaScript © 2017 American Chemical Society

with compiled asm.js code, e.g. exposing functions and classes from C/C++ to JavaScript (and vice versa). In recent years, Dr. O’Boyle, as one of the key developers of the famous chemoinformatics toolbox OpenBabel,11 has done some experimental work in compiling several C++ toolkits with Emscripten,12 including RDKit,13 Helium,14 and OpenBabel. All of the experimental programs focused on one simple task: generating 2D molecular figure from SMILES15 input. The task was accomplished using C++ codes calling to APIs of those toolkits. The codes were then compiled to JavaScript and adapted to the web browser, running like normal native programs. Although the experiments were quite simple, they demonstrated the possibility of porting chemoinformatics software through Emscripten. Writing all the functional codes in C/C++ then compiling to a JavaScript program concentrating on one concrete task, such a workflow is a widely used mode of utilizing Emscripten. During this time, OpenBabel was also compiled by us in a different mode. Instead of compiling to a simple executable program, we ported it as a universal library, exposing API functions and classes to JavaScript. In this manner, our compilation covered more features of the toolkit and was easier to utilize by web developers. Other JavaScript codes could access the library and share with the dominant functions of OpenBabel. This approach also enabled the integration of OpenBabel with existing JavaScript chemoinformatics toolkits (e.g., our Kekule.js).7 A similar method has also been used by Guillaume Godin and his collaborators in porting RDKit as a JavaScript library.16 Afterward, several other libraries and programs were ported by us in the same way. The functionality of these libraries varies Received: July 19, 2017 Published: September 7, 2017 2407

DOI: 10.1021/acs.jcim.7b00434 J. Chem. Inf. Model. 2017, 57, 2407−2412

Application Note

Journal of Chemical Information and Modeling

previously discussed. In our current work, most of the C API functions have been ported, and the corresponding JavaScript compilation is almost as powerful as the native one. In addition, the Indigo library relies heavily on C++ exceptions, so during our compiling an extra command line setting DISABLE_EXCEPTION_CATCHING = 0 must be used to enable them in compiled JavaScript. Unfortunately such a setting does have some negative effects on performance, which will be discussed in part 5.3. 2.3. OpenBabel. OpenBabel11 is a well-known open source chemoinformatics toolkit, famous for its ability to convert between dozens of different chemical formats. It could be extremely helpful in compiling Open Babel into JavaScript and integrating with existing web-based chemoinformatics applications. OpenBabel is built upon hundreds of C++ classes. Compared to plain C codes, much more work must be done to export C+ + classes to JavaScript. Emscripten provides two options to accomplish this: Embind and WebIDL binder. These two methods make no significant difference to performance but the former is more complex and powerful. In our current work, Embind was used. Extra binding codes and some helper classes were created, that helped to expose many key OpenBabel classes like OBAtom, OBBond, and OBMol as well as some façade classes like ObFormatWrapper and ObConversionWrapper. For example, the following codes expose class ObConversionWrapper, a key façade class written by us to conduct data format conversions, together with all its methods:

widely, from molecular data I/O (input/output) to molecular dynamic calculations, but there are some common steps in porting these projects. First of all, since the graphic and interaction mechanisms of web browsers are totally different from desktop, the original C/C++ code related to UI (user interface) should be stripped off. Second, the proper settings or additional C++ source files should be provided to export certain C functions or C++ classes to JavaScript for later use. Then the whole project can be compiled using Emscripten and a JavaScript library file will be output. Last, the essential JavaScript glue code should be written to link the exported functions or classes of the compiled library to the web browser user interface (UI). To these ported projects, compiled file sizes, execution results, and performance are also compared between native and Emscripten compilations. All of these comparisons lead to exhilarating results, and we believe it is not only a possible but also a feasible and practical method to reuse existing chemoinformatics C/C++ codes on a web platform. It should be noted that aside from utilizing these compiled projects directly, users may easily build their own JavaScript compilation of these libraries to adapt to their own needs (e.g., with different function exports, class bindings, and so on). The building instructions, demonstrations, and all essential source files are provided in the Supporting Information of this manuscript and published on the web at GitHub17 and our homepage.18 Moreover, other C/C++ based libraries can be ported in a similar way.

2. PORTED PROJECTS 2.1. InChI. InChI (the IUPAC International Chemical Identifier)19 is a widely used textual identifier for chemical substances. InChI Trust20 is the current maintainer who has published an official open source library to generate InChI. The library is written completely in plain C with a set of application programming interface (API) functions. It is not difficult to compile the InChI library using Emscripten and a special compiler command line parameter, EXPORTED_FUNCTIONS, can be used to expose the C functions to normal JavaScript. However, since C style structures and pointers are used widely in the InChI API, it is quite complex to utilize these functions directly in normal JavaScript. So an additional façade function: molToInchiJson, which converts MDL mol format21 molecule data to JSON (JavaScript object notation)22 string of InChI information, is created by us to wrap the dominating usage of InChI library. Since this wrapper function only uses simple strings for input and output, it can be called up in JavaScript with ease:

After binding, these classes can be called up by JavaScript with a style similar to that found in C++. For example, the following codes convert MDL mol format data to the PDB24 format:

Aside from data conversion, OpenBabel also supports molecular mechanics calculation through class OBForceField. That C++ class was also exported and could be used directly in JavaScript. 2.4. OpenMD. In the past, the running speed of JavaScript was no match for the native program, so it is not the best practice to perform time-consuming calculations like molecular

2.2. Indigo. Indigo23 is a powerful open source universal organic chemistry toolkit. The toolkit itself is written in C++. However, it also shipped with a plain C API, which can be exported to JavaScript in a way that is similar to InChI as 2408

DOI: 10.1021/acs.jcim.7b00434 J. Chem. Inf. Model. 2017, 57, 2407−2412

Application Note

Journal of Chemical Information and Modeling Table 1. File Size Comparison between Native and Compiled JavaScript Files native file name and size (MB) JavaScript file name and size (MB)

InChI

Indigo

OpenBabel

OpenMD

1.2 (inchi-1) 1.2 (inchi.js)

9.2 (libindigo.so) 10.3 (indigo.js)

33.4 (libopenbabel.so) 7.3 (openbabel.js)

3.3 (openmd) 9.6 (openMD.js)

Figure 1. Average consumed time ratios between native and compiled JavaScript code in different web browsers (the smaller the faster, native as 100%).

JavaScript and C/C++, results from long time calculation evolving float numbers (e.g., OpenMD simulation) may be slightly different but still within the allowable range of error. All these test sources are provided in the Supporting Information of this manuscript.

modeling or simulations in the web client. However, after years of evolution, its efficiency has been greatly improved with modern engines. So OpenMD (formerly called OOPSE),25 an open source C++ molecular dynamics engine, has been experimentally ported by us. Unlike many libraries or toolkits, OpenMD is primarily designed as a standalone simulation program and does not provide a clear API. Therefore, several façade classes were created to wrap the calculation process. Meanwhile, the original OpenMD program heavily relies on the file system. For example, it reads source data from an OMD file and outputs resulting data to report, status, and dump files. Unfortunately, JavaScript cannot access local files directly in the web environment, so the virtual file system of Emscripten is used in JavaScript compilation so that the input/output data can use normal strings directly. The molecular dynamics simulation often requires a great deal of time, ranging from minutes to hours or even days. So OpenMD JavaScript compilation should never be used in the main thread, or it will surely block the web browser UI. Web worker,26 a simple means for web content to run scripts in background threads, is the ideal place to run OpenMD calculations where data can be transferred between the web worker and main thread by messages.

4. COMPILED FILE SIZE COMPARISON Comparison of file sizes are made between ported JavaScript libraries and the corresponding native programs or shared object files in Ubuntu Linux. The results of these comparisons are listed in Table 1. In general, the file sizes of JavaScript and the corresponding native compilations are similar. For example, both inchi.js and the native inchi-1 program focus on the same job: converting molecular structure to an InChI string, so their sizes are both about 1.2 MB. In indigo.js, most of the C API functions of Indigo are exported during compiling as mentioned above, so its size is also very close to that of the native libindigo.so. For OpenBabel, since only part of functions and classes are exposed in our JavaScript library, the size of openbabel.js is much smaller than the native libopenbabel.so. For OpenMD, it seems that the JavaScript compilation is much larger than the native openmd program. But it is noticeable that some external force field files (about 2.12 MB total) are required for the native program. In our JavaScript compilation, all those external data files are embedded in and converted to text format (due to the limitation of JavaScript source code), thus the size of openMD.js is greatly inflated. Another option in Emscripten is to split these external data from the main JavaScript code and organize them into one separate data file. With this option, size of openMD.js will be reduced to 3.15 MB, which is even a little smaller than the native one, and the data file size will be 2.12 MB, exactly the same as the total size of original force field files. However, such a separation may lead to extra complexities in the deployment of web applications.

3. FUNCTIONAL TESTS To ensure the correctness of compiled JavaScript code, a series of functional tests were built for those ported projects. Each test follows a similar process: load data from the original test data set or examples of native libraries, then perform operations on test data (e.g., generate InChI strings for InChI library, convert molecule data formats for OpenBabel and Indigo, do 2D layout of molecules for Indigo, and run molecular dynamic simulations for OpenMD), and finally compare the results with native programs, ensuring that they are identical. In most test cases, the output results of the JavaScript and the native code are exactly the same. Due to the difference of precision in 2409

DOI: 10.1021/acs.jcim.7b00434 J. Chem. Inf. Model. 2017, 57, 2407−2412

Application Note

Journal of Chemical Information and Modeling In terms of required network transmission, file sizes in web application are usually more concerning than in the desktop environment. The size of a compiled JavaScript can be further reduced by removing superfluous functions and classes exported from C/C++. For example, for the Indigo library, on the condition that only functions related to I/O and 2D layout are exported, the final JavaScript file size is about 7 MB, 30% smaller than the one with all API functions exported.

The molecular mechanics calculation speeds of OpenBabel were also determined. In our tests, force field MMFF94 was used to generate 3D structures from a set of 2D molecules and the results are shown in Table S3 and Figure S3. It was quite surprising that in some cases, the speeds of JavaScript in Firefox and Edge were even faster than the native code. This may have been caused by two reasons. First, compared to string manipulation jobs (e.g., molecule data I/O in previous tests), all browsers seem to have better efficiency with numeric calculation. Second, during calculations, packs of log messages are printed out. On the native side, the log messages are printed synchronously, in the same thread of the calculation. Meanwhile in most web browsers, messages in the console windows seem to be displayed asynchronously in a separated thread, thus the logging cost will not affect the calculation speed. 5.3. Indigo. Many functions of Indigo, including molecule data I/O, 2D molecular layout, and tautomer discovery were compared. Each of these tests was also executed 1000 times, and the total consumed computer time was recorded. The results are shown in Table S4 and Figure S4. Since the exceptions were enabled during compiling as previously mentioned, the speed of JavaScript was affected to some degree and was about 5−15 times slower than the native code in all the browsers. However, since the execution time of a single operation is still less than several milliseconds, the JavaScript compilation is usable for most circumstances. 5.4. OpenMD. Some of the OpenMD example files were used in the comparison and the execution results are shown in Table S5 and Figure S5. Similar to the molecular mechanics calculations of OpenBabel, in these long-term numerical calculations, the efficiency of the compiled JavaScript was very close to the native code, especially in Firefox, and was only about 1.6 times slower. It is also noticeable that native OpenMD can run in the parallel mode, performing calculations on multiple CPU cores to shorten the total computing time. But for JavaScript in web browsers, low level multithreading is currently far from usable. So in the above comparison, native OpenMD code was forced to run in the single core mode. It can also be seen that, currently most native molecular modeling and calculation packages are still superior to JavaScript, since the former ones usually support a multithread mode. However, the situation may be improved in the future, for there is a work-in-progress research project and a prototype specification that are trying to overcome that shortcoming of JavaScript.28

5. PERFORMANCE COMPARISON A series of benchmark programs were built by us to compare the execution speed of the compiled JavaScript and the native code. All tests were executed in a virtual machine configured with a four-core CPU and 4 G memory, running on a PC with Intel Xeon E3-1230 V2 CPU and 8 G DDR3 1600 memory. Native programs were compiled from C/C++ by GCC (The GNU Compiler Collection)27 and run on Ubuntu Linux 16.04 while the JavaScript codes were run on three different web browsers: Mozilla Firefox 53.0.2/Google Chrome 58.0.3029 on Ubuntu and Microsoft Edge 25.10586 on Windows 10. In the benchmarks, the typical functions of these four libraries were performed on a series of test cases. The average ratios of consumed time between JavaScript and native compilations are shown in Figure 1. More details are discussed in the following parts of this section. The detailed resulting data of these benchmarks, together with all source codes and test cases, are provided in the Supporting Information of this manuscript. 5.1. InChI. The rates for converting a set of MDL mol format molecular data into InChI were compared between native program and JavaScript code. To reduce the errors, each conversion process was executed 1000 times, and the total time required for the conversion was recorded. The detailed results are shown in Figure S1 and Table S1 in the Supporting Information. From these tests, it can be seen that the speed of compiled JavaScript varied in the various web browsers. For Firefox, which has the best optimization for asm.js, the speed was about 2−4 times slower than the native code in all test cases. In other browsers, the speed was 5−10 times slower. However, the comparative results are quite encouraging considering that C has a reputation for speed while JavaScript originates from an interpretation language and is run with the extra burdens imposed by the web browser environment. 5.2. OpenBabel. The I/O aspect of OpenBabel was first tested on a set of MDL mol format molecular files. The speeds of reading/writing the mol data and converting to the PDB format were compared. Each of these tests was also executed 1000 times, and the consumed time was recorded. The results are shown in Table S2 and Figure S2. It can be seen that the speeds of compiled JavaScript were closer to the speeds of the native C++. In Firefox, JavaScript runs about 2 times slower than the native code in MDL mol data I/O. An extra comparison was made between the Emscripten compiled code and handwritten JavaScript, and the results are included in Table S2. Our Kekule.js toolkit was introduced in this case to run the same task, namely outputting MDL mol data from the same set of molecules. Although the implementation of OpenBabel and Kekule.js are totally different so that the comparison is not accurate, it is still quite obvious from the tests that the speed of compiled JavaScript code was faster than the speed of the handwritten code.

6. CONCLUSION The file size comparison and performance benchmarks detailed in this study indicate that the Emscripten compiled JavaScript code is quite practical for use in the web environment. It may be a little slower than the native code, but in many cases (e.g., molecule data format conversion, tautomer matching, and so on) that consume very little time, the execution speeds are not as dominant. In addition, the compiled code runs much faster than the handwritten JavaScript, so it is ideal for adaptation to current web chemoinformatics toolkits for implementing extra functions or performing time-consuming calculations. For instance, the InChI, OpenBabel and Indigo JavaScript libraries have now been provided as plugins to the Kekule.js toolkit. The former two provide the ability to support extra chemistry data formats and the latter enables automatic 2D molecule layout for SMILES input (as shown in left part of Figure 2). With OpenBabel, some molecular mechanics 2410

DOI: 10.1021/acs.jcim.7b00434 J. Chem. Inf. Model. 2017, 57, 2407−2412

Application Note

Journal of Chemical Information and Modeling

Figure 2. Integration of OpenBabel and Indigo with Kekule.js,.



calculations (e.g., generating 3D structures from 2D ones) are also introduced in the toolkit (as shown in right part of Figure 2). Those compiled libraries can greatly expand the use of Kekule.js. Aside from the four libraries currently built by us, other C/C ++ based libraries can be ported to the web by Emscripten with ease. We expect more of such porting projects in the near future. With the huge legacy of native libraries and with the help of Emscripten, the web based chemoinformatics ecosystem can be rapidly constructed.



(1) What is HTML. http://www.w3.org/html/ (accessed Aug 31, 2017). (2) Cascading Style Sheets Home Page. http://www.w3.org/Style/ CSS/ (accessed Aug 31, 2017). (3) Bienfait, B.; Ertl, P. JSME: a Free Molecule Editor in JavaScript. J. Cheminf. 2013, 5, 24. (4) Ketcher Home Page. http://lifescience.opensource.epam.com/ ketcher/index.html (accessed Aug 31, 2017). (5) JSMol Home Page. http://sourceforge.net/projects/jsmol/ (accessed Aug 31, 2017). (6) Burger, M. C. ChemDoodle Web Components: HTML5 Toolkit for Chemical Graphics, Interfaces and Informatics. J. Cheminf. 2015, 7, 35. (7) Jiang, C.; Jin, X.; Dong, Y.; Chen, M. Kekule.js: An Open Source JavaScript Chemoinformatics Toolkit. J. Chem. Inf. Model. 2016, 56, 1132−1138. (8) Emscripten Home Page. http://kripken.github.io/emscriptensite/index.html (accessed Aug 31, 2017). (9) Mozilla Home Page. http://www.mozilla.org (accessed Aug 31, 2017). (10) LLVM Home Page. http://llvm.org/ (accessed Aug 31, 2017). (11) O’Boyle, N. M.; Banck, M.; James, C. A.; Morley, C.; Vandermeersch, T.; Hutchison, G. R. Open Babel: An Open Chemical Toolbox. J. Cheminf. 2011, 3, 33. (12) Cheminformatics.js: Preamble. http://baoilleach.blogspot.ch/ 2015/02/cheminformaticsjs-preamble.html (accessed Aug 31, 2017). (13) RDKit: Open-Source Cheminformatics Software. http://www. rdkit.org (accessed Aug 31, 2017). (14) Helium GitHub Page. https://github.com/timvdm/Helium (accessed Aug 31, 2017). (15) Weininger, D. SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules. J. Chem. Inf. Model. 1988, 28, 31−36. (16) RDkitjs Github Page. https://github.com/cheminfo/RDKitjs (accessed Aug 31, 2017).

ASSOCIATED CONTENT

* Supporting Information S

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.7b00434.



REFERENCES

Detailed resulting data of all benchmarks and steps to compile the ported four projects (PDF) Additional source files for compiling those projects (ZIP) Compiled JavaScript code of those ported projects (ZIP) Demos of ported libraries (ZIP) Source files and data of benchmarks (ZIP) Functional tests for ported libraries (ZIP)

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Chen Jiang: 0000-0001-6653-1590 Notes

The authors declare no competing financial interest. 2411

DOI: 10.1021/acs.jcim.7b00434 J. Chem. Inf. Model. 2017, 57, 2407−2412

Application Note

Journal of Chemical Information and Modeling (17) Cheminfo-to-web Project on Github. https://github.com/ partridgejiang/Cheminfo-to-web/ (accessed Aug 31, 2017). (18) Cheminfo-to-web Project Home Page. https://partridgejiang. github.io/cheminfo-to-web/ (accessed Aug 31, 2017). (19) Heller, S.; McNaught, A.; Stein, S.; Tchekhovskoi, D.; Pletnev, I. InChI - the Worldwide Chemical Structure Identifier Standard. J. Cheminf. 2013, 5, 7. (20) InChI Trust Home Page. http://www.inchi-trust.org/ (accessed Aug 31, 2017). (21) CTfile Formats. http://download.accelrys.com/freeware/ctfileformats/ctfile-formats.zip (accessed Aug 31, 2017). (22) Introducing JSON. http://www.json.org/ (accessed Aug 31, 2017). (23) Indigo Home Page. http://lifescience.opensource.epam.com/ indigo/index.html (accessed Aug 31, 2017). (24) Atomic Coordinate Entry Format Version 3.3. http://www. wwpdb.org/documentation/file-format-content/format33/v3.3.html (accessed Aug 31, 2017). (25) Meineke, M. A.; Vardeman, C. F.; Lin, T.; Fennell, C. J.; Gezelter, J. D. OOPSE: An object-oriented parallel simulation engine for molecular dynamics. J. Comput. Chem. 2005, 26, 252−271. (26) Web Worker Standard. https://html.spec.whatwg.org/ multipage/workers.html (accessed Aug 31, 2017). (27) GCC, the GNU Compiler Collection. http://gcc.gnu.org/ (accessed Aug 31, 2017). (28) Shared memory and atomics for ECMAscript. https://github. com/tc39/ecmascript_sharedmem (accessed Aug 31, 2017).

2412

DOI: 10.1021/acs.jcim.7b00434 J. Chem. Inf. Model. 2017, 57, 2407−2412