Mass++: A Visualization and Analysis Tool for ... - ACS Publications

Jun 26, 2014 - Koichi Tanaka Laboratory of Advanced Science and Technology, Shimadzu Corporation, Kyoto 604-8511, Japan. ‡. Eisai Product Creation ...
0 downloads 0 Views 701KB Size
Subscriber access provided by West Virginia University | Libraries

Technical Note

Mass++: A visualization and analysis tool for mass spectrometry Satoshi Tanaka, Yuichiro Fujita, Howell E Parry, Akiyasu C. Yoshizawa, Kentaro Morimoto, Masaki Murase, Yoshihiro Yamada, Jingwen Yao, Shinichi Utsunomiya, Shigeki Kajihara, Mitsuru Fukuda, Masayuki Ikawa, Tsuyoshi Tabata, Kentaro Takahashi, Ken Aoshima, Yoshito Nihei, Takaaki Nishioka, Yoshiya Oda, and Koichi Tanaka J. Proteome Res., Just Accepted Manuscript • Publication Date (Web): 26 Jun 2014 Downloaded from http://pubs.acs.org on June 26, 2014

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 29

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 29

Mass++: A visualization and analysis tool for mass spectrometry AUTHOR NAMES Satoshi Tanaka,*,† Yuichiro Fujita,† Howell E. Parry,† Akiyasu C. Yoshizawa,† Kentaro Morimoto,† Masaki Murase,† Yoshihiro Yamada,† Jingwen Yao,† Shin-ichi Utsunomiya,† Shigeki Kajihara,† Mitsuru Fukuda,‡,§ Masayuki Ikawa,‡,§ Tsuyoshi Tabata,‡ Kentaro Takahashi,‡ Ken Aoshima,‡ Yoshito Nihei,║ Takaaki Nishioka,║ Yoshiya Oda,‡ and Koichi Tanaka† AUTHOR ADDRESSES †

Koichi Tanaka Laboratory of Advanced Science and Technology, Shimadzu Corporation,

Kyoto 604-8511, Japan ‡

Eisai Product Creation Systems, Eisai Co., Ltd., Tsukuba, Ibaraki 300-2635, Japan

§

iBioTech Co., Tsukuba, Ibaraki 300-0031, Japan



Graduate School of Information Science, Nara Institute of Science and Technology, Ikoma,

Nara 630-0192, Japan

ACS Paragon Plus Environment

1

Page 3 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

KEYWORDS Mass++, MassBank, Identification, Quantitation, Platform, Plug-in, Mass Spectrometry

Abstract We have developed Mass++, a plug-in style visualization and analysis tool for mass spectrometry. Its plug-in style enables users to customize it and develop original functions. Mass++ has several kinds of plug-ins, including rich viewers and analysis methods for proteomics and metabolomics. Plug-ins for supporting vendors’ raw data are currently available; hence Mass++ can read several data formats. Mass++ is both a desktop tool and a software development platform. Original functions can be developed without editing the Mass++ source code. Here we present this tool’s capability to rapidly analyze MS data and develop functions by providing examples of label-free quantitation and implementing plug-ins or scripts. Mass++ is freely available on http://www.first-ms3d.jp/english/.

Introduction Today, mass spectrometry (MS) plays an important role as an analysis technology in life sciences such as proteomics and metabolomics. After data acquisition by a mass spectrometer, various software products are used for visualizing, processing, and analyzing the raw data. Commercial software supplied with the MS instrument is usually used for these purposes. However, such software products usually cannot be controlled by third-party software. Therefore,

ACS Paragon Plus Environment

2

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 29

even if researchers have original ideas such as new algorithms and automation systems, they are not easy to implement. Many free tools for the visualization, processing, and analysis of mass spectrometry data are now being distributed (e.g., Trans Proteomic Pipeline (TPP),1, 2 Open MS,3,4 MaxQuant,5,6 MZmine,7,8 XCMS,9,10 ProteoWizard,11,12 Skyline,13,14 mzAPI,15 pyteomics16 and Coreflow17). These tools provide powerful functions; however, some of them do not allow the integration of additional code to implement researchers’ ideas, or, even if they allow integration of additional code, they require users to write a large amount of code for the implementation. (See the supplementary data for data analysis tools and development environments.) Here we report on a new software application, Mass++ (“Mass plus plus”), for viewing and manipulating any type of mass spectrometry data. It is capable of performing a wide variety of manual or automatic tasks such as peak detection, smoothing, and automatic data submission into identification search engines such as Mascot and X! Tandem. Mass++ can read various mass spectrometer file formats, so users can analyze several kinds of data files in the same way. Mass++ can convert sample data from these formats to common formats such as AIA (netCDF), mzXML, and mzML. Mass++ is plug-in software designed to satisfy diverse needs, and users can develop new automatic routines, viewers, and algorithms as plug-ins. Thus each user can customize Mass++ according to their objectives. Mass++ plug-ins are written in C++, VB.NET, or C#.NET and the necessary functions can be added using these languages without reference to the Mass++ source code. Additionally, Mass++ has a script console that enables creation of simple programs using a script language. We have focused on developing comprehensive identification and quantitation analysis. Mass++ can directly post peak lists and parameters to various search engines and register the results into an internal database, which users can confirm

ACS Paragon Plus Environment

3

Page 5 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

afterwards. Quantitation results are also saved in the internal database. Users can determine the differences between samples using the peak matrix, distribution plot, and overlapping view functions. Mass++ contributes to metabolomics by improving the chemical identification of small molecules, which is the principal bottleneck in metabolomics study. For chemical identification after automatic peak detection of LC-MS raw data, users can submit a batch of large peak-data sets from Mass++ to MassBank, which is a public repository of reference mass spectra. They can also contribute to the database by using the provided function to generate MassBank-formatted records from the peak data semi-automatically. Software Structure of Mass++ Mass++ is implemented in a plug-in style to simplify customizing and adding functions (See the supplementary data for the plug-in structure). A plug-in is a software component that can be added to another software application. For example, web browsers such as Internet Explorer or Firefox cannot read .pdf files just after installation; however, they can read .pdf files after Adobe Reader is installed. In this case, Adobe Reader works as a plug-in for these web browsers. Another example is an add-in (or add-on) for Excel, i.e., a small program that adds a specific feature to Excel. Such an add-in is an actual instance of a plug-in. In Mass++, users can customize Mass++ according to their objectives, adding functions or removing unnecessary functions as needed to increase performance or simplify the software.

ACS Paragon Plus Environment

4

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 29

Implementation Supported Data Formats. Mass++ supports various instrument data formats and common formats: LCMSsolution, GCMSsolution (Shimadzu), Xcalibur (Thermo Scientific), Analyst, AnalystQS (AB Sciex), MassLynx (Waters), MassHunter (Agilent), LaunchPad (Kratos) mzXML,18 mzML,19 and netCDF20. (See the supplementary data for supported data formats.) All functions for reading file formats are also implemented as Mass++ plug-ins. Hence, Mass++ can be made to read any new file format by developing a plug-in for the corresponding format. This structure simplifies supporting newly released mass spectrometer file formats. Mass++ can export sample data as mzXML, mzML, and Mass Spectrum Binary (MSB) files. mzXML and mzML are already standard formats for mass spectrometry and are available in many tools. However, they have data access problems, such as when generating a chromatogram. MSB file format. MSB is an original and lossless data file format for Mass++. This format has been designed with a focus on improving file-reading performance, retaining both spectrum information and chromatogram information. This format is a binary format that contributes to improved reading performance; however, the most distinctive feature is its data structure. In general cases, the most time-consuming process in reading mass spectrometry data is the reconstruction of the chromatogram. In the MSB format, the chromatogram data is stored as data objects; the data of a chromatogram is divided into multiple fragments, and statistical values, such as the maximum intensity value and the total intensity value, which are pre-calculated for each fragment, are stored within the MSB file. Thus, especially when reading chromatogram data across intervals corresponding to the data fragments, calculations of chromatogram values can be omitted by reading these pre-calculated values, thereby file-reading performance is improved

ACS Paragon Plus Environment

5

Page 7 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

without any loss of contents (See the supplementary data for the details of the MSB file format). Especially when a large LC/MS data file or multiple LC/MS data files are analyzed, the MSB format greatly improves performance. (See the supplementary data for the performance test results.) Basic Functions. Mass++ is general-purpose software capable of performing a wide variety of tasks. The program contains several peak-detection algorithms (Table 1) and can display the peaks on an m/z scale or a time scale. One of the goals in designing Mass++ was to create a program for viewing mass spectra, so Mass++ has several functions for this purpose such as file input/output functions, a waveform viewer, a 3D viewer, zoom-in functionality, peak detection, smoothing, and retention time (RT) alignment (Fig. 1). (See the supplementary data for the full list of functions.) Identification and Quantitation. For proteome data analysis, users can invoke identification and quantitation functions. Conventionally, the standard protocol for identifying proteins from MS data begins with extracting peaks, saving them to a text file, and then posting this to a search engine; it is thus quite time-consuming. In contrast, Mass++ can directly post peak lists and parameters to certain search engines such as Mascot21 and X! Tandem22, search results are stored in the Mass++ internal database and can be displayed via the viewing functionality of Mass++. In addition, Mass++ provides quantitation data for the peaks and manages quantitation results using a “peak matrix,” in which each row represents a peak and each column represents a sample. Users can create a peak matrix step-by-step using a wizard. The quantitation results are also stored in the internal database and linked to the corresponding identification results. Peaks related to target substances can therefore be found easily in the original mass spectrometric data.

ACS Paragon Plus Environment

6

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 29

Quantitation processing using Mass++ creates a peak matrix containing the differential analysis results of all peaks, in a table format. We prepared 11 samples of a protein mixture containing five proteins from yeast, which are described below, as an example. All of them contained 200fmol of yeast extracts. Three of them were spiked with 10fmol of BSA tryptic peptides, five of them were spiked with 50fmol of BSA tryptic peptides, and the other three samples were not spiked with BSA tryptic peptides; these contained only yeast extracts. The samples were analyzed by LC/MS on a LCMS-IT-TOF system (Shimadzu). The peak matrix is created via the following steps in Mass++. (1) Registering samples in groups. (2) Normalization. (3) Retention Time (RT) Alignment. (4) Peak-position determination. (5) Peak-value calculation. (6) Other analysis. In Step (1), sample data are classified into groups, usually according to the properties of samples such as a control group and a treatment group. In this example, groups are classified according to the amount of BSA (none, 10fmol and 50fmol). After the sample data is read, the columns of the peak matrix are created. In Step (2), samples are normalized in order to correct the intensity gaps that often appear between different samples. Mass++ includes normalization methods using internal standards, totalized peak intensities, etc. (See the supplementary data for

ACS Paragon Plus Environment

7

Page 9 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

the full list of methods.) In Step (3), RT alignment is performed using a dynamic programming algorithm in order to correct the RT gaps that appear between different samples. In Step (4), the peak positions are determined. Mass++ includes various methods for this such as detecting peaks (label-free), importing from a file (targeted peaks), and MRM (see the supplementary data for the full list of methods). In this example, the peak positions are determined by detecting peaks from 11 samples. After the peak positions are determined, the peak matrix rows are generated. In Step (5), the peak intensities, or areas under the waveform of the peak, are calculated from the spectra or chromatograms. After this process, all elements in the peak matrix are fixed. If needed, users can perform various analyses on the peak matrix such as statistical analysis and identification, which are used for annotating each substance peak. In this example, an analysis of variance (ANOVA) and identification are performed. Figure 2 presents the results of the quantitation. P-value and substance columns are appended after ANOVA and identification are performed. The p-values of peaks annotated as BSA are small enough to be able to notice the differences (p-value < 0.0001). Users can visually check peaks using a distribution plot and an overlapping view by clicking on peak rows in the peak matrix. The distribution plot displays the distribution of peak intensities or areas, and it can show a boxplot. The overlapping view presents the spectrum or chromatogram waveform, and the peak shape can be confirmed. In this case, we can confirm that peak intensities and areas increase with increasing BSA. MassBank. MassBank,23 http://www.massbank.jp/, is a public repository of mass spectra of small molecules that currently contains 40,064 mass spectra contributed by twenty-seven laboratories (as of January 2014). It is one of the most referenced databases for chemically identifying small molecules detected by GC- and LC-MS analysis of biological samples. Mass++

ACS Paragon Plus Environment

8

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 29

has been optimized to visualize and analyze metabolome as well as proteome data by collaborating with the MassBank project. MassBank is both a simple database and a platform containing a search engine; rich search functions are available via a web browser (Table 2). However, users previously had to extract peak data in text format and paste it into the web browser because raw data is not accepted as a query for the MassBank spectral search. However, MassBank provides a Simple Object Access Protocol (SOAP) Application Programming Interface (API) that enables applications to be written without any web browser. Mass++ can perform a search in MassBank through its SOAP API; hence users can search MassBank data by simply selecting a raw data spectrum as a query. Mass++ is very helpful for constructing a user’s private mass spectral library. MassBank system installers for Windows and Linux are available as open source software, so anyone can construct their own MassBank private library in a laboratory. All spectral data with metadata in a MassBank record should be prepared by following the “MassBank Record Format.” Creating MassBank records has been very time-consuming because researchers had to manually extract sample information and peaks from several kinds of raw data. Furthermore, the extraction method differs with MS instrument because the supplied software differs for each instrument. Now Mass++ can semi-automatically export MassBank records that are generated from various data formats using a wizard. These records can then be easily registered in MassBank. The public MassBank server also distributes public MassBank records, so anyone can create original databases from public spectrum data and private spectra acquired in a laboratory. Furthermore, Mass++ users can search databases using peak information, or their sample data (Fig. 3).

ACS Paragon Plus Environment

9

Page 11 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Development of user functions Plug-in Development. The Mass++ Standard Development Kit (SDK), an additional package, enables users with programming skills to create Mass++ functions without needing the Mass++ source code, using C++ or .NET technologies. SDK contains files, library files, and documents needed for developing Mass++ plug-ins. Each Mass++ plug-in function is called by an event, which is specified by an attribute named “call type.” For example, when a spectrum is displayed, a DRAW_SPEC event occurs, and any plug-ins can respond by displaying additional information on the graph. Some libraries used for accessing MS raw data are publicly available, enabling us to implement new functionality ideas. However, the following steps are usually required to process or analyze data: (i) Open a raw data file, (ii) Find or select the target object, (iii) Process/Analyze the data, (iv) Output the result, and (v) Close the raw data file. In Mass++ plug-in development, the program can access objects already displayed; it is thus sufficient for a developer to develop a program that consists of just the processing/analyzing and output parts. All files for a given plug-in are installed in a folder containing the plug-in definition file (plugin.xml) and the dynamic link library (DLL) file, namely the program itself. The parameter settings file, written in XML, and other resources such as icons and help files are also installed in the same folder. After the Mass++ SDK is installed, the Mass++ plug-in development wizard for Microsoft Visual Studio is automatically installed. Mass++ plug-ins can be easily developed using this wizard. The Mass++ SDK is freely distributed as well as the main software. Mass++ plug-ins can be developed using Microsoft Visual Studio 2010, which is an integrated development environment (IDE). This is a commercial software product, but the Express

ACS Paragon Plus Environment

10

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 29

Edition24 of Visual Studio, is freeware; anyone can download and install it. Mass++ requires some program libraries such as boost,25 xerces-c,26 and wxWidgets27, but these are freely available on the Internet. Therefore, an environment for Mass++ plug-in development can be built for free. Figure 4 presents an example of plug-in development. This example demonstrates how to create a function for drawing lines at product ion positions on the spectrum waveform view. The plugin information and user interface, including the call type (see the supplementary data for examples of call type), function names, resource files, and menu structures, are defined in the plug-in definition file (plugin.xml). The call type means the trigger for calling functions, such as executing a menu, drawing a waveform, opening a spectrum or a chromatogram, clicking a mouse button, pushing a key, detecting peaks and so on. Libraries in the SDK have various kind of functions for accessing and analyzing MS data, the graphical user interface (GUI), database access and network connections - these are described in the API document contained in the Mass++ SDK. In this case, the “drawProductPos” function is called in response to the “DRAW_SPEC_FG” event, which is fired while drawing additional information in the foreground of a spectrum waveform. In the next drawProductPos step, the appropriate code for drawing lines at product ion positions is written. Using the Visual Studio debug function, we can confirm that green lines are drawn at product ion positions on the spectrum waveform. Some algorithms such as peak detection, normalization, peak filter and RT alignment are implemented as Mass++ plug-ins. Plug-in development makes it possible to add new algorithms such as these to Mass++. The plug-in structure allows users to implement a new algorithm/methodology without reading the entire source code and understanding the encoded logic of Mass++ itself; this is the original and fundamental philosophy of Mass++ development.

ACS Paragon Plus Environment

11

Page 13 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Scripting. Many computer scientists/engineers can use the C/C++ or .NET Framework languages, but most chemists and biologists are unfamiliar with these programming languages. Additionally, developing plug-ins is often too time-consuming for implementing a simple test program or a temporary program. In such cases, script languages are a better choice, since some researchers have experience with scripting languages such as Perl, Python, or Ruby. Users can implement simple functionality via Mass++’s script console. At present, Mass++ supports IronPython,28 a Python language implementation for the .NET Framework environment. In the script console, .NET Framework classes in the Mass++ SDK, in the kome.clr namespace, can be used. The specification of these classes can be checked in the Mass++ SDK documentation produced by Doxygen29 and via tutorial documents and sample programs (Fig. 5). Discussion Simply stated, Mass++ has four predominant features. The first feature is a basic data management system for mass spectra. Currently, the basic feature set of Mass++ contains various functions for visualizing annotated mass spectra and for reading/writing multiple data-file formats. Additionally, it has peak-detection functions using newly developed algorithms. These functions make Mass++ a powerful and universal viewer for mass spectra. The second feature is its wide variety of analysis/analysis-assistance functions for omics data. Although it does not currently provide its own original search engine for proteome research, Mass++ obviates timeconsuming manual processes by automatically posting the peak lists to existing search engines. Furthermore, Mass++ can read AXIMA MALDI data, which is useful for glycan analysis. The plug-ins for both of these functions are contained in Mass++; besides Mass++, no other freeware with functions focusing on MALDI data is available. The third feature is a notable one, namely the MassBank-collaboration functions. Mass++ offers both an alternative to the time-consuming

ACS Paragon Plus Environment

12

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 29

process of manually formatting the data for MassBank records, and simplifies posting extracted peak lists to the MassBank website for spectra search, in the same manner as the proteome search. These features enable Mass++ to be utilized as analysis/analysis-assistance software for both proteomics and metabolomics. Mass++ is thus valuable front-end software for the integrated analysis of proteome data and metabolome data. The fourth and last key Mass++ feature is its flexible plug-in structure. All functions are implemented as plug-ins, making it easy to add or remove functions. Moreover, Mass++ provides a plug-in development environment, so users can develop their own functions with programming languages or easy-to-use script languages. Mass++ was originally developed to resolve the problems described in the Introduction via the following features. Regarding the problem that commercial software cannot be controlled by third-party software, Mass++ has a plug-in structure, so anyone familiar with programming can develop a new algorithm, work-flow, etc. using the freely available plug-in development system. Users can employ a scripting language for plug-in development; it is thus relatively easy to control Mass++ or to add functions. Regarding the difficulties in adding user code to implement functionality using a development environment distributed as freeware, this is essentially the same as the problem of commercial software and is dealt with by Mass++ as described above. Many tools are publicly distributed as open-source software and developers can add their original functions. However developers have to not only implement new functions but typically also have to read large amounts of source code and write programs in multiple parts according to events such as mouse clicking, drawing, closing a sample and so on; typical software development which includes editing the open source software. It makes development and maintenance difficult. In Mass++ plug-in development, developers do not have to understand

ACS Paragon Plus Environment

13

Page 15 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Mass++’ssource code itself and can manage their own source code as they like because responses to events can be defined with functions using call types. In addition, developers can use functions contained in existing plug-ins. This new development system results in lower development costs when compared with creating a new tool from scratch, or editing in open source software source code. The plug-in structure realizes many of Mass++’s distinguishing features. Firstly, it enables users to add or remove functions to/from their own copy of Mass++; hence it is possible to build an original optimized version of Mass++ for each user according to their needs. Secondly, if users do not find their desired functions, they can implement them themselves with relative ease utilizing existing rich plug-in functions. Moreover, there is no technical restriction on implementing “wrapper” plug-ins for other programs; thus, it is possible to use other tools from Mass++ via plug-ins. For instance, users can employ proteome search engines Mascot and X! Tandem, and proteome analysis programs PeptideProphet,30 ProteinProphet,31 and iProphet32 in TPP via Mass++; plug-ins for using these software products are distributed with the Mass++ main software. Hence, when developing plug-ins, users can utilize Mass++ as a kind of “glue” program for required functions and/or tools. Note that the software licenses for plug-ins are mutually independent; according to the Mass++ user’s license, users can release their selfimplemented plug-ins under any license except for the copy-left type, which is incompatible with the Mass++ license. Open-source plug-ins and commercial plug-ins can thus be used together in a single instance of Mass++, specifically, in the same plug-in execution environment. In recent times, parallel computing and its specialized forms cloud computing, are hot topics, and many researchers are investigating opportunities to apply them to omics studies. Mass++ cannot run under a parallel computing environment at this time. However, Mass++ has a

ACS Paragon Plus Environment

14

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 29

command line input/output mode and using this mode, can work as an analysis manager under a proper job management system. We think this suggests the future direction of our development. In conclusion, Mass++ is not just a simple visualization/analysis program, but more a general software platform oriented towards mass spectrometry. We believe Mass++ helps not only researchers such as biologists, chemists and bioinformaticians, but also all developers in the mass spectrometry field, and we hope that many researchers will develop their own ideas as plug-ins. Mass++ is everyone’s software, developed by everyone. Information Mass++ runs on 32-bit and 64-bit Windows and can be downloaded for free from this website. http://www.first-ms3d.jp/english/ A Mass++ community is operated as a Google Group. http://groups.google.com/group/massplusplus/ FIGURES

ACS Paragon Plus Environment

15

Page 17 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Fig. 1. Mass++ has various functions for displaying, manipulating, and analyzing MS raw data, all implemented as plug-ins.

ACS Paragon Plus Environment

16

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 29

Fig. 2. Quantitation results are shown in a table called “Peak Matrix.” Each row represents a peak. The first several columns present peak RT position, m/z position, substance and p-value. The remaining columns present peak intensities or areas for each sample. Overlapping view, which displays computationally aligned chromatograms, group plot, and box plot for a specified peak, is displayed by double-clicking a peak row after creating the peak matrix. In this example, users can confirm the areas or intensities of peaks that are annotated as BSA or are within different groups.

ACS Paragon Plus Environment

17

Page 19 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Fig. 3. Mass++ provides two functions for MassBank: searching and building a MassBank database. Mass++ can export MassBank record files to register into an in-house MassBank database. Mass++ can also search for similar spectra in a MassBank database (in-house or the public one).

ACS Paragon Plus Environment

18

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 29

Fig. 4. Mass++ plug-in packaged as a folder containing a plug-in definition file (XML), a program file (DLL), a parameter definition file (XML), and other resources such as icons, help files, and documents. Developers can implement and test their original plug-ins using Microsoft Visual Studio. The plug-in development wizard for Visual Studio is also distributed for free on the Internet.

ACS Paragon Plus Environment

19

Page 21 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Fig. 5. Example of a script program written in IronPython and employing several classes of the Mass++ SDK. This program calculates and displays the total intensity of the active spectrum.

TABLES Table 1. Peak-detection algorithms. Algorithm

Description

MWD

Peak-detection algorithm suitable for identification.

GION

Peak-detection algorithm suitable for quantitation.

AB3D

Peak-detection algorithm for detecting peaks from 2D data; suitable for label-free quantitation.

Local Maximum

Very simple peak-detection algorithm. It just picks local maximum points.

Peak-detection algorithms are implemented for different purposes. For more details, refer to the supplementary data.

ACS Paragon Plus Environment

20

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 29

Table 2. Database search services for MassBank, supported in Mass++. Service

Detail

Spectrum Search

Search similar spectra on a peak-by-peak basis.

Peak Search

Search spectra by m/z values.

Peak-Difference Search

Search spectra by m/z differences.

Batch Search

Search similar spectra in a batch process.

Mass++ calls certain MassBank search functions. “Batch Search” is implemented as a selectable search engine in the identification function in Mass++.

AUTHOR INFORMATION Corresponding Author * Satoshi Tanaka Shimadzu Corporation. 1 Nishinokyo Kuwabara-cho, Nakagyo-ku, Kyoto 604-8511, Japan Phone Number: +81-75-823-2897 Fax Number: +81-75-823-2900 E-mail: [email protected] Present Addresses Jingwen Yao, Shimadzu Research Laboratory Ltd., Trafford Wharf Road, Manchester M17 1GP, UK

ACS Paragon Plus Environment

21

Page 23 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Howell E. Parry, Thermo Fisher Scientific, Altrincham, Cheshire WA14 5TP, UK Tsuyoshi Tabata, Department of Molecular and Cellular Bioanalysis, Graduate School of Pharmaceutical Sciences, Kyoto University, Kyoto 606-8501, Japan. Ken Aoshima, Biostatistics Clinical Science, Japan Biostatistics, CCLO, Eisai Product Creation Systems, Bunkyo-ku Tokyo 112-8088, Japan Yoshito Nihei, Yamagata 999-7782, Japan Author Contributions S.T., H.E.P., M.F., and M.I. designed and implemented software. Y.F., A.C.Y., K.M., M.M., Y.Y, J.Y., S.U., T.T., and K.A. designed and/or developed functions/algorithms. Y.N. and T.N. developed MassBank. S.U., S.K., and K. Takahashi managed the software development. Y.O. and K. Tanaka supervised the project. S.T., A.C.Y., S.U., and S.K. wrote the manuscript. All authors commented on and revised the manuscript.

ACKNOWLEDGMENTS This work was originally funded by the Japan Science and Technology Agency (CREST). This work is currently funded by the Japan Society for the Promotion of Science (JSPS) through the “Funding Program for World-Leading Innovative R&D on Science and Technology (FIRST Program),” initiated by the Council for Science and Technology Policy (CSTP).

ABBREVIATIONS

ACS Paragon Plus Environment

22

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 29

MS, mass spectrometry; TPP, Trans Proteomic Pipeline; LC-MS, liquid chromatography – mass spectrometry; 3D, three dimensional; RT, retention time; MRM, multiple reaction monitoring; ANOVA, analysis of variance; GC, gas chromatography; SOAP, simple object access protocol; API, application programming interface; SDK, standard development kit; DLL, dynamic link library; IDE, integrated development environment; XML, extensible markup language

Supporting Information Available: This material is available free of charge via the Internet at http://pubs.acs.org. 1.

The Plug-in Structure

2.

Mass++ Menus

3.

Supported Data Formats

4.

Peak Detection Algorithms

5. Normalization Methods 6.

Peak Position Determination Methods.

7.

Examples of Call Types in Mass++

8.

MSB File Format

9.

Results of Performance Tests for MSB File Format

10. Example of Mass++ Plug-in Development using C#

ACS Paragon Plus Environment

23

Page 25 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

11. Table of MS Data Analysis Tools 12. Table of Development Environments

REFERENCES (1) Keller, A.; Eng, J.; Zhang, N.; Li, X. J.; Aebersold, R. A uniform proteomics MS/MS analysis platform utilizing open XML file formats. Mol. Syst. 2005, 1, 0017. (2) Deutsch, E. W.; Mendoza, L.; Shteynberg, D.; Farrah, T.; Lam, H.; Tasman, N.; Sun, Z.; Nilsson, E.; Pratt, B.; Prazen, B.; Eng, J. K.; Martin, D. B.; Nesvizhskii, A. I.; Aebersold, R. A guided tour of the Trans-Proteomic Pipeline. Proteomics 2010, 10, 1150-9. (3) Sturm, M.; Bertsch, A.; Gröpl, C.; Hilderbrandt, R.; Hussong, R.; Lange, E.; Pfeifer, N.; Schulz-Trieglaff, O.; Zerck, A.; Reinert, K.; Kohlbacher, O. Open MS – an open-source software framework for mass spectrometry. BMC Bioinf. 2008, 9, 163. (4) Weisser, H.; Nahnsen, S.; Grossmann, J.; Nilse, L.; Quandt, A.; Brauer, H.; Sturm, M.; Kenar, E.; Kohlbacher, O.; Aebersold, R.; Malmström, L. An automated pipeline for high-throughput label-free quantitative proteomics. J. Proteome Res. 2013, 12, 1628-44. (5) Cox, J.; Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 2008, 26, 1367-72.

ACS Paragon Plus Environment

24

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 29

(6) Cox, J.; Matic, I.; Hilger, M.; Nagaraj, N.; Selbach, M.; Olsen, J. V.; Mann, M. A practical guide to the MaxQuant computational platform for SILAC-based quantitative proteomics. Nat. Protocol. 2009, 4, 698-705. (7) Pluskal, T.; Castillo, S.; Villar-Briones, A.; Oresic, M. MZmine 2: modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data. BMC Bioinf. 2010, 11, 395. (8) Pluskal, T.; Uehara, T.; Yanagida, M. Highly accurate chemical formula prediction tool utilizing high-resolution mass spectra, MS/MS fragmentation, heuristic rules, and isotope pattern matching. Anal. Chem. 2012, 84, 4396-403. (9) Benton, H. P.; Wong, D. M.; Trauger, S. A.; Siuzdak, G. XCMS2: processing tandem mass spectrometry data for metabolite identification and structural characterization. Anal. Chem. 2008, 80, 6382-6389. (10) Tautenhahn, R.; Patti, G. J.; Rinehart, D.; Siuzdak, G. XCMS Online: a web-based platform to process untargeted metabolomic data. Anal. Chem. 2012, 84, 5035-9. (11) Kessner, D.; Chambers, M.; Burke, R.; Agus, D.; Mallick, P. ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics. 2008 24, 2534-2536. (12) Chambers, M.C.; MacLean, B.; Burke, R.; Amode, D.; Ruderman, D.L.; Neumann, S.; Gatto, L.; Fischer, B.; Pratt, B.; Egertson, J.; Hoff, K.; Kessner, D.; Tasman, N.; Shulman, N.; Frewen, B.; Baker, T.A.; Brusniak, M. Y.; Paulse, C.; Creasy, D.; Flashner, L.; Kani, K.; Moulding, C.; Seymour, S. L.; Nuwaysir, L. M.; Lefebvre, B.; Kuhlmann, F.; Roark, J.; Rainer, P.; Detlev, S.; Hemenway, T.; Huhmer, A.; Langridge, J.; Connolly, B.; Chadick, T.; Holly, K.;

ACS Paragon Plus Environment

25

Page 27 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Eckels, J.; Deutsch, E. W.; Moritz, R. L.; Katz, J. E.; Agus, D. B.; MacCoss, M.; Tabb, D. L.; Mallick, P. A cross-platform toolkit for mass spectrometry and proteomics. Nat. Biotechnology. 2012, 30, 918-20. (13) MacLean, B.; Tomazela, D. M.; Shulman, N.; Chambers, M.; Finney, G. L.; Frewen, B.; Kern, R.; Tabb, D. L.; Liebler, D. C.; MacCoss, M. J. Skyline: an open source document editor for creating and analyzing targeted proteomics experiments. Bioinformatics. 2010, 26, 966-8. (14) Schilling, B.; Rardin, M. J.; MacLean, B. X.; Zawadzka, A. M.; Frewen, B. E.; Cusack, M. P.; Sorensen, D. J.; Bereman, M. S.; Jing, E.; Wu, C. C.; Verdin, E.; Kahn, C. R.; Maccoss, M. J.; Gibson, B. W. Platform-independent and label-free quantitation of proteomic data using MS1 extracted ion chromatograms in skyline: application to protein acetylation and phosphorylation. Mol. Cell. Proteomics. 2012, 11, 202-14 (15) Askenazi, M.; Parikh, J. R.; Marto, J. A. mzAPI: a new strategy for efficiently sharing mass spectrometry data. Nat. Methods. 2009, 6, 240-1 (16) Goloborodko, A. A.; Levitsky, L. I.; Ivanov, M. V.; Gorshkov, M. V. Pyteomics--a Python framework for exploratory data analysis and rapid software prototyping in proteomics. J. Am. Soc. Mass Spectrom. 2013, 24, 301-4. (17) Pasculescu, A.; Schoof, E. M.; Creixell, P.; Zheng, Y.; Olhovsky, M.; Tian, R.; So, J.; Vanderlaan, R. D.; Pawson, T.; Linding, R.; Colwill, K. CoreFlow: a computational platform for integration, analysis and modeling of complex biological data. J. Proteomics 2014, 100, 167-73. (18) Pedrioli, P. G.; Eng, J. K.; Hubley, R.; Vogelzang, M.; Deutsch, E. W.; Raught, B.; Pratt, B.; Nilsson, E.; Angeletti, R. H.; Apweiler, R.; Cheung, K.; Costello, C. E.; Hermjakob, H.;

ACS Paragon Plus Environment

26

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 29

Huang, S.; Julian, R. K.; Kapp, E.; McComb, M. E.; Oliver, S. G.; Omenn, G.; Paton, N. W.; Simpson, R.; Smith, R.; Taylor C. F.; Zhu, W.; Aebersold, R. A common open representation of mass spectrometry data and its application to proteomics research. Nat. Biotechnol. 2004, 22, 1459-66. (19) Martens, L.; Chambers, M.; Sturm, M.; Kessner, D.; Levander, F.; Shofstahl, J.; Tang, W. H.; Römpp, A.; Neumann, S.; Pizarro, A. D.; Montecchi-Palazzi, L.; Tasman, N.; Coleman, M.; Reisinger, F.; Souda, P.; Hermjakob, H.; Binz, P. A.; Deutsch, E. W. mzML--a community standard for mass spectrometry data. Mol. Cell. Proteomics. 2011, 10, R110.000133. (20) ASTM International. Standard specification for analytical data interchange protocol for chromatographic data. ASTM, E1947-98(2009) (21) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 1999, 20, 3551-67 (22) Fenyö, D.; Beavis, R. C. A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Anal. Chem. 2003, 75, 768-74 (23) Horai, H.; Arita, M.; Kanaya, S.; Nihei, Y.; Ikeda, T.; Suwa, K.; Ojima, Y.; Tanaka, K.; Tanaka, S.; Aoshima, K.; Oda, Y.; Kakazu, Y.; Kusano, M.; Tohge, T.; Matsuda, F.; Sawada, Y.; Hirai, M. Y.; Nakanishi, H.; Ikeda, K., Akimoto, N.; Maoka, T.; Takahashi, H.; Ara, T.; Sakurai, N.; Suzuki, H.; Shibata, D.; Neumann, S.; Iida, T.; Tanaka, K.; Funatsu, K.; Matsuura, F.; Soga,

ACS Paragon Plus Environment

27

Page 29 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

T.; Taguchi, R.; Saito, K.; Nishioka, T. MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass. Spectrom. 2010, 45, 703-714 (24) Microsoft. Visual Studio – Home. http://www.visualstudio.com/ (accessed Jan 29, 2014) (25) Boost C++ Libraries. http://www.boost.org/ (accessed Jan 29, 2014) (26) Xerces-C++ XML Parser. http://xerces.apache.org/xerces-c/ (accessed Jan 29, 2014) (27) wxWidgets. http://www.wxwidgets.org/ (accessed Jan 29, 2014) (28) IronPython – Home. http://ironpython.codeplex.com/ (accessed Jan 29, 2014) (29) Doxygen: Main Page. http://www.stack.nl/~dimitri/doxygen/ (accessed Jan 29, 2014) (30) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Empirical Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database Search. Anal. Chem. 2002, 74, 5383-5392 (31) Nesvizhskii, A. I.; Keller, A.; Kolker, E.; Aebersold, R. A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 2003, 75, 4646-4658 (32) Shteynberg, D.; Deutsch, E. W.; Lam, H.; Eng, J. K.; Sun, Z.; Tasman, N.; Mendoza, L.; Moritz, R. L.; Aebersold, R.; Nesvizhskii, A. I. iProphet: multi-level integrative analysis of shotgun proteomic data improves peptide and protein identification rates and error estimates. Mol. Cell. Proteomics. 2011, 10, M111.007690

ACS Paragon Plus Environment

28

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 29

TOC Graphic

ACS Paragon Plus Environment

29