Integrated Analysis of Transcript, Protein and Metabolite Data To

A systems biology approach is presented for integrated modeling of transcriptomics, proteomics and metabolomics data from hybrid aspen trees. A consid...
0 downloads 4 Views 7MB Size
Integrated Analysis of Transcript, Protein and Metabolite Data To Study Lignin Biosynthesis in Hybrid Aspen Max Bylesjo ¨ ,†,‡,# Robert Nilsson,§,# Vaibhav Srivastava,§ Andreas Gro ¨ nlund,| § | | § Annika I. Johansson, Stefan Jansson, Jan Karlsson, Thomas Moritz, Gunnar Wingsle,§ and Johan Trygg*,†,‡ Department of Chemistry, Umeå University, SE-901 87 Umeå, Sweden, Computational Life Science Cluster (CLIC), KBC, Umeå University, SE-901 87 Umeå, Sweden, Umeå Plant Science Centre, Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural Sciences, SE-901 83 Umeå, Sweden, and Umeå Plant Science Centre, Department of Plant Physiology, Umeå University, SE-901 87 Umeå, Sweden Received April 18, 2008

Tree biotechnology will soon reach a mature state where it will influence the overall supply of fiber, energy and wood products. We are now ready to make the transition from identifying candidate genes, controlling important biological processes, to discovering the detailed molecular function of these genes on a broader, more holistic, systems biology level. In this paper, a strategy is outlined for informative data generation and integrated modeling of systematic changes in transcript, protein and metabolite profiles measured from hybrid aspen samples. The aim is to study characteristics of common changes in relation to genotype-specific perturbations affecting the lignin biosynthesis and growth. We show that a considerable part of the systematic effects in the system can be tracked across all platforms and that the approach has a high potential value in functional characterization of candidate genes. Keywords: Combined profiling • O2PLS • Chemometrics • Populus • Lignin biosynthesis

Introduction Functional genomics studies in the postgenomics era have largely been focused on profiling techniques for parallel monitoring of, for example, transcript, protein and metabolic profiles.1-10 This approach has become possible mainly due to the increasing availability of instrumentation required for high-throughput characterization of biological samples. This involves, for example, the microarray technology for transcript profiling11 or chromatography coupled with mass spectrometry for peptide or metabolite profiling.12 The purpose, in this context, is to study organisms as integrated systems of genetic, protein, metabolic, pathway and cellular events in order to achieve a higher level of understanding of the interplay between molecular and cellular components. Data sets from a combined profiling experiment are typically of extremely high dimensionality, containing noise and multicollinearities, while the sample replication level is the restricting factor. Extracting valuable and general information from such a system is a nontrivial task which requires careful experimental planning in order to avoid confounding of known factors (such as time and treatment) or large influence of uncontrollable factor (such as run-order dependencies). Failure * To whom correspondence should be addressed. Phone: +46 (0)90 786 6917. Fax: +46 (0)90 786 7655. E-mail: [email protected]. † Department of Chemistry, Umeå University. ‡ Computational Life Science Cluster (CLIC), KBC, Umeå University. # These authors had equal contribution. § Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural Sciences. | Department of Plant Physiology, Umeå University. 10.1021/pr800298s CCC: $40.75

 2009 American Chemical Society

to acknowledge any of these factors may cause unwanted systematic effects, which are introduced during data collection, to create false connections between variables (transcripts, proteins, metabolites), ultimately affecting biological conclusions. Depending on context, experimental planning can have several meanings. First, it could involve systematic variation of factors of interest in order to estimate the relative influence of factors and synergistic effects between factors in order to maximize some objective function (e.g., yield or stability).13,14 The advantages of such an approach for optimization of an extraction protocol have been nicely illustrated by Gullberg et al.15 Second, it could involve reducing the effects of uncontrollable factors by means of randomization to avoid confounding between unwanted effects and effects of interest (e.g., sample treatments). A typical example is run-order randomization to reduce the influence of time dependencies in Gas Chromatography coupled with Mass Spectrometry (GC/MS) measurements on sample properties.16 Third, it could involve selection of a subset from a redundant sample set, where the selected subset minimizes the redundancy while maximizing the information gain.17,18 Such situations typically occur when one can easily characterize a large number of samples using a crude technological platform but can only afford to further characterize a smaller subset of interesting samples using a second, more advanced technology. We will generally refer to experimental planning involving any of the three categories above as design of experiments (DoE).13 All of these strategies will be utilized at some point in different phases of the presented study. The Journal of Proteome Research 2009, 8, 199–210 199 Published on Web 12/03/2008

research articles

Bylesjo ¨ , et al.

Figure 1. Overview of the O2PLS model components. The variation in both data sets (matrices) is separated into predictive (joint) variation, data set specific variation and nonsystematic (residual) variation, respectively.

Figure 2. Overview of the study design. (A) An overview of the available sample categories is shown, containing three genotypes (G5, G3 and WT) as well as three internodes (A-C) on a 3 × 3 grid. All 9 samples categories have been measured for transcript, protein and metabolite quantities. The smaller empty circles form a legend for the larger filled circles. (B) Image of a poplar tree, with internode B (internode 20) marked with a red band.

benefits of utilizing DoE for experimental planning are documented in numerous studies, including the field of systems biology.19 Given proper data collection, the task of data integration for functional genomics studies is still daunting. Numerous alternative strategies exist in the literature for integration of data from parallel sources; for an overview, see ref 20. In this paper, the O2PLS method21,22 will be utilized as the primary tool for data integration. O2PLS is a bidirectional multivariate regression method that allows separate modeling of covariance between two data sets (e.g., from separate profiling platforms) from systematic sources of variation that are specific to each data set. The O2PLS method handles noisy, multicollinear data with many more variables than observations (samples), which is typical for biochemical and biological applications. An illustration of the O2PLS model is described schematically in Figure 1. Several studies exist where the utility of the O2PLS 200

Journal of Proteome Research • Vol. 8, No. 1, 2009

Table 1. The Utilized Steps for Identification of Joint and Platform-Specific Variation step

1. 2.

2.1. 3. 3.1.

description

Identify the joint covariance structures from the transcript and protein data using O2PLS. Use the joint covariance structures from the joint transcript-protein variation in Step 1 and the metabolite data, to identify the joint covariance between all data sets using O2PLS. Analyze the joint covariance structures for each data set separately. Remove the joint covariance structures and extract specific systematic variation from each data set. Analyze the residual matrices (without joint covariance structures) separately using PCA.

method for data integration purposes is demonstrated.23-26 For life science applications, Rantalainen et al. described the

research articles

Populus Omics Integration Table 2. The O2PLS Model Components entry

dimensionality

Tp

(N × Ap)

WT

(Ap × K)

To

(N × AYo)

PYoT

(AYo × K)

E

(N × K)

Up

(N × Ap)

CT

(Ap × M)

Uo

(N × AXo)

PXoT

(AXo × M)

F

(N × M)

description

Predictive scores for the X matrix. Composed of Ap score vectors that describe the relation between observations (samples). Predictive weights for the X matrix. Composed of covariance vectors corresponding to the scores Tp, used to describe relations among variables (transcripts, proteins, metabolites). Y-orthogonal scores for the X matrix. Composed of AYo mutually orthogonal score vectors that describe the relation between observations (samples). Y-orthogonal loadings for the X matrix. Composed of loading vectors corresponding to the scores To, used to describe relations among variables (transcripts, proteins, metabolites). Residual matrix of X containing nonsystematic variation for the prediction of Y. Predictive scores for the Y matrix. Composed of Ap score vectors that describe the relation between observations (samples). Predictive weights for the Y matrix. Composed of covariance vectors corresponding to the scores Up, used to describe relations among variables (transcripts, proteins, metabolites). X-orthogonal scores for the Y matrix. Composed of AXo mutually orthogonal score vectors that describe the relation between observations (samples). X-orthogonal loadings for the Y matrix. Composed of loading vectors corresponding to the scores Uo, used to describe relations among variables (transcripts, proteins, metabolites). Residual matrix of Y containing nonsystematic variation for the prediction of X.

usefulness of the method for joint analysis of protein and metabolite profiles and Bylesjo¨ et al. subsequently showed the strengths of the method for integrated analysis of transcript and metabolite profiles. The latter study provides a comparison between the properties of O2PLS and related methods for analyzing data from omics technologies.26 In the present study, we describe a strategy for planning and integrated analysis of data from transcript, protein and metabolite profiling technologies. The study involves profiling of a steady-state system of three different genotypes of hybrid aspen (Populus tremula × Populus tremuloides). The first genotype is the wild-type (WT), which will be used as a reference sample. The second genotype, which will be denoted G5 throughout, contains several antisense constructs of the gene PttMYB21a, affecting plant growth. The closest ortholog to PttMYB21a in Arabidopsis thaliana is AtMYB52.27 The G5 genotype displays a distinct phenotype with slower growth compared to WT samples. The third genotype G3 contains only one antisense construct of the gene PttMYB21a, displaying a similar but less distinct phenotype compared to the G5 samples. The transgenic lines used in this study named G3 and G5 corresponds to lines “21III” and “21V” described in detail by Karpinska et al.27 Xylem tissue from all genotypes has been collected from three internode positions of the plants (denoted internodes A-C), corresponding to an approximate growth gradient. The genotype effect together with the internode effect corresponds to a full factorial design at multiple levels;13,14 see Figure 2 for an overview. This comprehensive sampling allows the study of both the internode and genotype effects separately but also the investigation of any potential synergism between these factors. Design of experiments has been used at all applicable steps, for example, to maximize information about known factors (internodes and genotypes) or to minimize the effect of unwanted factors (e.g., run-order dependencies), to ensure

that data quality is as high as possible. We show how joint covariance structures can be extracted using the O2PLS method while acknowledging the presence of platformspecific systematic variation as well as nonsystematic variation. Differences and similarities across the profiling technologies, in terms of relations to the genotype categories and internode gradient as well as specific variation and joint covariation will be elaborated. The main aim in this study is to demonstrate a methodology for integration and discuss general implications of combined profiling analyses in the context of tree biotechnology.

Materials and Methods 1. Sample Collection. Ten biological replicates of each genotype, 90 samples in total, were grown in greenhouse under fixed conditions essentially as described by Karpinska et al.27 Cuttings of mutants and WT were transferred to soil and allowed to grow for 12 weeks prior to harvesting. Sampling was preceded by labeling internode 20 with a red band to convey fast processing and avoid errors. Internode positions were determined by counting from the first, 1 cm long leaf. Samples from internode category A were cut 5 cm from the top and 10 cm down the stem, samples from internode category B at internode 20 (10 cm down) and samples from internode category C at internode 30 (10 cm down). After peeling off the bark, tissue was sampled by scraping off xylem with a sharp blade. All samples were ground in a mixer-mill (MM 301, Retsch GmbH, Germany) for 1 min and the resulting powder was used for RNA, proteins and metabolites extraction or kept at -80 °C until further use. 2. Combined Profiling Techniques. All samples were characterized using parallel profiling techniques for profiling of transcript (cDNA microarrays), protein/peptide (UPLC/MS) and metabolite (GC/MS) levels, see subsections for details. Journal of Proteome Research • Vol. 8, No. 1, 2009 201

research articles

Bylesjo ¨ , et al.

Figure 3. The O2PLS-based integration framework. The different steps used to identify joint and platform-specific variation for three data sets by means of O2PLS are shown. (A) In Step 1, the joint covariance structures from the transcript and protein data sets are identified. In Step 2, the joint covariance structures from the joint transcript-protein variation in Step 1 and the metabolite data are utilized to identify the joint covariance between all data sets. (B) In Step 3, the joint covariance structures are removed from each data set and specific systematic variation is extracted.

2.1. Transcript Profiling. cDNA microarrays were used for transcript profiling. The utilized POP2.3 microarray layout consist of 27 648 single spotted cDNA clones from a previous assembly of more than 100 000 expressed sequence tags (ESTs) from the Populus genus.28 All sequence information is available in the PopulusDB (http://www.populus.db.umu.se/) online sequence database. A full array layout is available for download from the UPSC-BASE (http://www.upscbase.db.umu.se/) online microarray database.29 All microarray slides were printed using a QArray arrayer (Genetix, Hampshire, U.K.). The preparation, labeling and hybridization of cDNA clones and mRNA samples were carried out according to the protocol described by Smith et al.30 with a few modifications. Total RNA was extracted from 50 mg of xylem tissue using Aurum total RNA mini kit (BioRad) according to manufacturer’s instructions. Approximately 1.2 µg of total RNA was used to selectively amplify mRNA using MessageAmp” II aRNA Amplification Kit (Ambion, Cat. AM1751). A total of 7 µg of amplified RNA (a-RNA) was reverse transcribed into aminoallyl-labeled cDNA with 3 µg of Random Primer 9 Nonamer. The arrays were scanned on a ScanArray 202

Journal of Proteome Research • Vol. 8, No. 1, 2009

4000 (Perkin-Elmer Wellesley, MA) at 10 µm resolution to obtain raw image files for the Cy5 and Cy3 dye channels. Gridding of all images was performed in GenePix Pro 5.1 (Molecular Devices, CA) and segmentation and quantification were performed using MASQOT-GUI.31,32 Quantification was based on median foreground intensity values and data was subsequently normalized using the OPLS microarray normalization method.33 Microarray elements with an average expression level below the intensity levels of a set of unspotted elements (containing no probe material) were removed from the microarray data set. This reduced the number of microarray elements from 27 648 to 14 738. All original image files as well as raw data are available online for download at the UPSCBASE29 microarray database from experiment UMA-0073. 2.2. Protein/Peptide Profiling. 2.2.1. Protein Extraction. Highly water-soluble proteins were extracted from 20 mg of frozen tissue powder. Extraction was performed mainly according to the method described by Giavalisco et al.34 One tablet of protease inhibitor cocktail (Complete Mini; Roche, Indianapolis, IN) was added per 10 mL of extraction buffer (100

Populus Omics Integration

research articles

Figure 4. The internode effect. (A) The internode gradient is seen along the second joint score vector. (B) The internode effect is shown for the transcripts. Highlighted photosynthesis and translation elongation factors are elevated at the primary growth region (internode A). (C) The internode effect is shown for the proteins. Highlighted translation elongation factors are increased at the primary growth region (internode A). (D) The internode effect is shown for the metabolites. Highlighted amino acids are elevated at the primary growth region (internode A). The metabolites denoted as ‘unidentified’ are metabolites of potential interest that have no well-defined library match.

mM KCl, 20% glycerol, 50 mM Tris, pH 8.0). Tissue powder was dissolved in 100 µL of buffer and left for rotation 10 min (LABINCO rotator LD-79) at 4 °C. The homogenate was centrifuged for 30 min at 226 000g at 4 °C (Ultracentrifuge Beckman Optima MAX, rotor MLA-130). The top 80 µL of supernatant was collected in 0.5 mL PCR tubes. 2.2.2. Digestion of Proteins and Recovery of Peptides. Extracted proteins were reduced by adding 5 µL of DTT solution to a final concentration of 20 mM and incubated at 95 °C for 10 min using a thermocycler. Tubes were transferred to ice and 10 µL of iodoacetamide solution was added to a final concentration of 80 mM and stored 20 min at room temperature in the dark for alkylation. A modified procedure35 using InSolution Tryptic Digestion Kit, #89895 (Pierce, Rockford, IL) was used. Efficiency of reduction/alkylation was evaluated on

reference samples (not shown). An aliquot of 20 µL was digested on filter plates (MultiScreen Filter Plate with Ultracel-10 Membrane Millipore MAUF01010). Samples diluted to 200 µL were applied to prewetted membrane and centrifuged 60 min, 2000g at 25 °C (Centrifuge Heraeus Multifuge 3 S-R, rotor 75006444). The samples were washed twice with 200 µL of 0.2 M ammonium bicarbonate before adding 50 µL of trypsin (5 ng/µL) and overnight digestion. Peptides were eluted onto a collection plate (Well Storage Plate VWR, AB-1058) by three repeated centrifugations using 40 µL, 0.2 M ammonium bicarbonate. Samples were evaporated until dryness (5 h, 33 °C, concentrator LABCONCO CentriVap) and dissolved in 10 µL, 0.1% formic acid and stored at -20 °C until use. 2.2.3. UPLC-MS Peptide Profiling. The tryptic digests were thawed and diluted 1:50 in 0.1% formic acid and peptides were Journal of Proteome Research • Vol. 8, No. 1, 2009 203

research articles

Bylesjo ¨ , et al.

Figure 5. The genotype effect. (A) The genotype effect is shown as a combination of the first (G5 versus G3 and WT) and third joint score vector (G3 versus G5 and WT). (B) The genotype effect is shown for the transcripts. Numerous highlighted transcripts that could be linked to the lignin biosynthesis have decreased levels for the G5 mutant, containing multiple antisense constructs. (C) The genotype effect is shown for the proteins. Several highlighted proteins related to the lignin biosynthesis can be seen at the protein level, but displays an opposite direction of change compared to the transcripts. (D) The genotype effect is shown for the metabolites. Many highlighted compounds are affected, in particular for the G5 genotype, including, for example, quinic acid which can be linked to the lignin biosynthesis. The metabolites denoted as ‘unidentified’ are metabolites of potential interest that have no well-defined library match. Legend: ADF ) Actin-depolymerizing factor, CCR1 ) Cinnamoyl-CoA Reductase 1, CCoAOMT ) Caffeoyl-CoA O-methyltransferase, COMT ) Caffeic acid 3-O-methyltransferase, CAD ) Cinnamyl-alcohol dehydrogenase.

separated by reversed-phase ultraperformance liquid chromatography using a nanoACQUITY UPLC system (Waters, Milford, MA) prior to MS analysis. Each sample, 7 µL (corresponding to ∼2 ‰ of initial fresh weight of tissue) was loaded onto a C18 trap column, (Symmetry 180 µm × 20 mm 5 µm; Waters, Milford, MA) and washed with 5% acetonitrile, 0.1% formic acid at 15 µL/min for 1 min. The samples were eluted from the trap column and separated on a C18 analytical column (75 µm × 100 mm 1.7 µm; Waters, Milford, MA) at 600 nL/min using 0.1% formic acid as solvent A and 0.1% formic acid in acetonitrile as solvent B, in a gradient. The following gradients were used: linear from 0 to 40% B in 25 min, linear from 40 to 80% B in 1 min, isocratic at 80% B 204

Journal of Proteome Research • Vol. 8, No. 1, 2009

in 1 min, linear from 80 to 5% B in 1 min and isocratic at 5% B for 7 min. The eluting analyte was sprayed into the MS (Q-Tof Ultima; Waters, Milford, MA) with the capillary voltage set to 1.9 kV and cone voltage to 40 V. MS spectra were collected in the 400-1000 m/z range (0.5 s scan time, 0.1 s inter delay). Instrument and offset calibration was performed as described earlier.36 Runorder of the samples was randomized in order to minimize the influence of systematic time drift. 2.2.4. Data Preprocessing. Data pretreatment, including baseline subtraction, spectra alignment and peak identification, was performed using the MassLynx/MarkerLynx suite (Waters, Milford, MA) version 4.1. The preprocessing resulted in a total

Populus Omics Integration

research articles

Figure 6. Network-based analysis of transcript data. (A) Largest cluster of transcripts in the network that are elevated in the primary developmental region (internode A). (B) Largest clusters in the network for the transcripts that have decreased quantities primarily for the G5 genotype. Associated GO groups for the clusters can be found in the Supporting Information.

Figure 7. Variation explained by the integrative model. The explained variation is shown for the three different profiling platforms, illustrating how the trends are similar for all profiling techniques. Approximately one-third of the variation is either joint (connected across transcripts, proteins and metabolites), specific (specific to the analytical platform) or residual (nonsystematic). The transcript data set shows a higher proportion of joint variation compared to the residual, which is potentially due to the initial filtering of lowintensity (noise-like) signals.

of 3132 putative peptide markers. The parameter settings used to preprocess the data were optimized in a separate study (Bylesjo¨ et al., in preparation). 2.2.5. Protein Identification by Peptide Fragmentation. Two sample mixtures were created by pooling all WT samples and all G5 mutant samples. The samples were analyzed three times at different mass ranges using an injection volume of 10 µL (diluted 1:20 in 0.1% formic acid). Peptide fragmentation data was generated by automated Data Dependent Acquisition (DDA) and submitted for database search, settings previously described.36 The Populus protein database (45 555 entries, assembly release v1.1) used in the search was created from predicted and translated gene models from Populus trichocarpa genomic sequence37 available at the DOE Joint Genome Institute (JGI). The match to peptide FALESFWDGK belonging to methionine synthase was chosen as an internal reference to generate a relative retention time index, since it was clearly

visible in all samples and eluting at the end of the chromatogram. Markers were subsequently matched to these identified peptides based on the minimal difference in retention time and mass, constrained to matches within (40 mDa. 2.3. Metabolite Profiling. The samples were extracted by chloroform/MeOH/H2O and their metabolite profiles were analyzed by GC/TOFMS essentially according to ref 15. Runorder of the samples was randomized in order to minimize the influence of systematic time drift. All nonprocessed MS-files from the metabolic analysis were exported from the ChromaTOF software in NetCDF format to MATLAB (Mathworks, Natick, MA) version 7.0, in which all data pretreatment procedures, such as baseline correction, chromatogram alignment, data compression and Hierarchical Multivariate Curve Resolution (H-MCR), were performed using in-house produced scripts according to Jonsson et al.38 All manual peak integraJournal of Proteome Research • Vol. 8, No. 1, 2009 205

research articles

Bylesjo ¨ , et al.

Figure 8. Transcript-specific variation. (A) The strongest systematic transcript-specific variation is shown. The transcripts that are most strongly affecting this variation are related to housekeeping-like events. (B) Graph of a selected subset of gene-ontology (biological process) groups that are affected in the transcript-specific variation. Dark red rectangles denote significantly affected groups (p < 0.01), whereas white rectangles are used to describe the context. Legend: PAP ) Plastid-lipid Associated Protein.

tions were performed using in-house scripts, yielding a total of 281 putative metabolites. 3. Selection of a Nonredundant Subset. Because of cost, time and labor limitations, only four out of the 10 available biological replicates could be run for the transcript and protein profiling platforms. This corresponds to a reduction of the 90 available samples to 36 samples being characterized for all technical platforms. Instead of picking or pooling these 36 samples randomly, we utilized a selection strategy to maximize the spread of the selected samples in accordance with existing approaches for finding nonredundant subsets.17,18 All 90 samples were initially run for the metabolite profiling platform as described in previous subsections. The resulting metabolite profiles were subsequently explored using Principal Component Analysis (PCA)39,40 in order to estimate the similarities and differences of the biological replicates based on the maximum variance projection from PCA. The four most diverse biological samples for each genotype and internode were selected as candidates for further profiling using the transcript and protein profiling platforms; see Supporting Information for details regarding the selection. The underlying principle of this selection strategy was to ensure that the innate variability existing between all biological samples was not underestimated. 4. Data Pretreatment. The transcript data set was log2transformed and mean-centered per microarray element. The protein data set was log10-transformed, mean-centered and scaled to unit variance for each variable (extracted chromatographic peak). Unit variance scaling implies dividing each variable by its standard deviation. The metabolite data set was log10-transformed, mean-centered and scaled to unit variance for each variable (extracted chromatographic peak). The mean values and standard deviations used for preprocessing were only based on the measurements of the WT samples; hence, this sample category was used as an internal reference across the different profiling platforms. All data sets were subsequently scaled to an equal sum of squares prior to any further analyses to avoid magnitude differences between profiling platforms to influence the results. 5. Data Integration. 5.1. The O2PLS Method. O2PLS is a bidirectional multivariate regression method that identifies 206

Journal of Proteome Research • Vol. 8, No. 1, 2009

joint covariation between two data sets as well as systematic variation that is unique to each data set (Figure 1). Both the jointly covarying and unique sources of variation are composed of smaller entities which are referred to as latent variables that describe independent effects in the data. The two modeled data sets are traditionally denoted by the matrices X (N × K) and Y (N × M), where N denotes the number of observations (samples), K the number of variables in X (e.g., microarray elements, chromatographic peaks, etc.) and M the number of variables in Y. On the basis of these matrices, the O2PLS model of X can be formulated algebraically as outlined in eq 1 and the model of Y as in eq 2. The notation in eqs 1 and 2 is explained in Table 2. The relation between observations (samples) is described by the score matrices T (both predictive and Y-orthogonal). T is generally composed of a set of vectors t1, t2,..., tA, one for each of the extracted latent variables. The corresponding relation to the underlying variables (transcripts, proteins, metabolites) is described by the weight and loading matrices W, C, and P. X ) TpWT+ToPYoT + E

(1)

Y ) UpCT+UoPXoT + F

(2)

The important points from eq 1 and 2 can be summarized as follows. • A linear relationship exists between Tp and Up, describing the multivariate association between X and Y. • Variation in the X matrix that is systematically covarying with the Y matrix is described by TpWT. Systematic variation that exists in X but is linearly independent of (orthogonal to) Y, denoted Y-orthogonal variation, is described by ToPYoT. The remaining variation ends up in the X-residual, denoted E. • Variation in the Y matrix that is systematically covarying with the X matrix is described by UpCT. Systematic variation that exists in Y but is linearly independent of (orthogonal to) X, denoted X-orthogonal variation, is described by UoPXoT. The remaining variation ends up in the Y-residual, denoted F.

Populus Omics Integration • Inner dimensionality (number of latent variables) of the various matrices in eqs 1 and 2 are determined partly by the dimensionality of X and Y, but also by Ap (number of predictive components), AYo (number of Y-orthogonal components) and AXo (number of X-orthogonal components). Ap, AYo and AXo are data-specific and typically estimated by resampling methods such as cross-validation.41,42 More detailed explanations of the underlying mathematics are available in the works of Trygg and Wold.21,22 5.2. Implementation. All O2PLS models were calculated using in-house produced code for R (http://www.r-project.org/). PCA models were calculated using SIMCA-P+ 11.0 (Umetrics AB, Umeå, Sweden). 5.3. Graph-Based Visualization. To visualize the interrelatedness of the affected genes, we employed a multilayer network algorithm by forming the union of multiple minimal spanning trees (MST), each constructed from random resamplings of a large data set of 1024 microarrays.48 Thicker lines indicate that the connection is seen in a large portion of the data and vice versa. Given this network, it is possible to study how genes of particular interest appear in the network based on a “guilt by association” principle. The O2PLS correlation loadings have been utilized to identify and highlight clusters in the graph and the relative size of the nodes correspond to the loading values (within each network).

Results Extracting Joint and Specific Variation for Multiple Data Sets. The O2PLS method was originally developed for integrated analysis of two data sets. In the presented study, three data sets have been characterized in parallel for transcript, protein and metabolite profiles. Multiple O2PLS models have been utilized for this purpose to identify joint covariance from the transcript data through the protein data to the metabolite data. The utilized algorithm is briefly described in Table 1 and in Figure 3. Monte Carlo Cross-Validation (MCCV)42 was utilized to estimate model parameters (see Supporting Information for details). Note that the methodology is unsupervised in the sense that no knowledge about the sample labels is used in the modeling. Connecting Transcript, Protein and Metabolite Levels. The joint covariation for all profiling platforms was calculated using multiple O2PLS models in accordance with Table 1. The joint covariation captures two main effects that are common for all profiling techniques. The first joint effect is an internode gradient describing the common developmental progress of the samples, independent of the genotypes (Figure 4). The second joint effect is a distinct separation of the G5 and G3 genotypes, respectively, independent of internode (Figure 5). We will refer to the internode gradient as the internode effect and the genotype discrimination as the genotype effect throughout. These effects essentially describe the experimental setup in Figure 2, which confirms the suitability of the integrative methodology since no information regarding the samples has been employed in the modeling. The effect of using different order of integration for the O2PLS models was evaluated to test the consistency of the methodology. As a similarity measure, the correlation between joint effects (score vectors) was used. The average absolute correlation measures were ∼0.90 and the same effects (internode and genotype) were detected; hence, the integration order

research articles was not considered having a critical effect on the conclusions; see Supporting Information for additional information. Because of the transparency of the modeling approach, all of these effects can be directly related to the variables of interest (transcript, proteins and metabolites) in order to put the results in a biological context. We highlight certain pathways of interest in the following section, although these do not comprehensively describe all the affected transcripts, proteins and metabolites. The integrated analysis both confirms a set of known links between transcripts, proteins and metabolites and reveals putative connections. The internode (developmental) effect exhibits increased levels of transcripts related to protein translation and photosynthesis in the primary growth region (internode A); see Figure 4B. The majority of identified proteins are related to protein translation elongation and glucose metabolism (Figure 4C). The increased metabolism in the primary growth region in turn requires amino acids, which are elevated at the metabolite level (Figure 4D). The metabolite myo-inositol (and the possible nitrogen-donor aspartate) are less abundant in the growth regions. Myo-inositol in particular is known to play a role in plant growth due to its involvement in, for example, the cell wall generation, auxin production and biosynthesis of oligosaccharides.43 The PttMYB21a gene is known to primarily affect lignin biosynthesis and plant growth characteristics,27 although the underlying mechanisms are less well-known. The fact that the normal growth gradient and the mutant separation are independent of one another (Figures 4 and 5) suggests that both the G5 and G3 mutants share the essential growth characteristics of normal plants with a few modifications, causing slower growth. Several transcripts coding for factors that are essential for cell growth, including tubulin, actin-depolymerizing factor (ADF) and translation elongation factors, all have decreased transcription levels due to the introduced antisense constructs in G5 (Figure 5B). The behavior of the protein translation elongation factors is particularly interesting as it is both affected by the internode gradient as well as the genotype effect (seen in Figures 4B and 5B). Numerous factors involved in the lignin biosynthesis are also heavily affected. For instance, CaffeoylCoA O-methyltransferase (CCoAOMT), Caffeic acid 3-O-methyltransferase (COMT), Cinnamoyl-CoA Reductase 1 (CCR1), Cinnamyl-alcohol dehydrogenase (CAD) and chorismate synthase are central or peripheral enzymes in the lignin biosynthesis and their reduced levels in G5 can be seen at the transcript levels. The behavior of the G5 mutant conforms surprisingly well with the characteristics of tension wood (TW) as shown in previous studies.44 These changes at the transcript level can, in turn, be linked to metabolite fluctuations of, for example, quinic acid, which is also related to the lignin biosynthesis. To visualize the inter-relatedness of the affected transcripts, we use a multilayer network algorithm constructed from an extensive set of Poplar microarrays.48 Given this network, it is possible to study how genes of particular interest appear in the network based on a “guilt by association” principle. We use the O2PLS correlation loadings to highlight and identify clusters from the predefined MST. The largest identified clusters are illustrated in Figure 6 for the correlation loading vector describing the internode effect. The corresponding Gene Ontology (GO) functional categorization (http://www.geneontology.org/) of each identifier is given in Tables S2-S3 in the Supporting Information. These clusters conform well to the Journal of Proteome Research • Vol. 8, No. 1, 2009 207

research articles previous conclusions regarding the underlying biological events of the internode and genotype effects, respectively. The effects on tubulin, COMT and CCoAOMT have reliable matches at the protein level but display an opposite direction of change. Where transcript levels are decreased, the protein levels are elevated correspondingly (see Discussion). This negative correlation between transcript and protein data (Figure 5B-C) might be an effect of altered cellular and tissue structure in the mutants/wild-type affecting the efficiency of the protein extraction. During secondary cell wall formation, extensive rearrangement of the cytoskeleton occurs45 as well as new interaction between proteins and cell wall is established.46 Material from the mutants which display a delayed maturation (i.e., less developed cell wall) would then mediate higher yields compared to WT and a similar effect would apply along the internode gradient. However, to characterize the behavior of the mutants to a full extent, further investigations and follow-up experiments will be required. Omics-Specific Sources of Variation. In terms of explained (co)variation, there is a consensus among the three profiling techniques as approximately one-third of the variation is platform-specific (Figure 7). This signifies systematic events that have no correspondence across omics data sets, which are interesting and important to study from a biological perspective. As the joint variation captures all known effects related to the experiment design (Figure 2), we do not expect to see any such trends in the specific sources of variation. Investigation of the systematic omics-specific sources of variation reveals that this is indeed the case. Instead, these effects are to a major extent linked to housekeeping-like events, which cannot be directly traced to molecular events for the other omics technologies. This could be due to the fact that the effects are indeed unrelated or it could be a consequence of the fact that the technologies do not allow completely global profiling. This topic is elaborated further in the Discussion. An example of a platform-specific effect is shown in Figure 8, which describes the strongest systematic trend that is specific for the transcript data. Transcripts that are associated with housekeeping-like tasks are dominating this systematic variation, for example, ubiquitin-related transcripts or nucleosome assembly.

Discussion A strategy is described for data generation and integrated analysis from transcript, protein and metabolite profiles measured in parallel from hybrid aspen samples (P tremula × P. tremuloides). From the results, it is concluded that a considerable part (40% on average) of the existing variation in the steady-state system can be linked from changes in transcript levels through protein quantities to metabolite levels. The higher proportion of joint variation seen for the transcript data set is most likely linked to the initial filtering of low-intensity signals, by employing signals from positions that lack probe material to estimate properties of the noise. This filtering was not conducted for the protein and metabolite data sets due to a lack of a good reference point to estimate the noise levels. Much of the variation in the data is also specific to the corresponding profiling technique, which seem to be linked to housekeeping-like events. One possible explanation for this behavior can be traced to the instrumental techniques used to characterize the data. Although the microarray technology allows an almost global monitoring of the transcriptome, the utilized UPLC/MS and GC/MS techniques can currently only 208

Journal of Proteome Research • Vol. 8, No. 1, 2009

Bylesjo ¨ , et al. identify a partial range of all available peptides and metabolites, respectively. This is a limitation when connecting different omics data sets using a global profiling approach, which can only be solved by further technological advances. Despite these restrictions, the utility of MS-based techniques for global profiling are evident, as described in numerous studies, and are likely to remain equally prevalent in the future. The joint covariation describes two independent effects that are common for all profiling platforms: the internode (developmental) gradient and the separation of the G5 and G3 genotypes from the WT. The developmental gradient exhibits changes in the levels of photosynthesis-related transcripts and amino acid levels. These are fundamental processes of growing plants that can be traced across all profiling techniques and confirm that the detected associations between transcripts, proteins and metabolites are biologically sound. The unique effects introduced in the G5 mutant, due to down-regulation of the PttMYB21a gene, is known to affect plant growth and lignin biosynthesis.27 The fact that the normal growth gradient and the separation of the G5 and G3 genotypes are independent of one another suggests that the mutants share the essential growth characteristics of a normal plant with a few alterations, introducing slower growth. Both central factors in cell growth (tubulin and protein translation elongation factors) and lignin biosynthesis (COMT2 and CCoAOMT2) are heavily affected in both the transcript and protein data sets for the stronger G5 mutant. A striking result is that, when transcript levels are decreased, the protein levels are increased to a corresponding level in the G5 genotype. The correlation between protein and mRNA has been studied in several experiments. It is obvious that various mechanisms of post-translational regulation exists and will be reflected in a weak correlation of levels of protein and transcripts of the corresponding gene (see, e.g., refs 1, 47). The changes of COMT2 and CCoAOMT1/2 can additionally be linked to metabolite fluctuations of, for example, quinic acid also related to the lignin biosynthesis. The platform-specific systematic variation signifies events that have no correspondence across the omics data sets. Monitoring at multiple levels makes it possible to separate systematic variation that is platform-specific from variation that can be linked across the different platforms. Should one only measure, for example, transcript levels, effects that are transferred to changes in protein quantities would be mixed with effects that are transcript-specific without the possibility to separate these different events. If the majority of systematic variation is generally platform-specific, the use of a combined profiling approach is of increasing importance in order to be able to correctly interpret the underlying biological effects. Results from O2PLS models allow easy access to model components that are useful for interpretation of the result, which has been convincingly demonstrated in previous life science studies.25,26 Because of this model transparency, all of the identified (internode and genotype) effects can be directly related to the variables of interest (transcript, proteins and metabolites) in order to put the results in a biological context. We have only discussed general implications of combined profiling analyses in the present study but aim to elaborate on further details regarding the underlying biology in a future study.

Acknowledgment. The authors are grateful to Bo Segerman for creating the necessary software tools for managing the Populus protein database. This work was

Populus Omics Integration supported by grants from the Swedish Foundation for Strategic Research, the Swedish Research Council, FORMAS, MKS Umetrics and the KEMPE Foundation.

Supporting Information Available: Selection of diverse samples for transcriptomic and proteomic profiling, model details and cross-validation results, Monte Carlo CrossValidation (MCCV) to estimate the generalization error, estimation of run-order effects for the metabolomics data set, transcript-protein correlations, the effect of changing the order of the O2PLS integrative models, main gene ontology categorization for the genes identified as up-regulated in internode A and down-regulated in G5 in the multilayer network. This material is available free of charge via the Internet at http:// pubs.acs.org. References (1) Gygi, S. P.; Rochon, Y.; Franza, B. R.; Aebersold, R. Correlation between protein and mRNA abundance in yeast. Mol. Cell. Biol. 1999, 19 (3), 1720–1730. (2) Kleno, T. G.; Kiehr, B.; Baunsgaard, D.; Sidelmann, U. G. Combination of ‘omics’ data to investigate the mechanism(s) of hydrazineinduced hepatotoxicity in rats and to identify potential biomarkers. Biomarkers 2004, 9 (2), 116–138. (3) Hirai, M. Y.; Yano, M.; Goodenowe, D. B.; Kanaya, S.; Kimura, T.; Awazuhara, M.; Arita, M.; Fujiwara, T.; Saito, K. Integration of transcriptomics and metabolomics for understanding of global responses to nutritional stresses in Arabidopsis thaliana. Proc. Natl. Acad. Sci. U.S.A. 2004, 101 (27), 10205–10210. (4) Tohge, T.; Nishiyama, Y.; Hirai, M. Y.; Yano, M.; Nakajima, J.; Awazuhara, M.; Inoue, E.; Takahashi, H.; Goodenowe, D. B.; Kitayama, M.; Noji, M.; Yamazaki, M.; Saito, K. Functional genomics by integrated analysis of metabolome and transcriptome of Arabidopsis plants over-expressing an MYB transcription factor. Plant J. 2005, 42 (2), 218–235. (5) Kolbe, A.; Oliver, S. N.; Fernie, A. R.; Stitt, M.; van Dongen, J. T.; Geigenberger, P. Combined transcript and metabolite profiling of Arabidopsis leaves reveals fundamental effects of the thiol-disulfide status on plant metabolism. Plant Physiol. 2006, 141 (2), 412–422. (6) Hirai, M. Y.; Klein, M.; Fujikawa, Y.; Yano, M.; Goodenowe, D. B.; Yamazaki, Y.; Kanaya, S.; Nakamura, Y.; Kitayama, M.; Suzuki, H.; Sakurai, N.; Shibata, D.; Tokuhisa, J.; Reichelt, M.; Gershenzon, J.; Papenbrock, J.; Saito, K. Elucidation of gene-to-gene and metaboliteto-gene networks in arabidopsis by integration of metabolomics and transcriptomics. J. Biol. Chem. 2005, 280 (27), 25590–25595. (7) Carrari, F.; Baxter, C.; Usadel, B.; Urbanczyk-Wochniak, E.; Zanor, M. I.; Nunes-Nesi, A.; Nikiforova, V.; Centero, D.; Ratzka, A.; Pauly, M.; Sweetlove, L. J.; Fernie, A. R. Integrated analysis of metabolite and transcript levels reveals the metabolic shifts that underlie tomato fruit development and highlight regulatory aspects of metabolic network behavior. Plant Physiol. 2006, 142 (4), 1380– 1396. (8) Clish, C. B.; Davidov, E.; Oresic, M.; Plasterer, T. N.; Lavine, G.; Londo, T.; Meys, M.; Snell, P.; Stochaj, W.; Adourian, A.; Zhang, X.; Morel, N.; Neumann, E.; Verheij, E.; Vogels, J. T.; Havekes, L. M.; Afeyan, N.; Regnier, F.; van der Greef, J.; Naylor, S. Integrative biological analysis of the APOE*3-leiden transgenic mouse. OMICS 2004, 8 (1), 3–13. (9) Oresic, M.; Clish, C. B.; Davidov, E. J.; Verheij, E.; Vogels, J.; Havekes, L. M.; Neumann, E.; Adourian, A.; Naylor, S.; van der Greef, J.; Plasterer, T. Phenotype characterisation using integrated gene transcript, protein and metabolite profiling. Appl. Bioinf. 2004, 3 (4), 205–217. (10) Rischer, H.; Oresic, M.; Seppanen-Laakso, T.; Katajamaa, M.; Lammertyn, F.; Ardiles-Diaz, W.; Van Montagu, M. C.; Inze, D.; Oksman-Caldentey, K. M.; Goossens, A. Gene-to-metabolite networks for terpenoid indole alkaloid biosynthesis in Catharanthus roseus cells. Proc. Natl. Acad. Sci. U.S.A. 2006, 103 (14), 5614–5619. (11) Schena, M.; Shalon, D.; Davis, R. W.; Brown, P. O. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995, 270 (5235), 467–70. (12) de Hoffmann, E.; Stroobant, V. Mass Spectrometry: Principles and Applications, 2nd ed.; John Wiley & Sons: Chichester, U.K., 2001. (13) Box, G. E. P.; Hunter, W. G.; Hunter, J. S., Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building; Wiley: New York, 1978.

research articles (14) Lundstedt, T.; Seifert, E.; Abramo, L.; Thelin, B.; Nystro¨m, A.; Pettersen, J.; Bergman, R. Experimental design and optimization. Chemom. Intell. Lab. Syst. 1998, 42 (1-2), 3–40. (15) Gullberg, J.; Jonsson, P.; Nordstro¨m, A.; Sjo¨stro¨m, M.; Moritz, T. Design of experiments: an efficient strategy to identify factors influencing extraction and derivatization of Arabidopsis thaliana samples in metabolomic studies with gas chromatography/mass spectrometry. Anal. Biochem. 2004, 331 (2), 283–295. (16) Jiye, A.; Trygg, J.; Gullberg, J.; Johansson, A.; Jonsson, P.; Antti, H.; Marklund, S.; Moritz, T. Extraction and GC/MS analysis of the human blood plasma metabolome. Anal. Chem. 2005, 77 (24), 8086–8094. (17) deAguiar, P.; Bourguignon, B.; Khots, M.; Massart, D.; PhanThanLuu, R. D-optimal designs. Chemom. Intell. Lab. Syst. 1995, 30 (2), 199–210. (18) Marengo, E.; Todeschini, R. A new algorithm for optimal, distancebased experimental-design. Chemom. Intell. Lab. Syst. 1992, 16 (1), 37–44. (19) Pir, P.; Kirdar, B.; Hayes, A.; Onsan, Z. Y.; Ulgen, K. O.; Oliver, S. G. Integrative investigation of metabolic and transcriptomic data. BMC Bioinf. 2006, 7, 203. (20) Joyce, A. R.; Palsson, B. O. The model organism as a system: integrating ’omics’ data sets. Nat. Rev. Mol. Cell Biol. 2006, 7 (3), 198–210. (21) Trygg, J. O2-PLS for qualitative and quantitative analysis in multivariate calibration. J. Chemom. 2002, 16, 283–293. (22) Trygg, J.; Wold, S. O2-PLS, a two-block (X-Y) latent variable regression (LVR) method with an integral OSC filter. J. Chemom. 2003, 17, 53–64. (23) Gabrielsson, J.; Jonsson, H.; Airiau, C.; Schmidt, B.; Escott, R.; Trygg, J. The OPLS methodology for analysis of multi-block batch process data. J. Chemom. 2006, 20 (8-10), 362–369. (24) Gabrielsson, J.; Jonsson, H.; Airiau, C.; Schmidt, B.; Escott, R.; Trygg, J. OPLS methodology for analysis of pre-processing effects on spectroscopic data. Chemom. Intell. Lab. Syst. 2006, 84 (1-2), 153– 158. (25) Rantalainen, M.; Cloarec, O.; Beckonert, O.; Wilson, I. D.; Jackson, D.; Tonge, R.; Rowlinson, R.; Rayner, S.; Nickson, J.; Wilkinson, R. W.; Mills, J. D.; Trygg, J.; Nicholson, J. K.; Holmes, E. Statistically integrated metabonomic-proteomic studies on a human prostate cancer xenograft model in mice. J. Proteome Res. 2006, 5 (10), 2642–2655. (26) Bylesjo¨, M.; Eriksson, D.; Kusano, M.; Moritz, T.; Trygg, J. Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data. Plant J. 2007, 52, 1181– 1191. (27) Karpinska, B.; Karlsson, M.; Srivastava, M.; Stenberg, A.; Schrader, J.; Sterky, F.; Bhalerao, R.; Wingsle, G. MYB transcription factors are differentially expressed and regulated during secondary vascular tissue development in hybrid aspen. Plant Mol. Biol. 2004, 56 (2), 255–270. (28) Sterky, F.; Bhalerao, R.; Unneberg, P.; Segerman, B.; Nilsson, P.; Brunner, A.; Charbonnel-Campaa, L.; Lindvall, J.; Tandre, K.; Strauss, S.; Sundberg, B.; Gustafsson, P.; Uhlen, M.; Bhalerao, R.; Nilsson, O.; Sandberg, G.; Karlsson, J.; Lundeberg, J.; Jansson, S. A Populus EST resource for plant functional genomics. Proc. Natl. Acad. Sci. U.S.A. 2004, 101 (38), 13951–13956. (29) Sjo¨din, A.; Bylesjo¨, M.; Skogstro¨m, O.; Eriksson, D.; Nilsson, P.; Ryde´n, P.; Jansson, S.; Karlsson, J. UPSC-BASE -Populus transcriptomics online. Plant J. 2006, 48 (5), 806–817. (30) Smith, C.; Rodriguez-Buey, M.; Karlsson, J.; Campbell, M. The response of the poplar transcriptome to wounding and subsequent infection by a viral pathogen. New Phytol. 2004, 164 (1), 123–136. (31) Bylesjo¨, M.; Eriksson, D.; Sjo¨din, A.; Sjo¨stro¨m, M.; Jansson, S.; Antti, H.; Trygg, J. MASQOT: a method for cDNA microarray spot quality control. BMC Bioinf. 2005, 6, 250. (32) Bylesjo¨, M.; Sjo¨din, A.; Eriksson, D.; Antti, H.; Moritz, T.; Jansson, S.; Trygg, J. MASQOT-GUI: spot quality assessment for the twochannel microarray platform. Bioinformatics 2006, 22 (20), 2554– 2555. (33) Bylesjo¨, M.; Eriksson, D.; Sjo¨din, A.; Jansson, S.; Moritz, T.; Trygg, J. Orthogonal projections to latent structures as a strategy for microarray data normalization. BMC Bioinf. 2007, 8, 207. (34) Giavalisco, P.; Nordhoff, E.; Lehrach, H.; Gobom, J.; Klose, J. Extraction of proteins from plant tissues for two-dimensional electrophoresis analysis. Electrophoresis 2003, 24 (1-2), 207–216. (35) Kinter, M.; Sherman, N. E. Protein Sequencing and Identification Using Tandem Mass Spectrometry; John Wiley and Sons: New York, 2000.

Journal of Proteome Research • Vol. 8, No. 1, 2009 209

research articles (36) Ba¨ckstro¨m, S.; Elfving, N.; Nilsson, R.; Wingsle, G.; Bjo¨rklund, S. Purification of a plant mediator from Arabidopsis thaliana identifies PFT1 as the Med25 subunit. Mol. Cell 2007, 26 (5), 717–729. (37) Tuskan, G. A.; Difazio, S.; Jansson, S.; Bohlmann, J.; Grigoriev, I.; Hellsten, U.; Putnam, N.; Ralph, S.; Rombauts, S.; Salamov, A.; Schein, J.; Sterck, L.; Aerts, A.; Bhalerao, R. R.; Bhalerao, R. P.; Blaudez, D.; Boerjan, W.; Brun, A.; Brunner, A.; Busov, V.; Campbell, M.; Carlson, J.; Chalot, M.; Chapman, J.; Chen, G. L.; Cooper, D.; Coutinho, P. M.; Couturier, J.; Covert, S.; Cronk, Q.; Cunningham, R.; Davis, J.; Degroeve, S.; Dejardin, A.; Depamphilis, C.; Detter, J.; Dirks, B.; Dubchak, I.; Duplessis, S.; Ehlting, J.; Ellis, B.; Gendler, K.; Goodstein, D.; Gribskov, M.; Grimwood, J.; Groover, A.; Gunter, L.; Hamberger, B.; Heinze, B.; Helariutta, Y.; Henrissat, B.; Holligan, D.; Holt, R.; Huang, W.; Islam-Faridi, N.; Jones, S.; Jones-Rhoades, M.; Jorgensen, R.; Joshi, C.; Kangasjarvi, J.; Karlsson, J.; Kelleher, C.; Kirkpatrick, R.; Kirst, M.; Kohler, A.; Kalluri, U.; Larimer, F.; Leebens-Mack, J.; Leple, J. C.; Locascio, P.; Lou, Y.; Lucas, S.; Martin, F.; Montanini, B.; Napoli, C.; Nelson, D. R.; Nelson, C.; Nieminen, K.; Nilsson, O.; Pereda, V.; Peter, G.; Philippe, R.; Pilate, G.; Poliakov, A.; Razumovskaya, J.; Richardson, P.; Rinaldi, C.; Ritland, K.; Rouze, P.; Ryaboy, D.; Schmutz, J.; Schrader, J.; Segerman, B.; Shin, H.; Siddiqui, A.; Sterky, F.; Terry, A.; Tsai, C. J.; Uberbacher, E.; Unneberg, P.; Vahala, J.; Wall, K.; Wessler, S.; Yang, G.; Yin, T.; Douglas, C.; Marra, M.; Sandberg, G.; Van de Peer, Y.; Rokhsar, D. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science 2006, 313 (5793), 1596–1604. (38) Jonsson, P.; Johansson, A. I.; Gullberg, J.; Trygg, J.; A, J.; Grung, B.; Marklund, S.; Sjo¨stro¨m, M.; Antti, H.; Moritz, T. High-throughput data analysis for detecting and identifying differences between samples in GC/MS-based metabolomic analyses. Anal. Chem. 2005, 77 (17), 5635–5642.

210

Journal of Proteome Research • Vol. 8, No. 1, 2009

Bylesjo ¨ , et al. (39) Jolliffe, I. T., Principal Component Analysis, 2nd ed.; Springer: New York, 2002; p 502. (40) Wold, S.; Esbensen, K.; Geladi, P. Principal Component Analysis. Chemom. Intell. Lab. Syst. 1987, 2 (1-3), 37–52. (41) Wold, S. Cross validatory estimation of the number of components in factor and principal components models. Technometrics 1978, 20, 397–406. (42) Shao, J. Linear-model selection by cross-validation. J. Am. Stat. Assoc. 1993, 88 (422), 486–494. (43) Loewus, F.; Murthy, P. Myo-inositol metabolism in plants. Plant Sci. 2000, 150 (1), 1–19. (44) Andersson-Gunnerås, S.; Mellerowicz, E. J.; Love, J.; Segerman, B.; Ohmiya, Y.; Coutinho, P. M.; Nilsson, P.; Henrissat, B.; Moritz, T.; Sundberg, B. Biosynthesis of cellulose-enriched tension wood in Populus: global analysis of transcripts and metabolites identifies biochemical and developmental regulators in secondary wall biosynthesis. Plant J. 2006, 45 (2), 144–165. (45) Wasteneys, G. O. Progress in understanding the role of microtubules in plant cells. Curr. Opin. Plant Biol. 2004, 7 (6), 651–660. (46) He, Z. H.; Fujiki, M.; Kohorn, B. D. A cell wall-associated, receptorlike protein kinase. J. Biol. Chem. 1996, 271 (33), 19789–19793. (47) Foss, E. J.; Radulovic, D.; Shaffer, S. A.; Ruderfer, D. M.; Bedalov, A.; Goodlett, D. R.; Kruglyak, L. Genetic basis of proteome variation in yeast. Nat. Genet. 2007, 39 (11), 1369–1375. (48) Grönlund, A.; Bhalerao, R. P.; Karlsson, J. Modular gene expression in Poplar: a multilayer network approach. New Phytologist, [Online early access]. DOI: 10.1111/j.1469-8137. 2008. 02668.x. Published online: Nov 5, 2008.

PR800298S