Chromosome Gene Orientation Inversion Networks (GOINs) of

Jan 16, 2018 - This work opens the gate to the use of GOINs as a tool for the study of the structure of chromosomes and the study of protein function ...
1 downloads 7 Views 4MB Size
Subscriber access provided by University of Leicester

Article

Chromosome Gene Orientations Inversion Networks (GOINs) of Plasmodium proteome Viviana F. Quevedo-Tumailli, Bernabe Ortega-Tenezaca, and Humbert Gonzalez-Diaz J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.7b00861 • Publication Date (Web): 16 Jan 2018 Downloaded from http://pubs.acs.org on January 18, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Chromosome Gene Orientations Inversion Networks (GOINs) of Plasmodium proteome Viviana F. Quevedo-Tumailli a, b, Bernabé Ortega-Tenezaca a, b, c, and Humbert González-Díaz d, e, * a

RNASA-IMEDIR, Computer Science Faculty, University of A Coruña, 15071, A Coruña, Spain. b c

d

Universidad Estatal Amazónica UEA, Puyo, Pastaza, Ecuador

Universidad Regional Autónoma de los Andes UNIANDES-Puyo, Ecuador

Dept. of Organic Chemistry II, University of the Basque Country UPV/EHU, 48940, Leioa, Biscay, Spain e

IKERBASQUE, Basque Foundation for Science, 48011, Bilbao, Biscay, Spain

* Corresponding author: H.G.D., Email: [email protected], Phone: 94 601 3547

Abstract. The spatial distribution of genes in chromosomes seems not to be random. For instance, only 10% of genes are transcribed from bidirectional promoters in humans and many more are organised in larger clusters. This raises intriguing questions asked by different authors before. We would like to add a few more questions in this context, related to gene orientation inversions. Does gene orientation (inversion) follow a random pattern? Is it relevant to biological activity somehow? In this paper, we define a new kind of network coined as the Gene Orientation Inversion Network (GOIN). GOINs complex network encodes short and long-range patterns of inversion of the orientation of pairs of gene in the chromosome. We selected Plasmodium falciparum as case of study due to the high relevance of this parasite to public health (causal agent of Malaria). We constructed here for the first time all the GOINs for the genome of this parasite. These networks have an average of 383 nodes (genes in one chromosome) and 1314 links (pairs of gene with inverse orientation). We calculated node centralities and other parameters of these networks. These numerical parameters were used to study different properties of gene inversion patterns, e.g.; distribution, local communities, similarity to Erdös-Renyi random networks and randomness, etc. We find clues that seems to indicate that gene oritentation inversion does not follows a random pattern. We noted that some gene communities in the GOINs tend to group genes encoding for RIFIN-related proteins in the proteome of the parasite. RIFIN-like proteins are a second family of clonally variant proteins expressed on the surface of red cells infected with Plasmodium falciparum. Consequently, we used these centralities as input of Machine Learning (ML) models to predict the RIFIN-like activity of 5365 proteins in the proteome of Plasmodium sp. The best linear ML model found discriminates RIFIN-like from other proteins with Sensitivity and Specificity 70-80% in trainining and external validation series. All these results may point to a possible biological relevance of gene orientation inversion not directly depedent of genetic sequence information. This work opens the gate to the use of GOINs as a tool for the study of the structure of chromosomes and the study of protein function in proteome research. Keywords: Malaria; Plasmodium sp. proteome; Chromosome microstructure; Gene orientation; Complex Networks; Machine Learning 1 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 30

Introduction The spatial distribution

1

of genes in chromosomes seems not to be random. For instance, only 10% of

genes are transcribed from bidirectional promoters in humans and many more are organised in larger clusters. Interestingly, neighbouring genes are frequently coexpressed but rarely functionally related. Kustatscheret al. 2, found that coexpression of bidirectional gene pairs, and close by genes in general, is buffered at the protein level. According to these authors, grouping human genes together along the genome sequence is biologically relevant to reduce expression noise. In any case, as Kustatscher et al. all this raises intriguing questions. Why is functionally unrelated gene clustered in the genome? How can the cell tolerate their coexpression? Asked these authors in a previous work 2. We would like to add here more questions, related to gene orientation inversions, this time. Does gene orientation (inversion) follow a random pattern? Is it relevant to biological activity somehow? On the other hand, Network Analysis (NA) has been expanded to different levels of organization 3. In fact, NA may help us to study the long-range distribution of many different structural patterns, connectivity features, and regulatory motifs in diverse complex systems. At a molecular-biological level we can use NA to study drug-target interactions 4, the structure of these targets (protein contact maps), the interactions among proteins (PINs) 5, 6, to metabolic pathways, brain cortex, or ecosystems. At a larger scale we can use NA to study large social networks like Internet, financial networks 7-9 or US Supreme Court 10, 11. Inferring the structure-and/or-property relationships in complex networks from observable data is significant in many areas of science 12. We can use Machine Learning (ML) tools in NA to perform network inference 13. For instance, Ghanat et al.

14

used ML models for reconstructing a cancer network. One way to perform this

task is by calculating a type parameter called topological indices (TIs) (for full networks) and/or vertex centralities (for nodes). These indices are useful to characterize numerically patterns of connectivity between nodes or actors in a network (represented as a mathematical graph). In particular, the node or vertex centrality values are a structural attribute, strictly dependent of node´s connections (node network location). The concept of centrality was introduced by Bavelas in 1948

15

. In this sense, to define a

centrality must find a parameter that quantifies the contribution from one node to the network, from its location on it. Next, these indices can be used as inputs variables for ML algorithms like General Discriminant Analysis (GDA), Artificial Neural Networks (ANN), etc. In such a way, we can fit quantitiative models able to predict the properties of networks that depend on their structure. The combination of NA and ML can be useful to study the interrelationship between structure and properties of many types of networks including, e.g., Proteomes, Brain cortex, Epidemiological, and Social networks etc. 16-26

.

In this work, we define for the first time a new type of complex networks to study features in the microstructure of Chromosomes. We coined this new class of networks as Gene Orientation Inversion Networks (GOINs). This involves analyzing at the same time the spatial distribution of genes in the 2 ACS Paragon Plus Environment

Page 3 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

chromosome and the appeareance of inversions in the orientation of the genes along the chromosome. In so doing, we selected as case of study the parasite Plasmodium falciparum (P. falciparum). Malaria is a major public health problem, considered a tropical negelected disease, in many parts of the world, especially in Africa, increased by drug development and resitance problems 27. This organism has 14 chromosomes and 5365 proteins in the proteome; encoded by identical number of protein-encoding genes 28. Many researches in proteome research have done proteomics studies of different families of proteins in this parasite

29-31

.

However, many of these proteins are hypothetical proteins which function remains unknown until now. Firstly, we constructed the adjacency matrix (Ak) and the respective GOIN for each one of the 14 chromosomes. Next, we calculated the node centralities (Ct) for all the genes in each network. We perform a comparative study of the distribution and topological properties of these networks to random network models. Last, we used node centralities as input of ML algorithms to predict specific examples of biological function without relying upon genetic information. The Figure 1 illustrates the general workflow used in this paper to develop the new model.

Figure 1. Workflow of the GOIN Study of Plasmodium Proteome dataset

Materials and Methods Mapviewer dataset of P. falciparum genome and proteome. We downloaded all the information about

gene

(genome) and

proteins

in P.

(https://www.ncbi.nlm.nih.gov/projects/mapview/)

falciparum

proteome

from

Mapviewer database

32-34

. The P. falciparum genome reported in Mapviewer

is organized in 14 chromosomes. Each chromosome contains a certain number of total genes. In turn, the database registers the coordinates within the chromosome for each gene (start and stop position), used to obtain the sequences. Additionally, the database reports the symbol, positive (+) or negative (-) for the Orientation (Oik) of the ith gene (Geneik) along the kth chromosome, Oik = 1 (positive) or Oik = -1 (negative). We also obtained the Position (Pik) of each gene in the chromosome and a description of the biological 3 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 30

function of each protein in the proteome of the parasite. In the Supplementary Information file SI_01.pdf, Table S1, we detailed the characteristics of the 5365 genes and proteins of P. falciparum proteome. Gene Orientation Inversion Networks (GOINs). We constructed two types of GOIN graphs. The first type is the GOIN graph with information about gene orientation only. The second type are S-GOIN graphs including information about gene orientation and spatial distribution. In so doing, we have given the following steps. Firstly, we listed the values of Oik and Pik of each Geneik in one Excel file. After that, the nodes are labeled with numbers ni (i = 1, 2, ... nmax) that represent the genes of the chromosome k. These labels follow the order established by Mapviewer, that is, the number 1 represents Gene1k and the position is represented by Pik = 1. Finally, we used two different approaches in GOINs and S-GOINs to calculate the existence (Lij = 1) or not (Lij = 0) of links between nodes in the network. For the GOIN graphs, we interconnected the nodes according to the pattern of the gene orientation inversion of its neighbors. In doing so, we use the following function:

Lij = if ( and ( Oik * O jk = -1 ; abs ( Pik - Pjk ) < cutoff ) ; 1 ; 0 )

(1)

Where: Oik, Ojk are the orientation and Pik, Pjk are the position of the Geneik and Genejk, 0 = 1 for all the nodes in the network. It means that there are no isolated nodes. Evaluating the function, Lij = 1 if Oik ≠ Ojk (orientation inversion) and |Pik – Pjk| < cutoff and Lij = 0 otherwise. We can organize the values of Lij into nxn squared adjacency matrices (Ak). These matrices can be visualized as graphs representations of the GOIN. In the file SI_02.txt, we released the information in .net format for the 14 GOINs; one for each Chromosome. The Figure 3 shows the process used in this work to build the matrices.

Figure 2. Ilustration of gene inversion patterns with different cutoff 4 ACS Paragon Plus Environment

Page 5 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 3. Construction of adjacency matrices from gene data of each Chromosomes

For the second type of graphs, we started with the GOIN and added the spatial adjacency between the genes along the chromosome. We called this second network Spatial GOIN (S-GOIN). In doing so, we use the following nested function to determine the existense of links (Lij = 1) or not (Lij = 0):

Lij = if ( or ( abs (Pik - Pjk ) = 1 ; and ( Oik * O jk = - 1 ; abs ( Pik - Pjk ) < cutoff ) ) ; 1 ; 0 )

(2)

The condition |Pik – Pjk| = 1 guarantee that there is a link Lij = 1 between pairs of genes Geneik and Genejk with spatial adjacency in the Chromosome k. In the file SI_03.txt, we released the files in .net format for the 14 S-GOIN, one for each Chromosome. As a result, we generated 28 matrices, 14 for GOIN and other 14 for S-GOIN. In addition, we constructed 14 random (R) network models called R-GOINs for each one of the S-GOIN, for comparative purposes. These 14 R-GOINs where constructed with the same number of nodes using Ërdos-Renyi random network models with the software CentiBin version 1.4.3

35

. We

calculated the edge probability of each GOIN, S-GOIN, and R-GOIN used the following function:

pk =

Lk 2× L L = k = 2 k  nk  nk − nk Lmax   2

(3)

Where, p(Lk) is the probability of formation of links Lij between two nodes. Lk is the number of links between two nodes, and nk is the number of nodes in the network of the kth chromosome. In the file SI_04.txt, we released the information in format .net for the 14 R-GOINs. The values of Lk and p(Lk) of the S-GOIN were used to built the R-GOIN in the software CentiBin according to Ërdos-Renyi random network model 35. The file SI_05.xlsx shows the values of the matrices for all GOINs. Complex Networks Centralities. In the particular case of gene networks, node centralities can play a very important role. The centrality of a node, in graph theory and complex networks, refers to numerical parameters that somehow measure the relative importance of the node within the network. The value of centrality of a node is useful, for example, detecting relevant neighbors. In Table 1, we summarize some node centralities calculated by CentiBiN, software used on this work. CentiBiN supports node centralities for undirected networks. It computes five different types of centralities (Ct), ranging from local measures. 5 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 30

Those measures only consider the direct neighborhood of a network element to global measures 35. The file SI_05.xlsx shows the values of the centralities and the calculated averages. Table 1. Definition of more relevant parameters used in this work to describe complex networks

Centrality Name

Average Distance

Degree

Information

= avg (dist (genei, gene j ))

Average topological distance between genes in the chromosome and/or topological among genes with inverted orientation.

= deg ( gene i )

Number of neighbor genes with inverse orientation in the chromosome neighborhood (cutoff window).

Cdeg(Genei , Chrk)

Cclo(Genei , Chrk)

Centrality Average b

Symbol

Distance Average



Clossenes Average



Moving Average b

Symbol

MA of Degree

a

Cdist(Genei , Chrk)

Closeness

MA of Closseness

Formula a

Symbol

∆Cclo (Genei , Chrk )

∆Cdeg (Genei , Chrk )

  =  ∑ dist (gene i , gene j )  ' gene'∈V 

Proximity to neighbor genes with inverse orientation in the chromosome neighborhood (cutoff window)

−1

Formula a

=

1 nk

nk

∑( avg (dist (gene i, gene j )) ) i =1

   ∑ dist (gene i , gene j ) ∑   i =1  ' gene '∈V  nk 1 −1 = ∑ (Cclo (Gene i , Chr k ) ) nk i =1

1 = nk

Information

nk

Expected value of distance (average proximity) for one gen in the chromosome k

−1

Expected value of closeness (average proximity) for one gen in the chromosome k

Formula a

Information

= Cclo (Genei , Chrk ) − Cclo (Chrk )

Deviation of the closeness of geneiwith respect to the expected value for all genes in the same Chromosomek.

= Cdeg (Genei , Chrk ) − Cdeg (Org )

Deviation of the degree of genei with respect to the expected value for all genes in the organism

All symbols used in these formulas are very common in complex networks literature and should be explained in detail here.

However, G = (V, E) is an undirected, (strong) connected graph with n = |V| vertices or nodes; dist(genei, genej) denotes the length of a shortest path between the nodes of geneiand genej. The matrix A is the adjacency matrix of the graph G. please, for more details see the references cited. Software: CB = CentiBin.

b

Selected examples of Moving Average (MA) operators used.

6 ACS Paragon Plus Environment

Page 7 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Biological function data for RIFIN related proteins. In the previous dataset, we also collected the biological function of the proteins encode by the genes in each chromosome of organism P. falciparum. This organism is the causative agent of deadly malaria disease 36. We focused on the genes encoding for the one class of proteins called RIFINs. RIFIN is a protein present in this organism, belonging to the largest known family of variable infected erythrocyte surface-expressed proteins, are also naturally immunogenic. RIFIN proteins were used to analyze the antibody responses of individuals living in an area of intense malaria transmission 37. There are 149 genes in the genome of the parasite codifying for the RIFIN protein. The information downloaded from Mapviewer was processed to assign biological activity class for each protein Rik = 1 for RIFIN related proteins or Rik = 0 otherwise. We created a list of gene names Gene.nameik for all genes in each Chrk. Then, using the following function nested in a calculation sheet, we extracted the values of Rik:

R ik = If ( Find (" RIFIN"; Gene.nameik ) ; 1 ; 0 )

(4)

For instance, the second gene is recorded from the start position = 39205 and the stop region = 40430. This gene has a negative orientation Oik = -1 and the symbol is MAL1P4.02 and its description is RIFIN. As the second gene belongs to the class of RIFIN proteins then Rik = 1. The last gene is registered from the start region 609110 to the stop region 616613, has a negative orientation (Oik = -1), the symbol is PFA0765c and its description is erythrocyte membrane protein 1 (PfEMP1). The last gene belongs to the class of no-RIFIN proteins (Rik = 0). The file SI_05.xlsx shows the values for each Rik.

Machine Learning models. We used the parameters of the GOINs as inputs of ML moldels to test the ability of these networks for encoding chromosome microestructural information relevant to biological activity. First, we calculated the node centralities Ct(Genei, Chrk) for each node (Genei) of each Graph (Chrk) of the GOIN model initially, later on S-GOIN and R-GOIN models. After that, we used the values of node centralities as inputs to carry out a General Discriminant Analysis (GDA). The objective of the GDA was to seek a new model able to discriminate genes expressing RIFIN-like proteins from genes encoding from other proteins in the proteome of P. falciparum. The variable to be predicted is membership (Rik = 1) or not (Rik = 0) of the protein to the RIFIN class. The ouput of the GDA equation is not Rik per se, but S(Rik), which is one real-valued score of Rik. We can write the GDA equation with the parameters mentioned above in the following form: S (Rik ) = a0 + ∑bt ⋅ [Ct (Genei , Chrk ) − Ct (Chrk ) ] 5

t =1

+ ∑ ct ⋅ [Ct (Genei , Chrk ) − Ct (Orienti , Chrk ) ] 5

t =1

+ ∑ dt ⋅ [Ct (Genei , Chrk ) − Ct (Org) ] 5

t =1

7 ACS Paragon Plus Environment

(5)

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

5

S (Rik ) = a0 + ∑bt ⋅ ∆Ct (Genei , Chrk )

Page 8 of 30

(6)

t =1

5

+ ∑ ct ⋅ ∆Ct (Genei , Orienti , Chrk ) t =1 5

+ ∑ d t ⋅ ∆Ct (Genei , Chrk , Org) t =1

We used the GDA algorithm implemented on the software STATISTICA version 12.0 to fit a linear model

38

. For this study, we used as output variable Rik and as inputs the different moving average

operators, see details in Table 1. The parameters of the goodness-of-fit of GDA model used here are: n = number of cases, Chi-square, p-value, and Specificity (Sp), Sensitivity (Sn), for both sample and validation classification matrix. Last, we carried out a Receiver Operating Characteristic (ROC) curve analysis to calculate the Area Under ROC curve (AUROC) for models 38. The file SI_06.xlsx shows the results of the linear ML model.

Results and Discussion GOINs of P. falciparum chromosomes. We constructed two types of graphs, the GOINs and S-GOINs for all the chromosomes of P. falciparum using the software CentiBin. In Figure 4 we illustrate the GOIN and S-GOIN for chromosome I on interface of CentiBiN software. Next, we carried out a triple comparative analysis of the observed GOIN and S-GOIN graphs with Ërdos-Renyi R-GOIN models. The main aim of this study is to compare the general topological charateristics of GOIN with random models in order to study the degree of randomness of gene orientation patterns in P. falciparum genome. In so doing, we build 14 R-GOINs as similar as possible to the respectively S-GOIN graphs. We also calculated several topological parameters of these networks. In the Table 2, we summarize the results of this study for some selected chromosomes. This descriptive study focused on the average values of Distance and Closesness centrality among genes. We focused on this parameter because these variables measure the distance between genes with gene orientation inversion, Oik ≠ Ojk. After a simple visual inspection, we can conclude that the values of topological indices estudied are very different in SGOIN vs. R-GOIN graphs. For instance, the topological distance varies in the range = [15.7152.84] for the S-GOINs of the 14 chromosomes.

8 ACS Paragon Plus Environment

Page 9 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 4. Illustration of the GOIN and S-GOIN for chromosome I on interface of CentiBiN software

9 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 30

Conversely, the same parameter varies in the range = [2.90-4.35] for the R-GOINs of the same chromosomes. Meanwhile, the average values of closeness varies in the range = [0.00004-0.00038] (S-GOIN) vs. = [0.00043-0.00191] (R-GOIN), see selected examples in

Table 2. In file SI_01.pdf, we show detailed results for each one of the 14 chromosomes. In fact, the Figure 5 illustrates the notable differences on the values of average closeness for S-GOINs vs. the respective randon networks in the 14 chromosomes. These results seems to indicate that the distribution of gene orientation in the S-GOIN does not follow a random pattern according to Ërdos-Renyi model 35.

Figure 5. Average closeness of S-GOINs vs. R-GOINs of 14 chromosomes

The histograms depicted in Table 2 show that the node degree distributions for different chromosomes resemble normal distribution for both S-GOIN and R-GOIN networks (See full results for 14 chromosomes separatedly in SI_01.pdf). However, we carried out Kolmogorov-Smirnov (KS) test of normality for the SGOIN of the entire Chariotype of P. falciparum (all chromosomes). As the value of D = 0.1629 has associated Lilliefors-p < 0.01 (lower than 0.05) we should reject the hypothesis that the S-GOIN of P. falciparum follows a normal distribution. In addition, we performed the Shapiro-Wilks test of normality for each one of the 14 GOINs + 14 S-GOINs = 28 networks studied separatedly. See full results for 28 networks separatedly in SI_01.pdf. All SW statistic values are in the range W = 0.89 – 0.98 with p < 0.01 (lower than 0.05) for the 28 networks. Then, we should reject the hypothesis that the gene inversion in GOIN and S-GOINs of P. falciparum follows a normal distribution. 10 ACS Paragon Plus Environment

Page 11 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 2. GOIN/S-GOIN vs. R-GOIN models (selected examples) SGOIN Graph

Chromosome I

Networks b

Param.a S-GOIN

GOIN

R-GOIN

n

157

157

157

L

378

289

379

p(Lk)

0.031

0.024

0.031



17.28

17.50

3.39



0.38

0.38

1.91

0.933

0.936

0.9587

(