Insights from ENCODE on Missing Proteins: Why β-Defensin

Aug 10, 2015 - SnoVault and encodeD: A novel object-based storage system and applications to ENCODE metadata. Benjamin C. Hitz , Laurence D. Rowe , Ni...
0 downloads 5 Views 853KB Size
Subscriber access provided by UNIV OF CAMBRIDGE

Article

Insights from ENCODE on missing proteins: why #-defensin expression is scarcely detected Yang Fan, Yue Zhang, Shaohang Xu, Nannan Kong, Yang Zhou, Zhe Ren, Yamei Deng, Liang Lin, Yan Ren, Quanhui Wang, Jin Zi, Bo Wen, and Siqi Liu J. Proteome Res., Just Accepted Manuscript • Publication Date (Web): 10 Aug 2015 Downloaded from http://pubs.acs.org on August 10, 2015

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 37

Journal of Proteome Research

1 2 4

3

Insights from ENCODE on missing proteins: why β-defensin 5 7

6

expression is scarcely detected 8 9 10 1 12 14

13

Yang Fan1,2,3,#, Yue Zhang1,2,3,#, Shaohang Xu2, Nannan Kong1,2,3, Yang Zhou1,2,3, 16

15

Zhe Ren2, Yamei Deng1,2,3, Liang Lin2, Yan Ren2, Quanhui Wang1,2,3, Jin Zi2, Bo 17 18

Wen2,*, Siqi Liu1,2,3,* 20

19 21 2 23 24 25

1 CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of 26 28

27

Genomics, Chinese Academy of Sciences, Beijing 101318, China 29 31

30

2 BGI-Shenzhen, Shenzhen 518083, China 32 3 35

34

3 Graduate University of the Chinese Academy of Sciences, Beijing 100049, China 36 38

37

# These authors contributed equally to this work. 39 40 41

*To whom correspondence should be addressed: 42 43 45

4

Siqi Liu, Beijing Institute of Genomics, CAS, 1 BeiChen West Road, Beijing 100101, 47

46

China. 48 49 50

Tel and Fax: 86-10-80485460; E-mail: [email protected] 51 52 54

53

Bo Wen, BGI-Shenzhen, 11 Build, Beishan Industrial Zone, Yantian District, 56

5

Shenzhen 518083, China 57 58 60

59

Tel and Fax: 86-0755-25273620; E-mail: [email protected] 1 / 33

ACS Paragon Plus Environment

Journal of Proteome Research

1 3

2

Abbreviations 4 6

5

DEFB, β-defensin gene; CHPP, Chromosome-Centric Human Proteome Project; 7 9

8

ENCODE, Encyclopedia of DNA Element; DHS, DNase I Hypersensitive Sites; TF: 1

10

Transcription Factor; TSS, Transcription Starting Site; FPKM, Fragments Per Kilobase 12 13

of exon per Million fragments mapped; GRAVY, grand average of hydropathy; ChIP14 16

15

seq, Chromatin Immumo Precipitation followed by sequencing. 17 18 19 20 21 2 23 24 25 26 27 28 29 30 31 32 3 34 35 36 37 38 39 40 41 42 43 4 45 46 47 48 49 50 51 52 53 54 5 56 57 58 59 60

2 / 33

ACS Paragon Plus Environment

Page 2 of 37

Page 3 of 37

Journal of Proteome Research

1 3

2

Abstract 4 6

5

β-defensins (DEFBs) have a variety of functions. The majority of these proteins were 7 9

8

not identified in a recent proteome survey. Neither protein detection nor the analysis of 1

10

transcriptomic data based on RNA-seq data for three liver cancer cell lines identified 12 13

any expression products. Extensive investigation into DEFB transcripts in over 70 cell 14 16

15

lines offered similar results. This fact naturally begs the question – why are DEFB 18

17

genes scarcely expressed? After examining DEFB gene annotation and the 19 20

physicochemical properties of its protein products, we postulated that regulatory 21 23

2

elements could play a key role in the resultant poor transcription of DEFB genes. Four 25

24

regions containing DEFB genes and six adjacent regions on chromosomes 6, 8 and 20 26 28

27

were carefully investigated using The Encyclopedia of DNA Elements (ENCODE) 30

29

information, such as that of DNase I hypersensitive sites (DHSs), transcription factors 32

31

(TFs) and histone modifications. The results revealed that the intensities of these 3 35

34

ENCODE features were globally weaker than in the adjacent regions. Impressively, 37

36

DEFB-related regions on chromosomes 6 and 8 containing several non-DEFB genes 38 39

had lower ENCODE feature intensities, indicating that the absence of DEFB mRNAs 40 42

41

might not depend on the gene family but may be reliant upon gene location and 4

43

chromatin structure. 45 46 47 48 49 51

50

Key words 52 54

53

CHPP, missing proteins, β-defensin, ENCODE, DHS, histone modification, TF, 5 57

56

proteome 58 59 60

3 / 33

ACS Paragon Plus Environment

Journal of Proteome Research

1 3

2

Introduction 4 6

5

β-defensin (DEFB) proteins were discovered in humans as antimicrobial peptides in the 7 8

innate immune system 1. In response to infection and inflammation, these proteins 1

10

9

effectively eliminate bacteria, fungi and enveloped viruses 2. The structure of DEFBs 12 13

has been intensively studied. According to UniProt database. These proteins are 14 16

15

comprised of 61-111 amino acid residues with more than 6 conserved cysteine residues2 18

17

and have a strong charge status that likely plays a key role in their antimicrobial 19 20

activities 3. During the arms race between pathogens and the human immune system, 21 23

2

DEFBs evolved into a gene family containing multiple paralogs. DEFB family 25

24

members are assumed to have various activities and functions 4 in the immune response; 26 28

27

however, the corresponding behaviors of individual DEFB members have been poorly 30

29

clarified. A fundamental issue in this field is how to detect DEFB proteins and how to 32

31

evaluate their presence in cells and tissues. 3 34 36

35

A total of 34 DEFB genes are annotated in the human genome based on the newest 38

37

human reference genome, hg19. With the exception of the two DEFB genes located on 39 40

Chromosome (Chr) 4 and Chr 11, most DEFB genes are distributed on 3 chromosomes 41 43

42

in a clustered manner, with 4 on Chr 6, 14 on Chr 8 and 14 on Chr 20 (Figure 1A). 45

4

Large-scale analysis by RNA-seq or Liquid chromatography-tandem mass 46 48

47

spectrometry (LC-MS/MS) creates an overall view of gene expression status in cells or 50

49

tissues. Surprisingly, DEFB mRNA and protein expression levels are very low in 52

51

several cells compared with expected levels. Although 14 DEFB genes are located on 53 54

chromosome 8 5, there are no expressed proteins that correlate with these DEFB genes 57

56

5

in three cancer cell lines. Additionally, RNA-seq data for both total mRNAs and 58 60

59

ribosome-bound mRNAs only detected the transcriptional form of the DEFB1 gene in 4 / 33

ACS Paragon Plus Environment

Page 4 of 37

Page 5 of 37

Journal of Proteome Research

1 3

2

these cells. Wang et al. conducted proteomics analysis using LC-MS/MS on gastric, 5

4

colon and liver tissues from the digestive system 6

6

and identified no DEFB proteins

8

7

encoded by chromosome 20. Furthermore, the same group extended the trans-omics 10

9

study to three liver cancer cell lines to look for the gene expression status of DEFB on 12

1

chromosome 20 7. Similar to their results in human tissues, no DEFB gene products 13 15

14

from chromosome 20 were detected by either RNA-seq or LC-MS/MS. The DEFB 17

16

genes on chromosomes 8 and 20 are not the only ones that are suppressed; the 18 19

expression of the DEFB genes on chromosome 6 was also barely detected by RNA-seq. 20 2

21

Previous studies indicated that considerable expression of DEFBs was detected in some 24

23

human tissues like skin, oral mucosa and trachea in response to the microbial invaders 25 26

from outer environment 8. Meanwhile the two groups revealed that the DEFB 27 29

28

expression in lung disease was highly depended upon inflammation and microbial 31

30

infection 9-10. However, expression of DEFB109 was found its abundance decreased in 32 34

3

human eye tissue under microbial infection condition, suggesting its further 36

35

complicated function rather than only anti-microbicidal infection 11. Nonetheless, the 37 38

global absence of DEFB expression in normally conditioned cells and tissues naturally 39 41

40

raises the question of what causal factors lead to this phenomenon. Gene expression is 43

42

regulated by many elements at various levels. Cell-specificity and chromatin structure 4 45

are the main factors that impact gene expression at the transcriptional level, while the 46 48

47

physicochemical properties of proteins and tissue-location may be the driving forces at 50

49

the translational level. 51 52 54

53

56

5

The Encyclopedia of DNA Elements (ENCODE) project

12

has generated a wealth of

experimental information by mapping diverse chromatin properties in over hundred 57 58

human cell lines, including the locations of DNase I hypersensitive sites (DHSs), the 59 60

associations of transcription factors (TFs) and whether histones and CpG modifications 5 / 33

ACS Paragon Plus Environment

Journal of Proteome Research

Page 6 of 37

1 3

2

are present. ENCODE offers a solid database to deeply understand the relationship 5

4

between chromatin features and gene expression. For instance, the DHS data reveal that 6 8

7

chromatin loses its condensed structure in certain regions, leading to exposed and 10

9

accessible genomic DNA that is functionally related to transcriptional activity 13. TFs, 1 12

which represent the largest family of proteins and account for approximately 10% of 13 15

14

genes, are the critical regulation elements for gene expression 17

16

14-16

. Histone

modifications are believed to be involved in both transcriptional activation and 18 19

repression; the various modified states can change chromatin structure and function by 20 2

21

recruiting other enzyme complexes 24

23

17-21

. These elements can work together to form a

regulatory network for controlling gene expression. The effective utilization of the 25 27

26

ENCODE data therefore enables a comprehensive understanding of which factors are 29

28

critical for the expression of corresponding genes. To date, there have been no 31

30

systematic investigations examining DEFB gene chromosomal locations and the 32 34

3

corresponding chromatin structures. Because the RNA-seq data shows solid evidence 36

35

that DEFB genes are poorly expressed at the transcriptional level, it led us to question 37 38

whether DEFB chromatin characteristics play a role in the repression of these genes in 39 41

40

digestive system tissues. 42 4

43

In this communication, we examined mRNA signals encoded by individual DEFB 45 46

genes in three liver cancer cell lines before extending the survey to examine DEFB 47 49

48

mRNAs from over 70 cell lines. Using transcriptomic data, we formed a hypothesis and 51

50

utilized ENCODE data to study the correlation between the transcriptional expression 52 54

53

and chromatin structure of DEFB genes. For the first time, we have attempted to 56

5

investigate missing proteins from a new angle, focusing on whether unmeasurable 57 58

protein expression levels are derived from the scarce transcription of corresponding 59 60

genes. 6 / 33

ACS Paragon Plus Environment

Page 7 of 37

Journal of Proteome Research

1 3

2

Methods 4 6

5

1. Data sources 7 8 10

9

Protein sequences of whole genomes were retrieved from UniProt 12

1

(http://www.uniprot.org/uniprot/). The transcriptome was determined using next13 14

generation sequencing techniques. The transcriptome data for the three liver cancer 15 17

16

cell lines, Hep3B, MHCC97H and HCCLM3, were retrieved from the Gene 19

18

Expression Omnibus database (accession number: GSE49994). The mRNA data for 20 21

the HepG2 cell line were collected from the ENCODE dataset at 2 24

23

https://www.encodeproject.org/experiments/ENCSR468ION/. The transcriptome data 26

25

for the 79 cell lines and tissues surveyed by Uhlén’s group 20 were downloaded from 27 28

the original paper’s supplementary files on the Science website at 31

30

29

http://www.sciencemag.org/content/suppl/2015/01/21/. The ENCODE features for the 32 3

HepG2 cell line were retrieved from the ENCODE databases. The DHS data in 34 36

35

narrowPeak format, the histone modification data in broadPeak format and the 38

37

uniformly processed TF binding data in narrowPeak format were downloaded from 39 40

https://www.encodeproject.org/experiments/, http://genome.ucsc.edu/cgi41 43

42

bin/hgFileUi/ and http://genome.ucsc.edu/cgi-bin/hgTrackUi/, respectively. The strict 45

4

procession of ChIP-seq data described by a guideline from ENCODE consortium was 46 47

applied to achieve the data interpretation with high quality 21. 48 49 50 51 52 53 54

2. RNA-seq data processing 5 56 58

57

The mRNA reads for HepG2, Hep3B, MHCC97H and HCCLM3 were mapped to the 60

59

Ensembl human genome (release GRCh37.75) and mature mRNA with Tophat (version 7 / 33

ACS Paragon Plus Environment

Journal of Proteome Research

1 3

2

2.0.8), using the default parameters. A maximum of two mismatches was allowed in 5

4

the alignment process. The transcriptome reconstruction and expression quantification 6 8

7

were sequentially performed with Cuffinks (version 2.0.2), using the following 10

9

parameters (1) -g genes.gtf, (2) -b genome.fa and (3) -u. The protein-coding gene 1 12

annotation model was extracted from the Ensembl gene annotations in GTF format 13 15

14

(release GRCh37.75). The quantitative transcriptome was evaluated and normalized 17

16

using FPKM (Fragment Per Kilobase per Million mapped reads). 18 19 20

3. Assignment of the comparable scores to the ENCODE features 21 2 24

23

a) DHS 25 27

26

The DHS score is represented by the accumulated signal values from the DNase-seq 28 30

29

reads within a potential regulatory region. A 2-Kb region upstream of the transcription 32

31

starting sites (TSS) of each transcript was defined as the potential regulatory region for 3 34

DHS, and the position of the DHS peaks intersected the position of the potential 35 37

36

regulatory regions. The average DHS score of corresponding transcripts was assigned 39

38

as the DHS score for each gene. 40 41 42

b) Histone modification 43 4 46

45

The peak signals for the histone modifications derived from the sequencing reads were 48

47

retrieved from an ENCODE dataset. The histone modification score per gene was 49 51

50

defined as the accumulated peak signals that mapped to two regions related to the gene, 53

52

the TSS, including 2-Kb upstream of the TSS, and the gene body, which were 54 5

considered to be potential regulatory regions with various histone modifications. For a 56 58

57

better estimation of each modification, the scores per gene were calculated in two steps; 59 60

8 / 33

ACS Paragon Plus Environment

Page 8 of 37

Page 9 of 37

Journal of Proteome Research

1 3

2

first separately accumulating the values from the TSS or gene body region and then 5

4

integrating both values. 6 7 9

8

c) TF binding 10 12

1

According to the sequencing data acquired from ChIP-seq, the TF binding score for 13 14

each TF on a gene was assigned as the sum of the peak signals in the potential regulatory 15 17

16

region, which covered 1-Kb upstream of where gene initiation occurs to the end of the 19

18

gene. Because the TF peaks were uniform, no additional normalization was required 20 21

for the TF binding scores of the 59 different TFs. 2 23 25

24

To characterize the ENCODE features in selected chromatin regions, the scores for 27

26

each feature in a certain region were first cumulated and then averaged for the gene 28 30

29

numbers within the region to generate the regional score. 31 3

32

4. Data analysis 34 35 36

To present the chromosomal locations of the DEFBs genes, an ideogram was generated 37 39

38

using the online tool Idiographica, which was provided by the Computational Biology 41

40

Research Center (CBRC) (http://www.ncrna.org/idiographica/) using the hg19 42 43

reference genome. To characterize the gene expression profiles in the selected cell lines 4 46

45

and the TF binding landscape for the individual genes or selected regions, a heatmap 48

47

was generated using the “pheatmap” package in R 49

22

by plotting the log2 (FPKM) of

51

50

the DEFB genes versus the cell lines and tissues or the log2 (TF binding scores) of each 53

52

TF versus the region of interest. To illustrate the physicochemical properties of the 54 5

theoretical and tryptic peptides derived from the DEFB proteins and the whole genome, 56 58

57

house-made Perl programs were used according to the trypsin digestion principle or 60

59

grand average of hydropathy (GRAVY) 23. The distribution curves for these properties 9 / 33

ACS Paragon Plus Environment

Journal of Proteome Research

1 3

2

were plotted with the “density” function in R. To reveal the correlation between gene 5

4

expression and the simplified DEFB-related reference regions, the log2 (FPKM) of 6 8

7

each gene located in the selected regions was plotted against its chromosomal positions 10

9

using R. To summarize the histone modification statuses, the scores for each histone 1 12

modification were cumulated for every DEFB-related region and its neighbor reference 13 15

14

regions and then 4 region groups were formed, each one consisting of the DEFB-related 17

16

region and its reference region(s). The region fractions in each group were calculated 18 19

by dividing the score in a region by the sum of the scores in the group representing a 20 2

21

specific modification. A stacked bar plot was generated with the region fractions using 24

23

the ggplot2 package in R. 25 26 28

27

Results and discussion 29 31

30

1. The expression status of DEFB genes and the physical-chemical characteristics 32 3

of DEFB proteins 34 35 37

36

In the chromosome proteomics study carried out by the Chinese Human Chromosome 39

38

Proteome Consortium (CCPC), the three -omics datasets, transcriptome, translatome 40 41

and proteome, were generated from three hepatocellular carcinoma (HCC) cell lines, 42 4

43

Hep3B, MHCC97H and MHCCLM3 7; all of the sequencing data are available from 46

45

the Gene Expression Omnibus database (accession number: GSE49994). In the CCPC 47 49

48

dataset, the total number of protein entries encoded by the human genome is 19448. In 51

50

contrast, the number of missing proteins entries is 2936, as defined by Lane et al. 24. As 53

52

most DEFB proteins are missing proteins, we specifically looked through the dataset to 54 56

5

find the expression status of the DEFB genes. There were no protein identification 58

57

signals in the CCPC, and there were only a few mRNA signals for DEFB genes detected 59 60

in the three hepatoma cell lines. According to the newest version of the human genome 10 / 33

ACS Paragon Plus Environment

Page 10 of 37

Page 11 of 37

Journal of Proteome Research

1 3

2

database, hg19, with the exception of Chr 4 and Chr 11 that only have one DEFB gene 5

4

on each (not shown), most DEFB genes are distributed on 3 chromosomes in a clustered 6 8

7

manner, with 4 on Chr 6, 14 on Chr 8 and 14 on Chr 20 (Figure 1A). In the CCPC 10

9

study, only one DEFB mRNA was detected for each of the 3 chromosomes – DEFB 1 12

112 from Chr 6, DEFB1 from Chr 8 and DEFB126 from Chr 20. DEFB112 and 13 15

14

DEFB126 mRNA were found at a very low abundance and DEFB1 mRNA was shown 17

16

to be expressed at a relatively high abundance in the three cell lines. This indicates the 18 19

universal absence of the clustered DEFB genes expression in the hepatoma cell lines. 20 2

21

Next, we checked DEFB gene expression in HepG2, a hepatoma cell line that is widely 24

23

used to study liver cancer, and its chromatin structure was deeply mined in ENCODE. 25 27

26

The expression patterns for the DEFB genes were similar to our observations in the 29

28

CCPC data (Supplementary Figure 2), except that the abundance of the DEFB1 31

30

mRNA was relatively low. In pursuit of minimizing false positive results, we allowed 32 34

3

merely 2 mismatches in the analysis of RNA-seq data. However, this setting might be 36

35

too strict to miss some DEFB expression products with nucleotides variants. On the 37 38

basis of the observation from Taudien et al. and Groth et al., DEFBs on Chr8 could 39 41

40

43

42

possess more than 10 single nucleotide variations on average 25, 26. Therefore, the fact of undetected the DEFB mRNAs is needed further evaluated using more specific 4 45

detection approach. 46 47 49

48

At least one of the following four causal factors are likely to lead missing proteins: (1) 51

50

tissue- or cell-type-dependent gene expression, (2) gene misannotation, (3) high 52 54

53

specificity of peptides features such that the relevant MS signals are difficult to detect, 56

5

or (4) the negative regulation of gene expression by specific chromatin features. 57 58

Therefore, we inquired as to which causal factor led to the extremely low expression 59 60

status of DEFB genes. 11 / 33

ACS Paragon Plus Environment

Journal of Proteome Research

1 3

2

We first sought to find whether the universal absence of clustered DEFB gene 5

4

expression in different chromosomes was unique to the hepatoma cell lines. The DEFB 6 8

7

mRNAs in the published database were examined to address this question. Based on 10

9

the RNA-seq database published by Uhlén’s group, the transcriptomes of 76 different 1 12

cell lines and tissues types were surveyed 13

20

. The results are presented as a heatmap

15

14

with the normalized expression of each DEFB gene versus the 76 cell lines and tissue 17

16

types (Figure 1B). According to the RNA-seq data, DEFB gene expression was very 18 19

low in most cell lines and tissues, with abundance values less than 5. This transcriptome 20 2

21

dataset did not include tissues or cell lines affected by microbial infection, suggesting 24

23

that low DEFB mRNA expression is a common characteristic of normal major tissues 25 27

26

and cell lines. Additionally, as shown in Figure 1B, the gene DEFB1, which displays 29

28

high expression levels, was found in many cell lines, implying that this gene may be 31

30

located at a special site on a specific chromosome or perform certain functions under 32 34

3

stimulation. 35 37

36

Next, we questioned whether the low rate of identification of DEFB mRNAs resulted 38 39

from the incorrect annotation of these genes. We reviewed the annotation status of 34 40 42

41

DEFB genes in the UniProt database (http://www.uniprot.org/). Whether a gene 4

43

encoding a specific protein is well annotated depends on evidence from three 45 46

experimental tools, including antibody assays, mRNA arrays or RNA-seq and gene 47 49

48

confirmation across different references. All of the annotation information for the 51

50

DEFB genes is summarized in Supplementary Table 1. Based on the experimental 52 54

53

evidence, the existence of most of DEFBs is supported by RNA-seq data, except for 56

5

DEFB108 and DEFB109. Sixty percent of DEFB transcripts were identified by mRNA 57 58

array and only 18% of DEFB translated products were found by antibody assays. 59 60

Furthermore, the presence of DEFB was examined across multiple gene expression 12 / 33

ACS Paragon Plus Environment

Page 12 of 37

Page 13 of 37

Journal of Proteome Research

1 3

2

databases, including EMBL, CCDS, UniGene and neXtProt. DEFB108 and DEFB109 5

4

were excluded from most databases, though the presence of the other 32 DEFBs were 6 8

7

supported by all of these databases. As majority of DEFBs were detected in their 10

9

transcriptional forms and the detection of the DEFB transcription was generally found 1 12

in several references, the annotation of the DEFB genes was acceptable. 13 14 16

15

Proteomics analysis is performed based on the identification of tryptic peptides by MS. 18

17

Whether a peptide is measurable is partially determined by its length, one of the primary 19 20

limitations of mass spectrometry. Swaney DL et al. evaluated the correlation between 21 23

2

peptide length and detectability using MS with yeast proteins 27 and found that a 7 to 25

24

35 amino acid residue peptide can be detected by MS. Additionally, the parameter for 26 27

‘scan range’ on most mass spectrometers is set as an m/z limit of 350, indicating that 28 30

29

measurable peptides should be longer than 6 amino acids. As depicted in Figure 1C, 32

31

the fractions of tryptic peptides at a certain lengths were plotted against the peptide 3 35

34

lengths; the tryptic peptides were theoretically generated from the DEFB family and 37

36

the total proteins encoded by the human genome. The similar distribution curves 38 39

revealed that lengths of approximately 60% of the tryptic peptides were quite short for 40 42

41

both DEFB and whole genome proteins; however, the tryptic peptides with suitable 4

43

lengths detectable by MS were still in a high percentile, with 36% for DEFB and 44% 45 46

for whole genome proteins. Specifically, the tryptic peptides generated from each 47 49

48

DEFB protein are summarized in Supplementary Figure 1, which suggests that the 51

50

DEFB proteins contained at least 2 suitable peptides for MS detection and that over 52 54

53

63% of DEFB proteins possessed at least 5 such peptides. Hydrophobicity in a peptide 56

5

is an important factor for peptide separation with reverse-phase HPLC and peptide 57 58

ionization in electron spray. The hydrophobicity distribution described by the GRAVY 59 60

score for the tryptic peptides from DEFB and the whole human genome proteins are 13 / 33

ACS Paragon Plus Environment

Journal of Proteome Research

1 3

2

illustrated in Figure 1D. This suggests that the distribution of total proteins is relatively 5

4

symmetric, while the distribution of DEFB was slightly shifted to right, with a GRAVY 6 8

7

score of approximately 0.3. Moreover, the fractional sum of tryptic peptides over a 0.5 10

9

GRAVY score for DEFB and the whole genome proteome were 18% and 11%, 1 12

respectively, indicating that DEFB proteins could produce more peptides with 13 15

14

relatively higher hydrophobicities. Taken together, we concluded that DEFB peptides 17

16

did not show any strong biases versus other human proteins in the proteomics analysis 18 19

with LC-MS/MS. 20 21 23

2

By examining the biological and physicochemical properties of DEFB, it is difficult to 25

24

draw specific conclusions regarding what specific factor is decisive in leading to the 26 28

27

low transcriptional and translational levels of the DEFB genes. Thus, we next examined 30

29

the chromatin structure to determine whether it affected the transcriptional status of the 32

31

DEFB genes. 3 34 36

35

2. Regionalization of DEFBs and their adjacent genes on corresponding 38

37

chromosomes 39 40 41

Gene expression is controlled by many classes of cis-regulatory elements, including 42 4

43

enhancers, promoters, insulators, silencers and locus control regions. The ENCODE 46

45

data is expected to annotate these elements and to reveal novel relationships between 47 49

48

chromatin accessibility, transcription

12

, DNA methylation and regulatory factor

51

50

occupancy patterns 18. As discussed above, the DEFB genes were correctly annotated 53

52

at the genomic level but their transcriptional signals were rarely detected. This 54 56

5

prompted us to inquire as to whether the specific chromatin structures at DEFB genes 58

57

are involved in the regulation of DEFB gene expression. Although several regulatory 59 60

elements have been well studied in ENCODE, we focused on three main elements, 14 / 33

ACS Paragon Plus Environment

Page 14 of 37

Page 15 of 37

Journal of Proteome Research

1 3

2

DHS, histone methylation patterns and transcription factor binding, and scrutinized the 5

4

relationships among the chromatin at the DEFB genes and these elements. 6 7 9

8

The DHS data exploits chromatin regions that are sensitive to cleavage by the enzyme 1

10

DNase I; these accessible chromatin zones are functionally related to transcriptional 12 13

activity 14

28

. DHSs were somehow selective according to cell type. In ENCODE, 125

16

15

different human cell types have been examined by DHS analysis and approximately 18

17

30% of DHSs were specific to each cell type. In our dataset, the three liver cancer cell 19 20

lines were used for a trans-omics survey; however, none of them were studied by 21 23

2

ENCODE. How do we utilize the ENCODE information to understand the 25

24

transcriptional status of the DEFB genes in these cell lines? We selected all of the liver 26 28

27

cell lines available in ENCODE and downloaded their transcriptome data. By 30

29

normalizing the transcriptional abundance in the liver cells stored in ENCODE and in 32

31

three cell lines that we studied, we were able to globally evaluate the transcriptional 3 35

34

patterns in our cell lines. We found that the HepG2 cell line shared similar 37

36

transcriptional patterns with our three liver cancer cell lines, and reasoned that the 38 39

ENCODE information on HepG2 was appropriate for exploring the role of DHS on 40 42

41

DEFB gene expression. 43 45

4

Approximately 91% of DEFB genes are located on three chromosomes, Chr 6 (5 46 48

47

genes), Chr 8 (7 genes) and Chr 20 (14 genes). Importantly, 90% of the DEFB genes 50

49

on these three chromosomes form clustered structures. We combined the genomic 52

51

location of the DEFB genes and the transcriptome data from ENCODE for the HepG2 53 5

54

cells to survey the relationship between the transcriptional level of the DEFB genes and 57

56

the corresponding chromatin regions. On Chr 6, the DEFB gene cluster consisting of 58 60

59

DEFB133, DEFB114, DEFB113, DEFB110 and DEFB112 did not have significant 15 / 33

ACS Paragon Plus Environment

Journal of Proteome Research

1 3

2

RNA-seq signals, and their neighbor regions, which extended from 0.45 Mb upstream 5

4

of the 5’ end to 2.1 Mb downstream of the 3’ end of the cluster, had poor transcription 6 8

7

as well. On Chr 8, the DEFB gene cluster consisting of DEFB107B, DEFB105A, 10

9

DEFB106A, DEFB104B, DEFB103A, DEFB4A, DEFB109P1B and DEFB107A had 1 12

very low mRNA expression levels. The neighboring regions of this DEFB cluster, 13 14

which extended from 1 Mb upstream of the 5’ end to 0.23 Mb downstream of the 3’ 17

16

15

end of the cluster, also displayed poor transcription. In contrast to the DEFB clusters 18 19

found on Chr 6 and Chr 8, DEFB genes make up two distant clusters on Chr 20 and are 20 2

21

close to the Chr 20 telomere and centromere regions. Similar scarce transcription was 24

23

observed in the two DEFB gene clusters, whereas the flanking regions contained genes 25 27

26

with much higher transcription levels. The information regarding the DEFB gene 29

28

chromosome locations and their expression statuses are summarized in Figure 2. To 31

30

characterize the ENCODE features, such as DHS, histone modification and TF binding, 32 34

3

for the DEFB cluster regions and compare their features with those of other chromatin 36

35

regions, we specifically localized the DEFB genes and their neighboring regions as 37 38

reference regions. On Chr 6 and Chr 8, the DEFB region was centered at the DEFB 39 41

40

cluster and was expanded to include the boundaries of lower transcription, which 43

42

included approximately 2.7 Mb on Chr 6 and 1.5 Mb on Chr 8 and were termed DEFB4 45

c6 and DEFB-c8, respectively. The reference regions selected had genes with relatively 46 48

47

higher transcription levels. On Chr 6, the reference regions contained 2.7 Mb from the 50

49

3’ and 5’ ends of the DEFB-c6 and were named DEFB-r6-3’ and DEFB-r6-5’, 51 52

respectively, while on Chr 8, the regions contained 1.5 Mb expanded from the both 3’ 53 5

54

and 5’ ends of DEFB-c8 were named DEFB-r8-3’ and DEFB-r8-5’, respectively. On 57

56

Chr 20, the two DEFB clusters contain 216 Kb and 174 Kb and are termed DEFB-ca20 58 60

59

and DEFB-cb20, respectively. As the regions close to the 3’ ends of DEFB-ca20 and 16 / 33

ACS Paragon Plus Environment

Page 16 of 37

Page 17 of 37

Journal of Proteome Research

1 3

2

DEFB-cb20 contain genes with higher transcription levels and the telomeres or 5

4

centromeres of Chr 2 are close to the 5’ ends of these DEFB clusters, the reference 6 8

7

regions for DEFB-ca20 and DEFB-cb20 are the regions that extend to the right of the 10

9

DEFB-ca20 and DEFB-cb20 3’ ends for 216 Kb and 174 Kb, respectively, and are 1 12

termed DEFB-ra20 and DEFB-rb20, respectively. All of the regions used in the further 13 15

14

ENCODE analyses are labeled in Figure 2. 16 18

17

To compare the ENCODE features with the regions selected above, we adopted a 19 20

stepwise analysis. First, the genes in the corresponding regions with or without 21 23

2

detectable transcription signals were counted. For the DEFB regions, 22 genes 25

24

including 5 DEFB genes were found within the DEFB-c6 region, 35 genes including 7 26 28

27

DEFB genes within the DEFB-c8 region, 8 DEFB genes exclusively within the DEFB30

29

ca20 region and 6 DEFB genes within the DEFB-cb20 region. For the reference 32

31

regions, 14, 34, 7, 11, 9 and 6 genes were found in the DEFB-r6-5’, DEFB-r6-3’, 3 35

34

DEFB-r8-5’, DEFB-r8-3’, DEFB-ra20 and DEFB-rb20 regions, respectively (Figure 37

36

2). Next, the scores of certain ENCODE features were weighted for each gene in these 39

38

10 defined regions according to the weighting treatments described in the “Methods” 40 42

41

section. Finally, the scores of certain ENCODE features for all of the genes in each 4

43

region were summed, and the differences in the ENCODE features between the DEFB 45 46

genes and their reference regions were statistically evaluated using the Wilcoxon-test 47 49

48

(paired). 50 52

51

3. Analysis of the ENCODE features in the DEFB and reference regions 53 54 56

5

1) DHS 57

60

59

58

The DHS distribution along the DEFB-related and reference regions was analyzed with Genome Browser

29

. As depicted in Supplementary Figure 3, which includes the 17 / 33

ACS Paragon Plus Environment

Journal of Proteome Research

1 3

2

typical DHS distribution of the DEFB-ca20, DEFB-cb20, DEFB-ra20 and DEFB-rb20 5

4

regions, DHS signals were rarely observed in DEFB-ca20 and DEFB-cb20, which 6 8

7

include a few DHS binding sites with weak DHS binding intensities. In contrast, DHS 10

9

sites were widely identified in DEFB-ra20 and DEFB-rb20, with more binding sites 1 12

and higher binding intensities. A similar phenomenon to that of the regions described 13 15

14

above was observed on Chr 6 and Chr 8. Importantly, the genes neighboring the DEFBs 17

16

on Chr 6 or Chr 8 appeared to have low DHS binding signals. This implies that the co18 19

existence of poor transcription and the decreased DHS binding is not a characteristic of 20 2

21

the DEFB gene family because there are so many non-family genes in the DEFB-c6 24

23

and DEFB-c8 regions. The overall impacts of DHS on the ten selected regions were 25 27

26

further evaluated by accumulating the DHS scores per gene in each region. Figure 3A 29

28

shows that the sums of the DHS signals in all of the DEFB neighbor regions regardless 31

30

of chromosomes were obviously higher than those in the corresponding DEFB-related 32 34

3

regions. With a log10 axis for DHS intensity, the score difference between the DEFB36

35

related and reference region on chr6 or chr20a was more than hundreds-fold, while that 37 38

between DEFB-related and reference regions on Chr 8 or Chr 20b was almost infinity. 39 41

40

The evidence elicited from the DHS distribution and the cumulation concludes that the 43

42

poor transcriptional expression of the DEFB genes closely correlates with low DHS 4 45

intensities. 46 47 49

48

2) Histone modifications 50 52

51

For the HepG2 cell line, there were eleven histone modifications listed in ENCODE, 53 5

54

including eight different methylations, two different acetylations and the variant histone 57

56

H2A. By examining all 11 types of modification signal tracks in the ten regions in 58 60

59

Genome Browser, we found that the signals of most of the modifications (8 of 11) 18 / 33

ACS Paragon Plus Environment

Page 18 of 37

Page 19 of 37

Journal of Proteome Research

1 3

2

embodied a specific pattern between the DEFB-related and reference regions 5

4

Supplementary Figure 4. Importantly, the modifications synchronically changed 6 8

7

along with the fluctuation in gene expression within these regions. Meanwhile, three 10

9

histone modifications, H2az, H3k27me3 and H3k9me3, did not display a consistent 1 12

pattern in their signals in these regions; however, the signals of the three events in the 13 15

14

DEFB-related regions were indeed different from those in the reference regions. It has 17

16

been reported that histone modifications are mainly enriched in two genomic regions, 18 19

the TSS and the gene body 20

30

. To estimate the overall histone modification, we

2

21

developed an assembling algorithm that enabled the integration of the histone 24

23

modification scores from both the TSS upstream region and the fully covered gene body 25 27

26

region. With this approach, we further surveyed the modification scores on each gene 29

28

in these regions based on the ENCODE datasets. As depicted in Figure 3B, the signals 31

30

of 8 modifications on the right side were significantly enriched in the reference region; 32 34

3

specifically, the enrichment ratio per such modification was 92% in these regions. The 36

35

huge diversity in the enrichment ratios indicated by Figure 3B means that these histone 37 38

modifications could positively regulate gene transcription in all of the chromosomes 39 41

40

interest because the genes in the reference regions exhibited higher transcription levels 43

42

than the DEFB-related regions. In contrast to the other three histone modifications, 4 45

H3k27me3 and H3k9me3 have been reported to be regulatory elements that repress 46 48

47

transcriptional expression 50

49

31

. Moreover, the strong bias to the reference regions was

observed in H3k9ac and H3k27ac, implying activated function of histone deacetylases 51 53

52

(HDACs) on DEFB related regions, which is in accordance with a previous study on 5

54

the HDACs activity of DEFBs 57

56

32

. In Figure 3B, the enrichment ratios in the DEFB-

related regions were apparently increased compared to the modifications on the right 58 60

59

side, especially in Chr 20.a and the Chr 6 group, whose proportions had increased by 19 / 33

ACS Paragon Plus Environment

Journal of Proteome Research

1 3

2

more than 50 compared with the average proportion of the eight modifications on the 5

4

right side. This distinguishing histone modification pattern between the DEFB-related 6 8

7

regions and their reference region(s) suggest that chromatin structure and the scarce 10

9

expression of genes in the DEFB-related region are highly correlated. 1 12 13

3) Correlation between gene expression, DHS and histone modification in the TSS 14 15 17

16

As reported by He et al., the DHS status was crucial for the cistrome element binding 19

18

involved in histone modification 33. We checked the relationship between each histone 20 21

modification score, the DHS score and the FPKM values in the TSS regions after 2 24

23

normalization to their total scores. As depicted in Supplementary Figure 5, several 26

25

factors appeared to be correlated. DHS and FPKM have a correlation index of 0.73, 27 29

28

indicating that an enriched DHS region has favorable gene transcription. The majority 31

30

of the active regulatory histone modifications and DHS or FPKM had average 32 3

correlation indexes around more than 0.5, meaning that the correlation is acceptable, 34 36

35

especially with only 120 genes in the statistical estimation. Surprisingly, some of the 38

37

active regulatory histone modifications, such as H3k4me2, H3k4me3, H3k9ac, 39 40

H3k27ac and H3k79me2, tightly correlated with one other with indexes greater than 41 43

42

0.8 on average, implying their coexistence in the TSS regions of the genes of interest 45

4

and that their functions in regulating gene transcription are similar. For histone 46 48

47

acetylation, several types of histone acetylation including H3k9ac and H3k27ac were 50

49

defined as marks of active transcription, as reviewed by Salvi et al. 34. Andresen et al. 52

51

reported that in contrast to the suppression to gene expression induced by the activated 53 5

54

histone deacetylases (HDACs), the abundance of some HDACs, such as HDAC1-3, 57

56

HDAC5 and HDAC8, was positively correlated with the expression status of DEFB1 60

59

58

in the bronchial epithelial cell biopsies and bronchoalveolar lavage (BAL) fluids 35. We 20 / 33

ACS Paragon Plus Environment

Page 20 of 37

Page 21 of 37

Journal of Proteome Research

1 3

2

checked the large-scale transcriptome data of DEFBs as well as HDACs from the 70 5

4

cell lines and tissues, and found a poor correlation between the HDACs and DEFBs, 6 8

7

r2=0.1357 for DEFBs and r2=0.0853 for DEFB1, implying that the Andresen's 10

9

observation seemed not a common case in many cell lines and tissues. As mentioned 1 12

above, we directly examined the acetylation of H3k9 as well as H3k27 from ENCODE 13 15

14

dataset and found their absence in DEFB related regions correlated with the poor 17

16

transcription of DEFBs (Figure 3B). We thus assume that HDACs in these cells tissues 18 19

play a common role in regulation of DEFB genes transcription. 20 21 23

2

4) TF binding 24 26

25

We collected data on the binding position and normalized intensity of 59 TFs in the 27 29

28

HepG2 cell line, which were derived from ChIP-seq data in ENCODE. To estimate the 31

30

overall TF effect on the regulation of transcription in specific regions, the TF score of 32 3

a gene was defined as the TF binding intensity within its potential regulatory region. 34 36

35

The TF scores for all of the 122 genes included in the 10 selected regions are 38

37

summarized in Supplementary Figure 6. The genes in the DEFB-related regions 39 40

generally possessed lower TF occupancy than those in their reference regions, with 41 43

42

68% and 16% of the genes free from TF binding, respectively. Interestingly, three genes 45

4

in the DEFB-c6 region, DEFB107B, IL17A and PKHD1, had relatively high TF scores. 46 48

47

Nine genes with no TF scores were found in 5 reference regions in the 3 chromosomes, 50

49

suggesting that the TF score was not a unique factor for deciding gene transcription. To 52

51

further illustrate the association of TF binding with gene expression in these regions, 53 5

54

we evaluated the distribution of the average TF scores along them, which represented 57

56

the TF binding intensity per gene. As presented in Figure 3C, for the TF classes, the 58 60

59

fewest TFs were detected in the DEFB-ca20 and DEFB-cb20 regions, while the TF 21 / 33

ACS Paragon Plus Environment

Journal of Proteome Research

Page 22 of 37

1 3

2

distribution in the other regions was relatively comparable. Moreover, for the TF 5

4

binding intensity, the average TF scores in the DEFB-related regions were obviously 6 8

7

lower than those of reference regions, regardless of the chromosome. The TF 10

9

assessments on the regions discussed above imply that the weak transcription of the 1 12

DEFB-related regions in the 3 chromosomes likely resulted from fewer or weak TF 13 15

14

binding to the corresponding regulatory regions. 16 18

17

Based on the information shown in Supplementary Figure 6, there were three 19 20

important aspects to note. First, by comparing TF binding status across the regions 21 23

2

located on the different chromosomes, TF binding was pronouncedly enriched on Chr 25

24

6. This observation was in agreement with that of the previous study that showed that 26 27

TF binding displayed a biased distribution along chromosomes 36. Secondly, of the 59 28 30

29

TFs, only 5 TFs, USF1, RXRA2, CEBPB, MAFK and MAFF, displayed binding 32

31

intensities that were positively correlated with the transcriptional abundance of 3 35

34

individual genes. In contrast, over 90% of the TFs exhibited little correlation with TF 37

36

binding or transcriptional levels in the HepG2 cell line (Supplementary Figure 7). 38 39

Finally, of all of the TFs listed in the ENCODE database, approximately half of them 40 42

41

were related to the HepG2 cell line, with only 24% suggested to be consistent 4

43

transcriptional 45

activators

by

the

Factor

book

46

(http://factorbook.org/mediawiki/index.php/). In the human genome, over 1300 TFs 47 49

48

have DNA binding affinity 37; however, only 300 have been experimentally examined 51

50

to date according to the Human Transcriptional Regulation Interactions database 52 53

(HTRIdb) 38. This implies that the current ENCODE data for the HepG2 cell line is still 56

5

54

too limited to reveal the overall correlation between TFs and DEFB transcriptional 57 58

status. Moreover, as Taudien et al found that in the DEFB related regions, some copy 59 60

number variations (CNVs) was enriched, the CNVs were assumed as a regulatory factor 22 / 33

ACS Paragon Plus Environment

Page 23 of 37

Journal of Proteome Research

1 3

2

which would alter DEFBs expression, interweaving its influence with the TF binding 5

4

39

6

. Therefore, the complication may also pose the possible interpretation to the functions

8

7

of TF binding data related with the DEFB expression 9 1

10

Conclusions 12 13 14

The survey of transcriptiomes and proteomes in three liver cancer cell lines acquired 15 17

16

from RNA-seq and LC-MS/MS data presents clear evidence that DEFB gene 19

18

expression products are scarcely detected. Extensive surveys of transcriptomes in over 20 21

70 cell lines further support this observation, indicating that the infrequent expression 2 24

23

of DEFB genes is not cell-type dependent. We therefore hypothesized that the weak 26

25

transcription of DEFB genes was influenced by transcriptional regulatory elements. 27 29

28

The information related to the elements in the ENCODE database is reasoned to be 31

30

powerful enough to ascertain an answer to the hypothesis. Three main features of 32 3

ENCODE were utilized to systematically scrutinize the intensity of DEFB gene 34 36

35

expression and that of its adjacent regions on three chromosomes. Compared with the 38

37

adjacent regions, all of the DEFB regions, regardless of which chromosome they were 39 40

located on, exhibited significantly weak signals in the three ENCODE features. 41 43

42

Although DEFB is a missing protein, our data suggest that the absence of DEFB protein 45

4

directly results from the scarce transcription of its genes; the chromatin structure around 46 48

47

the DEFB genes exerts a repressive effect on the gene expression as well. The DEFB50

49

related regions on Chr 6 and 8 contain non-DEFB genes that also lack transcriptional 52

51

evidence. This suggests that the weak transcription of DEFB genes is not a special 53 5

54

feature of the gene family but is dependent upon their location and chromatin structure. 56 58

57

This material is available free of charge via http://pubs.acs.org/. This content contains 59 60

Supplementary Figure S1-S7 and Table S1. 23 / 33

ACS Paragon Plus Environment

Journal of Proteome Research

1 3

2

ACKNOWLEDGMENTS 4 6

5

This study was supported by the International Science & Technology Cooperation 7 9

8

Program of China (2014DFB30020), Chinese National Basic Research Programs 1

10

(2014CBA02002, 2014CBA02005), and the National High-Tech Research and 12 13

Development Program of China (2012AA020202). 14 15 17

16

Conflict of Interests Statement: 18 20

19

The authors declare no competing financial interest. 21 2 23 24 25 26 27 28 29 30

References 31 32 34

3

(1) Stephen H White, William C Wimley, Michael E Selsted, Structure, function, and 36

35

membrane integration of defensins, Current Opinion in Structural Biology. 1995, 4, 37 38

521-527. 39 40 42

41

(2) Schro, J.; Harder, È. Human β-defensin-2. The international journal of biochemistry 4

43

and cell biology. 1999, 31, 645–651. 45 46 47

(3) Taylor, K.; Barran, P. E.; Dorin, J. R. Review: Structure-activity relationships in β50

49

48

defensin peptides. Biopolymers - Peptide Science Section, 2008, 90, 1-7. 51 53

52

(4) Semple, C. a M.; Taylor, K.; Eastwood, H.; Barran, P. E.; Dorin, J. R. β-defensin 54 56

5

evolution: selection complexity and clues for residues of functional importance. 58

57

Biochem. Soc. Trans. 2006, 34, 257–262. 59 60

24 / 33

ACS Paragon Plus Environment

Page 24 of 37

Page 25 of 37

Journal of Proteome Research

1 3

2

(5) Liu, Y.; Ying, W.; Ren, Z.; Gu, W.; Zhang, Y.; Yan, G.; Yang, P.; Liu, Y.; Yin, X.; 5

4

Chang, C.; et al. Chromosome-8-coded proteome of Chinese Chromosome Proteome 6 8

7

Data Set (CCPD) 2.0 with partial immunohistochemical verifications. J. Proteome Res. 10

9

2014, 13, 126–136. 1 12 13

(6) Wang, Q.; Wen, B.; Yan, G.; Wei, J.; Xie, L.; Xu, S.; Jiang, D.; Wang, T.; Lin, L.; 14 16

15

Zi, J.; et al. Qualitative and quantitative expression status of the human chromosome 18

17

20 genes in cancer tissues and the representative cell lines. J. Proteome Res. 2013, 12, 19 20

151–161. 21 2 24

23

(7) Wang, Q.; Wen, B.; Wang, T.; Xu, Z.; Yin, X.; Xu, S.; Ren, Z.; Hou, G.; Zhou, R.; 26

25

Zhao, H.; et al. Omics evidence: Single nucleotide variants transmissions on 27 29

28

chromosome 20 in liver cancer cell lines. J. Proteome Res. 2014, 13, 200–211. 30 32

31

(8) Bals, R.; Wang, X.; Wu, Z.; Freeman, T.; Bafna, V.; Zasloff, M.; Wilson, J. M. 3 34

Human beta-defensin 2 is a salt-sensitive peptide antibiotic expressed in human lung. 35 37

36

J. Clin. Invest. 1998, 102, 874–880. 38 40

39

(9) Singh, P. K.; Jia, H. P.; Wiles, K.; Hesselberth, J.; Liu, L.; Conway, B. A.; 41 42

Greenberg, E. P.; Valore, E. V; Welsh, M. J.; Ganz, T.; et al. Production of beta43 45

4

defensins by human airway epithelia. Proc. Natl. Acad. Sci. U. S. A. 1998, 95, 14961– 47

46

14966. 48 49 51

50

(10) Mcnamara, N. A.; Van, R.; Tuchin, O. S.; Fleiszig, S. M. Ocular surface epithelia 53

52

express mRNA for human beta defensin-2. Exp. Eye Res. 1999, 69, 483–490. 54 5 56

(11) Abedin, A.; Mohammed, I.; Hopkinson, A.; Dua, H. S. A novel antimicrobial 57

60

59

58

peptide on the ocular surface shows decreased expression in inflammation and infection. Invest. Ophthalmol. Vis. Sci. 2008, 49, 28–33. 25 / 33

ACS Paragon Plus Environment

Journal of Proteome Research

1 3

2

(12) ENCODE consortium. The ENCODE (Encyclopedia of DNA Elements) Project. 5

4

Nature 2004, 306, 636–640. 6 7 9

8

(13) Djebali, S.; Davis, C. a; Merkel, A.; Dobin, A.; Lassmann, T.; Mortazavi, A.; 1

10

Tanzer, A.; Lagarde, J.; Lin, W.; Schlesinger, F.; et al. Landscape of transcription in 12 13

human cells. Nature 2012, 489, 101–108. 14 15 17

16

(14) Spivakov, M.; Akhtar, J.; Kheradpour, P.; Beal, K.; Girardot, C.; Koscielny, G.; 19

18

Herrero, J.; Kellis, M.; Furlong, E. E. M.; Birney, E. Analysis of variation at 20 21

transcription factor binding sites in Drosophila and humans. Genome Biol. 2012, 13, 2 24

23

R49. 25 27

26

(15) Whitfield, T. W.; Wang, J.; Collins, P. J.; Partridge, E. C.; Aldred, S. F.; Trinklein, 28 30

29

N. D.; Myers, R. M.; Weng, Z. Functional analysis of transcription factor binding sites 32

31

in human promoters. Genome Biol. 2012, 13, R50. 3 34 35

(16) Djebali, S.; Davis, C. a; Merkel, A.; Dobin, A.; Lassmann, T.; Mortazavi, A.; 36 38

37

Tanzer, A.; Lagarde, J.; Lin, W.; Schlesinger, F.; et al. Landscape of transcription in 40

39

human cells. Nature 2012, 489, 101–108. 41 42 43

(17) Bernstein, B. E.; Birney, E.; Dunham, I.; Green, E. D.; Gunter, C.; Snyder, M. An 4 46

45

integrated encyclopedia of DNA elements in the human genome. Nature 2012, 489, 48

47

57–74. 49 50 52

51

(18) Dong, X.; Greven, M. C.; Kundaje, A.; Djebali, S.; Brown, J. B.; Cheng, C.; 54

53

Gingeras, T. R.; Gerstein, M.; Guigó, R.; Birney, E.; et al. Modeling gene expression 5 56

using chromatin features in various cellular contexts. Genome Biol. 2012, 13, R53. 57 58 59 60

26 / 33

ACS Paragon Plus Environment

Page 26 of 37

Page 27 of 37

Journal of Proteome Research

1 3

2

(19) Consortium, I. H. G. Initial sequencing and analysis of the human genome. Nature 5

4

2002, 420, 520–562. 6 7 9

8

(20) Uhlén, M.; Fagerberg, L.; Hallström, B. M.; Lindskog, C.; Oksvold, P.; 1

10

Mardinoglu, A.; Sivertsson, Å.; Kampf, C.; Sjöstedt, E.; Asplund, A.; et al. Tissue12 13

based map of the human proteome. Science. 2015. 347, 394-402. 14 15 17

16

(21) Landt, S. G.; Marinov, G. K.; Kundaje, A.; Kheradpour, P.; Pauli, F.; Batzoglou, 19

18

S.; Bernstein, B. E.; Bickel, P.; Brown, J. B.; Cayting, P.; et al. ChIP-seq guidelines 20 21

and practices of the ENCODE and modENCODE consortia. Genome Res. 2012, 22, 2 24

23

1813–1831. 25 27

26

(22) R Core Team. R: A language and environment for statistical computing. R A Lang. 28 30

29

Environ. Stat. Comput. 2014, 0. 31 3

32

(23) Kyte, J.; Doolittle, R. F. A simple method for displaying the hydropathic character 34 35

of a protein. J. Mol. Biol. 1982, 157, 105–132. 36 37 39

38

(24) Lane, L.; Bairoch, A.; Beavis, R. C.; Deutsch, E. W.; Gaudet, P.; Lundberg, E.; 41

40

Omenn, G. S. Metrics for the human proteome project 2013-2014 and strategies for 42 43

finding missing proteins. J. Proteome Res. 2014, 13, 15–20. 4 45 47

46

(25) Taudien, S.; Szafranski, K.; Felder, M.; Groth, M.; Huse, K.; Raffaelli, F.; Petzold, 49

48

A.; Zhang, X.; Rosenstiel, P.; Hampe, J.; et al. Comprehensive assessment of sequence 50 52

51

variation within the copy number variable defensin cluster on 8p23 by target enriched 54

53

in-depth 454 sequencing. BMC Genomics 2011, 12, 243. 5 56 57 58 59 60

27 / 33

ACS Paragon Plus Environment

Journal of Proteome Research

1 3

2

(26) Groth, M.; Wiegand, C.; Szafranski, K.; Huse, K.; Kramer, M.; Rosenstiel, P.; 5

4

Schreiber, S.; Norgauer, J.; Platzer, M. Both copy number and sequence variations 6 8

7

affect expression of human DEFB4. Genes Immun. 2010, 11, 458–466. 9 1

10

(27) Swaney, D. L.; Wenger, C. D.; Coon, J. J. Value of using multiple proteases for 12 13

large-scale mass spectrometry-based proteomics. J. Proteome Res. 2010, 9, 1323–1329. 14 15 17

16

(28) Thurman, R. E.; Rynes, E.; Humbert, R.; Vierstra, J.; Maurano, M. T.; Haugen, E.; 19

18

Sheffield, N. C.; Stergachis, A. B.; Wang, H.; Vernot, B.; et al. The accessible 20 21

chromatin landscape of the human genome. Nature 2012, 489, 75–82. 2 23 25

24

(29) James Kent, W.; Sugnet, C. W.; Furey, T. S.; Roskin, K. M.; Pringle, T. H.; Zahler, 27

26

A. M.; Haussler, D. The human genome browser at UCSC. Genome Res. 2002, 12, 28 30

29

996–1006. 31 3

32

(30) Karlić, R.; Chung, H.-R.; Lasserre, J.; Vlahovicek, K.; Vingron, M. Histone 34 35

modification levels are predictive for gene expression. Proc. Natl. Acad. Sci. U. S. A. 36 38

37

2010, 107, 2926–2931. 39 41

40

(31) Bannister, A. J.; Kouzarides, T. Regulation of chromatin by histone modifications. 42 43

Cell Res. 2011, 21, 381–395. 4 45 47

46

(32) He, H. H.; Meyer, C. A.; Chen, M. W.; Jordan, V. C.; Brown, M.; Liu, X. S. 49

48

Differential DNase I hypersensitivity reveals factor-dependent chromatin dynamics. 50 52

51

Genome Res. 2012, 22, 1015–1025. 53 5

54

(33) Yip, K. Y.; Cheng, C.; Bhardwaj, N.; Brown, J. B.; Leng, J.; Kundaje, A.; 56 57

Rozowsky, J.; Birney, E.; Bickel, P.; Snyder, M.; et al. Classification of human genomic 58 59 60

28 / 33

ACS Paragon Plus Environment

Page 28 of 37

Page 29 of 37

Journal of Proteome Research

1 3

2

regions based on experimentally determined binding sites of more than 100 5

4

transcription-related factors. Genome Biol. 2012, 13, R48. 6 7 9

8

(34) Selvi, B. R.; Kundu, T. K. Reversible acetylation of chromatin: Implication in 1

10

regulation of gene expression, disease and therapeutics. Biotechnol. J. 2009, 4, 375– 12 13

390. 14 15 17

16

(35) Andresen, E.; Günther, G.; Bullwinkel, J.; Lange, C.; Heine, H. Increased 19

18

expression of beta-defensin 1 (DEFB1) in chronic obstructive pulmonary disease. PLoS 20 21

One 2011, 6. 2 23 25

24

(36) Wang, J.; Zhuang, J.; Iyer, S.; Lin, X.; Whitfield, T. W.; Greven, M. C.; Pierce, B. 27

26

G.; Dong, X.; Kundaje, A.; Cheng, Y.; et al. Sequence features and chromatin structure 28 30

29

around the genomic regions bound by 119 human transcription factors. Genome Res. 32

31

2012, 22, 1798–1812. 3 34 35

(37) Vaquerizas, J. M.; Kummerfeld, S. K.; Teichmann, S. a; Luscombe, N. M. A 36 38

37

census of human transcription factors: function, expression and evolution. Nat. Rev. 40

39

Genet. 2009, 10, 252–263. 41 42 43

(38) Bovolenta, L.; Acencio, M. L.; Lemke, N. HTRIdb: an open-access database for 4 46

45

experimentally verified human transcriptional regulation interactions. BMC Genomics 48

47

2012, 13, 405. 49 50 52

51

(39) Taudien, S.; Groth, M.; Huse, K.; Petzold, A.; Szafranski, K.; Hampe, J.; 54

53

Rosenstiel, P.; Schreiber, S.; Platzer, M. Haplotyping and copy number estimation of 5 56

the highly polymorphic human beta-defensin locus on 8p23 by 454 amplicon 57

60

59

58

sequencing. BMC Genomics 2010, 11, 252.

29 / 33

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 6

5

Figure Legends 7 8 9 10 12

1

Figure 1. Characterization of DEFB genes, including genomic location, expression 13 14

status and physicochemical properties of their protein products. (a) The distribution of 15 17

16

DEFB genes along the human chromosomes 6, 8 and 20. The bands on the 19

18

chromosomes stand for the mRNA density expressed by the corresponding genes. (b) 20 21

The expression status of the DEFB genes in 76 different cell lines and tissues. The 2 24

23

mRNA abundance is represented as the log2 (FPKM) of a DEFB gene. (c) 26

25

Comparison of the tryptic peptide lengths between the proteins encoded by DEFB 27 29

28

genes and the whole human encoding genes. (d) Comparison of the tryptic peptides 31

30

hydrophobicity between the proteins encoded by DEFB genes and the whole human 32 3

encoding genes. 34 35 36 37 39

38

Figure 2. The chromosome location and transcriptional level of genes located within 41

40

the DEFB-related and reference regions. (a) Regions located on Chr 6. (b) Regions 42 43

located on Chr 8. (c), (d) Regions located on Chr 20. The chromosome regions that 4 46

45

mainly included DEFB genes were divided into ten regions, as described in the text. 48

47

The x-axis represents the genomic positions and the y-axis shows as the mRNA 49 51

50

abundances. The dash lines indicate the region partitions. 52 53 54 56

5

Figure 3. Characterization of the ENCODE features on the DEFB-related and the 57 58

reference regions. (a) The accumulated log10 DHS scores in the ten regions in the 59 60

three chromosomes described in text. (b) The distribution of the individual histone 30 / 33

ACS Paragon Plus Environment

Page 30 of 37

Page 31 of 37

Journal of Proteome Research

1 3

2

modifications within each DEFB-related region and its neighboring reference 5

4

region(s). Each bar represents the fraction of the related regions for one modification 6 8

7

type. (c) Heatmap of the TF binding scores for the DEFB-related and reference 10

9

regions. The gradient bar represents the log2 scores of TF binding. 1 12 13 14 16

15

Supplementary Information: 17 18 19

Supplementary Table1. The annotation status of the DEFB genes in cross reference 20 2

21

databases. 23 24 25 26 27 29

28

Supplementary Figure1. Number of total theoretical peptides and fit theoretical 31

30

peptides (those with length between 7 and 35) of 34 DEFB proteins. X axis: DEFBs, Y 32 3

axis: number of theoretical peptides with black and green dots representing total 34 36

35

peptides and peptides with length between 7 and 35. 37 38 39 40 41

Supplementary Figure 2. The heatmap generated of the RNA-Seq data and the 42 4

43

RNC-Seq data of three C-HPP cell lines and the HepG2 (With biological replicates) 46

45

RNA-Seq data from ENCODE. 47 48 49 50 52

51

Supplementary Figure 3. The tracks of DHS signal in DEFB regions and their 54

53

neighbor reference region(s) in GenomeBroswer. The DEFB-related regions were 5 56

colored with light blue while the reference regions were indicated by the red 57

60

59

58

rectangle. (a) The DHS signals, gene positions and the corresponding transcription signal of DEFB-ca20 and its neighbor region, the features of specific region were 31 / 33

ACS Paragon Plus Environment

Journal of Proteome Research

1 3

2

depicted by the tracks of their peaks signal from ENCODE GenomeBrowser. (b) The 5

4

DHS signals, gene positions and the corresponding transcription signal of DEFB-cb20 6 8

7

and DEFB-rb20.(c) The DHS signals, gene positions and the corresponding 10

9

transcription signal on chr8. (d) The DHS signals, gene positions and the 1 12

corresponding transcription signal on chr6. 13 14 15 16 18

17

Supplementary Figure 4. The tracks of all 11 histone modification signals along the 20

19

interest regions on chr6, 8 and 20. (a) The eleven tracks represented the histone 21 2

modification status of corresponding regions were plotted together with the gene 23 25

24

positions and the transcription signals for DEFB-ca20 and DEFB-ra20 in the same 27

26

manner of Figure S2. (b) The histone modification signals along the DEFB-cb20 and 28 30

29

DEFB-rb20.(c) The histone modification signals along the regions on chr8. (d) The 32

31

histone modification signals along the regions on chr6. 3 34 35 36 37

Supplementary Figure 5. The correlation matrix between each two of DHS, FPKM 38 40

39

and all the histone modification scores in TSS regions for all the gene in all 10 42

41

regions. The lower panel plots the two features in scatter with the LOWESS smooth 43 4

line while the upper panel and diagonal panel present the correlation factors and the 45 47

46

histogram respectively. 48 50

49

Supplementary Figure 6. 59 TFs binding status of 122 cells in the selected regions. 51 53

52

Signal values of TF peaks mapped to potential regulatory regions of each gene were 5

54

summed as the TF score. Log2 (TF scores) were plotted in the same manner of Figure 56 57

3. 58 59 60

32 / 33

ACS Paragon Plus Environment

Page 32 of 37

Page 33 of 37

Journal of Proteome Research

1 3

2

Supplementary Figure 7. Correlation of 59 TFs binding as well as expression in 122 5

4

genes included in the selected regions. The binding pattern of 59 TFs and the expression 6 8

7

of the 122 genes were cross correlated to estimate the correlation between each two of 10

9

them. Matrix of the correlation scores were plotted with the “first principal component 12

1

order” clustering method. 13 14 15 16 17 18 19 20 21 2 23 24 25 26 27 28 29 30 31 32 3 34 35 36 37 38 39 40 41 42 43 4 45 46 47 48 49 50 51 52 53 54 5 56 57 58 59 60

33 / 33

ACS Paragon Plus Environment

Journal of Proteome Research

Page 34 of 37

1 2 3 4 6

5

Figure 1 8

7 9 1

10

a 12

b

testis kidney salivary gland skin THP−1 liver pancreas gallbladder NB−4 U−2197 RPMI−8226 BEWO spleen U−87 MG adrenal gland skeletal muscle SH−SY5Y SK−BR−3 U−266/70 SiHa cerebral cortex RT4 CAPAN−2 HEK 293 HDLM−2 thyroid gland NTERA−2 RH−30 U−698 MOLT−4 A−431 K−562 HMC−1 ovary SK−MEL−30 U−266/84 AN3−CA U−937 U−251 MG U−138 MG TIME REH PC−3 MCF7 Karpas−707 HL−60 HeLa A549 Daudi WM−115 U−2 OS smooth muscle lymph node heart muscle HEL fallopian tube esophagus endometrium rectum lung urinary bladder adipose tissue appendix placenta CACO−2 HaCaT Hep G2 duodenum small intestine tonsil colon prostate SCLC−21H stomach bone marrow EFO−21

+

13 14

6

15

-

DEFB110 DEFB113 DEFB114 DEFB112

16 17 18 19

8

20 DEFB1 DEFB103B DEFB104B DEFB105B DEFB106B DEFB107A DEFB103A DEFB104A DEFB105A DEFB106A DEFB108P1 DEFB4A DEFB109P1 DEFB130

21 2 23

20

24 26

DEFB125 DEFB126 DEFB127 DEFB128 DEFB129 DEFB132

DEFB115 DEFB116 DEFB117 DEFB118 DEFB119 DEFB122 DEFB123 DEFB124

25

No evidence mRNA evidence

28

27

c 29

0.20

30 31

DEFB peptides

32

Whole genome peptides

0.15

3 Fractions

34 35 36 37

0.10

38 0.05

39 40 41

0.00

42

0

43

1

2

3

4

5

6

7

8

9

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

Peptides length

45

4

d 47

46

0.7

48

DEFB peptides Whole genome peptides

0.6

49 50

0.5

5

54

53

Density

52

51

0.4 0.3 0.2

57

56

0.1 0.0

60

-4

-2

0

GRAVY score

2

4

DEFB104B DEFB104A DEFB115 DEFB107B DEFB107A DEFB110 DEFB113 DEFB114 DEFB116 DEFB112 DEFB106B DEFB106A DEFB125 DEFB128 DEFB118 DEFB108B DEFB105B DEFB105A DEFB127 DEFB126 DEFB121 DEFB103B DEFB103A DEFB123 DEFB119 DEFB124 DEFB1

59

58

ACS Paragon Plus Environment

8 6 4 2 0

Page 35 of 37

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 1 12

Figure 2 13 14 15 16

All genes 17

a 18

8

19 20

DEFB-r6-5’

DEFB-c6

b

DEFBs

DEFB-r6-3’

DEFB-r8-5’

DEFB-c8

DEFB-r8-3’

21 6

2 23 24

2 0 4.6e+07

5.0e+07

5.2e+07

5.4e+07

5.6e+07

5e+06

6e+06

7e+06

8e+06

9e+06

1e+07

4e+05

5e+05

8

d DEFB-ca20

DEFB-ra20

DEFB-cb20

DEFB-rb20

6

c

4.8e+07

40

4

39

38

37

36

35

34

3

32

31

30

29

28

27

26

Relative expression ( Log2 (FKPM) )

4

25

2

42

41

45

0

4

43 2.98e+07

46

2.99e+07

3.00e+07

3.01e+07

3.02e+07

3.03e+07

0e+00

Chromatin position (bp) 48

47 49 50 51 52 53 54 5 56 57 58 59 60

ACS Paragon Plus Environment

1e+05

2.e+05

3.e+05

Journal of Proteome Research

Page 36 of 37

1 2 3 4 6

5

Figure 3 9

8

7

a 1

10

c Chr20a

6

12

Expression ARID3A BHLHE40

Chr20b

13 14 4

15 16 17

DHS log10(scores)

18 19 20 21 2 23 24 25 26 27

2 0

DEFB-ra20

DEFB-ca20

DEFB-cb20

Chr6

6

DEFB-rb20

8

CEBPD

6

CHD2 CTCF ELF1

4

EP300 FOSL2

0

HNF4A HNF4G J UN J UND MAFF MAFK MAX MBD4

4

28

MXI1

2

29

MYBL2 NFIC NR2C2

30 31 0

32 3

DEFB-r6-5'

34

DEFB-c6

DEFB-r6-3'

35

5' neighbor region

DEFB-r8-5'

3' neighbor region

DEFB-c8

POLR2A RAD21 RCOR1

DEFB-r8-3'

REST RFX5 RXRA

DEFB-related region

b 37

36

SIN3AK20 SMC3 Chr20.a

100

38

Chr20.b

SP1 SP2 SRF TAF1

39 75

40

TBP TCF12 TCF7L2 TEAD4

41 50

42 43

USF1 USF2

25

47

46

45

Precentage

4

Chr6

100

51

50

49

48

YY1 ZBTB33 ATF3 ESRRA EZH2

0 Chr8

GABPA GRp20 HSF1 IRF3 MAZ NRF1

75 50

53

52

25

5

54

0

PPARGC1A SREBP1 ZBTB7A az H2

1 3 3 2 1 3 3 2 7ac me me me me me 79me 3k9ac k20me me k27 H3k9 H3k2 3k36 H3k4 H3k4 H3k4 k H H4 H3 H H3

az H2

DEFB.rb20

DEFB.ra20

DEFB.r8.5.

DEFB.r8.3.

DEFB.r6.5.

ACS Paragon Plus Environment

DEFB.r6.3.

DEFB-related region

DEFB.cb20

3' neighbor region

DEFB.ca20

5' neighbor region

60

DEFB.c8

Histone modifications 59

58

ZNF274 MYC

1 3 3 2 1 3 3 2 c me 7ac me me me me 79me 3k9a me me k20 k36 H3k4 H3k4 H3k4 k27 H3k9 H3k2 k H H4 H3 H3 H3

DEFB.c6

57

56

10

BRCA1 CEBPB

FOXA1 FOXA2 HDAC2

Chr8

12

2

Page 37 of 37

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 1 12 13 14 15 16 17 18 19 20 21 2 23 24 25 26 28

27 For TOC only 76x47mm (300 x 300 DPI)

29 30 31 32 3 34 35 36 37 38 39 40 41 42 43 4 45 46 47 48 49 50 51 52 53 54 5 56 57 58 59 60

ACS Paragon Plus Environment