Subscriber access provided by UNIV OF CAMBRIDGE
Article
Insights from ENCODE on missing proteins: why #-defensin expression is scarcely detected Yang Fan, Yue Zhang, Shaohang Xu, Nannan Kong, Yang Zhou, Zhe Ren, Yamei Deng, Liang Lin, Yan Ren, Quanhui Wang, Jin Zi, Bo Wen, and Siqi Liu J. Proteome Res., Just Accepted Manuscript • Publication Date (Web): 10 Aug 2015 Downloaded from http://pubs.acs.org on August 10, 2015
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 37
Journal of Proteome Research
1 2 4
3
Insights from ENCODE on missing proteins: why β-defensin 5 7
6
expression is scarcely detected 8 9 10 1 12 14
13
Yang Fan1,2,3,#, Yue Zhang1,2,3,#, Shaohang Xu2, Nannan Kong1,2,3, Yang Zhou1,2,3, 16
15
Zhe Ren2, Yamei Deng1,2,3, Liang Lin2, Yan Ren2, Quanhui Wang1,2,3, Jin Zi2, Bo 17 18
Wen2,*, Siqi Liu1,2,3,* 20
19 21 2 23 24 25
1 CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of 26 28
27
Genomics, Chinese Academy of Sciences, Beijing 101318, China 29 31
30
2 BGI-Shenzhen, Shenzhen 518083, China 32 3 35
34
3 Graduate University of the Chinese Academy of Sciences, Beijing 100049, China 36 38
37
# These authors contributed equally to this work. 39 40 41
*To whom correspondence should be addressed: 42 43 45
4
Siqi Liu, Beijing Institute of Genomics, CAS, 1 BeiChen West Road, Beijing 100101, 47
46
China. 48 49 50
Tel and Fax: 86-10-80485460; E-mail:
[email protected] 51 52 54
53
Bo Wen, BGI-Shenzhen, 11 Build, Beishan Industrial Zone, Yantian District, 56
5
Shenzhen 518083, China 57 58 60
59
Tel and Fax: 86-0755-25273620; E-mail:
[email protected] 1 / 33
ACS Paragon Plus Environment
Journal of Proteome Research
1 3
2
Abbreviations 4 6
5
DEFB, β-defensin gene; CHPP, Chromosome-Centric Human Proteome Project; 7 9
8
ENCODE, Encyclopedia of DNA Element; DHS, DNase I Hypersensitive Sites; TF: 1
10
Transcription Factor; TSS, Transcription Starting Site; FPKM, Fragments Per Kilobase 12 13
of exon per Million fragments mapped; GRAVY, grand average of hydropathy; ChIP14 16
15
seq, Chromatin Immumo Precipitation followed by sequencing. 17 18 19 20 21 2 23 24 25 26 27 28 29 30 31 32 3 34 35 36 37 38 39 40 41 42 43 4 45 46 47 48 49 50 51 52 53 54 5 56 57 58 59 60
2 / 33
ACS Paragon Plus Environment
Page 2 of 37
Page 3 of 37
Journal of Proteome Research
1 3
2
Abstract 4 6
5
β-defensins (DEFBs) have a variety of functions. The majority of these proteins were 7 9
8
not identified in a recent proteome survey. Neither protein detection nor the analysis of 1
10
transcriptomic data based on RNA-seq data for three liver cancer cell lines identified 12 13
any expression products. Extensive investigation into DEFB transcripts in over 70 cell 14 16
15
lines offered similar results. This fact naturally begs the question – why are DEFB 18
17
genes scarcely expressed? After examining DEFB gene annotation and the 19 20
physicochemical properties of its protein products, we postulated that regulatory 21 23
2
elements could play a key role in the resultant poor transcription of DEFB genes. Four 25
24
regions containing DEFB genes and six adjacent regions on chromosomes 6, 8 and 20 26 28
27
were carefully investigated using The Encyclopedia of DNA Elements (ENCODE) 30
29
information, such as that of DNase I hypersensitive sites (DHSs), transcription factors 32
31
(TFs) and histone modifications. The results revealed that the intensities of these 3 35
34
ENCODE features were globally weaker than in the adjacent regions. Impressively, 37
36
DEFB-related regions on chromosomes 6 and 8 containing several non-DEFB genes 38 39
had lower ENCODE feature intensities, indicating that the absence of DEFB mRNAs 40 42
41
might not depend on the gene family but may be reliant upon gene location and 4
43
chromatin structure. 45 46 47 48 49 51
50
Key words 52 54
53
CHPP, missing proteins, β-defensin, ENCODE, DHS, histone modification, TF, 5 57
56
proteome 58 59 60
3 / 33
ACS Paragon Plus Environment
Journal of Proteome Research
1 3
2
Introduction 4 6
5
β-defensin (DEFB) proteins were discovered in humans as antimicrobial peptides in the 7 8
innate immune system 1. In response to infection and inflammation, these proteins 1
10
9
effectively eliminate bacteria, fungi and enveloped viruses 2. The structure of DEFBs 12 13
has been intensively studied. According to UniProt database. These proteins are 14 16
15
comprised of 61-111 amino acid residues with more than 6 conserved cysteine residues2 18
17
and have a strong charge status that likely plays a key role in their antimicrobial 19 20
activities 3. During the arms race between pathogens and the human immune system, 21 23
2
DEFBs evolved into a gene family containing multiple paralogs. DEFB family 25
24
members are assumed to have various activities and functions 4 in the immune response; 26 28
27
however, the corresponding behaviors of individual DEFB members have been poorly 30
29
clarified. A fundamental issue in this field is how to detect DEFB proteins and how to 32
31
evaluate their presence in cells and tissues. 3 34 36
35
A total of 34 DEFB genes are annotated in the human genome based on the newest 38
37
human reference genome, hg19. With the exception of the two DEFB genes located on 39 40
Chromosome (Chr) 4 and Chr 11, most DEFB genes are distributed on 3 chromosomes 41 43
42
in a clustered manner, with 4 on Chr 6, 14 on Chr 8 and 14 on Chr 20 (Figure 1A). 45
4
Large-scale analysis by RNA-seq or Liquid chromatography-tandem mass 46 48
47
spectrometry (LC-MS/MS) creates an overall view of gene expression status in cells or 50
49
tissues. Surprisingly, DEFB mRNA and protein expression levels are very low in 52
51
several cells compared with expected levels. Although 14 DEFB genes are located on 53 54
chromosome 8 5, there are no expressed proteins that correlate with these DEFB genes 57
56
5
in three cancer cell lines. Additionally, RNA-seq data for both total mRNAs and 58 60
59
ribosome-bound mRNAs only detected the transcriptional form of the DEFB1 gene in 4 / 33
ACS Paragon Plus Environment
Page 4 of 37
Page 5 of 37
Journal of Proteome Research
1 3
2
these cells. Wang et al. conducted proteomics analysis using LC-MS/MS on gastric, 5
4
colon and liver tissues from the digestive system 6
6
and identified no DEFB proteins
8
7
encoded by chromosome 20. Furthermore, the same group extended the trans-omics 10
9
study to three liver cancer cell lines to look for the gene expression status of DEFB on 12
1
chromosome 20 7. Similar to their results in human tissues, no DEFB gene products 13 15
14
from chromosome 20 were detected by either RNA-seq or LC-MS/MS. The DEFB 17
16
genes on chromosomes 8 and 20 are not the only ones that are suppressed; the 18 19
expression of the DEFB genes on chromosome 6 was also barely detected by RNA-seq. 20 2
21
Previous studies indicated that considerable expression of DEFBs was detected in some 24
23
human tissues like skin, oral mucosa and trachea in response to the microbial invaders 25 26
from outer environment 8. Meanwhile the two groups revealed that the DEFB 27 29
28
expression in lung disease was highly depended upon inflammation and microbial 31
30
infection 9-10. However, expression of DEFB109 was found its abundance decreased in 32 34
3
human eye tissue under microbial infection condition, suggesting its further 36
35
complicated function rather than only anti-microbicidal infection 11. Nonetheless, the 37 38
global absence of DEFB expression in normally conditioned cells and tissues naturally 39 41
40
raises the question of what causal factors lead to this phenomenon. Gene expression is 43
42
regulated by many elements at various levels. Cell-specificity and chromatin structure 4 45
are the main factors that impact gene expression at the transcriptional level, while the 46 48
47
physicochemical properties of proteins and tissue-location may be the driving forces at 50
49
the translational level. 51 52 54
53
56
5
The Encyclopedia of DNA Elements (ENCODE) project
12
has generated a wealth of
experimental information by mapping diverse chromatin properties in over hundred 57 58
human cell lines, including the locations of DNase I hypersensitive sites (DHSs), the 59 60
associations of transcription factors (TFs) and whether histones and CpG modifications 5 / 33
ACS Paragon Plus Environment
Journal of Proteome Research
Page 6 of 37
1 3
2
are present. ENCODE offers a solid database to deeply understand the relationship 5
4
between chromatin features and gene expression. For instance, the DHS data reveal that 6 8
7
chromatin loses its condensed structure in certain regions, leading to exposed and 10
9
accessible genomic DNA that is functionally related to transcriptional activity 13. TFs, 1 12
which represent the largest family of proteins and account for approximately 10% of 13 15
14
genes, are the critical regulation elements for gene expression 17
16
14-16
. Histone
modifications are believed to be involved in both transcriptional activation and 18 19
repression; the various modified states can change chromatin structure and function by 20 2
21
recruiting other enzyme complexes 24
23
17-21
. These elements can work together to form a
regulatory network for controlling gene expression. The effective utilization of the 25 27
26
ENCODE data therefore enables a comprehensive understanding of which factors are 29
28
critical for the expression of corresponding genes. To date, there have been no 31
30
systematic investigations examining DEFB gene chromosomal locations and the 32 34
3
corresponding chromatin structures. Because the RNA-seq data shows solid evidence 36
35
that DEFB genes are poorly expressed at the transcriptional level, it led us to question 37 38
whether DEFB chromatin characteristics play a role in the repression of these genes in 39 41
40
digestive system tissues. 42 4
43
In this communication, we examined mRNA signals encoded by individual DEFB 45 46
genes in three liver cancer cell lines before extending the survey to examine DEFB 47 49
48
mRNAs from over 70 cell lines. Using transcriptomic data, we formed a hypothesis and 51
50
utilized ENCODE data to study the correlation between the transcriptional expression 52 54
53
and chromatin structure of DEFB genes. For the first time, we have attempted to 56
5
investigate missing proteins from a new angle, focusing on whether unmeasurable 57 58
protein expression levels are derived from the scarce transcription of corresponding 59 60
genes. 6 / 33
ACS Paragon Plus Environment
Page 7 of 37
Journal of Proteome Research
1 3
2
Methods 4 6
5
1. Data sources 7 8 10
9
Protein sequences of whole genomes were retrieved from UniProt 12
1
(http://www.uniprot.org/uniprot/). The transcriptome was determined using next13 14
generation sequencing techniques. The transcriptome data for the three liver cancer 15 17
16
cell lines, Hep3B, MHCC97H and HCCLM3, were retrieved from the Gene 19
18
Expression Omnibus database (accession number: GSE49994). The mRNA data for 20 21
the HepG2 cell line were collected from the ENCODE dataset at 2 24
23
https://www.encodeproject.org/experiments/ENCSR468ION/. The transcriptome data 26
25
for the 79 cell lines and tissues surveyed by Uhlén’s group 20 were downloaded from 27 28
the original paper’s supplementary files on the Science website at 31
30
29
http://www.sciencemag.org/content/suppl/2015/01/21/. The ENCODE features for the 32 3
HepG2 cell line were retrieved from the ENCODE databases. The DHS data in 34 36
35
narrowPeak format, the histone modification data in broadPeak format and the 38
37
uniformly processed TF binding data in narrowPeak format were downloaded from 39 40
https://www.encodeproject.org/experiments/, http://genome.ucsc.edu/cgi41 43
42
bin/hgFileUi/ and http://genome.ucsc.edu/cgi-bin/hgTrackUi/, respectively. The strict 45
4
procession of ChIP-seq data described by a guideline from ENCODE consortium was 46 47
applied to achieve the data interpretation with high quality 21. 48 49 50 51 52 53 54
2. RNA-seq data processing 5 56 58
57
The mRNA reads for HepG2, Hep3B, MHCC97H and HCCLM3 were mapped to the 60
59
Ensembl human genome (release GRCh37.75) and mature mRNA with Tophat (version 7 / 33
ACS Paragon Plus Environment
Journal of Proteome Research
1 3
2
2.0.8), using the default parameters. A maximum of two mismatches was allowed in 5
4
the alignment process. The transcriptome reconstruction and expression quantification 6 8
7
were sequentially performed with Cuffinks (version 2.0.2), using the following 10
9
parameters (1) -g genes.gtf, (2) -b genome.fa and (3) -u. The protein-coding gene 1 12
annotation model was extracted from the Ensembl gene annotations in GTF format 13 15
14
(release GRCh37.75). The quantitative transcriptome was evaluated and normalized 17
16
using FPKM (Fragment Per Kilobase per Million mapped reads). 18 19 20
3. Assignment of the comparable scores to the ENCODE features 21 2 24
23
a) DHS 25 27
26
The DHS score is represented by the accumulated signal values from the DNase-seq 28 30
29
reads within a potential regulatory region. A 2-Kb region upstream of the transcription 32
31
starting sites (TSS) of each transcript was defined as the potential regulatory region for 3 34
DHS, and the position of the DHS peaks intersected the position of the potential 35 37
36
regulatory regions. The average DHS score of corresponding transcripts was assigned 39
38
as the DHS score for each gene. 40 41 42
b) Histone modification 43 4 46
45
The peak signals for the histone modifications derived from the sequencing reads were 48
47
retrieved from an ENCODE dataset. The histone modification score per gene was 49 51
50
defined as the accumulated peak signals that mapped to two regions related to the gene, 53
52
the TSS, including 2-Kb upstream of the TSS, and the gene body, which were 54 5
considered to be potential regulatory regions with various histone modifications. For a 56 58
57
better estimation of each modification, the scores per gene were calculated in two steps; 59 60
8 / 33
ACS Paragon Plus Environment
Page 8 of 37
Page 9 of 37
Journal of Proteome Research
1 3
2
first separately accumulating the values from the TSS or gene body region and then 5
4
integrating both values. 6 7 9
8
c) TF binding 10 12
1
According to the sequencing data acquired from ChIP-seq, the TF binding score for 13 14
each TF on a gene was assigned as the sum of the peak signals in the potential regulatory 15 17
16
region, which covered 1-Kb upstream of where gene initiation occurs to the end of the 19
18
gene. Because the TF peaks were uniform, no additional normalization was required 20 21
for the TF binding scores of the 59 different TFs. 2 23 25
24
To characterize the ENCODE features in selected chromatin regions, the scores for 27
26
each feature in a certain region were first cumulated and then averaged for the gene 28 30
29
numbers within the region to generate the regional score. 31 3
32
4. Data analysis 34 35 36
To present the chromosomal locations of the DEFBs genes, an ideogram was generated 37 39
38
using the online tool Idiographica, which was provided by the Computational Biology 41
40
Research Center (CBRC) (http://www.ncrna.org/idiographica/) using the hg19 42 43
reference genome. To characterize the gene expression profiles in the selected cell lines 4 46
45
and the TF binding landscape for the individual genes or selected regions, a heatmap 48
47
was generated using the “pheatmap” package in R 49
22
by plotting the log2 (FPKM) of
51
50
the DEFB genes versus the cell lines and tissues or the log2 (TF binding scores) of each 53
52
TF versus the region of interest. To illustrate the physicochemical properties of the 54 5
theoretical and tryptic peptides derived from the DEFB proteins and the whole genome, 56 58
57
house-made Perl programs were used according to the trypsin digestion principle or 60
59
grand average of hydropathy (GRAVY) 23. The distribution curves for these properties 9 / 33
ACS Paragon Plus Environment
Journal of Proteome Research
1 3
2
were plotted with the “density” function in R. To reveal the correlation between gene 5
4
expression and the simplified DEFB-related reference regions, the log2 (FPKM) of 6 8
7
each gene located in the selected regions was plotted against its chromosomal positions 10
9
using R. To summarize the histone modification statuses, the scores for each histone 1 12
modification were cumulated for every DEFB-related region and its neighbor reference 13 15
14
regions and then 4 region groups were formed, each one consisting of the DEFB-related 17
16
region and its reference region(s). The region fractions in each group were calculated 18 19
by dividing the score in a region by the sum of the scores in the group representing a 20 2
21
specific modification. A stacked bar plot was generated with the region fractions using 24
23
the ggplot2 package in R. 25 26 28
27
Results and discussion 29 31
30
1. The expression status of DEFB genes and the physical-chemical characteristics 32 3
of DEFB proteins 34 35 37
36
In the chromosome proteomics study carried out by the Chinese Human Chromosome 39
38
Proteome Consortium (CCPC), the three -omics datasets, transcriptome, translatome 40 41
and proteome, were generated from three hepatocellular carcinoma (HCC) cell lines, 42 4
43
Hep3B, MHCC97H and MHCCLM3 7; all of the sequencing data are available from 46
45
the Gene Expression Omnibus database (accession number: GSE49994). In the CCPC 47 49
48
dataset, the total number of protein entries encoded by the human genome is 19448. In 51
50
contrast, the number of missing proteins entries is 2936, as defined by Lane et al. 24. As 53
52
most DEFB proteins are missing proteins, we specifically looked through the dataset to 54 56
5
find the expression status of the DEFB genes. There were no protein identification 58
57
signals in the CCPC, and there were only a few mRNA signals for DEFB genes detected 59 60
in the three hepatoma cell lines. According to the newest version of the human genome 10 / 33
ACS Paragon Plus Environment
Page 10 of 37
Page 11 of 37
Journal of Proteome Research
1 3
2
database, hg19, with the exception of Chr 4 and Chr 11 that only have one DEFB gene 5
4
on each (not shown), most DEFB genes are distributed on 3 chromosomes in a clustered 6 8
7
manner, with 4 on Chr 6, 14 on Chr 8 and 14 on Chr 20 (Figure 1A). In the CCPC 10
9
study, only one DEFB mRNA was detected for each of the 3 chromosomes – DEFB 1 12
112 from Chr 6, DEFB1 from Chr 8 and DEFB126 from Chr 20. DEFB112 and 13 15
14
DEFB126 mRNA were found at a very low abundance and DEFB1 mRNA was shown 17
16
to be expressed at a relatively high abundance in the three cell lines. This indicates the 18 19
universal absence of the clustered DEFB genes expression in the hepatoma cell lines. 20 2
21
Next, we checked DEFB gene expression in HepG2, a hepatoma cell line that is widely 24
23
used to study liver cancer, and its chromatin structure was deeply mined in ENCODE. 25 27
26
The expression patterns for the DEFB genes were similar to our observations in the 29
28
CCPC data (Supplementary Figure 2), except that the abundance of the DEFB1 31
30
mRNA was relatively low. In pursuit of minimizing false positive results, we allowed 32 34
3
merely 2 mismatches in the analysis of RNA-seq data. However, this setting might be 36
35
too strict to miss some DEFB expression products with nucleotides variants. On the 37 38
basis of the observation from Taudien et al. and Groth et al., DEFBs on Chr8 could 39 41
40
43
42
possess more than 10 single nucleotide variations on average 25, 26. Therefore, the fact of undetected the DEFB mRNAs is needed further evaluated using more specific 4 45
detection approach. 46 47 49
48
At least one of the following four causal factors are likely to lead missing proteins: (1) 51
50
tissue- or cell-type-dependent gene expression, (2) gene misannotation, (3) high 52 54
53
specificity of peptides features such that the relevant MS signals are difficult to detect, 56
5
or (4) the negative regulation of gene expression by specific chromatin features. 57 58
Therefore, we inquired as to which causal factor led to the extremely low expression 59 60
status of DEFB genes. 11 / 33
ACS Paragon Plus Environment
Journal of Proteome Research
1 3
2
We first sought to find whether the universal absence of clustered DEFB gene 5
4
expression in different chromosomes was unique to the hepatoma cell lines. The DEFB 6 8
7
mRNAs in the published database were examined to address this question. Based on 10
9
the RNA-seq database published by Uhlén’s group, the transcriptomes of 76 different 1 12
cell lines and tissues types were surveyed 13
20
. The results are presented as a heatmap
15
14
with the normalized expression of each DEFB gene versus the 76 cell lines and tissue 17
16
types (Figure 1B). According to the RNA-seq data, DEFB gene expression was very 18 19
low in most cell lines and tissues, with abundance values less than 5. This transcriptome 20 2
21
dataset did not include tissues or cell lines affected by microbial infection, suggesting 24
23
that low DEFB mRNA expression is a common characteristic of normal major tissues 25 27
26
and cell lines. Additionally, as shown in Figure 1B, the gene DEFB1, which displays 29
28
high expression levels, was found in many cell lines, implying that this gene may be 31
30
located at a special site on a specific chromosome or perform certain functions under 32 34
3
stimulation. 35 37
36
Next, we questioned whether the low rate of identification of DEFB mRNAs resulted 38 39
from the incorrect annotation of these genes. We reviewed the annotation status of 34 40 42
41
DEFB genes in the UniProt database (http://www.uniprot.org/). Whether a gene 4
43
encoding a specific protein is well annotated depends on evidence from three 45 46
experimental tools, including antibody assays, mRNA arrays or RNA-seq and gene 47 49
48
confirmation across different references. All of the annotation information for the 51
50
DEFB genes is summarized in Supplementary Table 1. Based on the experimental 52 54
53
evidence, the existence of most of DEFBs is supported by RNA-seq data, except for 56
5
DEFB108 and DEFB109. Sixty percent of DEFB transcripts were identified by mRNA 57 58
array and only 18% of DEFB translated products were found by antibody assays. 59 60
Furthermore, the presence of DEFB was examined across multiple gene expression 12 / 33
ACS Paragon Plus Environment
Page 12 of 37
Page 13 of 37
Journal of Proteome Research
1 3
2
databases, including EMBL, CCDS, UniGene and neXtProt. DEFB108 and DEFB109 5
4
were excluded from most databases, though the presence of the other 32 DEFBs were 6 8
7
supported by all of these databases. As majority of DEFBs were detected in their 10
9
transcriptional forms and the detection of the DEFB transcription was generally found 1 12
in several references, the annotation of the DEFB genes was acceptable. 13 14 16
15
Proteomics analysis is performed based on the identification of tryptic peptides by MS. 18
17
Whether a peptide is measurable is partially determined by its length, one of the primary 19 20
limitations of mass spectrometry. Swaney DL et al. evaluated the correlation between 21 23
2
peptide length and detectability using MS with yeast proteins 27 and found that a 7 to 25
24
35 amino acid residue peptide can be detected by MS. Additionally, the parameter for 26 27
‘scan range’ on most mass spectrometers is set as an m/z limit of 350, indicating that 28 30
29
measurable peptides should be longer than 6 amino acids. As depicted in Figure 1C, 32
31
the fractions of tryptic peptides at a certain lengths were plotted against the peptide 3 35
34
lengths; the tryptic peptides were theoretically generated from the DEFB family and 37
36
the total proteins encoded by the human genome. The similar distribution curves 38 39
revealed that lengths of approximately 60% of the tryptic peptides were quite short for 40 42
41
both DEFB and whole genome proteins; however, the tryptic peptides with suitable 4
43
lengths detectable by MS were still in a high percentile, with 36% for DEFB and 44% 45 46
for whole genome proteins. Specifically, the tryptic peptides generated from each 47 49
48
DEFB protein are summarized in Supplementary Figure 1, which suggests that the 51
50
DEFB proteins contained at least 2 suitable peptides for MS detection and that over 52 54
53
63% of DEFB proteins possessed at least 5 such peptides. Hydrophobicity in a peptide 56
5
is an important factor for peptide separation with reverse-phase HPLC and peptide 57 58
ionization in electron spray. The hydrophobicity distribution described by the GRAVY 59 60
score for the tryptic peptides from DEFB and the whole human genome proteins are 13 / 33
ACS Paragon Plus Environment
Journal of Proteome Research
1 3
2
illustrated in Figure 1D. This suggests that the distribution of total proteins is relatively 5
4
symmetric, while the distribution of DEFB was slightly shifted to right, with a GRAVY 6 8
7
score of approximately 0.3. Moreover, the fractional sum of tryptic peptides over a 0.5 10
9
GRAVY score for DEFB and the whole genome proteome were 18% and 11%, 1 12
respectively, indicating that DEFB proteins could produce more peptides with 13 15
14
relatively higher hydrophobicities. Taken together, we concluded that DEFB peptides 17
16
did not show any strong biases versus other human proteins in the proteomics analysis 18 19
with LC-MS/MS. 20 21 23
2
By examining the biological and physicochemical properties of DEFB, it is difficult to 25
24
draw specific conclusions regarding what specific factor is decisive in leading to the 26 28
27
low transcriptional and translational levels of the DEFB genes. Thus, we next examined 30
29
the chromatin structure to determine whether it affected the transcriptional status of the 32
31
DEFB genes. 3 34 36
35
2. Regionalization of DEFBs and their adjacent genes on corresponding 38
37
chromosomes 39 40 41
Gene expression is controlled by many classes of cis-regulatory elements, including 42 4
43
enhancers, promoters, insulators, silencers and locus control regions. The ENCODE 46
45
data is expected to annotate these elements and to reveal novel relationships between 47 49
48
chromatin accessibility, transcription
12
, DNA methylation and regulatory factor
51
50
occupancy patterns 18. As discussed above, the DEFB genes were correctly annotated 53
52
at the genomic level but their transcriptional signals were rarely detected. This 54 56
5
prompted us to inquire as to whether the specific chromatin structures at DEFB genes 58
57
are involved in the regulation of DEFB gene expression. Although several regulatory 59 60
elements have been well studied in ENCODE, we focused on three main elements, 14 / 33
ACS Paragon Plus Environment
Page 14 of 37
Page 15 of 37
Journal of Proteome Research
1 3
2
DHS, histone methylation patterns and transcription factor binding, and scrutinized the 5
4
relationships among the chromatin at the DEFB genes and these elements. 6 7 9
8
The DHS data exploits chromatin regions that are sensitive to cleavage by the enzyme 1
10
DNase I; these accessible chromatin zones are functionally related to transcriptional 12 13
activity 14
28
. DHSs were somehow selective according to cell type. In ENCODE, 125
16
15
different human cell types have been examined by DHS analysis and approximately 18
17
30% of DHSs were specific to each cell type. In our dataset, the three liver cancer cell 19 20
lines were used for a trans-omics survey; however, none of them were studied by 21 23
2
ENCODE. How do we utilize the ENCODE information to understand the 25
24
transcriptional status of the DEFB genes in these cell lines? We selected all of the liver 26 28
27
cell lines available in ENCODE and downloaded their transcriptome data. By 30
29
normalizing the transcriptional abundance in the liver cells stored in ENCODE and in 32
31
three cell lines that we studied, we were able to globally evaluate the transcriptional 3 35
34
patterns in our cell lines. We found that the HepG2 cell line shared similar 37
36
transcriptional patterns with our three liver cancer cell lines, and reasoned that the 38 39
ENCODE information on HepG2 was appropriate for exploring the role of DHS on 40 42
41
DEFB gene expression. 43 45
4
Approximately 91% of DEFB genes are located on three chromosomes, Chr 6 (5 46 48
47
genes), Chr 8 (7 genes) and Chr 20 (14 genes). Importantly, 90% of the DEFB genes 50
49
on these three chromosomes form clustered structures. We combined the genomic 52
51
location of the DEFB genes and the transcriptome data from ENCODE for the HepG2 53 5
54
cells to survey the relationship between the transcriptional level of the DEFB genes and 57
56
the corresponding chromatin regions. On Chr 6, the DEFB gene cluster consisting of 58 60
59
DEFB133, DEFB114, DEFB113, DEFB110 and DEFB112 did not have significant 15 / 33
ACS Paragon Plus Environment
Journal of Proteome Research
1 3
2
RNA-seq signals, and their neighbor regions, which extended from 0.45 Mb upstream 5
4
of the 5’ end to 2.1 Mb downstream of the 3’ end of the cluster, had poor transcription 6 8
7
as well. On Chr 8, the DEFB gene cluster consisting of DEFB107B, DEFB105A, 10
9
DEFB106A, DEFB104B, DEFB103A, DEFB4A, DEFB109P1B and DEFB107A had 1 12
very low mRNA expression levels. The neighboring regions of this DEFB cluster, 13 14
which extended from 1 Mb upstream of the 5’ end to 0.23 Mb downstream of the 3’ 17
16
15
end of the cluster, also displayed poor transcription. In contrast to the DEFB clusters 18 19
found on Chr 6 and Chr 8, DEFB genes make up two distant clusters on Chr 20 and are 20 2
21
close to the Chr 20 telomere and centromere regions. Similar scarce transcription was 24
23
observed in the two DEFB gene clusters, whereas the flanking regions contained genes 25 27
26
with much higher transcription levels. The information regarding the DEFB gene 29
28
chromosome locations and their expression statuses are summarized in Figure 2. To 31
30
characterize the ENCODE features, such as DHS, histone modification and TF binding, 32 34
3
for the DEFB cluster regions and compare their features with those of other chromatin 36
35
regions, we specifically localized the DEFB genes and their neighboring regions as 37 38
reference regions. On Chr 6 and Chr 8, the DEFB region was centered at the DEFB 39 41
40
cluster and was expanded to include the boundaries of lower transcription, which 43
42
included approximately 2.7 Mb on Chr 6 and 1.5 Mb on Chr 8 and were termed DEFB4 45
c6 and DEFB-c8, respectively. The reference regions selected had genes with relatively 46 48
47
higher transcription levels. On Chr 6, the reference regions contained 2.7 Mb from the 50
49
3’ and 5’ ends of the DEFB-c6 and were named DEFB-r6-3’ and DEFB-r6-5’, 51 52
respectively, while on Chr 8, the regions contained 1.5 Mb expanded from the both 3’ 53 5
54
and 5’ ends of DEFB-c8 were named DEFB-r8-3’ and DEFB-r8-5’, respectively. On 57
56
Chr 20, the two DEFB clusters contain 216 Kb and 174 Kb and are termed DEFB-ca20 58 60
59
and DEFB-cb20, respectively. As the regions close to the 3’ ends of DEFB-ca20 and 16 / 33
ACS Paragon Plus Environment
Page 16 of 37
Page 17 of 37
Journal of Proteome Research
1 3
2
DEFB-cb20 contain genes with higher transcription levels and the telomeres or 5
4
centromeres of Chr 2 are close to the 5’ ends of these DEFB clusters, the reference 6 8
7
regions for DEFB-ca20 and DEFB-cb20 are the regions that extend to the right of the 10
9
DEFB-ca20 and DEFB-cb20 3’ ends for 216 Kb and 174 Kb, respectively, and are 1 12
termed DEFB-ra20 and DEFB-rb20, respectively. All of the regions used in the further 13 15
14
ENCODE analyses are labeled in Figure 2. 16 18
17
To compare the ENCODE features with the regions selected above, we adopted a 19 20
stepwise analysis. First, the genes in the corresponding regions with or without 21 23
2
detectable transcription signals were counted. For the DEFB regions, 22 genes 25
24
including 5 DEFB genes were found within the DEFB-c6 region, 35 genes including 7 26 28
27
DEFB genes within the DEFB-c8 region, 8 DEFB genes exclusively within the DEFB30
29
ca20 region and 6 DEFB genes within the DEFB-cb20 region. For the reference 32
31
regions, 14, 34, 7, 11, 9 and 6 genes were found in the DEFB-r6-5’, DEFB-r6-3’, 3 35
34
DEFB-r8-5’, DEFB-r8-3’, DEFB-ra20 and DEFB-rb20 regions, respectively (Figure 37
36
2). Next, the scores of certain ENCODE features were weighted for each gene in these 39
38
10 defined regions according to the weighting treatments described in the “Methods” 40 42
41
section. Finally, the scores of certain ENCODE features for all of the genes in each 4
43
region were summed, and the differences in the ENCODE features between the DEFB 45 46
genes and their reference regions were statistically evaluated using the Wilcoxon-test 47 49
48
(paired). 50 52
51
3. Analysis of the ENCODE features in the DEFB and reference regions 53 54 56
5
1) DHS 57
60
59
58
The DHS distribution along the DEFB-related and reference regions was analyzed with Genome Browser
29
. As depicted in Supplementary Figure 3, which includes the 17 / 33
ACS Paragon Plus Environment
Journal of Proteome Research
1 3
2
typical DHS distribution of the DEFB-ca20, DEFB-cb20, DEFB-ra20 and DEFB-rb20 5
4
regions, DHS signals were rarely observed in DEFB-ca20 and DEFB-cb20, which 6 8
7
include a few DHS binding sites with weak DHS binding intensities. In contrast, DHS 10
9
sites were widely identified in DEFB-ra20 and DEFB-rb20, with more binding sites 1 12
and higher binding intensities. A similar phenomenon to that of the regions described 13 15
14
above was observed on Chr 6 and Chr 8. Importantly, the genes neighboring the DEFBs 17
16
on Chr 6 or Chr 8 appeared to have low DHS binding signals. This implies that the co18 19
existence of poor transcription and the decreased DHS binding is not a characteristic of 20 2
21
the DEFB gene family because there are so many non-family genes in the DEFB-c6 24
23
and DEFB-c8 regions. The overall impacts of DHS on the ten selected regions were 25 27
26
further evaluated by accumulating the DHS scores per gene in each region. Figure 3A 29
28
shows that the sums of the DHS signals in all of the DEFB neighbor regions regardless 31
30
of chromosomes were obviously higher than those in the corresponding DEFB-related 32 34
3
regions. With a log10 axis for DHS intensity, the score difference between the DEFB36
35
related and reference region on chr6 or chr20a was more than hundreds-fold, while that 37 38
between DEFB-related and reference regions on Chr 8 or Chr 20b was almost infinity. 39 41
40
The evidence elicited from the DHS distribution and the cumulation concludes that the 43
42
poor transcriptional expression of the DEFB genes closely correlates with low DHS 4 45
intensities. 46 47 49
48
2) Histone modifications 50 52
51
For the HepG2 cell line, there were eleven histone modifications listed in ENCODE, 53 5
54
including eight different methylations, two different acetylations and the variant histone 57
56
H2A. By examining all 11 types of modification signal tracks in the ten regions in 58 60
59
Genome Browser, we found that the signals of most of the modifications (8 of 11) 18 / 33
ACS Paragon Plus Environment
Page 18 of 37
Page 19 of 37
Journal of Proteome Research
1 3
2
embodied a specific pattern between the DEFB-related and reference regions 5
4
Supplementary Figure 4. Importantly, the modifications synchronically changed 6 8
7
along with the fluctuation in gene expression within these regions. Meanwhile, three 10
9
histone modifications, H2az, H3k27me3 and H3k9me3, did not display a consistent 1 12
pattern in their signals in these regions; however, the signals of the three events in the 13 15
14
DEFB-related regions were indeed different from those in the reference regions. It has 17
16
been reported that histone modifications are mainly enriched in two genomic regions, 18 19
the TSS and the gene body 20
30
. To estimate the overall histone modification, we
2
21
developed an assembling algorithm that enabled the integration of the histone 24
23
modification scores from both the TSS upstream region and the fully covered gene body 25 27
26
region. With this approach, we further surveyed the modification scores on each gene 29
28
in these regions based on the ENCODE datasets. As depicted in Figure 3B, the signals 31
30
of 8 modifications on the right side were significantly enriched in the reference region; 32 34
3
specifically, the enrichment ratio per such modification was 92% in these regions. The 36
35
huge diversity in the enrichment ratios indicated by Figure 3B means that these histone 37 38
modifications could positively regulate gene transcription in all of the chromosomes 39 41
40
interest because the genes in the reference regions exhibited higher transcription levels 43
42
than the DEFB-related regions. In contrast to the other three histone modifications, 4 45
H3k27me3 and H3k9me3 have been reported to be regulatory elements that repress 46 48
47
transcriptional expression 50
49
31
. Moreover, the strong bias to the reference regions was
observed in H3k9ac and H3k27ac, implying activated function of histone deacetylases 51 53
52
(HDACs) on DEFB related regions, which is in accordance with a previous study on 5
54
the HDACs activity of DEFBs 57
56
32
. In Figure 3B, the enrichment ratios in the DEFB-
related regions were apparently increased compared to the modifications on the right 58 60
59
side, especially in Chr 20.a and the Chr 6 group, whose proportions had increased by 19 / 33
ACS Paragon Plus Environment
Journal of Proteome Research
1 3
2
more than 50 compared with the average proportion of the eight modifications on the 5
4
right side. This distinguishing histone modification pattern between the DEFB-related 6 8
7
regions and their reference region(s) suggest that chromatin structure and the scarce 10
9
expression of genes in the DEFB-related region are highly correlated. 1 12 13
3) Correlation between gene expression, DHS and histone modification in the TSS 14 15 17
16
As reported by He et al., the DHS status was crucial for the cistrome element binding 19
18
involved in histone modification 33. We checked the relationship between each histone 20 21
modification score, the DHS score and the FPKM values in the TSS regions after 2 24
23
normalization to their total scores. As depicted in Supplementary Figure 5, several 26
25
factors appeared to be correlated. DHS and FPKM have a correlation index of 0.73, 27 29
28
indicating that an enriched DHS region has favorable gene transcription. The majority 31
30
of the active regulatory histone modifications and DHS or FPKM had average 32 3
correlation indexes around more than 0.5, meaning that the correlation is acceptable, 34 36
35
especially with only 120 genes in the statistical estimation. Surprisingly, some of the 38
37
active regulatory histone modifications, such as H3k4me2, H3k4me3, H3k9ac, 39 40
H3k27ac and H3k79me2, tightly correlated with one other with indexes greater than 41 43
42
0.8 on average, implying their coexistence in the TSS regions of the genes of interest 45
4
and that their functions in regulating gene transcription are similar. For histone 46 48
47
acetylation, several types of histone acetylation including H3k9ac and H3k27ac were 50
49
defined as marks of active transcription, as reviewed by Salvi et al. 34. Andresen et al. 52
51
reported that in contrast to the suppression to gene expression induced by the activated 53 5
54
histone deacetylases (HDACs), the abundance of some HDACs, such as HDAC1-3, 57
56
HDAC5 and HDAC8, was positively correlated with the expression status of DEFB1 60
59
58
in the bronchial epithelial cell biopsies and bronchoalveolar lavage (BAL) fluids 35. We 20 / 33
ACS Paragon Plus Environment
Page 20 of 37
Page 21 of 37
Journal of Proteome Research
1 3
2
checked the large-scale transcriptome data of DEFBs as well as HDACs from the 70 5
4
cell lines and tissues, and found a poor correlation between the HDACs and DEFBs, 6 8
7
r2=0.1357 for DEFBs and r2=0.0853 for DEFB1, implying that the Andresen's 10
9
observation seemed not a common case in many cell lines and tissues. As mentioned 1 12
above, we directly examined the acetylation of H3k9 as well as H3k27 from ENCODE 13 15
14
dataset and found their absence in DEFB related regions correlated with the poor 17
16
transcription of DEFBs (Figure 3B). We thus assume that HDACs in these cells tissues 18 19
play a common role in regulation of DEFB genes transcription. 20 21 23
2
4) TF binding 24 26
25
We collected data on the binding position and normalized intensity of 59 TFs in the 27 29
28
HepG2 cell line, which were derived from ChIP-seq data in ENCODE. To estimate the 31
30
overall TF effect on the regulation of transcription in specific regions, the TF score of 32 3
a gene was defined as the TF binding intensity within its potential regulatory region. 34 36
35
The TF scores for all of the 122 genes included in the 10 selected regions are 38
37
summarized in Supplementary Figure 6. The genes in the DEFB-related regions 39 40
generally possessed lower TF occupancy than those in their reference regions, with 41 43
42
68% and 16% of the genes free from TF binding, respectively. Interestingly, three genes 45
4
in the DEFB-c6 region, DEFB107B, IL17A and PKHD1, had relatively high TF scores. 46 48
47
Nine genes with no TF scores were found in 5 reference regions in the 3 chromosomes, 50
49
suggesting that the TF score was not a unique factor for deciding gene transcription. To 52
51
further illustrate the association of TF binding with gene expression in these regions, 53 5
54
we evaluated the distribution of the average TF scores along them, which represented 57
56
the TF binding intensity per gene. As presented in Figure 3C, for the TF classes, the 58 60
59
fewest TFs were detected in the DEFB-ca20 and DEFB-cb20 regions, while the TF 21 / 33
ACS Paragon Plus Environment
Journal of Proteome Research
Page 22 of 37
1 3
2
distribution in the other regions was relatively comparable. Moreover, for the TF 5
4
binding intensity, the average TF scores in the DEFB-related regions were obviously 6 8
7
lower than those of reference regions, regardless of the chromosome. The TF 10
9
assessments on the regions discussed above imply that the weak transcription of the 1 12
DEFB-related regions in the 3 chromosomes likely resulted from fewer or weak TF 13 15
14
binding to the corresponding regulatory regions. 16 18
17
Based on the information shown in Supplementary Figure 6, there were three 19 20
important aspects to note. First, by comparing TF binding status across the regions 21 23
2
located on the different chromosomes, TF binding was pronouncedly enriched on Chr 25
24
6. This observation was in agreement with that of the previous study that showed that 26 27
TF binding displayed a biased distribution along chromosomes 36. Secondly, of the 59 28 30
29
TFs, only 5 TFs, USF1, RXRA2, CEBPB, MAFK and MAFF, displayed binding 32
31
intensities that were positively correlated with the transcriptional abundance of 3 35
34
individual genes. In contrast, over 90% of the TFs exhibited little correlation with TF 37
36
binding or transcriptional levels in the HepG2 cell line (Supplementary Figure 7). 38 39
Finally, of all of the TFs listed in the ENCODE database, approximately half of them 40 42
41
were related to the HepG2 cell line, with only 24% suggested to be consistent 4
43
transcriptional 45
activators
by
the
Factor
book
46
(http://factorbook.org/mediawiki/index.php/). In the human genome, over 1300 TFs 47 49
48
have DNA binding affinity 37; however, only 300 have been experimentally examined 51
50
to date according to the Human Transcriptional Regulation Interactions database 52 53
(HTRIdb) 38. This implies that the current ENCODE data for the HepG2 cell line is still 56
5
54
too limited to reveal the overall correlation between TFs and DEFB transcriptional 57 58
status. Moreover, as Taudien et al found that in the DEFB related regions, some copy 59 60
number variations (CNVs) was enriched, the CNVs were assumed as a regulatory factor 22 / 33
ACS Paragon Plus Environment
Page 23 of 37
Journal of Proteome Research
1 3
2
which would alter DEFBs expression, interweaving its influence with the TF binding 5
4
39
6
. Therefore, the complication may also pose the possible interpretation to the functions
8
7
of TF binding data related with the DEFB expression 9 1
10
Conclusions 12 13 14
The survey of transcriptiomes and proteomes in three liver cancer cell lines acquired 15 17
16
from RNA-seq and LC-MS/MS data presents clear evidence that DEFB gene 19
18
expression products are scarcely detected. Extensive surveys of transcriptomes in over 20 21
70 cell lines further support this observation, indicating that the infrequent expression 2 24
23
of DEFB genes is not cell-type dependent. We therefore hypothesized that the weak 26
25
transcription of DEFB genes was influenced by transcriptional regulatory elements. 27 29
28
The information related to the elements in the ENCODE database is reasoned to be 31
30
powerful enough to ascertain an answer to the hypothesis. Three main features of 32 3
ENCODE were utilized to systematically scrutinize the intensity of DEFB gene 34 36
35
expression and that of its adjacent regions on three chromosomes. Compared with the 38
37
adjacent regions, all of the DEFB regions, regardless of which chromosome they were 39 40
located on, exhibited significantly weak signals in the three ENCODE features. 41 43
42
Although DEFB is a missing protein, our data suggest that the absence of DEFB protein 45
4
directly results from the scarce transcription of its genes; the chromatin structure around 46 48
47
the DEFB genes exerts a repressive effect on the gene expression as well. The DEFB50
49
related regions on Chr 6 and 8 contain non-DEFB genes that also lack transcriptional 52
51
evidence. This suggests that the weak transcription of DEFB genes is not a special 53 5
54
feature of the gene family but is dependent upon their location and chromatin structure. 56 58
57
This material is available free of charge via http://pubs.acs.org/. This content contains 59 60
Supplementary Figure S1-S7 and Table S1. 23 / 33
ACS Paragon Plus Environment
Journal of Proteome Research
1 3
2
ACKNOWLEDGMENTS 4 6
5
This study was supported by the International Science & Technology Cooperation 7 9
8
Program of China (2014DFB30020), Chinese National Basic Research Programs 1
10
(2014CBA02002, 2014CBA02005), and the National High-Tech Research and 12 13
Development Program of China (2012AA020202). 14 15 17
16
Conflict of Interests Statement: 18 20
19
The authors declare no competing financial interest. 21 2 23 24 25 26 27 28 29 30
References 31 32 34
3
(1) Stephen H White, William C Wimley, Michael E Selsted, Structure, function, and 36
35
membrane integration of defensins, Current Opinion in Structural Biology. 1995, 4, 37 38
521-527. 39 40 42
41
(2) Schro, J.; Harder, È. Human β-defensin-2. The international journal of biochemistry 4
43
and cell biology. 1999, 31, 645–651. 45 46 47
(3) Taylor, K.; Barran, P. E.; Dorin, J. R. Review: Structure-activity relationships in β50
49
48
defensin peptides. Biopolymers - Peptide Science Section, 2008, 90, 1-7. 51 53
52
(4) Semple, C. a M.; Taylor, K.; Eastwood, H.; Barran, P. E.; Dorin, J. R. β-defensin 54 56
5
evolution: selection complexity and clues for residues of functional importance. 58
57
Biochem. Soc. Trans. 2006, 34, 257–262. 59 60
24 / 33
ACS Paragon Plus Environment
Page 24 of 37
Page 25 of 37
Journal of Proteome Research
1 3
2
(5) Liu, Y.; Ying, W.; Ren, Z.; Gu, W.; Zhang, Y.; Yan, G.; Yang, P.; Liu, Y.; Yin, X.; 5
4
Chang, C.; et al. Chromosome-8-coded proteome of Chinese Chromosome Proteome 6 8
7
Data Set (CCPD) 2.0 with partial immunohistochemical verifications. J. Proteome Res. 10
9
2014, 13, 126–136. 1 12 13
(6) Wang, Q.; Wen, B.; Yan, G.; Wei, J.; Xie, L.; Xu, S.; Jiang, D.; Wang, T.; Lin, L.; 14 16
15
Zi, J.; et al. Qualitative and quantitative expression status of the human chromosome 18
17
20 genes in cancer tissues and the representative cell lines. J. Proteome Res. 2013, 12, 19 20
151–161. 21 2 24
23
(7) Wang, Q.; Wen, B.; Wang, T.; Xu, Z.; Yin, X.; Xu, S.; Ren, Z.; Hou, G.; Zhou, R.; 26
25
Zhao, H.; et al. Omics evidence: Single nucleotide variants transmissions on 27 29
28
chromosome 20 in liver cancer cell lines. J. Proteome Res. 2014, 13, 200–211. 30 32
31
(8) Bals, R.; Wang, X.; Wu, Z.; Freeman, T.; Bafna, V.; Zasloff, M.; Wilson, J. M. 3 34
Human beta-defensin 2 is a salt-sensitive peptide antibiotic expressed in human lung. 35 37
36
J. Clin. Invest. 1998, 102, 874–880. 38 40
39
(9) Singh, P. K.; Jia, H. P.; Wiles, K.; Hesselberth, J.; Liu, L.; Conway, B. A.; 41 42
Greenberg, E. P.; Valore, E. V; Welsh, M. J.; Ganz, T.; et al. Production of beta43 45
4
defensins by human airway epithelia. Proc. Natl. Acad. Sci. U. S. A. 1998, 95, 14961– 47
46
14966. 48 49 51
50
(10) Mcnamara, N. A.; Van, R.; Tuchin, O. S.; Fleiszig, S. M. Ocular surface epithelia 53
52
express mRNA for human beta defensin-2. Exp. Eye Res. 1999, 69, 483–490. 54 5 56
(11) Abedin, A.; Mohammed, I.; Hopkinson, A.; Dua, H. S. A novel antimicrobial 57
60
59
58
peptide on the ocular surface shows decreased expression in inflammation and infection. Invest. Ophthalmol. Vis. Sci. 2008, 49, 28–33. 25 / 33
ACS Paragon Plus Environment
Journal of Proteome Research
1 3
2
(12) ENCODE consortium. The ENCODE (Encyclopedia of DNA Elements) Project. 5
4
Nature 2004, 306, 636–640. 6 7 9
8
(13) Djebali, S.; Davis, C. a; Merkel, A.; Dobin, A.; Lassmann, T.; Mortazavi, A.; 1
10
Tanzer, A.; Lagarde, J.; Lin, W.; Schlesinger, F.; et al. Landscape of transcription in 12 13
human cells. Nature 2012, 489, 101–108. 14 15 17
16
(14) Spivakov, M.; Akhtar, J.; Kheradpour, P.; Beal, K.; Girardot, C.; Koscielny, G.; 19
18
Herrero, J.; Kellis, M.; Furlong, E. E. M.; Birney, E. Analysis of variation at 20 21
transcription factor binding sites in Drosophila and humans. Genome Biol. 2012, 13, 2 24
23
R49. 25 27
26
(15) Whitfield, T. W.; Wang, J.; Collins, P. J.; Partridge, E. C.; Aldred, S. F.; Trinklein, 28 30
29
N. D.; Myers, R. M.; Weng, Z. Functional analysis of transcription factor binding sites 32
31
in human promoters. Genome Biol. 2012, 13, R50. 3 34 35
(16) Djebali, S.; Davis, C. a; Merkel, A.; Dobin, A.; Lassmann, T.; Mortazavi, A.; 36 38
37
Tanzer, A.; Lagarde, J.; Lin, W.; Schlesinger, F.; et al. Landscape of transcription in 40
39
human cells. Nature 2012, 489, 101–108. 41 42 43
(17) Bernstein, B. E.; Birney, E.; Dunham, I.; Green, E. D.; Gunter, C.; Snyder, M. An 4 46
45
integrated encyclopedia of DNA elements in the human genome. Nature 2012, 489, 48
47
57–74. 49 50 52
51
(18) Dong, X.; Greven, M. C.; Kundaje, A.; Djebali, S.; Brown, J. B.; Cheng, C.; 54
53
Gingeras, T. R.; Gerstein, M.; Guigó, R.; Birney, E.; et al. Modeling gene expression 5 56
using chromatin features in various cellular contexts. Genome Biol. 2012, 13, R53. 57 58 59 60
26 / 33
ACS Paragon Plus Environment
Page 26 of 37
Page 27 of 37
Journal of Proteome Research
1 3
2
(19) Consortium, I. H. G. Initial sequencing and analysis of the human genome. Nature 5
4
2002, 420, 520–562. 6 7 9
8
(20) Uhlén, M.; Fagerberg, L.; Hallström, B. M.; Lindskog, C.; Oksvold, P.; 1
10
Mardinoglu, A.; Sivertsson, Å.; Kampf, C.; Sjöstedt, E.; Asplund, A.; et al. Tissue12 13
based map of the human proteome. Science. 2015. 347, 394-402. 14 15 17
16
(21) Landt, S. G.; Marinov, G. K.; Kundaje, A.; Kheradpour, P.; Pauli, F.; Batzoglou, 19
18
S.; Bernstein, B. E.; Bickel, P.; Brown, J. B.; Cayting, P.; et al. ChIP-seq guidelines 20 21
and practices of the ENCODE and modENCODE consortia. Genome Res. 2012, 22, 2 24
23
1813–1831. 25 27
26
(22) R Core Team. R: A language and environment for statistical computing. R A Lang. 28 30
29
Environ. Stat. Comput. 2014, 0. 31 3
32
(23) Kyte, J.; Doolittle, R. F. A simple method for displaying the hydropathic character 34 35
of a protein. J. Mol. Biol. 1982, 157, 105–132. 36 37 39
38
(24) Lane, L.; Bairoch, A.; Beavis, R. C.; Deutsch, E. W.; Gaudet, P.; Lundberg, E.; 41
40
Omenn, G. S. Metrics for the human proteome project 2013-2014 and strategies for 42 43
finding missing proteins. J. Proteome Res. 2014, 13, 15–20. 4 45 47
46
(25) Taudien, S.; Szafranski, K.; Felder, M.; Groth, M.; Huse, K.; Raffaelli, F.; Petzold, 49
48
A.; Zhang, X.; Rosenstiel, P.; Hampe, J.; et al. Comprehensive assessment of sequence 50 52
51
variation within the copy number variable defensin cluster on 8p23 by target enriched 54
53
in-depth 454 sequencing. BMC Genomics 2011, 12, 243. 5 56 57 58 59 60
27 / 33
ACS Paragon Plus Environment
Journal of Proteome Research
1 3
2
(26) Groth, M.; Wiegand, C.; Szafranski, K.; Huse, K.; Kramer, M.; Rosenstiel, P.; 5
4
Schreiber, S.; Norgauer, J.; Platzer, M. Both copy number and sequence variations 6 8
7
affect expression of human DEFB4. Genes Immun. 2010, 11, 458–466. 9 1
10
(27) Swaney, D. L.; Wenger, C. D.; Coon, J. J. Value of using multiple proteases for 12 13
large-scale mass spectrometry-based proteomics. J. Proteome Res. 2010, 9, 1323–1329. 14 15 17
16
(28) Thurman, R. E.; Rynes, E.; Humbert, R.; Vierstra, J.; Maurano, M. T.; Haugen, E.; 19
18
Sheffield, N. C.; Stergachis, A. B.; Wang, H.; Vernot, B.; et al. The accessible 20 21
chromatin landscape of the human genome. Nature 2012, 489, 75–82. 2 23 25
24
(29) James Kent, W.; Sugnet, C. W.; Furey, T. S.; Roskin, K. M.; Pringle, T. H.; Zahler, 27
26
A. M.; Haussler, D. The human genome browser at UCSC. Genome Res. 2002, 12, 28 30
29
996–1006. 31 3
32
(30) Karlić, R.; Chung, H.-R.; Lasserre, J.; Vlahovicek, K.; Vingron, M. Histone 34 35
modification levels are predictive for gene expression. Proc. Natl. Acad. Sci. U. S. A. 36 38
37
2010, 107, 2926–2931. 39 41
40
(31) Bannister, A. J.; Kouzarides, T. Regulation of chromatin by histone modifications. 42 43
Cell Res. 2011, 21, 381–395. 4 45 47
46
(32) He, H. H.; Meyer, C. A.; Chen, M. W.; Jordan, V. C.; Brown, M.; Liu, X. S. 49
48
Differential DNase I hypersensitivity reveals factor-dependent chromatin dynamics. 50 52
51
Genome Res. 2012, 22, 1015–1025. 53 5
54
(33) Yip, K. Y.; Cheng, C.; Bhardwaj, N.; Brown, J. B.; Leng, J.; Kundaje, A.; 56 57
Rozowsky, J.; Birney, E.; Bickel, P.; Snyder, M.; et al. Classification of human genomic 58 59 60
28 / 33
ACS Paragon Plus Environment
Page 28 of 37
Page 29 of 37
Journal of Proteome Research
1 3
2
regions based on experimentally determined binding sites of more than 100 5
4
transcription-related factors. Genome Biol. 2012, 13, R48. 6 7 9
8
(34) Selvi, B. R.; Kundu, T. K. Reversible acetylation of chromatin: Implication in 1
10
regulation of gene expression, disease and therapeutics. Biotechnol. J. 2009, 4, 375– 12 13
390. 14 15 17
16
(35) Andresen, E.; Günther, G.; Bullwinkel, J.; Lange, C.; Heine, H. Increased 19
18
expression of beta-defensin 1 (DEFB1) in chronic obstructive pulmonary disease. PLoS 20 21
One 2011, 6. 2 23 25
24
(36) Wang, J.; Zhuang, J.; Iyer, S.; Lin, X.; Whitfield, T. W.; Greven, M. C.; Pierce, B. 27
26
G.; Dong, X.; Kundaje, A.; Cheng, Y.; et al. Sequence features and chromatin structure 28 30
29
around the genomic regions bound by 119 human transcription factors. Genome Res. 32
31
2012, 22, 1798–1812. 3 34 35
(37) Vaquerizas, J. M.; Kummerfeld, S. K.; Teichmann, S. a; Luscombe, N. M. A 36 38
37
census of human transcription factors: function, expression and evolution. Nat. Rev. 40
39
Genet. 2009, 10, 252–263. 41 42 43
(38) Bovolenta, L.; Acencio, M. L.; Lemke, N. HTRIdb: an open-access database for 4 46
45
experimentally verified human transcriptional regulation interactions. BMC Genomics 48
47
2012, 13, 405. 49 50 52
51
(39) Taudien, S.; Groth, M.; Huse, K.; Petzold, A.; Szafranski, K.; Hampe, J.; 54
53
Rosenstiel, P.; Schreiber, S.; Platzer, M. Haplotyping and copy number estimation of 5 56
the highly polymorphic human beta-defensin locus on 8p23 by 454 amplicon 57
60
59
58
sequencing. BMC Genomics 2010, 11, 252.
29 / 33
ACS Paragon Plus Environment
Journal of Proteome Research
1 2 3 4 6
5
Figure Legends 7 8 9 10 12
1
Figure 1. Characterization of DEFB genes, including genomic location, expression 13 14
status and physicochemical properties of their protein products. (a) The distribution of 15 17
16
DEFB genes along the human chromosomes 6, 8 and 20. The bands on the 19
18
chromosomes stand for the mRNA density expressed by the corresponding genes. (b) 20 21
The expression status of the DEFB genes in 76 different cell lines and tissues. The 2 24
23
mRNA abundance is represented as the log2 (FPKM) of a DEFB gene. (c) 26
25
Comparison of the tryptic peptide lengths between the proteins encoded by DEFB 27 29
28
genes and the whole human encoding genes. (d) Comparison of the tryptic peptides 31
30
hydrophobicity between the proteins encoded by DEFB genes and the whole human 32 3
encoding genes. 34 35 36 37 39
38
Figure 2. The chromosome location and transcriptional level of genes located within 41
40
the DEFB-related and reference regions. (a) Regions located on Chr 6. (b) Regions 42 43
located on Chr 8. (c), (d) Regions located on Chr 20. The chromosome regions that 4 46
45
mainly included DEFB genes were divided into ten regions, as described in the text. 48
47
The x-axis represents the genomic positions and the y-axis shows as the mRNA 49 51
50
abundances. The dash lines indicate the region partitions. 52 53 54 56
5
Figure 3. Characterization of the ENCODE features on the DEFB-related and the 57 58
reference regions. (a) The accumulated log10 DHS scores in the ten regions in the 59 60
three chromosomes described in text. (b) The distribution of the individual histone 30 / 33
ACS Paragon Plus Environment
Page 30 of 37
Page 31 of 37
Journal of Proteome Research
1 3
2
modifications within each DEFB-related region and its neighboring reference 5
4
region(s). Each bar represents the fraction of the related regions for one modification 6 8
7
type. (c) Heatmap of the TF binding scores for the DEFB-related and reference 10
9
regions. The gradient bar represents the log2 scores of TF binding. 1 12 13 14 16
15
Supplementary Information: 17 18 19
Supplementary Table1. The annotation status of the DEFB genes in cross reference 20 2
21
databases. 23 24 25 26 27 29
28
Supplementary Figure1. Number of total theoretical peptides and fit theoretical 31
30
peptides (those with length between 7 and 35) of 34 DEFB proteins. X axis: DEFBs, Y 32 3
axis: number of theoretical peptides with black and green dots representing total 34 36
35
peptides and peptides with length between 7 and 35. 37 38 39 40 41
Supplementary Figure 2. The heatmap generated of the RNA-Seq data and the 42 4
43
RNC-Seq data of three C-HPP cell lines and the HepG2 (With biological replicates) 46
45
RNA-Seq data from ENCODE. 47 48 49 50 52
51
Supplementary Figure 3. The tracks of DHS signal in DEFB regions and their 54
53
neighbor reference region(s) in GenomeBroswer. The DEFB-related regions were 5 56
colored with light blue while the reference regions were indicated by the red 57
60
59
58
rectangle. (a) The DHS signals, gene positions and the corresponding transcription signal of DEFB-ca20 and its neighbor region, the features of specific region were 31 / 33
ACS Paragon Plus Environment
Journal of Proteome Research
1 3
2
depicted by the tracks of their peaks signal from ENCODE GenomeBrowser. (b) The 5
4
DHS signals, gene positions and the corresponding transcription signal of DEFB-cb20 6 8
7
and DEFB-rb20.(c) The DHS signals, gene positions and the corresponding 10
9
transcription signal on chr8. (d) The DHS signals, gene positions and the 1 12
corresponding transcription signal on chr6. 13 14 15 16 18
17
Supplementary Figure 4. The tracks of all 11 histone modification signals along the 20
19
interest regions on chr6, 8 and 20. (a) The eleven tracks represented the histone 21 2
modification status of corresponding regions were plotted together with the gene 23 25
24
positions and the transcription signals for DEFB-ca20 and DEFB-ra20 in the same 27
26
manner of Figure S2. (b) The histone modification signals along the DEFB-cb20 and 28 30
29
DEFB-rb20.(c) The histone modification signals along the regions on chr8. (d) The 32
31
histone modification signals along the regions on chr6. 3 34 35 36 37
Supplementary Figure 5. The correlation matrix between each two of DHS, FPKM 38 40
39
and all the histone modification scores in TSS regions for all the gene in all 10 42
41
regions. The lower panel plots the two features in scatter with the LOWESS smooth 43 4
line while the upper panel and diagonal panel present the correlation factors and the 45 47
46
histogram respectively. 48 50
49
Supplementary Figure 6. 59 TFs binding status of 122 cells in the selected regions. 51 53
52
Signal values of TF peaks mapped to potential regulatory regions of each gene were 5
54
summed as the TF score. Log2 (TF scores) were plotted in the same manner of Figure 56 57
3. 58 59 60
32 / 33
ACS Paragon Plus Environment
Page 32 of 37
Page 33 of 37
Journal of Proteome Research
1 3
2
Supplementary Figure 7. Correlation of 59 TFs binding as well as expression in 122 5
4
genes included in the selected regions. The binding pattern of 59 TFs and the expression 6 8
7
of the 122 genes were cross correlated to estimate the correlation between each two of 10
9
them. Matrix of the correlation scores were plotted with the “first principal component 12
1
order” clustering method. 13 14 15 16 17 18 19 20 21 2 23 24 25 26 27 28 29 30 31 32 3 34 35 36 37 38 39 40 41 42 43 4 45 46 47 48 49 50 51 52 53 54 5 56 57 58 59 60
33 / 33
ACS Paragon Plus Environment
Journal of Proteome Research
Page 34 of 37
1 2 3 4 6
5
Figure 1 8
7 9 1
10
a 12
b
testis kidney salivary gland skin THP−1 liver pancreas gallbladder NB−4 U−2197 RPMI−8226 BEWO spleen U−87 MG adrenal gland skeletal muscle SH−SY5Y SK−BR−3 U−266/70 SiHa cerebral cortex RT4 CAPAN−2 HEK 293 HDLM−2 thyroid gland NTERA−2 RH−30 U−698 MOLT−4 A−431 K−562 HMC−1 ovary SK−MEL−30 U−266/84 AN3−CA U−937 U−251 MG U−138 MG TIME REH PC−3 MCF7 Karpas−707 HL−60 HeLa A549 Daudi WM−115 U−2 OS smooth muscle lymph node heart muscle HEL fallopian tube esophagus endometrium rectum lung urinary bladder adipose tissue appendix placenta CACO−2 HaCaT Hep G2 duodenum small intestine tonsil colon prostate SCLC−21H stomach bone marrow EFO−21
+
13 14
6
15
-
DEFB110 DEFB113 DEFB114 DEFB112
16 17 18 19
8
20 DEFB1 DEFB103B DEFB104B DEFB105B DEFB106B DEFB107A DEFB103A DEFB104A DEFB105A DEFB106A DEFB108P1 DEFB4A DEFB109P1 DEFB130
21 2 23
20
24 26
DEFB125 DEFB126 DEFB127 DEFB128 DEFB129 DEFB132
DEFB115 DEFB116 DEFB117 DEFB118 DEFB119 DEFB122 DEFB123 DEFB124
25
No evidence mRNA evidence
28
27
c 29
0.20
30 31
DEFB peptides
32
Whole genome peptides
0.15
3 Fractions
34 35 36 37
0.10
38 0.05
39 40 41
0.00
42
0
43
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
Peptides length
45
4
d 47
46
0.7
48
DEFB peptides Whole genome peptides
0.6
49 50
0.5
5
54
53
Density
52
51
0.4 0.3 0.2
57
56
0.1 0.0
60
-4
-2
0
GRAVY score
2
4
DEFB104B DEFB104A DEFB115 DEFB107B DEFB107A DEFB110 DEFB113 DEFB114 DEFB116 DEFB112 DEFB106B DEFB106A DEFB125 DEFB128 DEFB118 DEFB108B DEFB105B DEFB105A DEFB127 DEFB126 DEFB121 DEFB103B DEFB103A DEFB123 DEFB119 DEFB124 DEFB1
59
58
ACS Paragon Plus Environment
8 6 4 2 0
Page 35 of 37
Journal of Proteome Research
1 2 3 4 5 6 7 8 9 10 1 12
Figure 2 13 14 15 16
All genes 17
a 18
8
19 20
DEFB-r6-5’
DEFB-c6
b
DEFBs
DEFB-r6-3’
DEFB-r8-5’
DEFB-c8
DEFB-r8-3’
21 6
2 23 24
2 0 4.6e+07
5.0e+07
5.2e+07
5.4e+07
5.6e+07
5e+06
6e+06
7e+06
8e+06
9e+06
1e+07
4e+05
5e+05
8
d DEFB-ca20
DEFB-ra20
DEFB-cb20
DEFB-rb20
6
c
4.8e+07
40
4
39
38
37
36
35
34
3
32
31
30
29
28
27
26
Relative expression ( Log2 (FKPM) )
4
25
2
42
41
45
0
4
43 2.98e+07
46
2.99e+07
3.00e+07
3.01e+07
3.02e+07
3.03e+07
0e+00
Chromatin position (bp) 48
47 49 50 51 52 53 54 5 56 57 58 59 60
ACS Paragon Plus Environment
1e+05
2.e+05
3.e+05
Journal of Proteome Research
Page 36 of 37
1 2 3 4 6
5
Figure 3 9
8
7
a 1
10
c Chr20a
6
12
Expression ARID3A BHLHE40
Chr20b
13 14 4
15 16 17
DHS log10(scores)
18 19 20 21 2 23 24 25 26 27
2 0
DEFB-ra20
DEFB-ca20
DEFB-cb20
Chr6
6
DEFB-rb20
8
CEBPD
6
CHD2 CTCF ELF1
4
EP300 FOSL2
0
HNF4A HNF4G J UN J UND MAFF MAFK MAX MBD4
4
28
MXI1
2
29
MYBL2 NFIC NR2C2
30 31 0
32 3
DEFB-r6-5'
34
DEFB-c6
DEFB-r6-3'
35
5' neighbor region
DEFB-r8-5'
3' neighbor region
DEFB-c8
POLR2A RAD21 RCOR1
DEFB-r8-3'
REST RFX5 RXRA
DEFB-related region
b 37
36
SIN3AK20 SMC3 Chr20.a
100
38
Chr20.b
SP1 SP2 SRF TAF1
39 75
40
TBP TCF12 TCF7L2 TEAD4
41 50
42 43
USF1 USF2
25
47
46
45
Precentage
4
Chr6
100
51
50
49
48
YY1 ZBTB33 ATF3 ESRRA EZH2
0 Chr8
GABPA GRp20 HSF1 IRF3 MAZ NRF1
75 50
53
52
25
5
54
0
PPARGC1A SREBP1 ZBTB7A az H2
1 3 3 2 1 3 3 2 7ac me me me me me 79me 3k9ac k20me me k27 H3k9 H3k2 3k36 H3k4 H3k4 H3k4 k H H4 H3 H H3
az H2
DEFB.rb20
DEFB.ra20
DEFB.r8.5.
DEFB.r8.3.
DEFB.r6.5.
ACS Paragon Plus Environment
DEFB.r6.3.
DEFB-related region
DEFB.cb20
3' neighbor region
DEFB.ca20
5' neighbor region
60
DEFB.c8
Histone modifications 59
58
ZNF274 MYC
1 3 3 2 1 3 3 2 c me 7ac me me me me 79me 3k9a me me k20 k36 H3k4 H3k4 H3k4 k27 H3k9 H3k2 k H H4 H3 H3 H3
DEFB.c6
57
56
10
BRCA1 CEBPB
FOXA1 FOXA2 HDAC2
Chr8
12
2
Page 37 of 37
Journal of Proteome Research
1 2 3 4 5 6 7 8 9 10 1 12 13 14 15 16 17 18 19 20 21 2 23 24 25 26 28
27 For TOC only 76x47mm (300 x 300 DPI)
29 30 31 32 3 34 35 36 37 38 39 40 41 42 43 4 45 46 47 48 49 50 51 52 53 54 5 56 57 58 59 60
ACS Paragon Plus Environment