A Novel Computational Approach To Predict Transcription Factor DNA

Dec 19, 2008 - Shanghai Jiao Tong University School of Medicine and Shanghai Institutes for Biological Sciences. , ∥. The University of Manchester. ...
0 downloads 0 Views 183KB Size
A Novel Computational Approach To Predict Transcription Factor DNA Binding Preference Yudong Cai,*,† JianFeng He,‡ XinLei Li,§ Lin Lu,‡ XinYi Yang,† KaiYan Feng,| WenCong Lu,⊥ and XiangYin Kong§ CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai 200031, China, Department of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai 200040, People’s Republic of China, Institute of Health Sciences, Shanghai Jiao Tong University School of Medicine and Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200025, China, Division of Imaging Science & Biomedical Engineering, Room G424 Stopford Building, The University of Manchester, M13 9PT, United Kingdom, and Laboratory of Chemical Data Mining, Department of Chemistry, College of Sciences, Shanghai University, Shanghai 200444, China Received September 5, 2008

Transcription is one of the most important processes in cell in which transcription factors translate DNA sequences into RNA sequences. Accurate prediction of DNA binding preference of transcription factors is valuable for understanding the transcription regulatory mechanism and1 elucidating regulation network.2-4 Here we predict the DNA binding preference of transcription factor based on the protein amino acid composition and physicochemical properties, 0/1 encoding system of nucleotide, minimum Redundancy Maximum Relevance Feature Selection method,5 and Nearest Neighbor Algorithm. The overall prediction accuracy of Jackknife cross-validation test is 91.1%, indicating that this approach is a useful tool to explore the relation between transcription factor and its binding sites. Moreover, we find that the secondary structure and polarizability of transcriptor contribute mostly in the prediction. Especially, a 7-nt motif with AT-rich region of the DNA binding sites discovered via our method is also consistent with the statistical analysis from the TRANSFAC database.6 Keywords: Transcription factor • Transcription factor DNA binding preference • mRMR • Nearest neighbor algorithm • 0/1 System • Jackknife cross-validation test

Introduction Transcription is the first step of reading out information from DNA sequence into RNA sequence, followed by translation of RNA in which a protein is synthesized according to the RNA sequence. It is a major regulatory step in gene expression because at the transcription stage the cell determines which proteins are to be synthesized and at what rate the proteins are produced.7 Transcription factors involve in this stage by binding to the specific DNA sequences and aiding the functioning of RNA polymerase. Generally, one transcription factor can bind to multiple DNA binding sites. Predicting the DNA binding preference of transcription factors by discovering the transcription factor-transcription factor binding site pairs (TFTFBS pairs) is useful in elucidating the regulatory mechanism, illuminating regulatory network.2-4 Despite the importance of discovering TF-TFBS pairs, the traditional biochemical methods to find those pairs are expensive and time-consuming. Thus, * To whom correspondence should be addressed. E-mail: [email protected]. † Shanghai Institutes for Biological Sciences. ‡ Shanghai Jiao Tong University. § Shanghai Jiao Tong University School of Medicine and Shanghai Institutes for Biological Sciences. | The University of Manchester. ⊥ Shanghai University. 10.1021/pr800717y CCC: $40.75

 2009 American Chemical Society

it is attractive for researchers to predict TF-TFBS pairs using computational methods, especially in the postgenomic era. In this paper, a novel computational method was developed to predict TF-TFBS pairs. Each TF-TFBS pairs would be converted into a 232-dimension vector with the help of protein amino acid composition, physicochemical properties8 and 0/1 encoding system of nucleotide.9The feature selection step contained three steps: minimum Redundancy Maximum Relevance Feature Selection method (mRMR),5 Incremental Feature Selection (IFS) and Feature Forward Selection (FFS). mRMR method calculated the relevance and redundancy of the features, and IFS and FFS extract the features which were used to construct predictor. Nearest Neighbor Algorithm (NNA)10 was employed to construct the predictor, which was tested by Jackknife cross-validation.11 After the feature selection procedure, 113 features were extracted to build the final predictor, which could achieve the total accuracy as 91.1%. Among the selected features, some were more important than others for prediction because they contributed more to the increment of the total accuracy during the feature selection procedure. These features included the secondary structure, polarizability of transcription factor, and some positions of DNA binding sites. It is expected that the analysis of the selected features would give some hints to the understanding of regulatory mechanism of transcription factor. Journal of Proteome Research 2009, 8, 999–1003 999 Published on Web 12/19/2008

research articles

Cai et al.

Table 1. Numbers of Features of Every Physicochemistry Propertya dimensions of vector properties

Amino acid composition Hydrophobicity Predicted secondary structure Predicted solvent accessibility Normalized van der Waals volume Polarity Polarizability a

C

N/A 3 3 1 3 3 3

T

D

20 3 3 1 3 3 3

N/A 15 15 5 15 15 15

total

20 21 21 7 21 21 21

N/A represents that there is no such feature.

Figure 1. IFS and FFS curves.

Materials and Methods Data Preparation. The data set consisted of positive data set and negative data set. Every entry in the data set contained two parts: TF and TFBS. The positive data set contained TFTFBS pairs which actually interact with each other, Originally, 5341 pairs (including 646 TFs and 3643 TFBSs) in positive data set were obtained from TRANSFAC v7.0 (Public 2005, http:// www.gene-regulation.com/pub/databases.html). The TF-TFBS pairs without sufficient sequence information would be excluded from our positive data set. The TF protein sequences with the length out of the range between 50 and 5000, or containing X, B, J, O, U, Z would be excluded. As a result, we got 3541 pairs (including 599 TFs and 2402 TFBSs) in positive data set (for details, see the Supporting Information I). On the other hand, we constructed the negative artificial data set by randomly choosing one TF and one TFBS from positive data set to form a TF-TFBS pair which met the requirement that the pair did not appear in the positive data set. The ratio between the number of pairs in positive data set and that in negative data set was set to 1:9, that is, the negative data set contained 31 869 noninteracting pairs (for details, see the Supporting Information I). The sequence of both TFs and TFBSs can be found in Supporting Information II. Representation of TFs and TFBSs. To build a computational predictor of pair of TF and TFBS, one has to represent the TFs and TFBSs in the numerical form. There are many encoding method of protein, such as amino acid composition method,12,13 GO (gene ontology) encoding system,14 Pseudo amino acid composition method,15 and so on. Previously, Qian et al. had employed GO encoding system to encode TFs.14 In this study, we utilized the amino acid composition and physicochemical 1000

Journal of Proteome Research • Vol. 8, No. 2, 2009

properties to represent the TFs. The advantage of such method over Qian’s is that it can be applied to newly sequenced TFs which have not been annotated, which is the case in Qian’s method. This representation method was first used by Yu et al. for the classification of nucleic acid binding proteins.8 We describe this method briefly here. Generally, the protein function is largely determined by its primary sequence, that is, we can utilize the primary sequence to encode a protein. The method used here calculates the amino acid composition and physicochemical properties, such as hydrophobicity, predicted secondary structure,16 predicted solvent accessibility,16 normalized van der Waals volume, polarity, and polarizability, from the primary sequence of protein. This representation of protein contains not only the composition (C) information, but also the transition (T) and distribution (D) information. For example, suppose that every unit (a residue or a fragment) of a protein is classified into three classes, A, B, and C, according to one physicochemical property of the unit; that is, now the protein sequence can be transformed into a sequence of A, B, and C. The composition of A is the percentage of A units in the all units; the transition between A and B is the percentage of A/B transition in all three types of transition; the distributions of A are the percentage of the sequence lengths within which the first, 25%, 50%, 75% and 100% of A units are located. As a result, one protein can be encoded in a 132-dimensional vector (Table 1). TFBSs were encoded with 0/1 system. We supposed that TFBSs are DNA sequence with length shorter than 25 base pairs (bp). If the length of a TFBS was shorter than 25 bp, its length would be elongated by adding ‘N’ suffixes. Every nucleic acid in the sequence was encoded in four binary digits, that is, A (0001), C (0010), G (0100), T (1000) and N (0000). Accordingly, a 25-bp DNA sequence was represented by a sequence of 100 binary digits, namely, a TFBS was formulated as 100-dimensional vector. At last, we hybridized the TF vector and TFBS vector into a new vector, TD vector, according to Qian et al.14 Because TFs and TFBSs contribute differently to TF-TFBS pair prediction, we assigned different weights to TFs and TFBSs. For example, if the TF vector is TF ) (t1, t2, ..., t132)T, and TFBS vector is TFBS ) (d1, d2, ..., d100)T, then TD vector is TD ) (t1, t2, ..., t100, k · d1, k · d2, ..., k · d100)T, where k is the weight for TFBSs. Here, we set k as 0.5 as was in Qian’s method. As a result, the TF-TFBS pair was represented as a 232dimensional (132 + 100 ) 232) vector. Each element in the vector was a feature of the pair which would be evaluated in the next step. Minimum Redundancy Maximum Relevance (mRMR) Feature Selection Method. mRMR was first proposed by Peng for processing microarray data.5 Because some features in the vector are redundant and some are more relevant to the prediction than others, mRMR was used here to pre-evaluate the features in the feature vector to select the most relevant and lest redundant features according to the minimal redundancy and maximal relevance principle. mRMR method is based on information theory. Both redundancy and relevance are measured by mutual information defined as: I(x, y) )

p(x, y) dxdy ∫∫ p(x, y) log p(x)p(y)

(1)

where x and y are two random variables; p(x,y) is the joint probabilistic density; p(x) and p(y) are the marginal probabilistic densities respectively.

research articles

Novel Computational Approach To Predict TF-TFBS Pairs mRMR method selects the most relevant and least redundant feature (according to eq 4) from the nonselected feature pool ΩC, puts the feature into the selected feature pool ΩS, and removes the feature from the ΩC at each round. Relevance D of feature f with target variable c can be computed by eq 2. D ) I(f, c)

(2)

Redundancy R of feature f in ΩC relative to features in ΩS can be computed by eq 3.



1 I(f, fi) R) m f ∈Ω i

(3)

S

Equations 2 and 3 can be integrated into eq 4 to maximize relevance and minimize redundancy:

[

max I(fj, c) fj∈ΩC



1 I(f f ) m f ∈Ω j, i i

S

]

(j ) 1, 2, ..., n)

(4)

(5)

where h denotes at which round the feature is selected. Nearest Neighbor Algorithm (NNA). NNA is used to calculate the similarity between two variables.10 Here it was used to construct predicting model in the feature selection step based on calculating the similarity between the testing sample and training samples. The similarity between two vectors is defined as: D(px, py) ) 1 -

px · py |px | · |py |

{

Correctly predicted positive TF-TFBS pair Positive TF-TFBS pair Correctly predicted negative TF-TFBS pair accuracynegative ) Negative TF-TFBS pair Correctly predicted TF-TFBS pair totalaccuracy ) All TF-TFBS pair (8)

accuracypositive )

Feature Selection. After the mRMR step, we would know the goodness of the features. But we do not know how many features and which features we should choose. IFS and FFS steps were utilized to determine the optimal number of features and the optimal features. Incremental Feature Selection (IFS). We can construct the N feature sets from ordered feature set S (eq 5), where the i-th feature sets is Si ) {f0, f1, ..., fi} (0 e i e N - 1)

At the beginning, the ΩC is the original, complete, and unordered feature pool, whereas the ΩS is an empty set. At last, we can obtain a feature pool S containing all features in a ordered way, that is, the earlier that the feature is selected from ΩC and put into ΩS, the more relevant and less redundant is the feature. S ) {f0, f1, ..., fh, ..., fN-1}

negative data set; the overall accurate rate was equal to the percentage of accurately predicted pairs in whole data set.

(6)

where px · py is the inner product of px and py, ||p|| is the modulus of vector p. D(px,py) is a measure of similarity between two vectors; the smaller D(px,py) is, the more similar are vector px and vector py. We could classify the testing sample into a category which contains the training sample which the testing sample is most similar with among all training samples, that is, D(ps, pt) ) min(D(p1, pt), D(p2, pt), ..., D(pn, pt), ..., D(pN, pt)) (n * t)(7) Jackknife Cross-Validation Method. The independent data set test, subsampling test, and jackknife cross-validation are three methods often used for cross-validation in statistical prediction. Among these three, jackknife cross-validation is believed the most effective and objective one.11 In jackknife cross-validation, every sample in the data set is knocked out and tested by the predictor trained by the samples left in the data set. During the validation process, every sample belongs not just to the training set, but also to the testing set. So the correct rate of validation can depict the robustness of the algorithm and the features used to construct the predictor. The accurate rate for positive data set was equal to the percentage of accurately predicted TF-TFBS pairs in positive data set; the accurate rate for negative data set was equal to the percentage of accurately predicted non-TF-TFBS pairs in

(9)

Every feature set Si is an ordered set. For every i(0 e i e N - 1), we use NNA to construct predictor with the features in Si and obtain an accurate rate of prediction evaluated by jackknife cross-validation. As a result, we get a curve named IFS curve, with accurate rate as its y-axis and index i of Si as its x-axis. Feature Forward Selection (FFS). IFS step produces a curve with an inflection. Suppose that the inflection lies at the point with k as its x-coordinate. We assume that the optimal features set contains first k features in S (eq 5), plus some features selected from the region which starts from k. At first, we construct a feature set F ) Sk ) {f0, f1, ..., fk}, and S′ ) {fk+1, ..., fN-1}. Then, we select every feature from S′ and construct a new set which contains this feature and all features from F. If S′ has m elements, then there are m different sets which contain a feature from S′ and all features from F. Next, we build m predictors each with all features from one of the m sets. Next, we calculate the correct prediction rates of these predictors, choose the predictor with the highest total correct rate, put the corresponding feature in F, and remove it from S′. Next, we go to the next round. The procedure terminates when the total correct rate begins to decrease when we add features into F from S′. At last, we get an optimal and ordered feature set F.

Results and Discussion mRMR Results. The mRMR program was downloaded from the Peng’s Web site (http://research.janelia.org/peng/proj/ mRMR/). We ran mRMR program with default parameters. The mRMR program outputs its result in a file, which contains two parts, the mRMR list and MaxRel list. mRMR list records feature indices as described in eq 4; MaxRel list contains the relevance of all feature with the target variable as described in eq 2. Only mRMR list was used in the following feature selection procedures. The mRMR output is presented in Supporting Information III. IFS and FFS Results. Figure 1 shows the results of IFS and FFS. The result show that the inflection of IFS curve lies around where the feature index is 24. So we set k to 24 in FFS procedure. The highest correct rate of IFS is 90.4% with 139 features, while the highest correct rate of FFS is 91.1% with 113 features, a little higher than that of the IFS procedure. The detailed rates can be found in Supporting Information IV. Discussion of Features. Among the features used to construct the predictor, their contributions are different. Some Journal of Proteome Research • Vol. 8, No. 2, 2009 1001

research articles

Cai et al.

Table 2. Features Ranked by Their Contributions to the Increment of the Total Accuracy order

feature index on FFS curve

features classification

feature information

accuracy

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73

6 7 3 1 8 2 9 12 13 14 10 18 19 15 4 21 25 32 26 22 24 29 20 34 47 28 27 30 46 52 60 49 38 86 71 51 36 50 11 76 111 43 68 57 44 97 37 102 63 5 56 101 48 35 42 73 87 69 70 107 59 31 40 84 77 82 91 112 83 58 39 95 103

TFBS TFBS TFBS T vector of polarizability TFBS D vector of solvent accessibility TFBS TFBS TFBS TFBS TFBS TFBS TFBS TFBS D vector of secondary structure TFBS TFBS TFBS TFBS TFBS TFBS TFBS TFBS TFBS TFBS TFBS TFBS TFBS TFBS TFBS TFBS TFBS D vector of polarizability TFBS TFBS TFBS TFBS TFBS T vector of polarizability TFBS Amino acid composition TFBS TFBS TFBS TFBS TFBS TFBS TFBS TFBS D vector of van der Waals volume TFBS TFBS TFBS TFBS TFBS C vector of secondary structure TFBS TFBS Amino acid composition TFBS TFBS T vector of secondary structure TFBS T vector of van der Waals volume TFBS TFBS D vector of van der Waals volume TFBS TFBS TFBS D vector of secondary structure TFBS TFBS

19 35 65 NH/HN 4 Last b 25 12 8 38 90 13 21 45 First E 52 26 18 29 43 49 33 87 16 28 44 84 82 39 1 77 41 50% N 24 14 48 62 79 PH/HP 63 C 93 50 20 36 61 37 40 7 75% P 34 67 32 22 30 E 51 66 Y 60 10 CE/EC 95 NH/HN 98 96 First H 75 92 31 Last C 85 91

0.424767 0.537249 0.279328 0.1 0.628551 0.17512 0.697458 0.770037 0.802288 0.822875 0.715052 0.842135 0.853318 0.833917 0.288788 0.863005 0.87303 0.891076 0.877775 0.866252 0.867495 0.884552 0.855747 0.892262 0.900424 0.881502 0.87961 0.886049 0.898503 0.904688 0.906637 0.902033 0.895058 0.909969 0.9085 0.90353 0.893505 0.902768 0.715787 0.909122 0.911296 0.89647 0.906891 0.905253 0.897119 0.910477 0.894126 0.910816 0.906975 0.289353 0.904603 0.910195 0.900988 0.89277 0.895792 0.908755 0.910364 0.90723 0.907568 0.910731 0.905592 0.886275 0.895312 0.909291 0.909263 0.909009 0.909969 0.911381 0.909093 0.905337 0.895114 0.909997 0.910844

contribute more to the increment of total accuracy. To explore the contribution of features, we introduced a procedure to calculate the contribution as described below. 1002

Journal of Proteome Research • Vol. 8, No. 2, 2009

accuracy increment

0.135414 0.112482 0.104208 0.1 0.091302 0.07512 0.068907 0.05425 0.032251 0.020587 0.017594 0.012595 0.011183 0.011042 0.00946 0.007258 0.005535 0.004801 0.004745 0.003247 0.003106 0.00305 0.002429 0.001977 0.001921 0.001892 0.001835 0.001497 0.001412 0.001158 0.001045 0.001045 0.000932 0.000932 0.000932 0.000762 0.000735 0.000735 0.000735 0.000735 0.000706 0.000678 0.00065 0.00065 0.000649 0.000649 0.000621 0.000621 0.000593 0.000565 0.000565 0.000565 0.000564 0.000508 0.00048 0.000424 0.000395 0.000339 0.000338 0.000338 0.000255 0.000226 0.000198 0.000198 0.000141 0.000141 0.000113 8.5 × 10-5 8.4 × 10-5 8.4 × 10-5 5.6 × 10-5 5.6 × 10-5 2.8 × 10-5

For the FFS curve, total accuracy of our model increases with the addition of new features. To evaluate the contribution of an individual feature, we calculated the increment of total

research articles

Novel Computational Approach To Predict TF-TFBS Pairs

Figure 2. The 14 most significant TFBS features.

accuracy when that feature was added. Then, we sorted all features in descending order on the basis of their contributions. Table 2 shows the ranked features. The table contains six columns. The first column is the ranking order according to the contribution of the feature for the increment of accuracy. The second column is the feature index ranked by the FFS procedure. The third and fourth columns are the feature classifications and the specific information of feature. The fifth column stands for the total accuracy. The last column is the increment of accuracy which is the result of adding the corresponding feature. Every feature contributes different to the increment of the accuracy. The 73 features shown in Table 2 have a positive contribution and TFBS features consist of the majority (60/73 ) 82.2%). Besides the TFBS features, 13 features also include features such as polarizability, solvent accessibility, secondary structure, amino acid composition and van der Waals volume. TFBS. Not all the TFBS features were being analyzed; we only focused on TFBS features that have a contribution to total accuracy larger than 0.005. As a result, we obtained 14 most significant TFBS features and examined their information in detail. For example, the most significant TFBS feature is the 151st feature with a contribution value of 0.135, and 151st feature of the total vector is also the 19th one in the TFBS vectors (151 - 132 ) 19). According to our encoding rules, cytimidine (C) is encoded as 0010, so the 19th feature represents cytimidine in the fifth position of the TFBS (19 - 4 × 4 ) 3). Finally, the 14 most significant TFBS features are transformed into a sequence listed in Figure 2. Interestingly, this sequence has a 7-nt successive motif. We consider two reasons that might lead to this result. First, statistical analysis of the TRANSFAC database found that most TFBSs have a ∼4-nt conserved core region,6 the 7-nt motif in our results might be a response to the intrinsic core region. Second, traditional TFBS prediction methods seldom capture the dependence between positions within the binding site, whereas several computational methods incorporating interposition dependence yield improvement in TFBS prediction.17-19 Furthermore, this sequence has an AT-rich feature. As we know, AT-rich elements have been found to be common binding sites for regulatory factors. For instance, TATA-box is a common TFBS found in the promoter region of most genes. Moreover, an A-T base pair has two hydrogen bonds as compared to three in a G-C base pair. AT-rich sequences might be favorable for unwinding of TFBS by cofactors of the TF complexes. Polarizability. Polarizability is another significant feature in the 73 features set. Usually, a protein contains a metal ion. In Zinc-coordinating TFs, for example, the zinc ion interacts with several conserved residues in the hydrophobic core of the zincfinger. Molecular dynamics simulations of Zn2+ binding to Cysand/or His0 reveal that charge transfer and polarization effects are key factors in the conformation maintenance of the zinccoordinating proteins. Neglecting polarizability leads to an abnormal nontetrahedral Cys2His2 zinc binding conformation.20 Polarization effect also participates in the proteinDNA and protein-protein interactions.21,22 Therefore, signifi-

cance of polarizability features in our results might due to the native biological meaningfulness. Secondary Structure. In our 73 features set, 4 features belong to secondary structure class. Apparently, secondary structure is an important feature of TF binding domain.23

Conclusion Predicting the DNA binding preference of transcription factor is a demanding task in bioinformatics. To approach this goal, we employ a computational method which combines the amino acid composition, physicochemistry features of transcription factor and 0/1 encoding system of nucleotides in the binding site. The performance of our method achieves 91.1% accurate rate. Moreover, by ranking the features used for prediction, we assess the contribution of different features for predicting the pairs. This advantage may give some hints to the mechanism of transcription.

Acknowledgment. This work is supported by the basic research grant of Chinese Academy of Science (KSCX2-YW-R-112). Supporting Information Available: Tables listing the detailed information of the data set, TF and TFBS sequences, the output list of mRMR method and the successful rates obtained using IFS and FFS. This material is available free of charge via the Internet at http://pubs.acs.org. References (1) Kadonaga, J. T. Cell 2004, 116 (2), 247–257. (2) Elowitz, M. B.; Leibier, S. Nature 2000, 403 (6767), 335–338. (3) Luscombe, N. M.; Babu, M. M.; Yu, H.; Snyder, M.; Teichmann, S. A.; Gerstein, M. Nature 2004, 431 (7006), 308–312. (4) Lee, T. I.; Rinaldi, N. J.; Robert, F.; Odom, D. T.; Bar-Joseph, Z.; Gerber, G. K.; Hannett, N. M.; Harbison, C. T.; Thompson, C. M.; Simon, I.; Zeitlinger, J.; Jennings, E. G.; Murray, H. L.; Benjamin Gordon, D.; Ren, B.; Wyrick, J. J.; Tagne, J. B.; Volkert, T. L.; Fraenkel, E.; Gifford, D. K.; Young, R. A. Science 2002, 298 (5594), 799–804. (5) Peng, H.; Long, F.; Ding, C. IEEE Trans. Pattern Anal. Machine Intell. 2005, 27 (8), 1226–1238. (6) Fogel, G. B.; G. W., D.; Varga, G.; Dow, E. R.; Craven, A. M.; Harlow, H. B.; Su, E. W.; Onyia, J. E.; Su, C. Biosystems 2005, 81, 137–154. (7) Alberts, A.; Johnson, A.; Lewis, J.; Raff, M.; Roberts, K.; Walter, P. How cells read the genome: from DNA to protein. In Molecular Biology of the Cell, 4th ed.; Garland Science: New York, 2002; pp 299-374, Chapter 6. (8) Yu, X.; Cao, J.; Cai, Y.; Shi, T.; Li, Y. J. Theor. Biol. 2006, 240 (2), 175–184. (9) Jia, P.; Shi, T.; Cai, Y.; Li, Y. BMC Bioinf. 2006, 7. (10) Cai, Y. D.; Chou, K. C. J. Theor. Biol. 2006, 238 (2), 395–400. (11) Chou, K. C.; Zhang, C. T. Crit. Rev. Biochem. Mol. Biol. 1995, 30 (4), 275–349. (12) Nakashima, H.; Nishikawa, K.; Ooi, T. J. Biochem. 1986, 99 (1), 153–162. (13) Chou, K. C. Proteins: Struct., Funct., Genet. 1995, 21 (4), 319–344. (14) Qian, Z.; Cai, Y. D.; Li, Y. Biochem. Biophys. Res. Commun. 2006, 348 (3), 1034–1037. (15) Chou, K. C. Proteins: Struct., Funct., Genet. 2001, 43 (3), 246–255. (16) Pollastri, G.; Przybylski, D.; Rost, B.; Baldi, P. Proteins: Structure, Function and Genetics 2002, 47 (2), 228–235. (17) B. Georgi, a. A. S. Bioinformatics 2006, 22, e166-173. (18) R. Osada, E. Z.; Singh., M. Bioinformatics 2004, 20, 3516–3525. (19) Hannenhalli, S. Bioinformatics 2008, 24, 1325–1331. (20) D. V. Sakharov, a. C. L. J. Am. Chem. Soc. 2005, 127, 4921–4929. (21) Nadassy, K.; Wodak, S. J.; Janin, J. Biochemistry 1999, 38, 1999– 2017. (22) Jones, S.; van Heyningen, P; Berman, H. M.; Thornton., J. M. J. Mol. Biol. 1999, 287, 877–896. (23) Latchman, D. S. Families of DNA binding transcription factor. In Eukaryotic Transcription Factors; Elsevier/Academic Press: Amsterdam; Boston, 2007; pp 96-154, Chapter 4.

PR800717Y Journal of Proteome Research • Vol. 8, No. 2, 2009 1003