Multichannel Fourier analysis of patterns in protein sequences - The

A. D. McLachlan. J. Phys. Chem. , 1993, 97 (12), ... Latent Periodicity of Protein Families, Identified with the Indel-Aware Algorithm. Andrew A. Lask...
2 downloads 0 Views 695KB Size
J. Phys. Chem. 1993,97, 3 m 3 0 0 6

3OOo

Multichannel Fourier Analysis of Patterns in Protein Sequenced A. D. McLachEPn Medical Research Council Loboratory of Molecular Biology. Hills Road, Cambridge, CB2 2QH, England Received: October 28, I992

The multichannel Fourier transform is a powerful new method for detecting weak periodic patterns in the sequences of protein molecules, against a background of noise. The result of the analysis is a combined power spectrum, analogous to a chi-square measure, which automatically extracts the most significant nonrandom features from the transforms of the original amino acids at any given frequency. The channels can be combined with either phased or unphased weights. Their degree of phase coherence is measured in terms of a coherence matrix and expressed through two polarization parameters. The multichannel method also providesnew measurea of the evolutionary self-similarity and other amino acid pairing correlations in terms of power spectra or spatial autocorrelation functions. Zinc finger proteins and the helical rod region of myosin are used to illustrate practical applications.

Introductioa The particular value of the Fourier transform in molecular sequence analysis is that it can reveal patterns which are not readily visible to the human eye, unlike simple sequencerepeats, which are often seen without difficulty. Thus, Fourier methods are not just a mathematical luxury but give us unusually powerful ways of searching for unsuspected patterns in large molecules. The types of periodic patterns that have been found in proteins are many and varied. Other kinds of patterns that exist in DNA can be analyzed by extensions of the same methods, but this paper will be restricted to proteins. Patterns in Proteins. Periodic repeated structural forms in three-dimensional space are often reflected in the protein amino acid sequence. The periods are not neccssarily integers, and there need be no exact sequencerepeats. Often, a certain type of amino acid tends to occur in a series of positions, rather than a specific residue. This distinguishes purely structural patterns from repetitive patterns that arise directly by gene duplication.’ Proteins show two main classes of periodicity: (a) Long-range repeated structures in fibrous proteins, which arerelated to theirself-assemblyandpacking in thecell. Examples are the periods of the 234-residue D-spacing in collagen,24 the 19.7-residue period of the acidic groups in tropomyosin?-7 periods of 28,2813, and 196 residues in myosin and paramyosin, related to the assembly of muscle fibrils.6-10 (b) Features related to the pitch of the a-helix (there are only weak features associated with the two sides of /3 sheets). Some examples are cited below. One of the most striking is the specific pattern of large hydrophobic residues at alternate spacingsof 3 and 4 residues in helical coiled coils, such as tropomyosin or fibrinogen” and the leucine zipper,I2J3 which gives rise to periods of 7/2 and 7/3 r a i d u e ~ The . ~ tendencyof polar and nonpolar residua to alternate on the two sides of a surface helix in globular proteins14 or on opposite faces of amphipathic helices in lipoproteinsls.16leads to periods close to 3.6 residues. Amphipathic helices in other proteins may be recognized through periodic features such as their hydrophobic moments.17-20 These moments are particularly marked, not only in the strongly amphipathicmembrane protei s, melittin and annexin21q22but also in transmembrane pro& rods.23-25Aminoacidvariability,as shown by substitutions within

’ Dedication: This paper is dedicated to Harden McConnell, who showed

me the power of new and penetrating theoretical approaches to the analysis of structure in large molecules, and who taught me his own unique methods during a year in his laboratory at the California Institute of Technology in 1959-60.

0022-3654/93/2097-3000$04.00/0

a protein family, is often biased to one side of a helix.2”*8 For example, transmembrane helices may have a variable nonpolar surface toward the lipid, with an invariant polar core facing into the protein domain.29 Fourier Methods in Pattern Detection. Strong sequence periodicity is detected quite easily by eye, but weak patterns need more refined methods, and are missed wen by careful inspection. McLachlan and Stewart6first showed how to pick out significant periodicity in a sequence from a background of random noise, and later developments4 increased the sensitivity of their method. Eisenberg’shydrophobic moment” has been a particularly useful concept for the structural interpretation of helical patterns. Recently Cornette et al.30 analyzed amphipathic periodicity in a systematic way and showed how to define an objective hydrophobicity scale30,31 by extracting the best weighting factors from the power spectrum. Current pattern detection methods usually require a choice between a search for the periodicity of a predetermined feature, such as hydrophobicity, variability, or charge, or a lengthysearch on individual amino acid distributions. Our new method of multichannel transforms is designed to analyze a whole molecular spectrum automaticallyand minimize the effects of random noise. This increasesthe chance of detecting important regularities and greatly reduces the work required for interpreting the results. Multichannel transforms were first introduced by Liquori,’ but the new feature of our approach is the method for combining the channels together. We shall work with Fourier transforms along the length of protein or DNA sequences, rather than in real three-dimensional space. Here distances are measured in terms of residues (amino acids or bases) and the wavenumber (in waves per residue) is commonly referred to as the “frequency”, even though it has no time component. The main method used here was originally developed in 1984 to analyze regular patterns in myosin and other proteins,” but recent practical improvements now make it suitable for wider applications. The present paper outlines the principles only, leaving mathematicalproofs and further refinements to bedetailed elsewhere. Multich.Mel Frequency Analysis of !3equences The essence of the new method is to start from the separated Fourier transforms of the 20 different amino acids. Each amino acid represents one channel of spectral information. We then combine the transforms together, by assigning weights to them, in a way which separates the statistically significant features Ca 1993 American Chemical Society

The Journal of Physical Chemistry, Vol. 97. No. 12, 1993 3001

Patterns in Protein Sequences from the random noise. The combined power spectrum which emerges from this process is automatically scaled in terms of its significance. We first need the Fourier transform of the distribution of each amino acid, individually, along the sequence. Thus, take a sequence of length N made up of Y types of amino acid, in which the number of positions containing type a is n, and the fractional composition is xa = n,/N. We set up Y multiple density distributions over the positions r for each of the types a. These are vectors of the form

= 190 (1) in which each 1 labels a position where amino acid a is present and the other positions are filled with zeroes. The example below illustrates the vectors for a short sequence made up of the four amino acids, L, E, K, A. Pra

2

position no. r: sequence:

1 L

E

L E K A

(1

0

(0 (0 (0

1

totals

(1

6

I

L

L

1

1

3 K

4 A

5 A

0

0 0

0 0

0 0 1 0

1

0 1

0.0 0 0 0 0

1

1

1

1

1

0 0

1

8 E 0) 1) 0) 0)

For long randomly ordered sequenceswith a fixed composition6 the intensities (with m # 0) have an approximately exponential probability distribution 1 P ( r m ) = 7exp(-rm/Q,2)

622

= [ N / ( N - ~)I(AY)' (7)

QZ

where 622 is the mean intensity and ( A Y ) ~is the mean square variation of y along the sequence about its actual mean (y). An important special case is wherey, = 1 or 0 only and thus represents the distribution of one amino acid type a,which occurs at n, positions, forming a fraction xa = na/Nof the whole. Then (Ay)* = x,( 1 - xu). This useful result provides a significance test for many simple periodic features of a single type, but it does not allow for any correlations between different amino acid types. Probability Distributions in Random Sequences The statistical significanceof the observed Fourier coefficients dependson their average properties for a set of random sequences of the same composition. The amino acids are shuffled in all possible orderings, keeping the compositions n, and x. and the length N fixed. Each Fourier coefficient with m # 0 has a mean of zero. The mean square correlations of the complex coefficients are described by their covariance matrix R,whose elements are

1)

The sum of each row of the array is equal to n,. The rows are not independent, because there is one and only one amino acid in occupation at each position, so that the sum rule for each column is

Here 6ap is the Kronecker delta symbol. Thus, the matrix can be used to calculate the mean square Fourier component of any weighted property profile along the sequence:

cpm= 1

(9)

a

The Fourier transforms of the vectors at wavenumber m are defined as

(3) wherem =0,1, ...,(N-1). Fornonzerofrequencies,thesumrule (2) leads to an identity

ET,,,.= 0, for m

#

o

(4)

a

It is important to note that the matrix R is only positive semidefinite, since it is singular, with one zero eigenvector, the uniform vector (1, 1, ..., 1). In a long random sequence the probability distribution of the Fourier amplitudes for the different amino acids at some common frequency m is approximately normal. The joint probability distribution for the Y complex variables T,, forms a (Y - 1)dimensional cloud in the complex amplitude space, because of the sum rule (4). For simplicity we omit the subscript m and then the full distribution takes the form

P(Tl,Tz,...Tv) = Ae-"O

Weighted Property Profdes To calculate the transform of some simple property of single amino acids, such as the side-chain volume, or the charge, we give each amino acid a weight f a corresponding to the value of the property, and then the contribution to this property from position r is (5) The corresponding Fourier transform of yr is a weighted sum of the Tma

with an intensity f,,,= IZml*.For the present we assume that the transform contains exactly N harmonics. In practice, this choice may give too coarse a sampling of the periods, and it is preferable4 to embed the actual sequence of length N into the left-hand end of a very long, almost continuous transform array of length L, padded out with a uniform background q u a l to (y), the mean of y over the actual sequence. This device also removes spurious low-frequency terms caused by the step at the edge of the padded region. Different parts of the sequence and the transform may then be selected by using moving windows and filters.

where A is a constant. One immediate deduction from the general distribution is the expected two-dimensional normal probability distribution for the complex amplitude of the transform 2, of a property profile (we now omit the frequency m) 2

P ( Z ) = A'e-",

ilz = IZ12/uz

(1 1) Here, the definition of 1q2, as before, is the observed value of

The expected standard deviation, derived from the covariance matrix R, is QZ

a8

Combining the Channels Suppose that we have a natural protein sequence, for which the Fourier amplitudes Tmaof all the amino acids are known at

3002 The Journal of Physical Chemistry, Vol. 97, No. 12, 1993 a particular chosen frequency m. Rather than take each amino acid separately, and look for unusually significant strong periodicities, we wish to find the most significant terms by an unbiased procedure. Our method is to consider weighted profiles of the form (6) with adjustable weightsf a which are initially unknown and then choose those weights which lead to the lowest probability for the observations, since these represent the most significant features of the transform. The channels of a given frequency can be combined with two kinds of weighting. Phased or complex weights, with both amplitude and phase, allow two waves to be coupled when the positions of their crests and troughs are unrelated. This is the most usual application. Unphased weights, which have positive or negative real values, reflect an assumption that there is a periodic feature along the length of the sequence which either attracts or repels each kind of amino acid, with a common phase. Forexample,the featurecould beassociated witha simplephysical property like hydrophobicity or electric charge. Our analysisof the most significant unphased weights requires mathematics that is very similar to Cornette's way30of deducing the best hydrophobicity scales, although the underlyingscientific idea of statistical significance is very different. The relative suitability of phased and unphased weights to describe a given frequency peak depends on the degree of phase coherence between the different channels. This is measured quantitatively in terms of a real two-by-two coherence matrix, or correlation matrix. This matrix yields two useful quantities: the angle of polarization, 8, defines the best compromise phase for the real weighting, and the degree of polarization, P, measures the bunching of the phase-amplitude vectors along the axis of polarization. The b t complex weights seek the maximum of P ( Z )for given data, but with variable weights. That is, the maximum of

McLachlan

T2

I I

Figure 1. Weighting of Fourier phase-amplitude vccton in a combined power spectrum: (a) unweighted vectors; (b) best complex weights align all the vectors in phase along the real axis; (c) best real weights obtained by projecting the vectors onto the axis of polarization.

parts of the Fourier coefficients separately and write Tu = Tap + iTaq. The products of the real and imaginary parts are now combined together into a single coherencematrix, which contains all the information needed to find the best weights. "be Pbase Coberence Matrix

The coherence matrix is a 2

n, =

X 2

matrix of the type

(14)

The unique solution (apart from a scale factor and a trivial additive constant) is obtained from the eigenvalues of the inverse R matrix and is T+a ITaI f a = z = x . e

The elements are defined as

(19)

(15)

where the Fourier component is written in the form (T,lexp(i&). The maximum value of Qc is immediately found as

which is the same as in eq 10. Thus, the maximum corresponds to the negative logarithm of the joint probability in (10). The best complex weights act to counterbalance the phases of the observed Fourier coefficients and build a combined profile Z, in which all the phase-amplitude vectors are lined up in parallel (Figure 1, a and b). . The best real weights are deduced by picking out the real and imaginary parts from the expression for nc. They maximize the more complicated expression

52, =

c/,R,Bfs 4

To find the solution, we need to consider the real and imaginary

Here u and D take the values p or q. Then the maximum value of QR is q u a l to the largest eigenvalue of J. This matrix is analogous to Stoke's matrix for describing the properties of polarized or the quantum-mechanical density matrix for a spin of '/z. The coherence matrix is positive semidefinite or definite, real and symmetric, so it has real eigenvectors 4 and eigenvalues 1 which obey the equations r

.lr

Jpp-?

. I

Jpp

(20) [J.

Jqq-llJIEJ =O

we convert the matrix into the standard form

in which JOis a scale factor, equal to OC in (16); P,the degree of polarization b a positive number between 0 and 1; and the angle 8 defines the principal axis of polarization in the complex plane, which is the direction along which the majority of the

The Journal of Physical Chemistry, Vol. 97, No. 12, I993 3003

Patterns in Protein Sequences

ZNF43 Zinc Finger Runs 07

I

0

"

"

"

I

7

"

"

"

I

14

Frequency (28) Figure 3. Combined power spectrum Ai2 of a run of 22 zinc fingers with a regular spacing of 28 amino acids, taken from the human ZNF43 gene family (Lovering and Tr~wsdalc~~). Strong peaks occur at all orders of 28.

chi-square one with mean and variance both equal to v - 1:

P(O) = 0'2exp(-~) ( u - 2)!

Figure 2. Fourier phase-amplitudevectors for different amino acids in the multichannel spectrum at a single frequency: (a) unpolarized waves with random phases; degree of polarization (P