Focus: SNP mining - Analytical Chemistry (ACS Publications)

AncestrySNPminer: A bioinformatics tool to retrieve and develop ancestry informative SNP panels. Sushil Amirisetty , Gurjit K. Khurana Hershey , Tesfa...
0 downloads 0 Views 9MB Size
Focus

SNP

MINING The rush is on But h o w much gold will we find and h o w much pyrite? Researchers have not yet plumbed the depthjSof the Human Genome Project, but (hey are already looking for better prospects in the form of genetic variations called single nucleotide polymorphisms (SNPs, pronounced "snips"). Although SNPs are not the only type of genetic variation, many researchers expect them to be "the most helpful for identifying disease genes. Others think SNPs will catapult us into the era of personalized medicine, when pharmacogenetics will enable physicians to prescribe drugs based on detailed knowledge of our genotypes But even as enthusiasm for SNP analysis studies suggest that there is a lot of nvrite mixed with the gold And nobodv knows how Ions' the commercial sector will have to die to reach pay dirt "A lot of people would say that we don't really know, ultimately, how valuable SNPs will be," says Chris Becker of GeneTrace Systems, one of several companies with a technology that can analyze SNPs. Re-

Elizabeth Zubritsky

searchers are just beginning to conduct SNP studies. "We're going to find out more about their value as more of these studies come out," Becker says. But for now, the rush is to identify and compile SNPs. Several prominent companies, including Celera Genomics, Genset SA, and Incyte Pharmaceuticals, are creating proprietary SNP databases with the intention of charging pharmaceutical and other companies for access to the collections. In an effort to preserve free access to SNPs, the National Institutes of Health (NIH) has started its own database, dbSNP This project, run by the National Center for Biotechnology Information, is one of several public databases worldwide, and NIH-funded projects are expected to deposit —56 000 SNPs in dbSNP the next 3 years. A much larger effort announced in mid-April by the SNP Consortium an association of 10 pharmaceutical companies several academic laboratories and the Wellcome Trust pViilanHirnnv

is. prniprtprl to find 300 000

SNPs in 2 years The consortium has plede-ed to keep its results public bv nutting the information in dbSNP

Analytical Chemistry News & Features, October 1, 1999 683 A

Focus of variations identified). And their true worth won't be known until researchers have conducted clinical studies to extract the rare SNPs that have a role in disease or in a person's response to pharmaceuticals. "Although having a SNP database is a valuable resource per se," says Denis Grant of Orchid Biocomputer, "it is very important to realize that a large number of SNPs will have no functional consequences. [So] it will always be important to correlate SNPs with some sort of a phenotype [i.e., a trait or disease]." Linkage versus association

Figure 1 . Recombination produces new genotypes. In this representation, a single crossover occurs between the "a" and "b" loci (regions containing genes). Thus, instead of the initial distribution—where variants (a,b) and wild types (+,+) are separated—some mixing occurs, yielding (+,b) and (a,+).

Such databases will provide researchers with an unprecedented set of genetic markers—the equivalent of landmarks in the human genome. Geneticists need these markers to keep track of "recombination segments"—blocks of 3000-30,000 base pairs in which SNPs tend to be associated with one another. These blocks are mixed and matched by the process of recombination, as shown in Figure 1. "It's like having a name for every street in a city," explains Fred Ledley of the pharmacogenetics company Variagenics. "In theory, if you had marker within of these blocks you would have a tag for piece of the human genome " Other genetic markers are used, including microsatellites—segments where 2-4-bp "motifs" are repeated many times. But analyzing microsatellites can be difficult because the repeated sequences can cause misalignments during amplification or hybridization. Perhaps more importantly, SNPs are much more prevalent. When chromosomes from two people are compared, SNPs probably occur every 300-500 or 500-1000 bases, depending on whom you ask. This density should provide detailed of the and 684 A

allow researchers to locate regions of interest more precisely, the same way that having short streets instead of long ones helps pinpoint a particular house. This is especially important when looking for the roots of "complex" diseases such as diabetes and heart disease, which seem to involve several genes. Researchers want to know not only which genes contribute but, ultimately, which polymorphisms. "You want to be able to home in on things [very] specifically because there could be many candidates," explains Rosalind Harding of Oxford University (U.K.). Genetic studies of these diseases are quite difficult due to the relative infrequency of the current markers, but many researchers say having a good SNP map would make such experiments more feasible. Most SNPs are identified through "brute force" DNA sequencing. In fact, many are being discovered as by-products of the Human Genome Project. Though not especially glamorous, sequencing has enough throughput and accuracy to make it a good mining method. But finding SNPs is just the beginning. They have to be validated (confirmed as true variations not sequencing errors) and scored (the range

Analytical Chemistry News & Features, October 1, 1999

To date, researchers have relied on linkage analyses to determine which genes are associated with diseases, says Ledley. In a simplified version of this type of study, two traits, for example, big feet and curly hair, would be traced in several families. Geneticists could estimate the distance between the underlying genes by determining how often family members had both big feet and curly hair (assuming that the genes occurred on the same chromosome). If the traits occurred together most of the time, then the curly hair and big feet variants would be "linked", that is they would tend to remain together during recombination Thus if the location of the gene for hair curliness were known and if the foot size linked to it then finding the foot size gene would be a matter of sequencing a relatively small region around the hair

curliness gene Because linkage analyses are conducted in families or other homogeneous populations, such as Mennonites or Icelanders, there are relatively few recombination events—that is, less mixing and matching of DNA blocks—and the relationships between the genes remain clear. "Linkage is very powerful," says Ledley. "That's what genomics has been all about until today." However, finding families with the particular traits or diseases requires a tremendous amount of work. Another problem is that some researchers think linkage analysis is unlikely to identify genes that contribute only a small amount to a trait. The alternative is an association study, in which people who have the trait—or, more often, the disease—and a control group are selected from the general popu-

lation. Researchers then look for marker alleles—variants that consistently show up in people who have the disease. This approach is more difficult because researchers have to contend with recombination events that have accumulated over, perhaps, tens of thousands of years. In a sense, recombination has had time to shuffle the DNA into smaller blocks, and that requires more markers. To cover the entire genome, anywhere from 100,000 to 1000,000 markers might be needed for an association study compared with 20003000 markers for a linkage analysis. Although association studies have been less common, they have been successful, according to Aravinda Chakravarti of Case Western University. "There have clearly been many successes [when this approach is used] for disorders with a simple inheritance," he says. "We use this principle over and over again." Preliminary evidence suggests the approach also can be used in more complex diseases. Perhaps the most cited example is a study by researchers at Glaxo Wellcome involving the APOE gene, which is correlated with Alzheimer's disease (1) The researchers concentrated on a 4-Mb region around the and identified 51 SNPs Ten of fhose SNP, had frequencies greater than exnected suggpsting that thev could be used as disease markers Thp researchers compared these results with other marker methods and

enough. Achieving the statistical power needed for such an analysis could require anywhere from 1000 to 30,000 people, so the amount of data becomes overwhelming very quickly. Although many researchers look to bioinformatics to sort through the data that will be generated, the mathematical tools aren't able to analyze that much data ye,, according to Brooks. The candidate-gene approach, on the odier hand, reduces the complexity of the analysis by targeting only coding regions that are suspected of being involved in diseases or traits. The next step might be sat-

ALTHOUGH A SNP DATABASE IS A VALUABLE RESOURCE, MANY S N P s WILL HAVE NO JUNCTIONAL CONSEQUENCES. urating a gene—determining every variation that occurs in a population—and asking which of those are likely to affect protein or RNA function. The goal is to "weight" the statistics in favor of ffnding the variations that are likely to matter.

f o u n d tViat ^NTPs prrwirlpd a m o r e pffiripnt

way of identifyine- the erne Chakravarti has been an advocate of genome-wide, or "whole-genome", association studies. So have Celera and Genset. But, because the SNP mpp is not complete, nobody has ever been in a position to fully test the whole-genome method. "It's a theoretically compelling concept," says Roy Whitfield of Incyte, who instead favors the so-called candidate-gene approach. "But it's going to take a lot of engineering and a lot of work to make [the whole-genome approach] happen." Lisa Brooks of the National Human Genome Research Institute agrees. "At this point, it would be too hard to select 50 people with a disease and 50 people without a disease and look through the entire genome tofindthe gene responsible," she says. And 100 people wouldn't be close to

Not by SNPs alone

Despite the potential of SNP analyses, many researchers agree that association studies require other kinds of information. That was Harding's conclusion when she re-examined data on the (3-globin gene, which contains a mutation that causes sickle cell anemia but also protects against malaria. Harding and her colleagues collected sequences from more than 500 people and found 20-30 SNPs in a region ~3000-bp long. Although the gene and its mutations are well characterized, Harding approached the study as if she didn't have that prior knowledge. "I came back to the question 'If nobody had ever known that sickle was a protective variant and if we just had this sea of polymorrjhisms what would we have to do to soot sickle as

As it turned out, having a lot of SNPs present—some of them quite common— didn't guarantee that any would have strong association with the sickle variant, says Harding. The problem was that the sickle mutation is recent, she explains. To be good markers for disease, variations need to be "younger" than the targeted mutation. But because most of the SNPs present were older than the sickle cell mutation, their relationships to it were "confused". "It's highly unlikely you would have found anything this " she says. To get around this problem, Harding's group "resolved" the SNPs, that is, physically assigned the SNPs and determined which of the two chromosomes (one inherited from the mother and one from the father) each variant was on. This told the researchers which haplotypes—combinations of SNPs—really existed and which were merely mathematical possibilities. "If every recombination event produced a haplotype that was passed on to future generations, you'd see every combination of SNPs," Harding explains. In that sense, haplotyping is equivalent to performing a study in a family or other select group of people. Ledley describes it as "getting back toward the power of linkage." Harding says it is a crucial step in association studies using random individuals; otherwise, the potential for false positives is high. Andy Brown of Sequenom agrees that haplotypes are important. "The information content is higher if you combine some of the markers," he says. But first, geneticists need high-throughput methods for haplotyping. If physically resolving the haplotypes is not possible, researchers will need to collect genetic information from each participant's parents to determine how the markers have segregated, says Harding. Brooks agrees that such family histories will be important for determining which SNPs contribute to disease, as will medical histories. She says that some databases can accept haplotypes and family histories. However, researchers have not decided yet how to incorporate medical and phenotype information. Harding is not alone in questioning how productive association studies will be. Joseph Terwilliger of Columbia University says, "I don't think I am the only statistical

Analytical Chemistry News & Features, October 1, 1999 6 8 5 A

CHEMICAL

&ENGINEERING NEWS

nline

The Newmagazine of the Chemical World

pubs.acs.org/cen THE INFORMATION YOU NEED AT ELECTRONIC SPEED

mmmmmmmmmmmm Why wait for the rnai'i wFicn cricre s a faster and easier way to r c C c i V c Lflc l a i c b l MCVYJ UI IMc CnclTllCdl

world: Chemical & Engineering News Online.You can now view and search the LUII I L/lc LC

trvJiLUI'Cll V U I I L C I I L J UI v_i/c//f/LUi Ol

Engineering News...online. With C&EN Online you II enjoy... • Instant access to each issue-the day it's published! • Direct links to university, company, and government agency web sites worldwide. •

Full-text searching.



Links to referenced ACS journal articles.



Archive library of C&EN back issues.

With its speed and unlimited information capabilities, Chemical & Engineering News Online is the perfect complement to C&EN magazine.

Don't miss out! Subscnbe too3y and ensure instant access to the latest news of the cheroical world. U.S. ACS Member® $15* U.S. Nonmember@$18* Non-U.S. Subscriber (Free*-included as part of international delivery package) "Offer valid for C&EN print subscribers only.

Subscribe online at

http://pubs.acs.oirg/cen o r call 800-333-9511 (within U.S.) 614-447-3376 (outside u.s.) '

4679M

Focus geneticist who will tell you that association analysis is not going to be very successful." SNPs are a satisfactory tool, Terwilliger says, but the power of a study depends on its design, not on having a higher density of markers. He argues that researchers tend to overlook two problems when designing association studies: that different SNPs may be needed for each population studied and that a single SNP may have different effects in various populations. "Powerful statistical methods will not salvage poorly designed experiments" he says In his opinion studying affected families has a better chance of paying off beit increases the likelihood of identifying a genetic as ODDosed to environmental—cause for the disease and of finding the same variation in many individuals Recent results from Chakravarti's group and Eric Lander's group at the Whitehead Institute also indicate that extracting useful SNPs can be more difficult than researchers previously thought (2, 3). Both groups found that many variations either were in noncoding regions of DNA or were in coding regions but did not alter protein sequence. Chakravarti notes that SNPs in noncoding regions might affect protein regulation and thus might still be important but jje acknowledges that the results were surprising "[It was] surprising relative to many of the previous studies where we could only few of the variants" he adds "That clearly very incorrect notion of the number and

from enough individuals to make the data meaningful. "The state-of-the-art and all the attention is really o n . . . enabling the throughput that people need and decreasing the cost," says Ledley. The cost per SNP is now 10C-$1, Ledley estimates, but nobody knows how low it needs to be before these analyses will become practical. He suggests that 0.10 might make associations an effective research tool, but the cost would have to come down further for clinical use. Some industry people even say that the cost will never be low enough if it is "cost per SNP". It must become "cost per patient" no matter how many SNPs are analyzed. But Whitfield argues that it's too early to draw such conclusions. He also contends mat it's misleading to lump all SNP analyses together, arguing that the economic considerations and value propositions (what companies offer clients) are different for the candidate-gene and whole-genome approaches. The only thing companies know for certain is that it will be critical to keep evaluating the utility and cost of SNPs, he says. "We're going to see an explosion of data in the field in the next couple of years," says Whitfield. "We should take a look at the data when it comes out, instead of prognosticating now." Chakravarti agrees, saying, "Right now, there are a lot of arguments [about SNPs], which are almost all theoreticall... We are just beginning to do and understand these studies."

Nevertheless, Chakravarti says he still thinks large association studies will be valuable. For example, they may reveal patterns of variations that can't be seen in narrower data sets. "[Anyway,] the question isn't whether one likes or does not like genomewide association studies," says Chakravarti. "We will, in all probability, need a variety of methods [to find disease genes]."

References (1) Lai, E.; Riley, J.; Purvis, L; Roses, A. Genomics s198,54, 31-38. (2) Halushka, M. K.; Fa,, J.-B.; Bentley, K.; Hsie, L;; Shen, N.; Weder, A; Cooper, R.; Lipshutz, R.; Chakravarti, A. Nat. Genet. 1999,22, 239-47. (3) Cargill, M.; Altshuler, D.; Ireland, J;; Sklar, P.; Ardlie, K;; Paul, N.; Lane, C. R; Lim, E. P.; Kalyanaraman, N.; Nemesh, J;; Ziaugra, L; Friedland, L; Rolfe, A; Warrington, J.J ;ipshutz, R; ;aley, G. Q.; Lander, E. S. Nat. Genet. .999,22,239-47.

Cost of analysis Despite the projected benefits of conducting SNP analyses, nobody knows if they will succeed commercially. Association studies have the same two fundamental economic and technical challenges. The first is performing 100,000 or more tests on each patient. The second is analyzing DNA

NIH's genetic variation program supports research on discovering and scoring SNPs and other types ofsequence variation; developing high-resolution mapp of genetic variation; and developing methods for the large-scale analysis ofgenetic variation. For more details, see http://www. nhgri. nih.gov/About_NHGRI/ Der/variat.htm or contact Lisa Brooks at [email protected].

the tvne and the freqnpncies of variants "