Anthrax - ACS Publications - American Chemical Society

As U.S. health and law enforcement officials pursued the perpetrator ... 3 Chinese isolates ..... In 1994, Congress defined the DNA Identification Act...
1 downloads 0 Views 1MB Size
Microbial Forensics: DNA Fingerprinting of Bacillus anthracis (Anthrax)

Remaining B. anthracis isolates

958 SNPs

3 Chinese isolates

3 Chinese isolates 1 Chinese isolate

Ames

Paul Keim, Talima Pearson, and Richard Okinaka • Northern Arizona University

I

n the aftermath of the 9/11 attacks, the U.S. Postal Service was the target of an anthrax bioterrorism attack. The event’s impact was relatively small in mortality and morbidity but enormous in its economic and security consequences. As U.S. health and law enforcement officials pursued the perpetrator, knowing the exact strain type of Bacillus anthracis was invaluable for narrowing the potential © 2008 American Chemical Societ y

CDC/Marshall Fox, Arthur E. Kaye, Laura Rose

Microbial forensics is based upon DNA analysis of biological weapons agents, such as B. anthracis.

A0394 Tex goat A1117 Tex A1115 Tex

sources and for defining the crime scene itself. Precise strain genotyping tied seemingly disparate infections in Connecticut, New York, Florida, and Washington, D.C., to a single crime while eliminating cases due to natural causes. Forensic sciences have increasingly become dependent on DNA evidence for the identification and elimination of criminal suspects. Likewise, microbial forensics has built upon this great J u ly 1 , 2 0 0 8 / A n a ly t i c a l C h e m i s t r y

4791

success with some modifications to make this science relevant to the unique biology of microbes and the specific pathogens used in biocrimes.

Overview of human DNA forensics

Molecular genetics has dramatically altered the field of human forensics analysis by providing one of the most powerful and definitive tools for the legal system. The process of sequencing DNA and the huge technical advances stimulated by the Human Genome Project and the discovery of PCR forever changed forensics (1–4). In only a few decades, DNA analysis has become the gold standard of forensic investigation. Before these advances, the identification and detection of variable human genetic markers required complex and tedious genetic cloning and DNA probing techniques. Nowadays, a human DNA signature from a latent, nearly invisible sample (sometimes as small as a single cell) can be analyzed and compared with great ease with large genetic databases. It is against this backdrop that the Combined DNA Index System (CODIS), the current Federal Bureau of Investigation (FBI)-funded computer system that is used to solve crimes by comparative DNA profile analysis, was created. The anthrax letters of 2001 brought attention to similar DNA signature needs for microbial and viral pathogens. However, these needs are not a recent problem—methods to subtype different strains (subsets of a microbial species that have identifiable differences) during epidemics or outbreaks were available already. In an infamous 1990 case, the transmission of HIV and AIDS from a dentist to his patients was eventually corroborated by genomic sequence analysis of the virus samples obtained from the dentist and the patients (5). Fortunately, the molecular changes in DNA (or RNA) that are suitable for signature development are similar among humans, microbes, and most other organisms. However, significant differences exist in developing and interpreting signatures in humans versus microorganisms (box on the next page). Before we delve into applications, it is useful to understand why and how DNA fingerprinting became a powerful tool. DNA molecules are accurately replicated or copied and then distributed to their progeny. Genetic variation in these genomes can be introduced (at very low rates and over relatively long time spans) by several mechanisms, including errors in the replication process and genetic exchange between two related cells. These errors cause observable differences between two otherwise identical genomes. In humans and other sexual organisms, this variation is scrambled or recombined in a Mendelian fashion to create novel genetic combinations that increase an individual’s uniqueness (box on p 4795). Because both conserved genetic material and errors are passed from generation to generation, DNA changes can be used to determine who is related to whom on the basis of Mendelian rules. The corollary is equally as important—DNA changes can also determine when two individuals (or samples) are not closely 4792

A n a ly t i c a l C h e m i s t r y / J u ly 1 , 2 0 0 8

related, resulting in the exclusion of an individual. As more and more of the landscapes within the human and other genomes were deciphered, it became evident that genetic variation levels are not uniform throughout. Although replication and distribution of DNA molecules from parent to progeny are very accurate processes, several other factors account for where and how often genetic change can occur within a genome. Despite a simple composition of only four different nucleotides, the DNA and genomes of humans and other organisms have emerged as a complex and variable blueprint of life. Certain pieces are well characterized, such as genes that encode for a plethora of proteins that, in turn, serve as building blocks for various tissues. But many other genes have not yet been characterized because they may not be essential or because of the complexity of the genome. Even more intriguing are parts of the genome that appear to be spacers or gaps, either between genes (intergenic sequences) or within genes (introns). Plants and mammals often have 10× more of these sequences than DNA for genes does. Bacteria contain few introns and much less spacer DNA so they can pack more genes into small amounts of DNA. Nonetheless, the landscape within any genome is highly variable and consists of a gradient from genes that are essential to regions that are extremely flexible and may not be necessary at all. This variation is a gold mine for molecular evolution and forensic science because these regions have evolved at different rates. At one end of the variability spectrum are the ribosomal RNA (rRNA) genes that revolutionized our view of basic taxonomic relationships among living organisms (6, 7). These large, universal, and extremely stable sequences (very slowly evolving) were used to discover an entirely new domain of life (Archaea) and to reconstruct the tree of life from microbes to man (7). At the opposite end of the spectrum are regions in the genome that are not absolutely essential for survival, where mutations can occur without dire consequences to the organism. Without selective evolutionary pressure to resist genomic change, these regions can accumulate genetic changes at significantly greater rates. Sometimes the sequence structure itself promotes rapid change, notably units of small tandemly repeated nucleotide units (e.g., AT AT AT, AAT AAT AAT, GATA GATA GATA). Here, the frequency of changes can be millions of times greater than would be found for the rRNA sequences, where there is selection against change. In bacteria and humans, the differences in mutation frequencies for different genomic sites can vary from 10 –10 to nearly 10 –3 per generation in a single species (8–10). Before DNA sequencing and PCR became common, these more frequent changes in the size of nearly identical DNA fragments could be detected by restriction enzymes that cut DNA molecules at specific sites, thus generating fragments whose lengths could be differentiated by gel electrophoresis. Different lengths indicate different sequence composition. These restriction fragment length polymorphisms (RFLPs) were one of the

Microbial diversity Microbes exhibit extreme variations relative to eukaryotes for multiple reasons, but the most apparent is the great evolutionary time separating many bacterial species (41, 42). The last common ancestor for bacteria may have existed ~3 billion years (yr) ago (43, 44). Archaeal organisms last shared a common ancestor ~3 billion yr ago. The last common ancestor for eukaryotes lived ~500 million yr ago (44). In addition, the bacteria and Archaea ecological niches and the selective pressures that drove diversification were extreme—a few billions of years of geological transformation and extremes of temperature and pressure. These deeper adaptations do capitalize upon recombination (sex) across lineages. Some younger bacterial populations (e.g., H. pylori) also carry out extensive recombination on a dayto-day basis. The explanation for deep phylogenetic lineage diversity does not apply to the diversity in many modern bacterial pathogens, because they have evolved from older nonpathogenic species only within the past 5000–10,000 yr. Their recent emergence (as evidenced by a lack of genetic diversity) pegs them as opportunists, taking advantage of the increasing human population and global travel. B. anthracis is an example of a recently emerged pathogen whose spread has been associated with human migrations and commerce. How can we see mutational changes in B. anthracis populations without combinatorial power? First, a bacterial generation can be as short as 20–30 minutes, whereas a human generation is typically 20–30 yr. E. coli may undergo 300 generations/ yr, whereas B. anthracis may have 25–50 generations/yr (36, 45). Second, the relatively miniature microbial genomes can now be sequenced in their entirety with relative ease, and even a SNP can be discovered and used to specifically identify a strain. Rare hypervariable loci also exist in bacterial genomes and can be discovered by comprehensive genome analysis.

earliest and most successful means to measure differences in inherited genetic markers for diseases and to establish familial relationships in humans. In 1985, Alec Jeffreys and his colleagues coined the term “DNA fingerprinting” after they developed several new probes that could detect variable simple tandem repeat (STR) sites that could be used to solve problems in human identification (11, 12). STRs in combination with PCR became known as DNA fingerprinting by the late 1990s. Sperm and egg cells produced by meiosis reduce a diploid set of 23 pairs of chromosomes to a haploid set containing only one of each pair of chromosomes (box on p 4795). The diploid state is restored during fertilization. Each of the 23 haploid chromosomes in the egg or the sperm is acquired by random sorting from either parent; this means that the combination of parental chromosomes is 223 or >8 million possibilities. In addition, during the process of meiosis, the matched pair of parental chromosomes come together and can readily exchange genetic material by a process called crossing over

or homologous recombination before or as they are separated into the haploid states. This recombination process causes tens or hundreds of different genotypes to be identified in a single chromosome, depending on the number of alleles occupying a single locus. But the diversity caused by recombination is minuscule in comparison with the combinatorial power curve generated by the independent assortment of chromosomes discussed earlier. The only exceptions are identical twins, who share a single genotype.

CODIS

In November 1997, 13 STR markers became the standard for CODIS, and they remain the basis for the large national DNA criminal database (www.fbi.gov/hq/lab/html/codis1. htm; 13). Kits specifically designed to analyze the 13 CODIS markers are commercially available (Applied Biosystems and Promega) and work on different platforms (14). Numerous web-based sites (http://cstl.nist.gov/div831strbase/fbicore. htm) describe the STR markers, and various software programs (discussed later) provide calculations for estimating the frequency of a given profile within different populations. A useful website from the University of Arizona’s biology department uses the 13 CODIS markers specifically generated from the DNA of a crime lab analyst to estimate the frequency of his combined STR profile in the general human population (www.biology.arizona.edu/human_bio/activities/blackett2/ str_codis.html). Using simple probability estimates based on particular allele frequencies, researchers determined that his profile is present at a frequency of 7.7 × 10 –15, which represents the probability of finding a match for the analyst (1 in 7.7 quadrillion). This illustrates the statistical proficiency of the DNA fingerprinting technique to exclude or include an individual as a suspect. Although the CODIS 13 STR marker set has not changed significantly since it was introduced, new DNA forensic applications continue to be developed by the FBI. The analysis of samples containing trace or degraded DNA, such as teeth, bones, hair roots, and fingerprints, is a challenge that is sometimes met by the use of cytoplasmic mitochondrial DNA (mtDNA), which is present in higher copy numbers than nuclear DNA targets are. mtDNA targets are now used to recover and type DNA from samples that might be difficult to analyze with the standard CODIS markers. mtDNAs are maternally inherited and do not undergo recombination; as a consequence, their genetic variation is similar to that of clonally derived bacterial species. In humans, a specific 1100 bp region of mtDNA has been sequenced extensively by the FBI to develop a quality forensic database (15). A similar situation exists for the male-specific Y chromosome, which boys inherit from their fathers. Most of the Y chromosome is not affected by the meiotic recombination process, and it is therefore similar to mtDNA in its discriminatory power. The allelic variation in the Y chromosome is not nearly as high as those for the autosomal (nonsex) chromosomes, and combinatorial power is lacking, too. Nonetheless, the FBI has developed Y-chromosome STR markers because these are imJ u ly 1 , 2 0 0 8 / A n a ly t i c a l C h e m i s t r y

4793

E. coli

portant markers in sexual assault crimes perpetrated by men (16 –18). Again, these markers are inherited in a manner similar to clonal bacterial propagation, which can dominate the evolutionary path of many pathogenic species. Forensic science relies on more than the availability of highly tested and highly resolving human genetic markers. The term forensic is synonymous with the legal system, and as a conse-

Bacteria

Microbial typing systems

Clostr

idium

Bacillus

Archaea

Eucarya s

an

m

Hu

Zea ium

ec

ram

Pa

FIGURE 1. A 16S rRNA tree that illustrates the complexity of the bacterial domain and the human and animal kingdoms. Notice the relatively close genetic relationship between humans and corn (Zea) and the phylogenetic separation between E. coli and Bacillus. (Adapted with permission from Ref. 40.)

quence, the quality assurance (QA) aspect of the DNA fingerprinting process goes beyond peer assessment by the scientific community. DNA fingerprinting must survive the scrutiny of the legal community in a process in which the ability to debate may be more important than the scientific data itself. In 1994, Congress defined the DNA Identification Act and created the DNA Advisory Board (DAB) to develop QA standards. This led to two documents that are the foundation for quality control in the CODIS program: QA Standards for Forensic DNA Testing Laboratories and QA Standards for Convicted Offender DNA Databasing Laboratories. In addition, the FBI has formed various scientific working groups (SWGs) consisting of scientific, legal, and other experts to improve forensic disciplines and to develop consistencies among federal, state, and local law enforcement groups. One SWG advises the DAB on new developments in human DNA; another deals with similar issues in microbial systems (19). 4794

Among the many issues addressed in these QA documents are the important technical aspects of sample contamination and chain of custody. PCR is an extremely sensitive method that can often detect samples that may contain only a few molecules of a specific DNA. As a result, samples can easily be contaminated even with what may normally be considered proper care. Negative controls become as important as positive controls. The CODIS standards outline in great detail the measures that all qualified laboratories must take to prevent contamination and to recognize situations in which contamination is a problem. Detailed record keeping and chain of custody are likewise extremely important considerations that are addressed in these QA documents.

A n a ly t i c a l C h e m i s t r y / J u ly 1 , 2 0 0 8

Although bacterial pathogens have genomes that are as much as 5000× smaller than the human genome, they are a significant DNA typing challenge because they are extremely diverse compared with the variation seen within the human genome. The spectrum of this diversity is reflected in the sequence of the regions containing the highly conserved 16S rRNA genes. In a universal phylogenetic tree, all of the animal kingdom would occupy a relatively short branch within one of the three major domains of life (box on p 4796). However, the bacterial domain would consist of tens of thousands of species residing along six different branches in the bacterial kingdom (6, 7, 20; Figure 1). Although sequence data from an rRNA gene deliver an accurate portrayal of the phylogenetic tree of life, they have negligible resolution at the level of a single species or subspecies. Fortunately, the microbial genome contains other more rapidly evolving regions that can be used to type individual species. For example, the U.S. Centers for Disease Control and Prevention (CDC) uses a whole-genome/whole-pathogen method that is based on the older RFLP method and uses rare restriction enzymes that cut DNA molecules at specific but infrequent sites; the large DNA fragments are analyzed by pulsefield gel electrophoresis. The bar-code profiles generated by these analyses can provide relatively rapid typing for the large CDC network called Pulsenet, which tracks outbreaks of foodborne diseases (www.cdc.gov/ecoli/2006/December/cdcre sponse.htm). This typing method is primarily used to include or exclude new isolates during outbreaks of pathogens such as E. coli, Salmonella, Campylobacter, and Shigella. However, it does not discriminate well among isolates from less diverse species, such as B. anthracis, and it requires live pathogen isolates for DNA preparation. Another global R FLP-like technique, amplified FLP (AFLP) analysis, was the first method used to show significant resolution among a large number of B. anthracis isolates (21). This method allows for small-scale genomic analysis by restriction-enzyme cutting followed by selective PCR amplification. At the strain level, AFLP has been replaced by sophisticated, genomewide variable number tandem repeat (VNTR) analysis; however, AFLP is still valuable for studying new species because it does not require significant sequence information. It

Genetic terms Allele: a sequence variant at a specific site or locus. In haploid organisms (e.g., bacteria), a single cell will contain only a single allele, but a population may contain many different alleles at the same locus. Traditionally, alleles were detected by alternate biological functions, and the term was reserved for alternate genes at a locus. With the advent of molecular analysis, the term is also applicable to nucleotide variation in which biological differences are negligible. Hence, STR loci can have multiple alleles even though no known biological effect exists. Chromosome: a single, large, continuous piece of DNA. Humans have 23 pairs of chromosomes, whereas most bacteria have only a single chromosome containing their entire genetic blueprint. Bacteria may contain smaller autonomously replicating units of genetic material called plasmids, which may confer specialized biological functions on the host cell. Genotype: the specific genetic makeup of a person, a bacterium, or a cell as defined by a set of specific sequence variants (alleles). A genotype can be single-locus, multiple-locus, or even a complete genomic sequence in some cases. The FBI CODIS 13 marker system uses 13 loci, each with a number of different sequence variants. Because humans carry two copies of each chromosome, when the 13 loci are typed for an individual, 26 separate alleles will be identified. The combined data set will describe the individual’s genotype. Locus/loci: a specific site on a chromosome, for example, a specific gene. The genetic concept is parallel to basic geometry because DNA and genomes are linear, with many locations (loci) arranged sequentially. Mendelian inheritance: the set of basic principles that govern the transmission of traits from parents to offspring. These rules were first reported by Gregor Mendel in the 19th century and “rediscovered” early in the 20th century. Mendelian inheritance primarily reflects the patterns associated with the transmission of single-gene traits; non-Mendelian inheritance is associated with complex, multigene traits, such as Alzheimer’s disease. Mendel’s second law describes independent assortment, which is the random or independent distribution of individual genes from parents to progeny that contributes to the wide genetic diversity of humans. Recombination: a new combination of alleles or genes that differ in progeny from those seen in the parents. The meiotic example in Mendelian genetics is fairly predictable and follows statistical rules. In bacteria, the frequency is less defined and varies greatly among different taxa. Recombination has a more specific definition at the molecular level, where two pieces of DNA can be broken and then joined into a recombinant molecule.

has been used to identify the closest known relatives of several pathogens, including B. anthracis and Clostridium botulinum (22–24). Although not necessarily consequential in forensics, methods that can identify near neighbors are important in the development of robust assays that are specific for the detection of a pathogen yet exclude a close relative (25). Housekeeping genes are critical for survival and are thus found in all bacteria. The ubiquitous nature of these loci, their higher rate of evolution compared with rRNA genes, and the increasing availability of whole-genome sequences for many pathogens created a niche for multiple-locus sequence typing (MLST; 26). The original idea behind MLST was to identify 7–10 genomically distributed conserved regions that could be sequenced from a collection of isolates from a single pathogen and then analyzed to determine the genetic relationships among isolates. From the outset, these kinds of data sets were useful in determining population structure and in ascertaining whether pathogens were evolving primarily by clonal growth (e.g., B. cereus, B. anthracis) or by a combination of clonal propagation and genetic recombination (e.g., Helicobacter pylori, Burkholderia pseudomallei; 26–28). The success of this approach is reflected in the availability of a large public website housing MLST sites for 23 bacterial species and pathogens that is currently maintained by Imperial College London and the Wellcome Trust (www.mlst.net). This site is particularly attractive because it is curated and partially maintained by individual laboratories with expertise for each of these species. Although significant inroads have been made into the development of typing schemes that are highly

efficient at resolving certain clinical isolates, MLST approaches cannot always provide resolution for the most recently evolved pathogens. Examples include Yersinia pestis (plague), B. anthracis, Francisella tularensis (tularemia), and Coxiella burnetii (Q fever). The most rapidly evolving markers in bacteria, as in humans, are VNTRs. Although the numbers of such sites in bacterial genomes vary considerably, VNTR markers have been used in many systems to provide the highest level of resolution for strain identification for several bacteria, including B. anthracis, Y. pestis, Brucella sp., and E. coli O157 (10, 29–31). And as in the FBI CODIS STR marker system for humans, when the highest degree of resolution was required, a multiple-locus VNTR analysis (MLVA) system was adapted for B. anthracis (30). Finally, the ultimate signature for any single organism would be obtained from a complete or whole-genome sequence that was compared with the sequence of a sibling or a more distant relative of the same species. The results of these analyses for B. anthracis are part of the overall theme and development of our hierarchical approach to the forensic analysis of this species. When the first comparative analysis of the B. anthracis genomes was completed (32), the sequencing of more than one isolate from a single species was very rare. Today, the National Center for Biotechnology Information public database lists a plethora of multiple genome sequences for many microbial species (www.ncbi.nlm.nih.gov/sutils/genom.table. cgi). This is the holy grail of microbial forensics, and it can be attained soon. J u ly 1 , 2 0 0 8 / A n a ly t i c a l C h e m i s t r y

4795

Phylogenetic nomenclature Phylogeny is a hypothesis or a model of the evolutionary history and relatedness of a set of organisms; the model is developed by using characteristics in logical or statistical analyses. The maximum parsimony method is a powerful and commonly used logical approach that constructs phylogenies by minimizing the number of mutations or steps needed to explain the distribution of characters vis-à-vis the organisms being studied. The original phylogenetic characters or characteristics were morphological (e.g., wing appendages) or biochemical (e.g., metabolic functions, such as nitrate reduction), but now nucleotide or protein sequences are most commonly used. Homology is a common evolutionary origin of characters. A morphological example would be bird wings, bat wings, and human arms, which share a common evolutionary origin. In contrast, an insect wing, though superficially similar, has evolved independently and does not share a common evolutionary origin with a bird wing; hence, it is not homologous. This is one of the most commonly misused terms in molecular genetics; it is frequently used to describe similarity (e.g., nucleotide sequences) instead of a common evolutionary origin. Homoplasy is the appearance of sameness resulting from independent evolution as a result of reversal, convergence, parallelism, or recombination. Homoplasy results in disagreement (incongruency) among character predictions of phylogenetic hypotheses. Synapomorphic characters evolved away from their ancestral status and are shared by two or more of the organisms being studied. These are the most powerful characters for constructing phylogenetic hypotheses. Autapomorphic characters evolved away from their ancestral status but are found in only a single organism being studied. These provide analytical identification power for that lone organism but do not otherwise help determine how individual organisms are related to one another. Canonical characters are selected from a redundant set of characteristics to represent key branches or features of a phylogeny to make categorization of an unknown organism highly efficient.

Although whole-genome sequencing is still somewhat expensive, new technologies already show enormous potential to substantially reduce the cost (e.g., 454 and Illumina). The current cost for a draft genome of a microbial pathogen can be as low as $500, and efforts are under way to reduce it to ≤$100 (33). Even by today’s standards, from a cost perspective, it is reasonable to sequence the complete genome of any pathogenic isolate involved in a nefarious incident or from a natural outbreak to obtain a complete DNA signature for forensic or historical comparisons. In human forensics, large population studies were required by the courts to establish the significance of the data before DNA analysis was widely accepted. This approach is also clearly important for each microbial agent used in a biocrime and needs to be incorporated into standard procedures.

B. anthracis—a phylogenetic approach to signature development

Bacterial and viral genomes change and diversify under some of the same intrinsic evolutionary rules as the human genome, although specific parameters can vary greatly from species to species. Mutations in DNA or RNA sequences are thought to be primarily due to polymerase errors, and hypervariable sequences also exist because of DNA sequence structures such as STRs or VNTRs. What is dramatically different among microbes and between microbes and humans is the frequency of recombination between individuals. Human sexual reproduction results in novel genetic combinations in every generation; this does not occur in most microbial species. Clonal or 4796

A n a ly t i c a l C h e m i s t r y / J u ly 1 , 2 0 0 8

Locus 1 A to T

asexual reproduction is common but not always a strict rule. Some bacteria will mix their genomes on a limited basis by mating conjugations, phage-mediated transductions, or simple transformations with naked DNA. When this happens, novel combinations of previous genetic variations (e.g., alleles) will occur and increase diversity within the population. This novelty makes unique forensic identification somewhat easier and more similar to human STR analysis of autosomal chromosomes. Unfortunately, the ratio between recombination and clonality in a microbial population is highly variable, and definitive rules are elusive. The simple calculations for probabilistic relatedness that are possible in a Mendelian scenario are not obvious in bacteria. Although recombination affects diversity in some microbes,

Locus 2 C to T

Locus 3 G to A

B. anthracis Ames (A group) B. anthracis A group B. anthracis B group B. anthracis C group B. cereus B. cereus B. thuringiensis

Locus 1 Locus 2 Locus 3 T T A T T G T C G T C G A C G A C G A C G

FIGURE 2. Phylogenetic diagram of B. anthracis rooted with near neighbors, and the data matrix of three canSNPs. The allelic variants at locus 1 represent a canSNP that is species-specific to B. anthracis—that is, all samples exhibit a T at this locus, whereas other bacilli have the ancestral character state of A. Many other SNP loci could potentially provide the same information, but locus 1 was arbitrarily chosen to represent this phylogenetic position and, hence, is a canonical locus. Locus 2 is also used as a group-specific canSNP because it distinguishes all A group members from B and C group members by virtue of the SNP character state of T. Similarly, locus 3 has a strain-specific character state in which the A allele is found only in the Ames strain. Derived alleles at loci 1 (T) and 2 (T) would present synapomorphic characters, whereas the A allele at locus 3 is an autapomorphic character because is not shared by more than one taxa.

the same is definitely not true for B. try to explain relatedness to a jury! 0 50 0 5 anthracis. This is a highly, probably Vogler et al. measured mutation SNP changes VNTR changes exclusively, clonal species with no evirates to develop a theoretical hypothdence of recombination or horizontal esis to test this approach (9, 10, 31). C genetic transfer since its derivation This involves alternative hypotheses B from a B. cereus ancestor (32). This developed for the specific details of a clonal pattern is also true for many particular epidemiological or forensic Ames other recently emerged pathogens investigation. For example, an investhat jump into new niches that give tigator might ask, “Did the infecting them great reproductive fitness. The pathogen come from the salad bar or pathogens’ rapid success often isofrom the laboratory vial?” Genetic lates them from other similar bacteria differences could be arranged in and changes their evolutionary patalternative phylogenetic hypotheses tern into one that is highly clonal. B. that have their own probabilities of anthracis is now ecologically distinct likeliness. The ratio of the probabilifrom its relatives living in the soil ties shows how much more likely one and has little opportunity for genetic scenario is than the other. A more exchange. Mutation rates in neutral statistical approach based on populagenomic regions become the driving tion analysis would use genotype freforce for diversity and for microbial quency to validate (or not) a match. forensic identity determinations. If two samples match (or are differPopulation structure of pathogens ent), their genotype frequency in a or humans dictates what the best anapopulation can be translated into the lytical approaches should be for foprobability that the samples would rensic analysis. Forensic DNA analysis be selected at random. This approach is, after all, really population genetic would seem to be more powerful for analysis with very specific questions. nonmatching confidence limits and Are two samples identical or are provides another approach for valiA they different? What confidence limdating microbial forensic analyses. its can be associated with matching Comparisons of whole-genome and nonmatching genotypes? Sexual, sequences from multiple isolates of recombining populations present very a single species can increase our different analytical problems than do knowledge of the evolutionary pathclonal populations. Hence, the comway that separates the sequenced binatorial approaches for human STR FIGURE 3. PHRANA-based phylogenetic tree with genomes. Although many types of typing have limited applicability to all B. anthracis genotypes. Gray branches are degenetic differences may accumulate microbial forensics and anthrax in- fined by SNPs; black branches are defined by MLVA. among isolates as a result of mutavestigations in particular. Combina- Note the different scales for the different markers. tion, some happen relatively quickly torial approaches can be thought of such that two unrelated isolates may as statistical analyses in which average similarities and differ- contain the same mutation by chance alone. To reduce the conences are meaningful. Clonally derived populations are highly fusion caused by such independent mutations that are not the structured, and simple statistical averages do little to accurately result of inheritance, we search for marker types that have a low represent population differences. Rather, these populations are probability of change and thus can be assumed to have mutated highly amenable to phylogenetic analyses. Some of the most only once during the evolutionary history of the organism. powerful phylogenetic approaches are actually logical algo- Single nucleotide polymorphisms (SNPs) have very low mutarithms that arrange mutational events in a hierarchical fashion. tion rates in the B. anthracis genome and thus provide a very The arrangement that is simplest is declared the most likely to stable evolutionary signal. occur or the one with maximum parsimony. Unfortunately, this important characteristic means that Although mathematical confidence limits are intrinsic to SNPs are very rare, requiring whole-genome comparisons for statistical analyses, they are somewhat tangential to logical their discovery. A comparison of five whole-genome B. anthraphylogenetic approaches. However, several methods have been cis sequences predicted ~3500 SNPs, 990 of which were used suggested. Lenski and Keim suggested that random matching for genotyping 26 diverse strains of anthrax (32). The resultacross polymorphic sites was highly improbable (34). Indeed, ing phylogenetic analysis yielded only a single example of the when highly mutable loci are used, perfect matches may be same mutation occurring more than once, that is, 1 in 25,000 elusive even from closely related isolates. “Near matches” are measurements. Therefore, when a mutation arose in a lineage, always problematic, especially when scientists and investigators the new state was passed along to all the descendants of that J u ly 1 , 2 0 0 8 / A n a ly t i c a l C h e m i s t r y

4797

Remaining B. anthracis isolates

958 SNPs

2

3 Chinese isolates

3 Chinese isolates

15

1 Chinese isolate A0394 Tex goat A1117 Tex A1115 Tex

8 5

Ames

FIGURE 4. An Ames strain phylogenetic tree. This Ames-specific branch was created by 32 SNPs that were specific to this strain in a comparison among the genomic sequences of seven diverse B. anthracis isolates (38); 11 different strains fell onto this branch, creating five different nodes or branch points. (The branch lengths are not yet known and are shown in green.) Note that five SNPs are specific for the Ames strain alone and that the closest known relatives were isolated in Texas (origin of the Ames strain). The next closest relatives were all isolated in China.

isolate. Descendants from other lineages did not contain the novel state but rather retained the ancestral state. These results suggest that once B. anthracis was established as a species, its genotype was inherited in a completely clonal manner, with no evidence that DNA has been exchanged between individuals or acquired from the environment. This is in opposition to what has been observed in many other bacterial species. The clonality of this species, coupled with the extreme evolutionary stability of SNPs, means that any single synapomorphic SNP can be used to define an entire lineage (35, 36). In cases where many SNPs occur on a long phylogenetic branch, a single representative SNP can be developed into a definitive genotyping assay rather than all of the SNPs on that branch being used (Figure 2). This eliminates the redundancy of phylogenetic information provided by the additional SNPs. On the other hand, the shortest branches will be defined by only a single SNP, and hence, that particular SNP must be incorporated into the genotyping scheme. Even though statistical analyses may suggest that short branches (with few SNPs) may be unreliable, this is not the case with recently emerged clonal organisms such as B. anthracis, for which the evolutionary stability of single characters is robust. 4798

A n a ly t i c a l C h e m i s t r y / J u ly 1 , 2 0 0 8

SNPs that are chosen to represent phylogenetic branches and developed into high-capacity genotyping assays are termed canonical SNPs (canSNPs) because they are definitive markers for given branches and subpopulations (35, 36). For example, a canSNP assay has been developed that is definitive for the entire B. anthracis species (37). This is based on a mutation in the plcR gene that is present in all known B. anthracis isolates but not in any close relatives, such as B. thuringiensis and B. cereus. Similarly, particular strains, such as the Ames strain (used in the 2001 anthrax letter attacks in the U.S.), can be identified by single or sets of autapomorphic canSNPs (38). Although published examples of canSNP assays define the entire B. anthracis species and specific strains, they can also be used in an intermediate phylogenetic manner to define major and minor lineages within the species. Despite the increasing availability of whole-genome sequences, many B. anthracis lineages are not sequenced, resulting in undefined phylogenetic topologies scattered throughout the tree. The bifurcation point at which such lineages diverge from a better characterized branch can be accurately defined by canSNPs; however, the evolutionary patterns within such a lineage need to be defined with other genetic markers, such as VNTRs. The fast VNTR evolution rates enable phylogenetic resolution even among closely related isolates. Rapidly evolving loci can generate confusing information, but this problem is mitigated by first categorizing isolates with known canSNPs to restrict DNA matches to only the most closely related isolates. This application of progressive hierarchical resolving assays with nucleic acids (PHR ANA) maximizes the strengths of each marker type while simultaneously minimizing their weaknesses (35; Figure 3). The progressive and hierarchical approach is intuitive and consistent with traditional bacterial identification. Microbiologists traditionally followed a similar diagnostic pathway, in which general tests were used to identify gross classifications, which led to increasingly specific assays and eventually to a specific designation.

The future of genotyping B. anthracis and other pathogens

Despite the development of a sophisticated and tiered approach to signature development in B. anthracis, not all of the isolates can always be resolved as individuals. For example, 476 different genotypes have been obtained from an analysis of 1067 isolates in a worldwide collection (35). Recent studies with individual whole-genome sequence comparisons at the strain level point to the future of genotyping for forensic purposes (38). The whole-genome sequence of the notorious Ames strain allowed the definition of the Ames branch that was 32 SNPs long (out of 990; 32, 36). Because of the conserved nature of this genome and from a purely forensic or typing perspective, the 32 SNP alleles are found only in isolates that are close relatives of the Ames strain. But how close are these strains to the Ames strain, and are any isolates identical to the Ames strain? Van Ert et al. answered these questions by determining which of these 32 SNPs might be specific to the Ames strain and which might be shared with other isolates (38). An analysis of

all 32 SNPs in 11 non-Ames B. anthracis isolates that had been previously placed in the Ames canSNP cluster led to the discovery that 5 of the 32 are unique to the Ames strain (38; Figure 4). Three B. anthracis isolates from Texas share a sixth SNP allele state with the Ames strain, and these would be the closest known relatives. Only 10 years ago, it would have been nearly impossible to discover five randomly scattered SNP differences between two isolates that were otherwise identical at >5 million nucleotides. Today, rapid PCR-based assays can specifically target these four Ames-specific SNPs in any unknown genome to either include or exclude the sample in a few hours (38). A recent analysis of tissue samples from the 1979 accidental anthrax incident in Russia suggested that a dominant genotype from these samples belongs to a single canSNP group (A.Br.008/009) that is spread primarily across Europe and parts of western China (36, 39). A toxin gene from this sample (the protective antigen or pagA gene) was reconstructed and sequenced by using overlapping PCR fragments. The sequencing revealed a rare SNP in this gene that was specific only to the Russian sample and to 3 of 156 isolates belonging to the A.Br.008/009 canSNP group (39). This SNP indicates the position of a new and rare branch; therefore, it provides a SNP that can be used to include or exclude any isolate from a small cluster that involves an isolate from the Russian incident. These studies demonstrate a whole-genome sequencing approach to the discovery and application of highly specific, forensic-quality SNP signatures for any B. anthracis isolate. The hierarchical approach can place any isolate into phylogenetically correct subgroups and can provide significant resolution among the isolates that belong to 1 of the 12 original canSNP subgroups. But the ultimate DNA signature may be in the comparative analysis of the entire genome sequence of a target isolate with its closest known relative. A new genome can be sequenced in draft by current technologies for as little as $500, the sequence reads can be aligned to existing reference genomes (e.g., Ames or a close relative), and high-quality SNPs can be discovered and used to define a branch that is specific to this target genome and its closest relatives. Screening of the new isolate against its relatives determines the unique and specific SNPs that can be developed into rapid PCR assays for forensic purposes.

References

Paul Keim is a professor in the department of biological sciences at Northern Arizona University and the director of pathogen genomics at TGen. Keim’s research focuses on the genetic and genomic analysis of dangerous bacterial pathogens as well as their phylogeny and forensic analysis. Talima Pearson is a Ph.D. student and Richard Okinaka is a research faculty member at Northern Arizona University. Pearson’s studies on B. anthracis have led to a phylogenetic hypothesis and have identified biases associated with using whole-genome sequences to find character differences between strains. Okinaka’s work focuses on the development of DNA signatures in microbial pathogens with an emphasis on sequencing and analysis of B. anthracis and its plasmids. Address correspondence about this article to Keim at [email protected].

(35) (36) (37) (38) (39)

(1)

(2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26) (27) (28) (29) (30) (31) (32) (33) (34)

(40) (41) (42) (43) (44) (45)

Mullis, K. B.; et al. Cold Spring Harbor Symposium on Quantitative Biology LI: Molecular Biology of Homo Sapiens, Cold Spring Harbor, NY, 1986; pp 263–274. Saiki, R. K.; et al. Science 1985, 230, 1350–1354. Science, 2003, 300 (5617), 197–376. Nature, 2003, 452 (7190), 913–1032. Myers, G. Ann. Intern. Med. 1994, 121, 889–890. Woese, C. R.; Fox, G. E. Proc. Natl. Acad. Sci. U.S.A. 1977, 74, 5088–5090. Woese, C. R.; Kandler, O.; Wheelis, M. L. Proc. Natl. Acad. Sci. U.S.A. 1990, 87, 4576–4579. Vogler, A.; Cardoso, A.; Barraclough, T. Syst. Biol. 2005, 54, 4–20. Vogler, A. J.; et al. Antimicrob. Agents Chemother. 2002, 46, 511–513. Vogler, A. J.; et al. J. Bacteriol. 2006, 188, 4253–4263. Jeffreys, A.; Wilson, V.; Thein, S. Nature 1985, 314, 67–73. Jeffreys, A.; Wilson, V.; Thein, S. Nature 1985, 316, 76–79. Butler, J. M. J. Forensic Sci. 2006, 51, 253–265. Holt, C. L.; et al. J. Forensic Sci. 2002, 47, 66–96. Budowle, B.; et al. Forensic Sci. Int. 1999, 103, 23–35. Butler, J. M.; et al. Forensic Sci. Int. 2006, 156, 250–260. Butler, J. M.; Schoske, R. J. Forensic Sci. 2005, 50, 975–977. Butler, J. M.; Shen, Y.; McCord, B. R. J. Forensic Sci. 2003, 48, 1054–1064. Budowle, B.; et al. Science 2003, 301, 1852–1853. Pace, N. R. Science 1997, 276 (5313), 734–740. Keim, P.; et al. J. Bacteriol. 1997, 179, 818–824. Hill, K. K.; et al. J. Bacteriol. 2007, 189, 818–832. Hill, K. K.; et al. Appl. Environ. Microbiol. 2004, 70, 1068–1080. Ticknor, L. O.; et al. Appl. Environ. Microbiol. 2001, 67, 4863–4873. Barns, S. M.; et al. Appl. Environ. Microbiol. 2005, 71, 5494–5500. Maiden, M. C. J.; et al. Proc. Natl. Acad. Sci. U.S.A. 1998, 95, 3140– 3143. Feil, E. J.; et al. Res. Microbiol. 2000, 151, 465–469. Feil, E. J.; et al. Proc. Natl. Acad. Sci. U.S.A. 2001, 98, 182–187. Whatmore, A. M.; et al. J. Clin. Microbiol. 2006, 44, 1982–1993. Keim, P.; et al. J. Bacteriol. 2000, 182, 2928–2936. Vogler, A. J.; et al. Mutat. Res. 2007, 616, 145–158. Pearson, T.; et al. Proc. Natl. Acad. Sci. U.S.A. 2004, 101, 13,536–13,541. Weinstock, G. Technol. Rev. 2007, May/June; www.technologyreview. com/biotech/18660/. Lenski, R. E.; Keim, P. In Microbial Forensics; Breeze, R. G., et al., Eds.; Elsevier Academic Press: Burlington, MA, 2005; pp 355–370. Keim, P.; et al. Infect. Genet. Evol. 2004, 4, 205–213. Van Ert, M. N.; et al. PLoS ONE 2007, 2, e461. Easterday, W. R.; et al. J. Clin. Microbiol. 2005, 43, 1995–1997. Van Ert, M. N.; et al. J. Clin. Microbiol. 2007, 45, 47–53. Okinaka, R.; et al. Emerg. Infect. Dis. 2008, 14 (4); www.cdc.gov/EID/ content/14/4/653.htm. Barns, S. M.; et al. Proc. Natl. Acad. Sci. U.S.A. 1996, 93, 9188–9193. Woese, C. Proc. Natl. Acad. Sci. U.S.A. 1998, 95, 6854–6859. Dawkins, R. The Ancestor’s Tale: A Pilgrimage to the Dawn of Evolution; Houghton Mifflin: Boston, 2004. Doolittle, W. F. Sci. Am. 2000, 282, 90–95. Doolittle, R. F.; et al. Science 1996, 271, 470–477. Whittam, T. S. In Escherichia coli and Salmonella; Curtiss, R., et al., Eds.; ASM Press: Washington, DC, 1966; Vol. 2, pp 2708–2720. J u ly 1 , 2 0 0 8 / A n a ly t i c a l C h e m i s t r y

4799