454 sequencing (Wikipedia)

| No Comments | No TrackBacks

454 Life Sciences, a Roche company, is a biotechnology company based in Branford, Connecticut specializing in high-throughput DNA sequencing using a novel massively parallel sequencing-by-synthesis approach. 454 has experienced rapid growth since its acquisition by Roche Diagnostics and release of the GS20 sequencing machine in 2005, the first next-generation DNA sequencer on the market. With its high accuracy, low cost, and long reads, many researchers have migrated away from traditional Sanger capillary sequencing instruments and toward the 454 Sequencing platform for a variety of genome projects [citation needed].

In 2008, 454 Sequencing launched the GS FLX Titanium series reagents for use on the Genome Sequencer FLX instrument, with the ability to sequence 400-600 million base pairs per run with 400-500 base pair read lengths. Continued improvements to the GS FLX System include simplification of the sample preparation workflow, including emPCR automation, and even longer reads. The company plans to launch kits enabling sequencing read lengths of up to 1,000 bp in 2010.

In late 2009, 454 Life Sciences introduced the GS Junior System, a bench top version of the Genome Sequencer FLX System. 454 Life Sciences (2009-11-19). "454 Life Sciences Unveils New Bench Top Sequencer, Significant Improvements to the Genome Sequencer FLX System Including 1,000 bp Reads for 2010". Press release. http://454.com/about-454/news/index.asp?display=detail&id=137. Retrieved 2009-11-19.  The GS Junior System is about the size of a typical laser printer and has throughput scaled fit the needs of small to medium sized laboratories. The platform will launch with long-read GS Junior Titanium chemistry, which generates average read lengths of 400 bases. The system will ship with a desktop computer that can perform all GS Junior run processing and data analysis.

Contents

[edit] History and Major Achievements

454 was founded by Jonathan Rothberg, and the underlying technology is based on pyrosequencing and was conceived while he was on paternity leave and wanted a way to sequence the genome of his new born son who had been placed in new born intensive care. For their invention, Dr. Rothberg and 454 Life Sciences were awarded the Wall Street Journal's Gold Medal for Innovation in 2005.

In late March, 2007, Roche Diagnostics announced an agreement to purchase 454 Life Sciences for US$154.9 million. It will remain a separate business unit.

In November 2006, Dr. Rothberg, Michael Egholm, and colleagues at 454 published a cover article with Svante Paabo in Nature describing the first million base pairs of the Neanderthal genome, and initiated the Neanderthal Genome Project to complete the sequence of the Neanderthal genome by 2009. In May 2007, Project "Jim", a project initiated by Dr. Rothberg and 454 Life Sciences to determine the first sequence of an individual was completed[1]. The results of the project, the complete genome sequence of James Dewey Watson, was handed to Dr. Watson at a ceremony taking place at Baylor College of Medicine[2].

[edit] Technology

454 Sequencing is a large-scale parallel pyrosequencing system capable of sequencing roughly 400-600 megabases of DNA per 10-hour run on the Genome Sequencer FLX with GS FLX Titanium series reagents. The technology is known for its unbiased sample preparation and long, highly accurate sequence reads (400-500 base pairs in length), including paired reads.[citation needed] Software analysis tools, including an assembler, mapper and amplicon variant analyzer, are included with the system.

The system relies on fixing nebulized and adapter-ligated DNA fragments to small DNA-capture beads in a water-in-oil emulsion. The DNA fixed to these beads is then amplified by PCR. Each DNA-bound bead is placed into a ~29 μm well on a PicoTiterPlate, a fiber optic chip. A mix of enzymes such as DNA polymerase, ATP sulfurylase, and luciferase are also packed into the well. The PicoTiterPlate is then placed into the GS FLX System for sequencing.

[edit] DNA library preparation and emPCR

Genomic DNA is fractionated into smaller fragments (300-800 base pairs) that are subsequently polished (made blunt at each end). Short adaptors are then ligated onto the ends of the fragments. These adaptors provide priming sequences for both amplification and sequencing of the sample-library fragments. One adaptor (Adaptor B) contains a 5'-biotin tag for immobilization of the DNA library onto streptavidin-coated beads. After nick repair, the non-biotinylated strand is released and used as a single-stranded template DNA (sstDNA) library. The sstDNA library is assessed for its quality and the optimal amount (DNA copies per bead) needed for emPCR is determined by titration.[citation needed]

The sstDNA library is immobilized onto beads. The beads containing a library fragment carry a single sstDNA molecule. The bead-bound library is emulsified with the amplification reagents in a water-in-oil mixture. Each bead is captured within its own microreactor where PCR amplification occurs. This results in bead-immobilized, clonally amplified DNA fragments.

[edit] Sequencing

sstDNA library beads are added to the DNA Bead Incubation Mix (containing DNA polymerase) and are layered with Enzyme Beads (containing sulfurylase and luciferase) onto a PicoTiterPlate device. The device is centrifuged to deposit the beads into the wells. The layer of Enzyme Beads ensures that the DNA beads remain positioned in the wells during the sequencing reaction. The bead-deposition process maximizes the number of wells that contain a single amplified library bead (avoiding more than one sstDNA library bead per well).

The loaded PicoTiterPlate device is placed into the Genome Sequencer FLX Instrument. The fluidics sub-system delivers sequencing reagents (containing buffers and nucleotides) across the wells of the plate. The four DNA nucleotides are added sequentially in a fixed order across the PicoTiterPlate device during a sequencing run. During the nucleotide flow, millions of copies of DNA bound to each of the beads are sequenced in parallel. When a nucleotide complementary to the template strand is added into a well, the polymerase extends the existing DNA strand by adding nucleotide(s). Addition of one (or more) nucleotide(s) generates a light signal that is recorded by the CCD camera in the instrument. This technique is based on sequencing-by-synthesis and is called pyrosequencing.[citation needed] The signal strength is proportional to the number of nucleotides; for example, homopolymer stretches, incorporated in a single nucleotide flow generate a greater signal than single nucleotides. However, the signal strength for homopolymer stretches is linear only up to eight consecutive nucleotides after which the signal falls-off rapidly.[3] Data are stored in standard flowgram format (SFF) files for downstream analysis.

[edit] Applications

454 Sequencing can sequence any double-stranded DNA and enables a variety of applications including de novo whole genome sequencing, re-sequencing of whole genomes and target DNA regions, metagenomics and RNA analysis.

[edit] Full genome sequencing (de novo sequencing and resequencing)

Full genome sequencing (FGS), also referred to as whole genome sequencing (WGS), consists of projects dealing with the sequencing of the entire genome of an organism, for example, humans, dogs, mice, viruses or bacteria. 454 Sequencing technology is ideal for whole genome assembly due to its ultra high throughput and long single reads (500 base pairs). The ability to combine shotgun reads with paired end reads facilitates high genome coverage with minimal gaps. As a result, the 454 platform has effectively sequenced and assembled a number of complex genomes. In June 2006 they launched a project with the Max Planck Institute for Evolutionary Anthropology to sequence the genome of the Neanderthal, the extinct closest relative of humans. This has implications for the understanding of human evolution and development. At 3 billion base pairs, a complete sequence of the Neanderthal genome is expected to take two years to finish[4][5]. In September 2008 the complete Neanderthal mitochondrial genome was sequenced, establishing the divergence between humans and Neanderthal at 660,000 +/- 140,000 years[6].

In May 2007, researchers at the Baylor College of Medicine used the 454 Sequencing system to sequence and assemble the complete human diploid genome of Dr. James Watson. The feat demonstrated that generating high-quality sequence from humans, quickly and affordably, is now feasible.

[edit] Amplicon Sequencing

Amplicon (ultra deep) sequencing is a new field which is largely being enabled through 454 Sequencing technology. Unlike Sanger, 454 sequencing allows mutations to be detected at extremely low levels. Researchers are able to PCR amplify specific, targeted regions of DNA. This method is particularly useful for identifying low frequency somatic mutations in cancer samples or discovery of rare variants in HIV infected individuals. The Genome Sequencer FLX system offers dedicated analysis software, the GS Amplicon Variant Analyzer, which automatically computes the alignment of reads from amplicon-based samples against a reference sequence.

[edit] Transcriptome sequencing

Transcriptome sequencing encompasses experiments including small RNA profiling and discovery, mRNA transcript expression analysis (full-length mRNA, expressed sequence tags (ESTs) and ditags, and allele-specific expression) and the sequencing and analysis of full-length mRNA transcripts. The transcriptome data derived from the Genome Sequencer FLX is ideally suited to detailed transcriptome investigation in the areas of novel gene discovery, gene space identification in novel genomes, assembly of full-length genes, single nucleotide polymorphism (SNP), insertion-deletion and splice-variant discovery.

[edit] Metagenomics

Metagenomics is the study of the genomic content in a complex sample. The two primary goals of this approach are to characterize the organisms present in a sample and identify what roles each organism has within a specific environment. Metagenomics samples are found nearly everywhere, including several microenvironments within the human body, soil samples, extreme environments such as deep mines and the various layers within the ocean. The Genome Sequencer FLX System enables a comprehensive view into the diversity of an environmental habitat. The system's long reads ensure the enormous specificity needed to compare sequenced reads against DNA or protein databases. Researchers often use the platform for counting environmental gene tags to analyze the relative abundance of microbial species under varying environmental conditions.

[edit] Advantages and disadvantages

454 Sequencing with GS FLX Titanium series reagents sequence 400-600 million bases per 10-hour run, allowing large amounts of DNA to be sequenced at low cost compared to Sanger chain-termination methods. With Q20 read lengths of 400 bases (99% accuracy at the 400th base and higher for preceding bases) and significantly higher throughput, de novo assembly with 454 Sequencing is at least equivalent to Sanger assembly, while being dramatically faster and an order of magnitude less expensive. G-C rich content is not as much of a problem, and the lack of reliance on cloning means that unclonable segments are not skipped. Also, it is capable of detecting mutations in an amplicon pool at a high sensitivity level, which may have implications in clinical research, especially cancer and HIV.[7][8] A limitation of 454 sequencing remains resolution of homopolymer DNA segments; i.e. regions of template which contain multiple simultaneous copies of a single base (A, C, G or T). Since pyrosequencing relies on the magnitude of light emitted to determine the number of repetitive bases, erroneous base calls can be a problem with homopolymers. Another disadvantage of 454 sequencing is that while it is cheaper and faster per base, each run is quite expensive, and it is therefore unsuited for sequencing targeted fragments from small numbers of DNA samples, such as for phylogenetic analysis. For some sequencing applications the high cost of an individual 454 sequencing run can be offset by subdividing sequencing plates into multiple regions and using sample specific molecular identifier (MID) tags of 10 bp to multiplex many individual samples in each sequencing run.

[edit] Patents awarded

Archon X Prize (Wikipedia)

| No Comments | No TrackBacks

The Archon X Prize for Genomics, the second X Prize to be offered by the X Prize Foundation, based in Santa Monica, California, was announced on October 4, 2006. The Archon X Prize in genomics is a joint effort of the X Prize Foundation and the J. Craig Venter Science Foundation.

The $10 million (US) prize is to be awarded to "the first Team that can build a device and use it to sequence 100 human genomes within 10 days or less, with an accuracy of no more than one error in every 100,000 bases sequenced, with sequences accurately covering at least 98% of the genome, and at a recurring cost of no more than $10,000 (US) per genome." The $10 million was donated by Canadian geologist and philanthropist Stewart Blusson, who co-discovered the Ekati Diamond Mine. The name "Archon" is the name of Blusson's company, which refers to the type of lithosphere beneath northern Canada.

In comparison, the Human Genome Project, was completed at an overall cost of some $3 billion (US), in 2003, by the joint effort of several teams, one of which was that of Dr. J. Craig Venter, who led the first private team to successfully sequence a complete human genome. In preceding decades, combined governmental and private funding efforts spent hundreds of millions of dollars to develop the instrumentation required. It took the Venter team hundreds of millions of dollars (US) and nine months to achieve their historic accomplishment.

The J. Craig Venter Science Foundation offered the $500,000 (US) Innovation in Genomics Science and Technology Prize in September 2003 aimed at stimulating development of less expensive and faster sequencing technology. To attract even more resources to this goal, Dr. Venter joined forces with the X Prize Foundation, wrapping his competition and prize purse into the new Archon X Prize for Genomics.

Contents

[edit] The Competition Guidelines

The purpose of Archon X Prize competition is to develop radically new technology that will dramatically reduce the time and cost of sequencing genomes, and accelerate a new era of predictive and personalized medicine. The X Prize Foundation aims to enable the development of low-cost diagnostic sequencing of human genomes.

If more than one team attempts the competition at the same time, and more than one team fulfills all the criteria, then teams will be ranked according to the time of completion. No more than three teams will be ranked and will share the purse in the following manner: $7.5 million to the winner and $2.5 million to the second place team if two teams are successful, or $7 million, $2 million and $1 million if three teams are successful.

Actual competition events will take place twice a year with all eligible teams given the opportunity to make an attempt, starting at precisely the same time as the other teams. The final deadline for winning the prize is prior to 12:01 AM Pacific Standard Time on October 4, 2013.

Personal genomics (Wikipedia)

| No Comments | No TrackBacks

Personal genomics is a branch of genomics where individual genomes are genotyped and analyzed using bioinformatics tools. It is also related to traditional population genetics. The genotyping stage can have many different experimental approaches including single nucleotide polymorphism (SNP) chips (typically 0.02% of the genome), or partial or full genome sequencing. Once the genotypes are known, there are many bioinformatics analysis tools that can compare individual genomes and find disease association of the genes and loci. The most important aspect of personal genomics is that it may eventually lead to personalized medicine, where patients can take genotype specific drugs for medical treatments.

Personal genomics is not a single individual's vision or invention. Many researchers for decades anticipated this biological branch will eventually arrive with minimum cost of genotyping. Due to the advent of cheap and fast sequencers, full genome personal genomics is becoming a reality. However, there have been active early proponents of personal genomics projects such as George Church in Harvard Medical School.

Genomics used to mean academic research on consensus genomes which have been assembled from many different individuals of a particular species. The personal genomics changes this into customized bioinformatic discovery on individuals.

Contents

[edit] Use of personal genomics in predictive medicine

Predictive medicine is the use of the information produced by personal genomics techniques when deciding what medical treatments are appropriate for a particular individual.

An example of the use of predictive medicine is pharmacogenomics, in which genetic information can be used to select the most appropriate drug to prescribe to a patient. The drug should be chosen to maximize the probability of obtaining the desired result in the patient and minimize the probability that the patient will experience side effects. It is hoped that genetic information will allow physicians to tailor therapy to a given patient, in order to increase drug efficacy and minimize side effects. There are only a few examples in which this information is currently useful in clinical practice, but it is anticipated that tailored therapy will emerge rapidly as researchers validate the clinical utility of different pharmacogenomic markers.

Another area in which there is great interest is disease risk prediction based on genetic markers. Researchers in this area have generated a great deal of information through the use of genome-wide association studies. While there is hope that risk information will be useful in providing predictive medicine, most common medical conditions are multifactorial and the actual risk to the individual depends on both genetic and environmental components, both of which are not completely understood at present. Therefore, the clinical utility of personal genomic information is currently limited. It is hoped that with further research, an accurate risk profile might enable individuals to take steps to prevent diseases for which they are at increased risk based on genetics.

[edit] Cost of sequencing an individual's genome

There is currently great interest in personal genomics. This is being fuelled by the rapid drop in the cost of sequencing a human genome. This drop in cost is due to the continual development of new, faster, cheaper DNA sequencing technologies such as "next generation DNA sequencing" that may provide access to full genome sequencing so that the entire genetic code of an individual can be deduced all at once.

The National Human Genome Research Institute, part of the U.S. National Institute of Health has set a target to be able to sequence a human-sized genome for US$100,000 by 2009 and US$1,000 by 2014[1]. There is a widespread belief that within 10 years the cost of sequencing a human genome will fall to $1,000.

There are 6 billion base pairs in the diploid human genome. Statistical analysis reveals that a coverage of approximately ten times is required to get coverage of both alleles in 90% human genome from 25 base-pair reads with shotgun sequencing[2]. This means a total of 60 billion base pairs that must be sequenced. An ABI SOLiD, Illumina or Helicos[3] sequencing machine can sequence 2 to 10 billion base pairs in each $8,000 to $18,000 run. The purchase cost, personnel costs and data processing costs must also be taken into account. Sequencing a human genome therefore costs approximately $300,000 in 2008.

In 2009, Complete Genomics of Mountain View announced that it would provide full genome sequencing for $5,000, from June 2009.[4] This will only be available to institutions, not individuals.[5]

This cost is still too high for governments to introduce programs into health services to sequence the genomes of all individuals in a country. However, it may be viable when it falls below $1,000, and the cost of sequencing a human genome is dropping rapidly. For example, approximately 1 million babies are born in Canada each year. To sequence all of their genomes would cost approximately $1 billion per year, or just 1% of Canada's total healthcare budget. Given the ethical concerns about presymptomatic genetic testing of minors,[6][7][8][9] it is likely that personal genomics will first be applied to adults who can provide consent to undergo such testing.

In June 2009, Illumina announced that they were launching their own Personal Full Genome Sequencing Service at a depth of 30X for $48,000 per genome.[10] This is still too expensive for true commercialization but the price will most likely decrease substantially over the next few years as they realize economies of scale and given the competition with other companies such as Complete Genomics.[11][12]

Whenever you get asked about a recent genome publication or the latest sequencing technology, the conversation invariably turns to cost. It turns out, cost is a tricky thing. When people talk of the "cost" of the Human Genome Project, they typically quote the cost for the entire project. A cost that includes sequencing instruments (several revisions), personnel, overhead, consumables, informatics, and IT. They contrast this rather large cost to the much lower cost of the $10,000 or $1,000 genome. However, in reality that "$10,000 genome" costs more than $10,000 (same goes for the $1,000 genome). You see, when people talk about the $10,000 genome, they are only accounting for the cost of consumables: flow cells and reagents. Perhaps this focus on consumables has its roots in the days of the Human Genome Project when reagent (BigDye®) costs dominated sequencing costs. Perhaps the focus is driven by marketers at the sequencing instrument companies who want to draw attention away from the six-figure sequencing instrument costs. Perhaps this focus is driven by the $10,000 recurring cost number specified by the Archon X PRIZE for Genomics, which receives much more attention than the $1 million direct cost cap. Regardless of the reason for the focus on consumables (likely some combination of all of the above), the reality is that consumable costs have fallen much more rapidly than any other cost associated with genome sequencing and can no longer be the only number quoted when stating the cost of a genome; at least if you want that number to actually mean anything.

So, what other costs should be considered? Well, the types of costs and actual values will depend greatly on your situation. Will you be doing the sequencing or will you be contracting at a core facility or sequencing-as-a-service company? Will you be doing the analysis or relying on a third party? How will you be validating your results? How many people will be working on the project at what percent of their efforts? Will you buy everyone a Pet Rock when the project reaches 1 exabases of sequence?

Here I'll run through a standard cost calculation for a typical academic sequencing and analysis center to sequence and analyze a human genome. The names and costs have been changed to protect the innocent (this means I chose nice, round numbers that are the right order of magnitude). Why not use real numbers? Read the previous paragraph (I'll wait ...): your cost factors and numbers will not be the same as anyone else's. So you're going to have to do the calculation for yourself, not just lift the numbers from this post.

First we can consider the consumables (e.g., flow cells and reagents) costs. Let's say those are $10,000. Then there is the instrument depreciation. Let's say the instrument costs $600,000, has an expected life of three years, and can do 40 runs per year. Assuming a straight-line depreciation, the instrument depreciation per run is $5,000 (= $600,000 / (3 × 40)). If the instrument supports two flow cells, you would divide the number in half to get $2,500. Now, the DNA doesn't just hop on the sequencer by itself. DNA has to be acquired, consents signed and approved by institutional review boards (IRBs), and sequencing libraries have to be made. Let's say sample acquisition costs $100,000 for 50 samples; that's $2,000 per sample. Shepherding the project and consents through the IRB takes one full-time employee (FTE) at 10% effort one month. We'll say the cost of one FTE (salary, benefits, etc.) is $60,000 per year. So getting the project through IRB approval costs $500. If the project is able to use all 50 samples, that's only $10 per sample! If the consumables and personnel time to make a sequencing library is $200, then the total production cost for sequencing our human genome is $14,710. Wait, I forgot the IT and LIMS support! In this scenario we'll say that each instrument needs one IT FTE and one LIMS FTE, each at 25% effort ($750). And you need disk space for the data ($1,000, you can cut that in half if you throw away everything but the sequence, qualities, and alignments) and compute time ($100) to run alignments and QC. Add to that 50% overhead charges that your institution takes to cover administration, utilities, lab space, etc. (a company would need to determine each of these costs and add them in rather than this overhead multiplier) and your $10,000 genome costs you nearly $25,000. And you haven't even called a variant yet.

Speaking of variants, let's assume you want to call SNPs, indels, and structural variations. The first thing you will have to do is align your reads. Let's say you are efficient and simply use the alignments from the production QC step. Above we assumed $100 for these alignments, but what goes into that number? First you have to determine an average alignment time per genome. Let's say 90 Gb of sequence (30× coverage of a human genome) in 2×100 base read pairs takes 1,000 core×hr to align to the human reference genome. If you did this on Amazon EC2 ($0.17/core×hr), it would cost you $170 (plus data transfer and storage costs). If you have your own cluster, you need to amortize the cost of your cluster (compute nodes, racks, networking equipment and cabling, PDUs, etc.) per core×hr, add in the cost of your administrators per core×hr, and utilities or overhead per core×hr to get your cost. When you do that calculation, let's say you get $0.10 per core×hr, so the alignment costs you $100 (but you already paid it above). Merging the BAM files from each lane's worth of data and marking duplicates takes 50 hours, costing $5. Calling SNPs and indels (including reassembly) takes 100 hours, costing $10. Detecting structural variation using aberrant read pairs takes 200 hours, costing $20. Annotating all the variants across an entire genome takes 100 hours, costing $10. The disk space for all of this costs you $1,000 (again, you'll need to calculate a cost per GB factoring storage, racks, switches, servers, personnel, etc. to get this number). Finally, somebody needs to run (or automate) this analysis pipeline. Figure that one analyst and one developer each at 10% effort can accomplish this over the course of two weeks; $480. Add all this up and your analysis with overhead runs you about $2300, or about 10% of the cost of generating the data. Of course, human resequencing for variant detection is not the only application of sequencing data. Other types of analysis, e.g., de novo assembly and metagenomic analysis, can have significantly higher costs per base. For example, in metagenomic analysis you may want to classify reads that do not align to known sequences by aligning them in protein space against a database like NCBI nr. If you generate 10 Gb of sequence per sample and 25% of the read pairs do not align to anything else, you will need to align 12.5 million reads. If you use the most common tool for this sort of alignment, NCBI BLAST+ blastx, it would take over 5,500 core×hr, costing about $550 by itself.

Now that you have your sequence data and list of variants, you are going to need to validate them. There are a lot of different ways to validate variants, e.g., PCR, pool, and sequence or Sequenom, so I am not going to go through a detailed cost calculation. It suffices to say that, depending on the number of variants you want to validate, the cost can rise into the thousands of dollars. Whatever platform you choose, you will need to go through a thorough cost calculation (like the one done above for the original sequencing and analysis). For the sake of this post, which is already too long, we'll say the validation cost is $2,000.

Finally, somebody has to be running this show. Let's say project management personnel costs $20,000, or $400 per sample. Put this all together and your $10,000 genome costs about $30,000. In other words, the often quoted consumables number only accounts for about 50% of the total cost (Note: overhead applies to consumables also, so while $10,000 looks like 1/3 of $30,000, it is actually half). Again, none of the numbers I use above are real (but they are in the ball park) and all sequencing and analysis facilities are going to have different contributors to their costs resulting in varying contributions from consumables. However, regardless of the cost contribution of consumables at present, the cost of consumables are projected to fall below $5,000 by the end of this year, and they won't stop there. As such, it is already meaningless to only quote consumable costs when stating the price of sequencing a genome. By the end of the year, it will be ridiculous.

Update: Clarified Archon X Prize cost accounting.

Full-genome sequencing (Wikipedia)

| No Comments | No TrackBacks

Full genome sequencing (FGS), also known as whole genome sequencing, complete genome sequencing, or entire genome sequencing, is a laboratory process that determines the complete DNA sequence of an organism's genome at a single time. This entails sequencing all of an organism's chromosomal DNA as well as DNA contained in the mitochondria and for plants the chloroplast as well. Almost any biological sample--even a very small amount of DNA or ancient DNA--can provide the genetic material necessary for full genome sequencing. Such samples may include saliva, epithelial cells, bone marrow, hair (as long as the hair contains a hair follicle), seeds, plant leaves, or anything else that has DNA-containing cells. Because the sequence data that is produced can be quite large (for example, there are approximately six billion base pairs in each human diploid genome), genomic data is stored electronically and requires a large amount of computing power and storage capacity. Full genome sequencing would have been nearly impossible before the advent of the microprocessor, computers, and the Information Age.

Full genome sequencing should thus not be confused with DNA profiling. The latter only determines the likelihood that genetic material came from a particular individual or group and does not contain additional information on genetic relationships, origin or suspectability on specific diseases.[1]. It is also distinct from SNP genotyping which covers less than 0.1% of the genome. Almost all truly complete genomes are of microbes; the term "full genome" is thus sometimes used loosely to mean "greater than 95%". The remainder of this article focuses on nearly complete human genomes.

In general, knowing the complete DNA sequence of an individual's genome does not, on its own, provide useful clinical information, but this may change over time as a large number of scientific studies continue to be published detailing clear associations between specific genetic variants and disease.[2][3]

The first nearly complete human genomes sequenced were J. Craig Venter's (Caucasian at 7.5-fold average coverage) [4][5][6] and James Watson's (Caucasian male at 7.4-fold).[7][8][9], a Han Chinese (YH at 36-fold) [10], a Yoruban from Nigeria (at 30-fold) [11], a female leukemia patient (at 33 and 14-fold coverage for tumor and normal tissues)[12], and Seong-Jin Kim (Korean at 29-fold) [13]. Other full genomes have been sequenced but not published, and as of June 2009, commercialization of full genome sequencing is in an early stage and growing rapidly.

Contents

New techniques

An ABI PRISM 3100 Genetic Analyzer. Sequencers automate the process of sequencing the genome.

One possible way to accomplish the cost-effective high-throughput sequencing necessary to accomplish full genome sequencing is by using Nanopore technology, which is a patented technology held by Harvard University and Oxford Nanopore Technologies and licensed to biotechnology companies.[14] To facilitate their full genome sequencing initiatives, Illumina licensed nanopore sequencing technology from Oxford Nanopore Technologies and Sequenom licensed the technology from Harvard University.[15][16] Another possible way to accomplish cost-effective high-throughput sequencing is by utilizing fluorophore technology. Pacific Biosciences is currently using this approach in their SMRT (single molecule real time) DNA sequencing technology.[17] Complete Genomics is developing DNA Nanoball (DNB) technology that are arranged on self-assembling arrays.[18] Pyrosequencing is a method of DNA sequencing based on the sequencing by synthesis principle.[19] The technique was developed by Pål Nyrén and his student Mostafa Ronaghi at the Royal Institute of Technology in Stockholm in 1996,[20][21][22] and is currently being used by 454 Life Sciences in their effort to deliver an affordable, fast and highly accurate full genome sequencing platform.[23]

Older techniques

Sequencing of the entire human genome was first accomplished in 2000 partly through the use of shotgun sequencing technology. While full genome shotgun sequencing for small (4000-7000 base pair) genomes was already in use in 1979,[24] broader application benefited from pairwise end sequencing, known colloquially as double-barrel shotgun sequencing. As sequencing projects began to take on longer and more complicated genomes, multiple groups began to realize that useful information could be obtained by sequencing both ends of a fragment of DNA. Although sequencing both ends of the same fragment and keeping track of the paired data was more cumbersome than sequencing a single end of two distinct fragments, the knowledge that the two sequences were oriented in opposite directions and were about the length of a fragment apart from each other was valuable in reconstructing the sequence of the original target fragment.

The first published description of the use of paired ends was in 1990 as part of the sequencing of the human HPRT locus,[25] although the use of paired ends was limited to closing gaps after the application of a traditional shotgun sequencing approach. The first theoretical description of a pure pairwise end sequencing strategy, assuming fragments of constant length, was in 1991.[26] In 1995 Roach et al.introduced the innovation of using fragments of varying sizes,[27] and demonstrated that a pure pairwise end-sequencing strategy would be possible on large targets. The strategy was subsequently adopted by The Institute for Genomic Research (TIGR) to sequence the entire genome of the bacterium Haemophilus influenzae in 1995,[28] and then by Celera Genomics to sequence the entire fruit fly genome in 2000,[29] and subsequently the entire human genome. Applied Biosystems, now called Life Technologies, manufactured the shotgun sequencers utilized by both Celera Genomics and The Human Genome Project.

While shotgun sequencing was one of the first approaches utilized to successfully sequence the full genome of a human, it is too expensive and requires too long of a turn-around-time to be utilized for commercial purposes. Because of this, shotgun sequencing technology, even though it is still relatively 'new', is being displaced by technologies like pyrosequencing, SMRT sequencing, and nanopore technology.[30]

Race to commercialization

In October 2006, the X Prize Foundation, working in collaboration with the J. Craig Venter Science Foundation, established the Archon X Prize for Genomics,[31] intending to award US$10 million to "the first Team that can build a device and use it to sequence 100 human genomes within 10 days or less, with an accuracy of no more than one error in every 100,000 bases sequenced, with sequences accurately covering at least 98% of the genome, and at a recurring cost of no more than $10,000 per genome."[32] However, higher accuracy rates (or confirmatory methods) are desirable for some clinical applications. An error rate of 1 in 100,000 bases, out of a total of six billion bases in the human diploid genome, would mean about 60,000 errors per genome, which is a significant number of false positives and negatives. For the latter it is not known where the errors occur . The error rates required for widespread clinical use, such as Predictive Medicine[33] is currently set by over 1400 clinical single gene sequencing tests [34] (for example, errors in BRCA1 gene for breast cancer risk analysis). As of May 2010, the Archon X Prize for Genomics remains unclaimed.

In 2007, Applied Biosystems started selling a new type of sequencer called SOLiD System in 2008.[35] Current SOLiD chemistries enable users to sequence 60 gigabases per run.[36]

In 2008 and 2009, both public and private companies have emerged that are now in a competitive race to be the first mover to provide a full genome sequencing platform that is commercially robust for both research and clinical use,[37] including Illumina,[38] Sequenom,[39] 454 Life Sciences,[40] Pacific Biosciences,[41] Complete Genomics,[42] Intelligent Bio-Systems,[43] Genome Corp.,[44] ION Torrent Systems,[45] and Helicos Biosciences[46]. These companies are heavily financed and backed by venture capitalists, hedge funds, investment banks and, in the case of Illumina, Sequenom and 454, heavy re-investment of revenue into research and development, mergers and acquisitions, and licensing initiatives.[47][48][49]

In the race to commercialize full genome sequencing, companies have made claims about being able to offer a service at a specific time for a specific price that have turned out to not be true. Intelligent Bio-Systems stated in November 2007 that by the end of 2008 they would release a platform capable of a providing a $5,000 full genome sequence, but, as of May 2010, no such platform has yet to be released.[50]

Pacific Biosciences stated that they would start selling their full genome sequencers in early 2010. While they didn't disclose the cost to sequence a single genome, they did state they may not release their second-generation machine capable of a $1,000 genome until 2013.[51] Complete Genomics, however, stated that they'll be able to provide a $5,000 full genome sequencing service by the summer of 2009.[52] The accuracy, precision, and reproducibility of both Pacific Biosciences and Complete Genomics technology, however, is still unknown.

Knome currently provides genome sequencing services but the cost is about $99,500 per genome (down from $350,000 per genome initially),[53] the turn-around time is unknown, the accuracy is unknown, and the number of people was limited to 20 for the first year, and is still considered early stage commercialization of full genome sequencing, focusing on wealthy customers.[54]

As of January 2009, there are no indications that any of these companies have been hindered by the global recession. And thus, the race appears to be proceeding forward at full speed.[55]

At the end of February 2009, Complete Genomics released a full sequence of a human genome that was sequenced using their service. The data indicates that Complete Genomics' full genome sequencing service accuracy is just under 99.99%, meaning that there is an error in one out of every ten thousand base pairs. This means that their full sequence of the human genome will contain approximately 80,000-100,000 false positive errors in each genome. However, this accuracy rate was based on Complete Genomics' sequence that was completed utilizing a 90x depth of coverage (each base in the genome was sequenced 90 times) while their commercialized sequence is reported to be only 40x, so the accuracy may be substantially lower unless they can find some way to improve it before their first service release planned for the summer 2009. This accuracy rate may be acceptable for research purposes, and clinical use would require confirmation by other methods of any reportable alleles.[56][57] In March 2009, it was announced that Complete Genomics has signed a deal with the Broad Institute to sequence cancer patient's genomes and will be sequencing five full genomes to start.[58] In April 2009, Complete Genomics announced that it plans to sequence 1,000 full genomes between June 2009 and the end of the year and that they plan to be able to sequence one million full genomes per year by 2013.[59] Complete Genomics plans to officially launch in June 2009, although it is unknown if their lab will have received CLIA-certification by that time.

In June 2009, Illumina announced that they were launching their own Personal Full Genome Sequencing Service at a depth of 30X for $48,000 per genome.[60] This is still expensive for widespread consumer use, but the price may decrease substantially over the next few years as they realize economies of scale and given the competition with other companies such as Complete Genomics.[61][62] Jay Flatley, Illumina's President & CEO, stated that "during the next five years, perhaps markedly sooner," the price point for full genome sequencing will fall from $48,000 to under $1,000.[63] Illumina has already signed agreements to supply full genome sequencing services to multiple direct-to-consumer personal genomics companies.

In August 2009, the founder of Helicos Biosciences, Dr. Stephen Quake, stated that using the company's Heliscope Single Molecule Sequencer he sequenced his own full genome for less than $50,000. He stated that he expects the cost to decrease to the $1,000 range within the next two to three years.[64]

In August 2009, Pacific Biosciences secured an additional $68 million in new financing, bringing their total capitalization to $188 million.[65] Pacific Biosciences said they are going to use this additional investment in-order to prepare for the upcoming launch of their full genome sequencing service in 2010.[66] Complete Genomics followed by securing another $45 million in a fourth round venture funding during the same month.[67] Complete Genomics has also made the claim that it will sequence 10,000 full genomes by the end of 2010.[68]

GE Global Research is also now in the race to commercialize full genome sequencing as they are currently working on creating a service that will deliver a full genome for $1,000 or less.[69]

In September 2009, the President of Halcyon Molecular announced that they will be able to provide full genome sequencing in under 10 minutes for less than $100 per genome.[70] This is, to date, the most ambitious promise of any full genome sequencing company.

In October 2009, IBM announced that they were also in the heated race to provide full genome sequencing for under $1,000, with their ultimate goal being able to provide their service for $100 per genome.[71] IBM's full genome sequencing technology, which uses nanopores, is known as the "DNA Transistor."[72]

In November 2009, Complete Genomics announced that they are now able to sequence a full genome for $1,700.[73] If true, this would mean the cost of full genome sequencing has come down exponentially within just a single year from around $100,000 to $50,000 and now to $1,700. However, it should be noted that Complete Genomics has previously released statements that it was unable to follow through on. For example, the company stated it would officially launch and release its service during the "summer of 2009," provide a "$5,000" full genome sequencing service by the "summer of 2009," and that it would "sequence 1,000 genomes between June 2009 and the end of 2009" - all of which, as of November 2009, have not yet occurred.[52][57][59][59]

In March 2010, Pacific Biosciences said they have raised more than $256 million USD in venture capital money and that they will be shipping their first 10 full genome sequencing machines by the end of 2010. The company reported that the market initially will be researchers and academic institutions and then will rapidly turn into clinical applications that will be applicable to every single person in the world. Pacific Biosciences also stated that their second generation machine, which is scheduled for release in 2015, will be capable of providing a full genome sequence for a person in just 15 minutes for less than $100 USD. Therefore, within five years we may see full genome sequencing revolutionize medicine by providing clinicians with a full genome for each one of his or her patients. However, the medical community has shown some push-back to this, stating that even if they are supplied with a full genome sequence of a patient, they wouldn't know how to analyze or make use of that data.[74]

In June 2010, Illumina lowered the cost of its individual sequencing service to $19,500 from $48,000. The company is offering a discounted price of $9,500 for people with serious medical conditions who could potentially benefit from having their genomes decoded.

Helicos Biosciences, Pacific Biosciences, Complete Genomics, Illumina, Sequenom, ION Torrent Systems, Halcyon Molecular, IBM, and GE Global appear to all be going head to head in the race to commercialize full genome sequencing.

Disruptive technology

Full genome sequencing provides information on a genome that is orders of magnitude larger than that provided by the current leader in sequencing technology, DNA arrays. For humans, DNA arrays currently provides genotypic information on up to one million genetic variants,[75][76][77] while full genome sequencing will provide information on all six billion bases in the human genome, or 3,000 times more data. Because of this, full genome sequencing is considered disruptive to the DNA array markets as the accuracy of both range from 99.98% to 99.999% (in non-repetitive DNA regions) and their cost of $5000 per 6 billion base pairs is competitive (for some applications) with DNA arrays ($500 per 1 million basepairs).[40] Agilent, another established DNA array manufacturer, is working on targeted (selective region) genome sequencing technologies[78]. It is thought that Affymetrix, the pioneer of array technology in the 1990s, has fallen behind due to significant corporate and stock turbulence and is currently not working on any known full genome sequencing approach.[79][80][81] It is unknown what will happen to the DNA array market once full genome sequencing becomes commercially widespread, especially as companies and laboratories providing this disruptive technology start to realize economies of scale. It is postulated, however, that this new technology may significantly diminish the total market size for arrays and any other sequencing technology once it becomes commonplace for individuals and newborns to have their full genomes sequenced.[82]

Sequencing versus Analysis

Full genome sequencing provides raw data on all six billion letters in an individual's DNA. However, it does not provide an analysis of what that data means or how that data can be utilized in various clinical applications, such as in medicine to help prevent disease. As of now, the companies that are working on providing full genome sequencing do not provide clinical analytical services for the interpretation of the raw genetic data. Therefore, in-order for this data to be useful, researchers or companies first need to find a way to analyze it on a clinical level and make it useful to physicians and patients.[74]

Societal impact

Inexpensive, time-efficient full genome sequencing will be a major accomplishment not only for the field of Genomics, but for the entire human civilization because, for the first time, individuals will be able to have their entire genome sequenced. Utilizing this information, it is speculated that health care professionals, such as physicians and genetic counselors, will eventually be able to use genomic information to predict what diseases a person may get in the future and attempt to either minimize the impact of that disease or avoid it altogether through the implementation of personalized, preventive medicine. Full genome sequencing will allow health care professionals to analyze the entire human genome of an individual and therefore detect all disease-related genetic variants, regardless of the genetic variant's prevalence or frequency. This will enable the rapidly emerging medical fields of Predictive Medicine and Personalized Medicine and will mark a significant leap forward for the clinical genetic revolution. Full genome sequencing is clearly of great importance for research into the basis of genetic disease. However, it should be recognized that despite advancements in genome sequencing technology, incomplete understanding of the significance of individual variants or combinations of variants will limit the widespread usefulness of full genome sequencing in medicine until its clinical utility can be demonstrated.

Illumina's CEO, Jay Flatley, stated in February 2009 that "A complete DNA read-out for every newborn will be technically feasible and affordable in less than five years, promising a revolution in healthcare" and that "by 2019 it will have become routine to map infants' genes when they are born."[83] This potential use of genome sequencing is highly controversial, as it runs counter to established ethical norms for predictive genetic testing of asymptomatic minors that have been well established in the fields of medical genetics and genetic counseling.[84][85][86][87] The traditional guidelines for genetic testing have been developed over the course of several decades since it first became possible to test for genetic markers associated with disease, prior to the advent of cost-effective, comprehensive genetic screening. It is established that norms, such as in the sciences and the field of genetics, are subject to change and evolve over time.[88][89] It is unknown whether traditional norms practiced in medical genetics today will be altered by new technological advancements such as full genome sequencing.

Today, parents have the legal authority to obtain testing of any kind for their children. Currently available newborn screening for childhood diseases allows detection of rare disorders that can be prevented or better treated by early detection and intervention. Specific genetic tests are also available to determine an etiology when a child's symptoms appear to have a genetic basis. Full genome sequencing, however, reveals a large amount of information (such as carrier status for autosomal recessive disorders, genetic risk factors for complex adult-onset diseases, and other predictive medical and non-medical information) that is currently not completely understood, not clinically useful during childhood, and may not necessarily be wanted by the individual upon reaching adulthood. Despite the theoretical (and currently unproven) benefits of predicting disease risk in childhood, genetic testing also introduces potential harms (such as discovery of non-paternity, genetic discrimination, and psychological impacts). The established ethical guidelines for predictive genetic testing of asymptomatic minors thus has more to do with protecting this vulnerable population and preserving the individual's privacy and autonomy to know or not to know their genetic information, than with the technology that makes this possible. While parents may have legal authority to obtain such testing, the mainstream opinion of professional medical genetics societies is that presymptomatic testing should be offered to minors only when they are competent to understand the relevancy of genetic screening so as to allow them to participate in the decision about whether or not it is appropriate for them.