Base pairs and beyond: A guide to your genome and genetic testing

Imagine your genome as a vast landscape scattered with millions of diamonds waiting to be mined. These metaphorical diamonds are the genetic data that can help you be healthier, smarter, and even more connected. But the diamonds in the DNA are unknown to many people, who never thought about tapping into their own genome even when the findings would be of most value to them. You, on the other hand, may be interested but have little knowledge of the risks, rewards, and practicalities.

The “prospecting” of your genome through a genetic test is an investment in self-discovery that requires a little research. Not all genetic tests are the same, therefore they are not all equally good investments. Like real diamonds, the insights from your DNA need the technology to extract them. Genetic tests use different technologies to survey your genome and to process the tiny percentage of sites that yield valuable information. Your results can vary in quality and quantity. In fact, some genetic tests are extremely limited or even worthless. Behind those are direct-to-consumer (DTC) companies that capitalize on naive consumers by offering no true bridge to understanding – just a bridge that they have managed to sell you.

Opposite of the risk of getting no real knowledge is the risk of getting too much knowledge. Not everyone wants to learn their genetic predisposition for a disease, especially if it is incurable. Depending on whether you believe ignorance is bliss, the findings excavated from your genome might not resemble diamonds after all. Thus, before venturing into your genome, you might want to be prepared by learning about your investment as well as your investment profile.

Most people could benefit in some way from knowing their genetic data, but like with real investments, how much you should expect to benefit also depends on your age and circumstance. The more time you have to make use of the information, the more value you get. It is with this reasoning that researchers have proposed we explore the benefits of genomic sequencing starting at birth. As an adult, you would derive greater benefits in certain contexts. If you are likely to become sick (as indicated by family history), a genetic screen can help prevent or prepare for the disease. If you are already sick, genetic data allows a more effective personalized medicine approach. Both these health contexts are preferentially covered by clinical genetic testing under professional oversight.

Compared to the more limited scope of clinical genetic testing, direct-to-consumer genetic testing extends to areas like ancestry, lifestyle, fertility and more. While much can be gained by being an explorer, non-medical DTC genetic tests operate in unregulated territories where there are greater risks for those who are easily misled.

Genomic basics

Nearly everyone knows what DNA is, but much less people are “genomically literate”.1 2 Genomic literacy is like having good reading comprehension as opposed to simply knowing your ABCs (or in this case, ATCGs), and it means having a functional understanding of the genome. That is not to say that we have a complete understanding of how the genome works. There is a lot to analyze – the human genome is approximately 3 billion base pairs. Printed out, your genome would cover 262,000 pages or a total of 175 books, and even some of the most pioneering researchers comprehend only four of them.3 Besides the information encoded in the DNA sequence, your genome has a 3D structure that can add another level of meaning. DNA is packaged into chromosomes whose arrangements can change dynamically to regulate gene expression,4 and we are still trying to understand how this works.

Since you are a diploid organism, your somatic cells (cells that are not sperm or egg) contain 6 billion base pairs packaged into 46 chromosomes. Having two sets of chromosomes confers certain advantages. Aside from allowing sexual reproduction, it’s like having a built-in backup for the approximately 20,000 genes that you inherit. If there is a bad copy from your dad’s side, you can rely on the good one from your mom’s. However, there are at least several hundred “haploinsufficient” genes where you can’t get away with having only one working copy.5 Haploinsufficient genes require the double dose for you to be alive and normally functioning (perhaps like that double shot of espresso in the morning).

It may be more surprising that you carry around 6 billion base pairs when just 1% of them actually count as genes.6 Only this tiny percentage of the genome contain exons, or meaningful DNA segments that code for parts of proteins. One gene can be translated into more than one protein through different combinations of exons. Exons collectively make up the exome, the most important sector of the genome.

Reading the genome

The technologies used for reading your genome range enormously in resolution. The most complete and laborious option is whole genome sequencing for your 6 billion base pairs. However, similiar to how you wouldn’t want to read hundreds of books especially if you couldn’t understand most of them, most genetic tests are limited to specific parts of the genome. They might cover the exome, a handful of genes, or only single-base differences through a much spottier method called genotyping. But some tests obtain very low resolution data and provide low-quality interpretations while claiming that their results are “comprehensive”, much like writing a book report after reading only a few words. The price vs. value of such services is extremely disproportionate, but you wouldn’t know it unless you understood the technologies.

Whole genome sequencing

If you want it all, then whole genome sequencing is for you. But it isn’t fast or cheap. Even with current “next-generation sequencing” (NGS) technology, reading the entire genome can take months and at least a thousand dollars.

What exactly is next-generation sequencing? It is probably best characterized by the technology developed by Illumina (Illumina is to next-generation sequencing what Intel is to processors) that allows you to read billions of DNA molecules not visible under a regular microscope.

NGS: The chemistry

The genome isn’t actually read in one long continuous scan; fragmenting DNA into random short lengths that can be processed in parallel is immensely faster. Expand...

The sequence of each length can be revealed through building the complementary one, which is why Illumina NGS is also known as sequencing by synthesis. Because there are chemistry-based limitations to how long of a DNA strand you can grow, the maximum “read length” is actually quite short, about 150 base pairs.

Each fragment is cloned into a cluster before the sequencing starts, like a garden patch strewn with identical seeds. Then the DNA is grown base-by-base, and because the information of each newly added bases comes through a chorus, the signal remains clear despite the small number of mismatch errors that occur naturally. The different bases have a chemical tag emitting light at specific colors that allows them to be distinguishable. Just as you can see the light of stars from vast distances, the sequencer machine can recognize the bases by their fluorescence.

NGS: The assembly

The billions of fragments must be pieced together to form the full genome sequence, a demanding computational process. Expand...

Since the reconstruction requires having overlapping fragments, random fragments are generated from more than one copy of the genome to avoid gaps and to provide higher statistical confidence. The sequences are next aligned to a position in the reference genome. Comparison of the new sequence against the reference genome reveals such variations as single-nucleotide polymorphisms (SNPs), large insertions or deletions, and more complex ones. Multiple fragments contain the same base; this average number of times that a base has been seen is called coverage. What you get with whole exome sequencing is quantity, but you also need sufficient quality, which means a standard coverage of 30x; higher quality “deep sequencing” means coverage of at least 60x.[^Illumina Sequencing Coverage] [^Broad Institute Whole Genome Sequencing]

While whole genome sequencing costs a significant amount of money for the average person, looking at value is better than looking at price. Indeed, if all its benefits could be realized, whole genome sequencing would be an outstanding value. But we can’t use all the data in reality. The genome is only as valuable as the actionable insights that can be mined from it, and we have not yet made half of those discoveries. Advocates of whole genome sequencing argue that it has to be done only once, and the benefits will surely accrue as we learn more through research. Ultimately, the best value would be achieved when the price of WGS is even cheaper and our knowledge base has expanded to greater utility. The price will likely fall further given that the cost has plummeted from $14 million in 2006.7

Nevertheless, whole genome sequencing might be more philosophically appealing to you if you are an early adopter. As with other technologies, early adopters may not benefit the most, but they help improve the product for others.8 The additional value of being an early adopter, i.e. sequencing your whole genome now, is being able to contribute your data to research that would generate new discoveries and grow the knowledge base for all. This is the idea behind Veritas Genetics and the Personal Genome Project.

Whole exome sequencing

The exome has been the most intensely studied part of the genome. Variations in exons can have a direct effect on protein function. Given the estimate that 85% of all disease-causing variations are found in the exome,9 doing whole genome sequencing to capture the remaining 15% would appear to be a classic example of diminishing returns.

So would whole exome sequencing be the best value? Despite using the same next-generation technology, the market price of the exome isn’t necessarily proportional to its tiny percentage of the genome. The company Genos offers a kind of premium deep exome sequencing at $499, half the price of whole genome sequencing. Thus whole genome sequencing is more economical in the long run, as you can expect to get a substantial bulk discount.

Targeted gene sequencing

Sequencing of one specific gene or handful of genes is usually done to identify a suspected disease mutation, screen for cancer, or screen for other inherited diseases. The most well-known example is probably BRCA1 and BRCA2 gene sequencing for breast cancer. Many companies develop their own test panels, which are a curated set of genes that are supposed to be highly informative for a particular health condition or trait. The selected gene(s) can be read with targeted next-generation sequencing in an “a la carte” approach to the genome.

For the majority of common diseases, such as Type II diabetes and other types of cancer, one or a handful of genes will not give a good estimate of your risk. Complex traits or health conditions can involve hundreds or thousands of genes and depend heavily on non-genetic factors.

Sanger sequencing

Sanger sequencing is the gold-standard but “old-school” way of sequencing DNA invented in the 1970’s. Using Sanger sequencing for whole-genome projects would be too costly, but the method is commonly used for smaller projects. Some companies use Sanger sequencing when necessary to validate next-generation sequencing results. Highly repetitive regions of DNA (i.e. STRs) are problematic for NGS and would need to be sequenced the Sanger way.

Genotyping

Single-Nucleotide Polymorphisms (SNPs)

If you were to compare your genome with another person’s, you would find single-letter differences scattered between every 300 bases on average.10 These single nucleotide polymorphisms, or SNPs, account for about 90% of the variation in your genome.11 While most SNPs are meaningless differences, some can be directly or indirectly linked to a disease or trait. A SNP in a protein-coding region can result in an amino acid substitution that affects the final protein structure and function. A SNP outside of a protein-coding region but in a regulatory region can alter protein activity. A SNP that isn’t part of either causal relationships is associated with a condition simply because it is close to a causal element.

Genotyping gathers information on SNPs throughout the genome. Currently, 23andMe and most similar direct-to-consumer companies use genotyping. The 23andMe custom platform looks at more than 900,000 SNPs, which is still a mere 0.03% of the genome. But the subset of SNPs chosen by 23andMe are ones that have more biological significance according to previous Genome-Wide Association Studies (GWAS).

The 23andMe custom platform looks at more than 900,000 SNPs, which is still a mere 0.03% of the genome.

Genotyping uses a different technology than next-generation sequencing. Expand...

SNPs are read by a microarray, also referred to as a SNP array or SNP chip. The companies Affymetrix and Illumina provide the most popular SNP arrays. Although there are important differences between their technologies, both rely on complementary base pairing, using single-stranded DNA sequences between 25-50 base pairs long to capture the complementary SNP-containing sequences from the sample DNA. (A friendly detailed guide for the Affymetrix GeneChip can be found here.)

Short Tandem Repeats (STRs)

Short Tandem Repeats (STRs), also called microsatellites, are a different form of variation than SNPs. Repetitive sequences of 2 to 6 base pairs form the STRs that appear at hundreds of thousands of places in the genome. Though STRs have a less sweeping presence in the genome than SNPs, they contribute substantially to variation and have been relevant in many research areas. In forensics, a subset of STRs is used for efficient DNA fingerprinting that leverages the high variability at each STR. In medical genetics, a well-known example of a disease-associated STR is the repeated “CAG” triplet in Huntington’s Disease.12 The third prominent application of STRs is genetic genealogy, which is where direct-to-consumer genetic tests come in.

Using 20 STRs for ancestry analysis is like using floppy disks for storage when you have a cheap 2GB flash drive.

Most DTC genetic ancestry services use SNP genotyping nowadays, but a few companies surprisingly still offer tests of about 20 to 100 STR markers at the same or higher price. Using 20 STRs for ancestry analysis is like using floppy disks for storage when a cheap 2GB flash drive is available. It was realized by 2003 that SNPs have more convenient properties that make them easier to analyze and, therefore, would make superior markers for ancestry research.13

Methylation sequencing

Like the the medieval practice of putting chains on books, putting methyl groups on DNA usually makes the DNA less accessible and is a mechanism of altering gene expression. Both cytosine © and adenine (A) bases can be methylated, but cytosine methylation is more prominent and better studied in the human genome. Patterns of genome methylation are important to health and disease, thus methylation sequencing can provide additional insight to the state of your genome.

There are few DTC genetic tests that look at methylation, but it is likely to become more common in the future.

Interpreting the genome

Your typical genome has between 149 - 182 proteins that are truncated, incomplete versions.14 While that may sound alarming, most people are still relatively healthy because as a diploid organism you would normally first need both copies to be incapacitated to see a loss-of-function effect. Even then, truncated proteins can have less serious consequences than expected. Not all proteins are involved in critical processes. Moreover, our biological systems have many mechanisms, including built-in redundancy, to compensate for proteins that fall short both structurally and functionally. But you can see why interpretation of genetic variants can be less than straightforward. Most of genetics is complicated, like human relationships. The simplest stories are Mendelian disorders where one mutation guarantees a disease, yet even these can have exceptions such as the rare individuals who escape devastating disease despite having the causal genetic mutations.

Research continually updates our knowledge of genetic variants. Findings have different levels of confidence, but this is not clear with most direct-to-consumer services. Confidence level can be generally higher in one area than another, depending on the underlying genetics and how much research has validated the data. For instance, one area that can be generally reported with greater confidence is pharmacogenomics, which describes how genetics influence response to various drugs (i.e. slow or fast metabolizer). Many global research groups exist to come up with high-confidence consensus interpretations for pharmacogenomics.15 Drug response genetics is also more easily predictable because it typically depends on fewer factors compared other areas of genetics, where you are likely to encounter findings of low confidence without realizing it.

DTC services have their own processes for interpretation. How can you assess if a DTC genetic service is delivering real value from your genome? The first indicator is how much of genomic data was available. Most DTC services use SNP data, but it can be 50 SNPs versus nearly a million SNPs. Some services capture such a paltry portion of your genome that it would be hard to report credible results. Aside from that, the back-end database, methods and tools used for interpretations can vary. Most of the time you won’t know how things were done unless the company goes to the effort of publishing whitepapers on their methods. While it can be difficult to assess how credibly a service is doing the interpretation, overall the good ones demonstrate better transparency and science while others rely on hype and ignorant consumers.


Meta-genomics

(to be continued in the series)

A map of your mitochondrial DNA

Meet your microbiome