The human genome sequence, first published in 2001, has some important information missing. The latest version of it, called GRCh38, has a monstrous 3.1 gigabases of information—but that's still not enough. A letter published in
Nature Genetics this week finds that
the reference genome is missing a colossal 10 percent of the genetic information found in the genomes of hundreds of people with African ancestry—information that also appears in other human populations.
Get the reference
The "human genome" is in fact assembled from the genomes of just a handful of people, with the majority of GRCh38 coming from just one person. It's not a snapshot of what's in human DNA so much as a kind of template and roadmap, giving a sense of what's in there and allowing comparisons between individuals and the "reference genome."
We've known this is a limitation and have been making constant additions to the reference genome, which has improved its ability to represent the huge range of variation that's present in modern humans. But because its source is so limited, write the authors of this week's letter, so is its usefulness: "In recent years, a growing number of researchers have emphasized the importance of capturing and representing sequencing data from diverse populations."
The current situation, they write, makes it tricky to analyze people whose ancestry is very different from that of the reference genome. Although there are some methods that allow researchers to look at limited amounts of genetic diversity alongside the reference, a more comprehensive solution that's been gaining traction has been to build population-specific references—a project already underway for certain groups, including Chinese and Ashkenazi.
The genome of all humans
There is no "pan-genome"—no "collection of sequences representing all of the DNA in [a] population," write lead author Rachel Sherman and her colleagues. It's been done for bacteria, but not for humans. So they set out to create a pan-genome for Africa, using DNA from 910 people of African descent. The group includes people from the Caribbean and the US, who retain some of Africa's genetic diversity, even though they have their own distinct genetic history.
They compared the DNA from these hundreds of people to the reference genome, looking for long sections that didn't match. The basic unit of DNA is the base pair, one of the rungs on the twisted ladder that makes up the double helix. Sherman and her colleagues looked for sequences more than 1,000 base pairs long that didn't match the reference and found a lot of them: nearly 300 million base pairs, which is about 10 percent of the size of the entire reference genome.
That's not to say this information is unique to African people: about 40 percent of this data matched either the Korean or Chinese genomes. This suggests that it's
important genetic material that's present across a huge range of humans, but still not captured by the reference genome assembled from just a small number of people. There's a lot going on with humans that isn't reflected by the human reference genome.
Medical consequences and cautions
Any research efforts that lean on the reference genome to study human variation will be missing out on this huge amount of data—and this is what "nearly all studies do at present," write Sherman and colleagues. "A single reference genome is not adequate for population-based studies of human genetics," they add, suggesting that a way forward is to create reference genomes for different human groups. Over time, this will lead to a pan-genome "capturing all of the DNA present in humans."
This has important consequences for medicine—"If you are a scientist looking for genome variations linked to a condition that is more prevalent in a certain population, you'd want to compare the genomes to a reference genome more representative of that population,"
says Rachel Sherman.
But having this information for Africans now doesn't tell us much that a scientist researching a given condition would be able to use.
The study didn't explore what's being done by any of the DNA that wasn't in the reference genome and can't say anything about whether it might play a role in health conditions or any other variation.
While population-specific genomes might be a useful way forward for studying human variation, they could run into a different set of difficulties when they leave the lab and run into the real world. As demonstrated by the fact that a lot of this DNA also shows up in Koreans,
population groupings between humans aren't neat lines, especially at the DNA level. They have fuzzy boundaries; individuals can have genetic traits from multiple populations; and how someone looks isn't a reliable guide to their DNA.
Complicating matters further, there are multiple populations within Africa that may have distinct genetic histories that we're just now scratching the surface of. Having a pan-African genome won't necessarily tell us a lot about what's distinctive for any individual African.
Although genetics researchers understand all of this,
any use of population-specific reference genomes in fields such as medicine could come with a new swathe of problems if this messiness isn't communicated or understood well.