Eukaryotic genomes are mostly non-coding sequence
In contrast to prokaryotic genes, which have a relatively high gene density with little interspersed DNA, eukaryotic genes have a much lower gene density. For example, depending on strain, an E coli genome may have about 5000 genes spread over 5,000,000 base pairs. This is about one gene per 1,000 base pairs. But for the human genome, with about 20,000 genes in about 3 billion base pairs, there is one gene per 150,000 base pairs! Less than 2% of the human genome is protein-coding gene sequence.
So what’s in that remaining 98%?
Some of the 98% is regulatory sequence, used in the control of cellular processes like replication, transcription, and translation. But about 50% of the human genome is repetitive DNA, much of for which we still do not know the function. Other organisms can have even more: for example, about 80% of the maize (corn) genome is repetitive DNA. Repetitive elements can be clustered together in the genome in tandem repeats, or they can be interspersed throughout the genome.
About 13% of the human genome is made up of short interspersed nuclear elements, or SINEs. SINEs are about 100-400 base pairs long. There are about 1.8 million SINEs in the human genome! This includes over a million copies the most common SINE, the 300 base pair Alu sequence4. An additional 20% of the human genome is made up of LINEs, or long interspersed nuclear elements. The most common human LINE is LINE1, which is about 6,000 base pairs long and repeated about 500,000 times in the genome.
Both SINEs and LINEs are ancient remnants of transposons. Transposons are elements that can move, or “jump”, locations in the genome. Although this sounds like science fiction, these “jumping genes” likely played a big role in evolution. Most SINEs and LINEs remaining in the genome of humans and other organisms are no longer are mobile, having accumulated enough mutations over evolutionary time that they do not retain the sequence information necessary for movement. But some still do!
SINEs and LINEs, while not encoding protein themselves, may nevertheless affect the regulation of nearby genes. Although the repetitive DNA in eukaryotic genomes was once thought of as “junk” DNA, there is increasing evidence that these sequences can, in fact, affect gene expression, cell function, and organismal phenotypes.