Science Daily William Noble, professor of genome sciences and computer science, in the data center at the William H. Foege Building. Noble, an expert on machine learning, and his team designed artificial intellience programs to analyze ENCODE data. These computer programs can learn from experience, recognize patterns, and organize information into categories understandable to scientists. The center houses systems for a wide variety of genetic research. The computer center has the capacity to store and analyze a tremendous amount of data, the equivalent of a 670-page autobiography of each person on earth, uncompressed.The computing resources analyze over 4 pentabytes of genomic data a year. (Credit: Clare McLean, Courtesy of University of Washington)
— The Human Genome Project produced an almost complete order of the 3 billion pairs of chemical letters in the DNA that embodies the human genetic code -- but little about the way this blueprint works. Now, after a multi-year concerted effort by more than 440 researchers in 32 labs around the world, a more dynamic picture gives the first holistic view of how the human genome actually does its job.
During the new study, researchers linked more than 80 percent of the human genome sequence to a specific biological function and mapped more than 4 million regulatory regions where proteins specifically interact with the DNA. These findings represent a significant advance in understanding the precise and complex controls over the expression of genetic information within a cell. The findings bring into much sharper focus the continually active genome in which proteins routinely turn genes on and off using sites that are sometimes at great distances from the genes themselves. They also identify where chemical modifications of DNA influence gene expression and where various functional forms of RNA, a form of nucleic acid related to DNA, help regulate the whole system.
"During the early debates about the Human Genome Project, researchers had predicted that only a few percent of the human genome sequence encoded proteins, the workhorses of the cell, and that the rest was junk. We now know that this conclusion was wrong," said Eric D. Green, M.D., Ph.D., director of the National Human Genome Research Institute (NHGRI), a part of the National Institutes of Health. "ENCODE has revealed that most of the human genome is involved in the complex molecular choreography required for converting genetic information into living cells and organisms."
NHGRI organized the research project producing these results; it is called the Encyclopedia of DNA Elements or ENCODE. Launched in 2003, ENCODE's goal of identifying all of the genome's functional elements seemed just as daunting as sequencing that first human genome. ENCODE was launched as a pilot project to develop the methods and strategies needed to produce results and did so by focusing on only 1 percent of the human genome. By 2007, NHGRI concluded that the technology had sufficiently evolved for a full-scale project, in which the institute invested approximately $123 million over five years. In addition, NHGRI devoted about $40 million to the ENCODE pilot project, plus approximately $125 million to ENCODE-related technology development and model organism research since 2003.
The scale of the effort has been remarkable. Hundreds of researchers across the United States, United Kingdom, Spain, Singapore and Japan performed more than 1,600 sets of experiments on 147 types of tissue with technologies standardized across the consortium. The experiments relied on innovative uses of next-generation DNA sequencing technologies, which had only become available around five years ago, due in large part to advances enabled by NHGRI's DNA sequencing technology development program. In total, ENCODE generated more than 15 trillion bytes of raw data and consumed the equivalent of more than 300 years of computer time to analyze.
"We've come a long way," said Ewan Birney, Ph.D., of the European Bioinformatics Institute, in the United Kingdom, and lead analysis coordinator for the ENCODE project. "By carefully piecing together a simply staggering variety of data, we've shown that the human genome is simply alive with switches, turning our genes on and off and controlling when and where proteins are produced. ENCODE has taken our knowledge of the genome to the next level, and all of that knowledge is being shared openly."
The ENCODE Consortium placed the resulting data sets as soon as they were verified for accuracy, prior to publication, in several databases that can be freely accessed by anyone on the Internet. These data sets can be accessed through the ENCODE project portal (www.encodeproject.org) as well as at the University of California, Santa Cruz genome browser,http://genome.ucsc.edu/ENCODE/, the National Center for Biotechnology Information,http://www.ncbi.nlm.nih.gov/geo/info/ENCODE.html and the European Bioinformatics Institute,http://useast.ensembl.org/Homo_sapiens/encode.html?redirect=mirror;source=www.ensembl.org.
"The ENCODE catalog is like Google Maps for the human genome," said Elise Feingold, Ph.D., an NHGRI program director who helped start the ENCODE Project. "Simply by selecting the magnification in Google Maps, you can see countries, states, cities, streets, even individual intersections, and by selecting different features, you can get directions, see street names and photos, and get information about traffic and even weather. The ENCODE maps allow researchers to inspect the chromosomes, genes, functional elements and individual nucleotides in the human genome in much the same way."
The coordinated publication set includes one main integrative paper and five related papers in the journal Nature; 18 papers inGenome Research; and six papers in Genome Biology. The ENCODE data are so complex that the three journals have developed a pioneering way to present the information in an integrated form that they call threads.
"Because ENCODE has generated so much data, we, together with the ENCODE Consortium, have introduced a new way to enable researchers to navigate through the data," said Magdalena Skipper, Ph.D., senior editor at Nature, which produced the freely available publishing platform on the Internet.
Since the same topics were addressed in different ways in different papers, the new website, www.nature.com/encode, will allow anyone to follow a topic through all of the papers in the ENCODE publication set by clicking on the relevant thread at the Nature ENCODE explorer page. For example, thread number one compiles figures, tables, and text relevant to genetic variation and disease from several papers and displays them all on one page. ENCODE scientists believe this will illuminate many biological themes emerging from the analyses.
In addition to the threaded papers, six review articles are being published in the Journal of Biological Chemistry and two related papers in Science and one in Cell.
The ENCODE data are rapidly becoming a fundamental resource for researchers to help understand human biology and disease. More than 100 papers using ENCODE data have been published by investigators who were not part of the ENCODE Project, but who have used the data in disease research. For example, many regions of the human genome that do not contain protein-coding genes have been associated with disease. Instead, the disease-linked genetic changes appear to occur in vast tracts of sequence between genes where ENCODE has identified many regulatory sites. Further study will be needed to understand how specific variants in these genomic areas contribute to disease.
"We were surprised that disease-linked genetic variants are not in protein-coding regions," said Mike Pazin, Ph.D., an NHGRI program director working on ENCODE. "We expect to find that many genetic changes causing a disorder are within regulatory regions, or switches, that affect how much protein is produced or when the protein is produced, rather than affecting the structure of the protein itself. The medical condition will occur because the gene is aberrantly turned on or turned off or abnormal amounts of the protein are made. Far from being junk DNA, this regulatory DNA clearly makes important contributions to human health and disease."
Identifying regulatory regions will also help researchers explain why different types of cells have different properties. For example why do muscle cells generate force while liver cells break down food? Scientists know that muscle cells turn on some genes that only work in muscle, but it has not been previously possible to examine the regulatory elements that control that process. ENCODE has laid a foundation for these kinds of studies by examining more than 140 of the hundreds of cell types found in the human body and identifying many of the cell type-specific control elements.
Despite the enormity of the dataset described in this historic collection of publications, it does not comprehensively describe all of the functional genomic elements in all of the different types of cells in the human body. NHGRI plans to invest in additional ENCODE-related research for at least another four years. During the next phase, ENCODE will increase the depth of the catalog with respect to the types of functional elements and cell types studied. It will also develop new tools for more sophisticated analyses of the data.