The National Institutes of Health (NIH) Genomic Data Sharing Policy expects that genomic research data from NIH-supported studies involving human specimens as well as non-human and model organisms will be submitted to an NIH-designated data repository. The list below provides examples of relevant databases.
Examples of NIH Data Repositories, NIH-Funded Databases, and NIH Database Collaborations
Array Express: an NIH-funded database at the European Molecular Biology Laboratory -European Bioinformatics Institute that collects and disseminates microarray-based gene-expression data.
about Array Express.
DNA Data Bank of Japan (DDBJ): a data bank organized by the National Institute of Genetics in Japan that collects sequence data. As a member of the International Nucleotide Sequence Database Collaboration, DDBJ exchanges data with GenBank at the NIH National Center for Biotechnology Information and the European Nucleotide Archive European Molecular Biology Laboratory -European Bioinformatics Institute.
Database of Genotypes and Phenotypes (dbGaP): an NIH database at the National Center for Biotechnology Information originally designed to archive and distribute coded genotype, phenotype, exposure, and pedigree data from genome-wide association studies. dbGaP now accepts additional types of data such as copy number variants and large-scale sequencing.
Database of Short Genetic Variations (dbSNP): an NIH database at the National Center for Biotechnology Information that includes single nucleotide variations, microsatellites, and small-scale insertions and deletions. dbSNP provides population-specific frequency and genotype data, experimental conditions, molecular context, and mapping information for both neutral variations and clinical mutations.
Database of Genomic Structural Variation (dbVar): an NIH database at the National Center for Biotechnology Information for large-scale structural genomic variations--such as insertions, deletions, translocations, and inversions--and associated phenotype information. dbVar accepts germline and somatic human structural variant data as well as data from a diverse array of organisms, including agriculturally important plants and livestock.
European Nucleotide Archive (ENA): a database at the European Molecular Biology Laboratory -European Bioinformatics Institute (EMBL-EBI) that collects, maintains, and presents comprehensive sequencing information--including raw sequencing data, sequence assembly information, and functional annotation--as part of the permanent public scientific record. As a member of the International Nucleotide Sequence Database Collaboration, EMBL-EBI exchanges data with GenBank at the NIH National Center for Biotechnology Information and the Data Bank of Japan.
FlyBase: an NIH-funded database for genetic and genomic information on the fruit fly Drosophila melanogaster and related fly species. It includes referenced sequence genomes, phenotypic and gene expression data, chromosome maps, and additional resources.
GenBank: an NIH genetic sequence database at the National Center for Biotechnology Information (NCBI) that provides an annotated collection of publicly available DNA sequences. As a member of the International Nucleotide Sequence Database Collaboration, NCBI exchanges GenBank data with the European Nucleotide Archive at the European Molecular Biology Laboratory -European Bioinformatics Institute and the Data Bank of Japan.
Gene Expression Omnibus (GEO): an NIH data repository that archives and distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomic data.
Influenza Research Database (IRD): an NIH-funded database that provides genomic and proteomic data for influenza viruses as well as surveillance data and phenotypic characteristics of viruses isolated from extracts.
Mouse Genome Informatics (MGI): an NIH-funded international database for the laboratory mouse Mus musculus that provides data on gene characterization, allelic variants, gene expression, mouse tumor biology, strain-specific phenotypes and genotypes, and mammalian orthology.
Rat Genome Database (RGD): an NIH-funded database that serve as a repository of genetic and genomic data from the laboratory rat Rattus norvegicus and also provides curation of mapped positions for quantitative trait loci, known mutations, and other phenotypic data.
Sequence Read Archive (SRA): NIH's primary archive of high-throughput sequencing data at the National Center for Biotechnology Information (NCBI). SRA stores raw sequencing data as well as alignment information in the form of read placements on a reference sequence. As a member of the International Nucleotide Sequence Database Collaboration, NCBI exchanges SRA data with the European Nucleotide Archive European Molecular Biology Laboratory -European Bioinformatics Institute and the Data Bank of Japan.
Read more about SRA.
WormBase: an NIH-funded international consortium that provides accurate, current, accessible information concerning the genetics, genomics, and biology of Caenorhabditis elegans and related nematodes.
Xenbase: an NIH-funded database that serves as a biology and genomics resource for research on the African frog species Xenopus laevis and Xenopus tropicalis.
Zebrafish Information Network (ZFIN): an NIH-funded database that collects, curates, and disseminates genetic, genomic, phenotypic, and developmental data about the zebrafish Danio rerio. Data represented in ZFIN are derived from three primary sources: curation of zebrafish publications, individual research laboratories, and collaborations with bioinformatics organizations.
Data Repositories Established as NIH Trusted Partners
The National Institutes of Health (NIH) promotes data sharing as an essential element to facilitate the translation of research results into knowledge, products, and procedures to improve human health. To achieve this goal, NIH has created a central repository model for data storage and distribution through the database for Genotypes and Phenotypes (dbGaP). However, in light of the increasing volume and complexity of the data, which necessitate innovative solutions for storing and presenting the data, NIH is exploring new models for data management resources, including structured partnerships with external organizations or "trusted partners."
A "trusted partner" is defined as a public or private, national or international organization that is able to meet core NIH standards for establishing data quality and data management service protocols for NIH, based on the programmatic need of an NIH funding Institute or Center (IC). A trusted partnership can be established only through a contract mechanism between an NIH funding IC and the trusted partner organization. Contracts are awarded through an IC's standard acquisition and negotiation processes. NIH funding ICs that are interested in submitting an application to establish a trusted partnership should contact GDS staff at:
NIH Established Trusted Partners
Cancer Genomics Hub (CGHub): CGHub stores, catalogs, and facilitates research using cancer genome sequences, alignments, and mutation information from The Cancer Genome Atlas (TCGA) consortium and related projects.
NCI Genomic Data Commons: The mission of the National Cancer Institute (NCI) Genomic Data Commons (GDC) is to provide the research community with a unified repository of cancer genomics data and associated clinical information.
NCI Cancer Genomics Cloud Pilots
National Cancer Institute (NCI) funded three Cancer Genomics Cloud Pilot contracts with the primary objective to foster innovative solutions that support co-location of data from The Cancer Genome Atlas (TCGA) with computational resources, which would enable access to the data and tools by authorized users who do not have the resources to download the entire TCGA dataset.
Broad Institute FireCloud: Broad Institute's FireCloud is a cancer genome analysis platform with co-located TCGA data as well as other public datasets including 1000 Genomes, Cancer Cell Line Encyclopedia (CCLE), and Genotype-Tissue Expression (GTEx). FireCloud will securely track and manage data, metadata, tools, job execution and results and will capture provenance for each run.
Institute for Systems Biology (ISB) Cancer Genomics Cloud: ISB's Cancer Genomics Cloud will host the TCGA data in Google Cloud Storage and BigQuery tables and will provide end-users with tools and services ranging from web-based interactive exploration to cloud-based instances of RStudio and IPython and the means to run Docker containers and pipelines on virtual machines hosted on Google's infrastructure. Cancer researchers will be able to analyze TCGA data in conjunction with their own private data or with other publicly available datasets.
Seven Bridges Genomics Cancer Genomics Cloud: The Cancer Genomics Cloud powered by Seven Bridges allows users to analyze their data alongside data from The Cancer Genomics Atlas (TCGA) using custom tools and pipelines or community-contributed apps.
Bionimbus: Bionimbus is a collaboration between the Institute for Genomics and Systems Biology (IGSB) at the University of Chicago and the Open Science Data Cloud to develop open source technology for managing, analyzing, transporting, and sharing large NCI-funded cancer genomics datasets in a secure and compliant fashion.