Data Access

Data generated by C-CoMP laboratories is stored in repositories that provide DOI numbers to ensure persistent access to our products, unless otherwise indicated. All repositories (see below) provide immediate access to the broader community. Metabolite data will be deposited in either MetaboLights or the Metabolomics Workbench. Field data from the Bermuda Atlantic Time-series (BATS) is stored at the BATS ftp website and at the NSF-supported Biological & Chemical Oceanography Data Management Office (BCO-DMO) repository. Processed protein data is stored at BCO-DMO and accessed through the Ocean Protein Portal. Proteomics raw mass spectral data is submitted to ProteomeXchange via Massive or Pride. Sequence data is stored at NCBI on the Sequence Read Archive (SRA), iMicrobe, and/or iVirus. We use Zenodo to share versioned, intermediate data products. Raw model data will be stored on local servers initially. For CESM results, we will adhere to their data policy, which includes data release within one year of generation. In general, C-CoMP will meet the NSF guidelines on data release within 2 years of generation but will release any hardened data products before then, if possible. All software development efforts in C-CoMP will follow open source software development practices, will be licensed through General Public License, and the source code of stable releases as well as the development branches will be accessible to the community through GitHub repositories. Stable releases of our software will also be available as Conda packages through the Anaconda Package Repository and as Docker containers through the Docker Hub for platform independent, easy-to-install or easy-to-run scenarios.

The figure below displays strategies for C-CoMP knowledge transfer within the ‘C-CoMP space’ and with the ‘Public space’. C-CoMP contributes to education outreach and knowledge transfer through research that includes integrated data analysis, software development and integration, and scientific discovery and modeling. Novel and existing data are inputs that enable this research. Data management, including internal sharing of datasets and versioning within the shared C-CoMP Google Drive, facilities the efficiency and speed of data sharing. C-CoMP knowledge transfer to the ‘public space’ includes:

long-term public data (raw ‘omics data, metadata, intermediate research products, and final research products) with assigned digital object identifiers (DOI’s)
stable software designed for analyzing, integrating, and visualizing data shared through the Docker Hub or Conda
source code and technical discussions for open-source software, models, and educational materials shared through GitHub
rapid publication and pre-peer review of manuscripts for providing early data descriptors and scientific discoveries via BioRxiv or other pre-print journals.
blog posts sharing science news, project management ideas, and trainee stories for community outreach and transparency.

Strategies for C-CoMP knowledge transfer within the ‘C-CoMP space’ and with the ‘Public space’; a detailed description is provided in the text above.

Strategies for C-CoMP knowledge transfer within the ‘C-CoMP space’ and with the ‘Public space’. Figure credit: C-CoMP Team.

Data type	Repository/website
Raw metabolite data	MetaboLights and/or Metabolomics Workbench
Field data	BATS, BCO-DMO
Processed protein data	BCO-DMO via Ocean Protein Portal
Raw proteomics data	ProteomeXchange
Genomic Sequence data (16S rRNA, Whole Genome Sequencing, Shotgun Metagenomics)	NCBI SRA, iMicrobe, iVirus
Open source software	Github, Anaconda, Docker
CESM results	TBD