About the Author
Iva Veseli is a graduate student in the C-CoMP Data Integration Working Group. She is part of the Meren Lab at the University of Chicago, where she develops bioinformatics strategies for estimating microbial metabolism from ‘omics data.
The Ruegeria pomeroyi digital microbe: showcasing our strategy for consolidating and sharing data with a C-CoMP model organism
One of C-CoMP’s early goals has been to establish a means of effectively sharing related ‘omics data in an easily-accessible, interoperable, and reproducible manner. As a large organization spanning multiple institutions and labs using and/or producing a variety of different sample types, it is critical to ensure that passing data from one lab to another doesn’t get confusing. And as a center committed to open science, we are striving to follow the FAIR data principles when we make our data publicly-accessible.
Our solution is to integrate datasets describing our model organisms into self-contained, self-describing, programmatically-queryable SQLite databases using anvi’o (an open-source software platform), and then to share these databases via Zenodo (an online data repository for researchers). This strategy offers several benefits, such as the ability to share multiple datasets by providing a single link, and the ability to easily analyze and interactively visualize these data. We refer to the databases as ‘digital microbes’ because they can be used to consolidate ‘omics data related to a single microbial organism: its genome sequence, gene annotations, mapping information from metagenomes or metatranscriptomes, associated metadata, and more.
In this blog post, we’ll share the details of our first digital microbe – Ruegeria pomeroyi, one of C-CoMP’s model organisms – and hopefully demonstrate how easy it is to work with data in this format.
What is Ruegeria pomeroyi?
R. pomeroyi DSS-3 is a model marine bacterium for C-CoMP research. It is an Alphaproteobacterium, part of the Roseobacter clade, and known to degrade dimethylsulfoniopropionate, or DMSP, which makes it an important player in sulfur cycling in the oceans (González et al, 2003).
The Moran Lab at the University of Georgia isolated R. pomeroyi DSS-3 from coastal seawater. They sequenced its genome (Moran et al, 2004) and have been collecting and curating data on this important microbe ever since, generating metagenomes, metatranscriptomes, proteomes, and a TnSeq mutant library. We combined several of these datasets to make the R. pomeroyi digital microbe (also sometimes referred to as a digital organism).
What data is part of this ‘digital microbe’ (and who generated it)?
So far (at time of writing), the following data has been incorporated into the R. pomeroyi digital microbe (current version: 2):
- the complete genome and megaplasmid sequence (of type strain DSS-3), generated by the Moran lab (Moran et al, 2004)
- curated gene calls and their functional annotations, provided by Zac Cooper, Christa Smith and William Schroer
- automatically-generated functional annotations from sources including Pfam, NCBI COGs, and KEGG KOfam/BRITE
- genes that have TnSeq mutants available in the Moran lab’s mutant library, generated by Lidimarie Trujillo Rodriguez
- mapping data from 133 (meta)transcriptomes to the genome/megaplasmid sequence. The (meta)transcriptome samples were generated by the Moran lab and span 6 publications (and ~2 papers pending publication): Durham et al, 2014; Landa et al, 2019; Ferrer-González et al, 2021; Nowinski and Moran, 2021; Olofsson et al, 2022; Uchimiya et al, 2022
In future versions, we plan to incorporate additional datasets and data types (e.g. proteomes).
How to access the R. pomeroyi databases
To get a copy of the databases (as well as a reproducible workflow describing how the databases were generated), you can download them from Zenodo using this link: https://zenodo.org/record/7439166.
There will be two databases in the datapack. The first is a contigs database, which contains the genome sequence and related information like gene calls, function annotations, and taxonomy. The second is a profile database, which contains read-mapping information from multiple metagenome and metatranscriptome samples, including coverage data and sequence/structural variants.
How the databases were generated
Here is a shortened version of the reproducible workflow for generating the R. pomeroyi databases. A longer workflow with additional detail is included in the datapack when downloading the databases from Zenodo.
How to use the data
Now, we’ll give some practical advice about using the databases. The commands below utilize the software platform anvi’o – anyone unfamiliar with anvi’o can find installation instructions, tutorials, and program documentation on the website anvio.org. This section is not meant to be a comprehensive tutorial on using anvi’o; it will simply provide some examples to follow. Note that the databases were generated using anvio
v7.1-dev, so this version of the software (or a later release) should be installed to use the databases.
Note that anvi’o is not necessary to access this data. These are SQLite databases, which are programmatically accessible via command line-based SQL queries or packages that support interfacing with SQL databases, like the Python
sqlite3 module or the R library
Visualizing the data interactively
Once the databases are downloaded (and the anvi’o conda environment is loaded in your terminal), you can open the databases in the interactive interface by running this command:
anvi-interactive -c R_POM_DSS3-contigs.db -p PROFILE-VER_01.db
-p parameters are not even necessary in most cases. If these files are the only two databases in your working directory, anvi’o will find and load them automatically when you run
A browser window should open. Anvi’o works best in Google Chrome, so if that is not your default browser and things are not working, you should open Chrome and copy-paste the URL into the Chrome window to see if that fixes things.
You will see a glowing red ‘Draw’ button, which you should click on:
At which point a figure like the following should be drawn in the center of the window:
This display shows the coverage of the 133 (meta)transcriptome samples on different parts of the genome and megaplasmid sequence of R. pomeroyi DSS-3. Each concentric circle in the plot (“layer” in anvi’o lingo) is a different (meta)transcriptome, and each spoke of the wheel (or “item”) is a small seqment of the larger genome (or plasmid) sequence. The dendrogram in the middle organizes each smaller sequence segment, by default in order of sequence composition (aka tetranucleotide frequency) and differential coverage patterns across the samples. The outermost red layers indicate which sequence segments contain ribosomal RNA genes, and the barplots show per-sample read-mapping statistics like the number of reads that mapped to the genome/megaplasmid.
The display is interactive and customizable, and we encourage new users to play around and learn how to change the display to suit their needs (here is one tutorial on the subject). For example, if you don’t like the circular organization, you can change it to a regular matrix by changing the ‘Drawing Type’ to ‘Phylogram’ in the Settings panel on the left (and clicking the ‘Draw’ button afterwards to reload the display):
But we’ll keep the circular display in the examples below.
If you want to see the actual data values, you can open the ‘Mouse’ panel on the right side of the screen. Hover over any part of the display to see the corresponding value highlighted:
Those are the basics of using and understanding the interface.
Now we’ll go through a few examples of how to explore the data interactively.
Viewing just two samples
There is a lot of data on the screen. We can hide most of the samples and focus on just two. First, open the Settings panel and scroll down to click the checkbox under the ‘Edit attributes for multiple layers’ section. Doing so will select all the samples.
Then, unselect two samples in the ‘Layers’ section above by unchecking their boxes:
Now we will change the height of all selected samples to 0 so that they will be hidden from the display. In the ‘Edit attributes for multiple layers’ section, set the height box to 0 and click elsewhere so that all but the two unselected layers have a height of 0:
Then click the ‘Draw’ button. Now there are only two samples shown in the display, so it is easy to see the coverage values for each smaller sequence segment:
Searching for a gene
Now let’s try to find a gene of interest. Open the ‘Settings’ panel and click on the ‘Search’ tab, then open the ‘Search functions’ dropdown. Let’s search for the gene encoding the transporter protein for DMSP, OpuA (described in Moran et al, 2004):
There is one gene that matches. If you click on ‘Show search results down below’, you will see that the gene is located on the megaplasmid.
And if you click on ‘Highlight splits on the tree’, you will see that a small red bar pops up to indicate which sequence segment the gene is on:
In the next section, we will take a closer look at the gene and its surrounding sequence context.
Inspecting the sequence
The smaller sequence segments on the main display are portions of the longer genome/megaplasmid sequence. In anvi’o lingo, these are called ‘splits’.
We want to look at the sequence around the OpuA gene that we found. Hover over the split that was highlighted previously, right-click, and click on ‘Inspect split’.
Another browser tab will open the inspection page, which looks like this:
Along the bottom you will see the megaplasmid sequence, where genes are indicated by arrows. The space above contains coverage plots from each sample.
You can click on the genes to see their annotations. You can also zoom in on the sequence by clicking and dragging your mouse in the lowermost panel. For example, here we zoom in on the OpuA gene:
And if we click on it, we will see its functional annotations:
Included in the functional annotations is the Gene_ID, which is the Moran lab’s curated annotation for each gene.
In the annotation panel there are also buttons that will give you the gene sequence (DNA or amino acid), and buttons that will automatically BLAST the gene sequence against the NCBI database.
If you look through the gene annotations, you might notice that some of them have information in the “TN_MUTANT_AVAILABLE” row; like this one:
This indicates that the Moran lab has a TnSeq mutant for the gene in their mutant library. If you want to have one of these mutants for your research, please reach out to the Moran Lab and give them in the annotation information of your mutant of interest.
Using anvi’o programs to analyze the data
Anvi’o has a lot of programs for interacting with these databases directly and doing common ‘omics analyses. You can browse the list of available programs on this page (note that this page is for the development version of anvi’o, and that the list of programs available in prior releases can be accessed by changing “main” in the URL to the anvi’o version number that you have, for versions >= 7).
The benefits of data-sharing via ‘digital microbes’
Hopefully, this post has demonstrated the advantages of the digital microbe framework for collaborative science. In case it’s still unclear, here is a summary of the benefits:
- multiple datasets (and data types) are summarized in just a couple of databases, which are straightforward to share
- the various data types are already integrated and ready for additional analyses and interactive visualization
- database versioning ensures everyone uses the same data and is critical for reproducibility
- sharing the databases on Zenodo facilitates both internal (C-CoMP) and public access of the data