Introduction
MGnifyR
is a package designed to ease access to the
EBI’s MGnify resource,
allowing searching and retrieval of multiple datasets for downstream
analysis. While MGnify pipelines are undoubtedly useful, as currently
implemented they produce results on a strictly per-sample basis. While
some whole study results are available, comparisons across studies are
difficult. The MGnifyR
package is designed to facilitate
cross-study analyses by handling all the per-sample data retrieval and
merging details internally, leaving the user free to perform the
analysis as they see fit.
The latest version of MGnifyR seamlessly integrates with the miaverse framework providing access to tools in microbiome down-stream analytics. This integration enables users to leverage optimized and standardized methods for analyzing the microbiome. Additionally, users can benefit from a comprehensive tutorial book that offers valuable guidance and support.
Installation
MGnifyR
is currently hosted on GitHub, and can be
installed using via devtools
. MGnifyR
should
be built using the following snippet.
BiocManager::install("MGnifyR")
Create a client
All functions in MGnifyR
make use of a
MgnifyClient
object to keep track of the JSONAPI url, disk
cache location and user access tokens. Thus the first thing to do when
starting any analysis is to instantiate this object. The following
snippet creates this.
mg <- MgnifyClient()
mg
It’s recommended that local caching is enabled with
useCache = TRUE
. Queries to the MGnify API can be quite
slow, particularly when retrieving multipage results for many analyses
(such as many Interpro
results). Using a local disk cache
can significantly speed up subsequent work, bypassing the need to
re-query the API. Use of the cache should be entirely transparent, as
the caching occurs at the raw data level. The cache can persist across
MGnifyR
sessions, and can even be used for multiple
sessions simultaneously - provided that different sets of accessions are
queried at once.
Optionally, a username and password may be specified during client
creation, causing MGnifyR
to attempt retrieval of an
authentication token from the API. Doing so gives access to non-public
results, such as those currently under an author imposed embargo
period.
mg <- MgnifyClient(
username = "Webin-username", password = "your-password", useCache = TRUE)
Functions for fetching the data
Search data
MGnifyR
gives users access to the complete range of
search functionality implemented in the MGnify JSON API. A single
function doQuery()
is used to do perform this searching,
allowing Studies, Samples, Runs and Accession to be interrogated from a
common interface. As with all MGnifyR functions the first argument
client
must be a valid MgnifyClient
instance.
The only remaining required parameter is
qtype
, specifying the type of data to be queried, and may
be one of studies
, samples
, runs
,
analyses
or assemblies
. Other general
parameter include max.hits
.
Unlike most other MGnifyR
high level functions, caching
is turned off by default for doQuery()
. New data and
analyses are being added to MGnify all the time, so enabling caching by
default may lead to out-of-date search results for long-lived sessions.
However, it’s easy to switch back on, and may be useful in many cases.
Also, given the huge and ever increasing number of datasets available in
MGnify, a limit to the number of results returned may be set using
max.hits
. By default this is set to 200, which for most
exploratory queries should be sufficient. It may be increased or
decreased by directly specifying max.hits
, and disabled
completely (no limit) by setting max.hits=NULL
.
In most cases we will want to be more specific about the search, and
will also use either an accession
parameter, or the many
filter options available through the API, and discussed below.
Specifying an accession
id, which in the case of
samples
, runs
and assemblies
may
be a vector of ids, returns a data.frame of metadata with one row per
matching accession.
If accession
is NULL
(the default) then
remaining parameters define the filters applied by the API to the search
result. Details of these parameters are given in
help(doQuery)
. By way of example though, supposing we are
interested in amplicon Illumina samples from the arctic, we might try
the following query:
northpolar <- doQuery(
mg, "samples", latitude_gte=60.0, experiment_type="amplicon",
biome_name="Soil", instrument_platform = "Illumina", max.hits = 10)
head(northpolar)
Specifying an accession
parameter will restrict results
to just those matching that particular entry, be it a study, sample or
run. For example, to retrieve information for study “MGYS00002891”:
Find relevent analyses accessions
Having obtained a particular set of search hits, it’s now time to
retrieve the associated results. General automated analysis is
complicated by the MGnify database design, wherein for example samples
may be shared between multiple studies, or studies analysed multiple
times using different versions of the pipeline. Navigating these
“many-to-one” relationships can be tricky, so MGnifyR
resorts to using analyses
accessions as it’s canonical
identifier. Each analysis corresponds to a single run of a particular
pipeline on a single sample in a single study. The downside of this
approach is that queries returning studies
,
samples
(or anything other than analyses
)
accessions need converting to the corresponding
analyses
.
MGnifyR
therefore provides a helper function to handle
this conversion - searchAnalysis()
. Following on from our
previous search, we have a list of study
accessions, so to
convert to corresponding analyses
we use:
analyses_accessions <- searchAnalysis(
mg, type="studies", accession = study_samples$accession)
head(analyses_accessions)
A useful side effect of the above call is that some attribute metadata for each sample has now been retrieved and stored in the local cache. Thus subsequent API calls for these samples (which will occur multiple times in later steps) will be significantly faster.
It’s important to be aware that the results of a
searchAnalysis()
command will not necessarily be a
one-to-one match with the input accessions. MGnify
analysis
runs are sometimes performed multiple times, perhaps using different
versions of the pipeline. Thus further filtering of the result list may
be required, but is easily performed and is illustrated in the next
section.
Fetch metadata
At this point we have a long list of analysis instances (with
potential duplicates) corresponding to the samples previously found. We
use the getMetadata
function to download and combine all
associated sample
, run
and study
metadata, which we then filter as required to include only the rows we
want.
analyses_metadata <- getMetadata(mg, analyses_accessions)
head(analyses_metadata)
The resulting data.frame has columns with names prefixed with their
source type. For example, “sample_xxx” columns correspond to metadata
gleaned from querying an accession’s sample
entry. MGnify
allows quite flexible specification of arbitray metadata at submission
time, in many cases leading to quite sparse data.frame
results if accession queries are sourced from more than one study. For
instance, if only one sample contains an entry for “sample_soil_PH”,
entries for other rows will be filled with NA
.
MGnifyR
does not automatically clean these missing values -
instead opting to allow the the user to choose the a correct action. The
particular study we’re looking at is from the marine biome, suppose we
were interested in only those samples or analyses for which the sampling
depth was known. The following snippet filters the full
data.frame
selecting only entries which contain a valid
sample_depth
. It’s worth noting the as.numeric
call to ensure the column is converted to numeric
type
before it is checked. All sample data from MGnifyR is initially
retrieved as type character
, and it’s up to the user to
make sure ostensibly numeric entries are converted properly.
known_depths <- analyses_metadata[
!is.na(as.numeric(analyses_metadata$sample_depth)), ]
# How many are left?
dim(known_depths)
Fetch microbiome data
Having selected the analyses we wish to examine further,
getResult()
is used to both download associated OTU tables
and taxonomy, and join all results into a single TreeSummarizedExperiment
(TreeSE
) object. TreeSE is becoming a defacto standard for
taxonomic abundance munging in R. TreeSE
objects
integrate abundance, taxonomic, phylogenetic, sample and sequence data
into a single object, with powerful facilities for filtering, processing
and plotting the results. Compared to phyloseq
object, TreeSE
is more scalable and capable for efficient
data analysis.
miaverse
framework is developed around
TreeSE
data container. It provides tools for analysis and
visualization. Moreover, it includes a comprehensive tutorial book
called OMA.
Amplicon sequencing
When the dataset includes amplicon sequencing data, i.e., the dataset
does not include function predictions, getResult()
method
returns the dataset as a TreeSE
by default. See other
output types from the function documentation.
tse <- getResult(mg, accession = analyses_accessions, get.func = FALSE)
tse
TreeSE
object is uniquely positioned to support SummarizedExperiment-based
microbiome data manipulation and visualization. Moreover, it enables
access to miaverse
tools. For example, we can estimate
diversity of samples.
library(mia)
tse <- estimateDiversity(tse, index = "shannon")
library(scater)
plotColData(tse, "shannon", x = "sample_geo.loc.name")
library(miaViz)
plotAbundance(
tse[!is.na( rowData(tse)[["Kingdom"]] ), ],
rank = "Kingdom",
as.relative = TRUE
)
If needed, TreeSE
can be converted to
phyloseq
.
pseq <- makePhyloseqFromTreeSE(tse)
pseq
Metagenomics
Although the previous queries have been based on the results from
doQuery()
, from now on we will concentrate on combining and
comparing results from specific studies. Since newly performed analyses
are retrieved first in the doQuery()
call, it’s likely that
by the time this vignette is read, the query results will be different.
This is principally due to the rapid increase in MGnify submissions,
leading to a potential lack of consistency between even closely spaced
queries. As mentioned previously, it may be best to use
useCache=FALSE
from MgnifyCLient
object for
doQuery()
calls, to ensure queries are actually returning
the latest data.
For the remainder of this vignette however, we’ll be comparing 3 ostensibly different studies. A study of saltmarsh soils from York University, human faecal samples from a survey of healthy Sardinians, and a set of samples from hydrothermal vents in the Mid-Cayman rise in the Carribbean Sea. To simplify things, only the first 20 samples from each study will be used. Furthermore, the intention is only to demonstrate the functionality of the MGnifyR package, rather than produce scientifically rigorous results.
soil <- searchAnalysis(mg, "studies", "MGYS00001447")
human <- searchAnalysis(mg, "studies", "MGYS00001442")
marine <- searchAnalysis(mg, "studies", "MGYS00001282")
# Combine analyses
all_accessions <- c(soil, human, marine)
head(all_accessions)
The first step with this new accession list is, as previously, to
retrieve the associated metadata using getMetadata()
, and
as seen with the doQuery()
results, the returned
data.frame
contains a large number of columns. Being
autogenerated and flexible, the column names can be a little difficult
to predict, but examining colnames(full_metadata)
should
make things clearer.
full_metadata <- getMetadata(mg, all_accessions)
colnames(full_metadata)
head(full_metadata)
From full_metadata
we get an idea of the type of data
we’re dealing with, and can extract useul information such as sequencing
platform, source biome, etc. The next code snippet tallies a few of
these columns to give an idea about what’s available. The boxplot also
indicates that while within study read counts are similar, we probably
need to use some sort of normalization procedure when comparing across
samples. We might also want to drop particularly low read coverage
samples from further analysis.
# Load ggplot2
library(ggplot2)
#Distribution of sample source material:
table(full_metadata$`sample_environment-material`)
#What sequencing machine(s) were used?
table(full_metadata$`sample_instrument model`)
# Boxplot of raw read counts:
ggplot(
full_metadata, aes(x=study_accession, y=log(
as.numeric(`analysis_Submitted nucleotide sequences`)))) +
geom_boxplot(aes(group=study_accession)) +
theme_bw() +
ylab("log(submitted reads)")
Again, we can fetch the data by calling getResult()
.
bulk.dl=TRUE
has the potential to significantly speed up
data retrieval. MGnify makes its functional results available in two
separate ways, either on a per-analysis basis through the web api, or at
the whole study level as large files, tab separated (TSV), and with
columns representing the results for each analysis. When
bulk.dl
is FALSE
, MGnifyR
queries
the web api to get results which (given some functional analyses results
may consist of thousands of entries) may take significant time. Setting
bulk.dl
to TRUE
causes MGnifyR
to
determine the source study associated with a particular
analysis
and to instead download and parse its
corresponding results file. Since this result file contains entries for
all analyses associated with the study, by taking advantage of
MGnifyR
’s local caching this single download provides
results for many future analyses. In some cases this affords several
orders of magnitude speedup over the api query case.
Unfortunately, column entries in the per-study results files do not
always directly correspond to those from a particular analysis run,
causing the retrieval to fail. The principal cause of this is believed
to be the running of multiple analyses jobs on the same sample. Thus for
reliability, bulk.dl
is FALSE
by default. As a
general recommendation though, you should try setting it
TRUE
the first time getResult()
is used on a
set of accessions. If this fails, setting bulk.dl
to
FALSE
will enable the more robust approach allowing the
analysis to continue. It might take a while though. Hopefully in the
future the sample/analysis correspondence mismatches will be fixed and
the default bulk.dl
will be switch to
TRUE
.
mae <- getResult(mg, all_accessions, bulk.dl = TRUE)
mae
For metagenomic samples, the result is MultiAssayExperiment
(MAE
) which links multiple TreeSE
objects into
one dataset. These TreeSE
objects include taxonomic
profiling data along with functional data in unique objects. Each
objects is linked with each other by their sample names. You can get
access to individual object or experiment by specifying index or
name.
mae[[2]]
We can perform principal component analysis to microbial profiling data by utilizing miaverse tools.
# Apply relative transformation
mae[[1]] <- transformAssay(mae[[1]], method = "relabundance")
# Perform PCoA
mae[[1]] <- runMDS(
mae[[1]], assay.type = "relabundance",
FUN = vegan::vegdist, method = "bray")
# Plot
plotReducedDim(mae[[1]], "MDS", colour_by = "sample_environment.feature")
Fetch raw files
While getResult()
can be utilized to retrieve microbial
profiling data, getData()
can be used more flexibly to
retrieve any kind of data from the database. It returns data as simple
data.frame or list format.
Fetch sequence files
Finally, we can use searchFile()
and
getFile()
to retrieve other MGnify pipeline outputs such as
merged sequence reads, assembled contigs, and details of the functional
analyses. searchFile()
is a simple wrapper function which,
when supplied a list of accessions, finds the urls of the files we’re
after. In most cases we’ll want to filter the returned list down to only
the files of interest, which is easily done on the resulting data.frame
object. In addition to the actual download location (the
download_url
column), extra columns include file type,
contents and compression. It’s recommended that the
colnames
of the data.frame
be examined to get
a grasp on the available metadata. To demonstrate the process, the code
below retrieves a data.frame containing all available downloads for each
accession we’ve been examining previously. It then filters this to
retain only those files corresponding retain the annotated amino acid
sequence files.
# Find list of available downloads
dl_urls <- searchFile(
mg, full_metadata$analysis_accession, type = "analyses")
# Filter table
target_urls <- dl_urls[
dl_urls$attributes.description.label == "Predicted CDS with annotation", ]
head(target_urls)
To list the types of available files, and guide the filtering, something like the following might be useful.
table(dl_urls$attributes.description.label)
Unlike other MGnifyR
functions,
searchFile()
is not limited to analyses
, and
by specifying accession_type
other results types may be
found. For instance, while general genome
functionality is
not yet integrated into MGnifyR
, we can retrieve associated
files for a particular genome
accession with the
following:
genome_urls <- searchFile(mg, "MGYG000433953", type = "genomes")
genome_urls[ , c("id", "attributes.file.format.name", "download_url")]
Having found the a set of target urls, the final step is to use
getFile()
to actually retrieve the file. Unlike other
functions, this only works with a single url location at once, so each
entry in target_urls
from above must be downloaded
individually - easily done by either looping or apply
ing
over the list.
If the files are intended to be used with external programs, it might
be easiest to provide a file
parameter to the function
call, which specifies a local filename for writing the file. By default
MGnifyR
will use the local cache, which can make getting to
the file afterwards more awkward. Regardless, the default behaviour of
getFile()
is to retrieve the file specified in the
parameter url
, save it to disk, and return the filepath it
was saved to.
# Just select a single file from the target_urls list for demonstration.
# Default behavior - use local cache.
cached_location1 = getFile(mg, target_urls$download_url[[1]])
# Specifying a file
cached_location2 <- getFile(
mg, target_urls$download_url[[1]])
cached_location <- c(cached_location1, cached_location2)
# Where are the files?
cached_location
A second download option is available, which allows built-in parsing
of the file. If we know ahead of time what processing will be performed,
it may be possible to integrate it into a function, pass this function
to getFile()
as the read.func
argument. The
function in question should take a single argument (the complete path
name of the locally downloaded file) and the result of the call will be
returned in place of the usual output file name.
Alternatively the files could first be downloaded in the standard
way, and then processed using this same function in a loop. Therefore in
many cases the read.func
parameter is redundant. However,
many of the outputs from MGnify can be quite large, meaning local
storage of many files may become an issue. By providing a
read_func
parameter (and necessarily setting from
MgnifyClient
object: useCache=FALSE
) analysis
of a large number of datasets may be possible with minimal storage
requirements.
To illustrate, suppose we were interested in retrieving all detected
sequences matching a particular PFAM motif in a set of analyses. The
simple function below uses the Biostrings
package to read
an amino acid fasta file, searches for a matching PFAM tag in the
sequence name, and then tallies up the unique sequences into a single
data.frame row. In this case the PFAM motif identifies sequences coding
for the amoC gene, found in both ammonia and methane oxidizing
organisms, but any other filtering method could be used.
library(Biostrings)
# Simple function to a count of unique sequences matching PFAM amoC/mmoC motif
getAmoCseqs <- function(fname){
sequences <- readAAStringSet(fname)
tgtvec <- grepl("PF04896", names(sequences))
as.data.frame(as.list(table(as.character(sequences[tgtvec]))))
}
Having defined the function, it just remains to include it in the
call to getFile()
.
# Just download a single accession for demonstration, specifying a read_function
amoC_seq_counts <- getFile(
mg, target_urls$download_url[[1]], read_func = getAmoCseqs)
amoC_seq_counts