Genomic Data Retrieval with R
listGenomes()
and biomart()
listGenomes()
filter error #107Some cool new generalization, and check out function biomartr:::supported_biotypes(db = "refseq")
.
This function will simplify a lot of stuff downstream. (#104)
Tests are now much quicker to run, because biomartr::is.genome.available
(which is used basically everywhere) now reads files with data.table instead of reader. (#104)
is.genome.available()
where the skip_bacteria argument was not passed on internally to is.genome.available.refseq.genbank()
(#105)Over 5000 lines have been edited, most of them removed (#100), to generalize the package to make it more safe for future development. This progress is still ongoing.
biomartr
.UniProt
database is now updated to the new API/FTP path system. Now users
can retrieve proteomes using the functions getProteome(db = "uniprot", ...)
and getProteomeSet(db = "uniprot", ...)
(see #82)getBioSet
: Generic Bio data set extractorgetBio
: A wrapper to all bio getters, selected with 'type' argumentgetUniProtSTATS()
: Retrieve UniProt Database Information File (STATS)The package now supports caching of back end files which used to be saved to /tmp folder (i.e. lost on computer restart). This make it easy for power users who want higher speed. For more info, see the function ?cachedir_set
getAssemblySummary()
getKingdomAssemblySummary()
was called by all get*()
functions, due to an error in the assembly_summary.txt
file for viruses where the total gene count was stored as character and not as integer (as is the case for all other assembly_summary.txt
files), an error occurred stating that dplyr::bind_rows()
cannot join column $X35
due to differences in data types. This has now been resolved by parsing the correct data types with readr
(#92)http
to https
curl requests (Many thanks to @Roleren)Overall, this new version fixes a big internet connection issue to NCBI and ENSEMBL. Users can now reinstall the new version from CRAN and will realize that their initially failing downloads will run now, without having to change their code.
New function check_annotation_biomartr()
helps to check whether downloaded GFF or GTF files are corrupt. Find more details here
new function getCollectionSet()
allows users to retrieve a Collection: Genome, Proteome, CDS, RNA, GFF, Repeat Masker, AssemblyStats of multiple species
Example:
# define scientific names of species for which
# collections shall be retrieved
organism_list <- c("Arabidopsis thaliana",
"Arabidopsis lyrata",
"Capsella rubella")
# download the collection of Arabidopsis thaliana from refseq
# and store the corresponding genome file in '_ncbi_downloads/collection'
getCollectionSet( db = "refseq",
organism = organism_list,
path = "set_collections")
getGFF()
function receives a new argument remove_annotation_outliers
to enable users to remove corrupt lines from a GFF file
Example:Ath_path <- biomartr::getGFF(organism = "Arabidopsis thaliana", remove_annotation_outliers = TRUE)
the getGFFSet()
function receives a new argument remove_annotation_outliers
to enable users to remove corrupt lines from a GFF file
the getGTF()
function receives a new argument remove_annotation_outliers
to enable users to remove corrupt lines from a GTF file
adding a new message system to biomartr::organismBM()
, biomartr::organismAttributes()
, and biomartr::organismFilters()
so that large API queries don't seem so unresponsive
getCollection()
receives new arguments release
, remove_annotation_outliers
, and gunzip
that will now be passed on to downstream retrieval functions
the getGTF()
, getGenome()
and getGenomeSet()
functions receives a new argument assembly_type = "toplevel"
to enable users to choose between toplevel and primary assembly when using ensembl database. Setting assembly_type = "primary_assembly"
will save a lot a space on hard drives for people using large ensembl genomes.
all get*()
functions with release
argument now check if the ENSEMBL release is >45 (Many thanks to @Roleren #31 #61)
in all get*()
functions, the readr::write_tsv(path = )
was exchanged to readr::write_tsv(file = )
, since the readr
package version > 1.4.0 is depreciating the path
argument.
tbl_df()
was deprecated in dplyr 1.0.0.
Please use tibble::as_tibble()
instead. -> adjusted organismBM()
accordingly
custom_download()
, getGENOMEREPORT()
, and other download functions now have specified withr::local_options(timeout = max(30000000, getOption("timeout")))
which extends the default 60sec timeout to 30000000sec
Fixing bug where genome availability check in getCollection()
was only performed in NCBI RefSeq
and not in other databases due to a constant used in is.genome.available()
rather than a variable (Many thanks to Takahiro Yamada for catching the bug) #53
fixing an issue that caused the read_cds()
function to fail in data.table
mode (Many thanks to Clement Kent) #57
fixing an SSL
bug that was found on Ubuntu 20.04
systems #66 (Many thanks to Håkon Tjeldnes)
fixing global variable issue that caused clean.retrieval()
to fail when no documentation file was in a meta.retrieval()
folder
The NCBI recently started adding NA
values as FTP file paths in their species summary files
for species without reference genomes. As a result meta.retrieval()
stopped working, because no FTP paths were found for some species. This issue was now fixed by adding the filter rule !is.na(ftp_path)
into all get*()
functions (Many thanks for making me aware of this issue Ashok Kumar Sharma #34 and Dominik Merges #72)
Fixing an issue in custom_download()
where the method
argument was causing issues when downloading from https
directed ftp
sites (Many thanks to @cmatKhan) #76
Fixing issue when trying to combine multiple summary-stats files where NA's were present in the list item that was passed along for combination in meta.retrieval()
#73 (Many thanks to Dominik Merges)
Fixing a bug in download.database.all()
where the lack of removing listed file *-metadata.json
caused corruption of the download process (Many thanks to Jaruwatana Lotharukpong)
Minor updates to comply with CRAN policy.
Please be aware that as of April 2019, ENSEMBLGENOMES
was retired (see details here). Hence, all biomartr
functions were updated
and won't support data retrieval from ENSEMBLGENOMES
servers anymore.
clean.retrieval()
enables formatting and automatic unzipping of meta.retrieval output (find out more here: https://ropensci.github.io/biomartr/articles/MetaGenome_Retrieval.html#un-zipping-downloaded-files)getGenomeSet()
allows users to easily retrieve genomes of multiple specified species.
In addition, the genome summary statistics for all retrieved species will be stored as well to provide
users with insights regarding the genome assembly quality of each species. This file can be used as Supplementary Information file
in publications to facilitate reproducible research.getProteomeSet()
allows users to easily retrieve proteomes of multiple specified speciesgetCDSSet()
allows users to easily retrieve coding sequences of multiple specified speciesgetGFFSet()
allows users to easily retrieve GFF annotation files of multiple specified speciesgetRNASet()
allows users to easily retrieve RNA sequences of multiple specified speciessummary_genome()
allows users to retrieve summary statistics for a genome assembly file to assess
the influence of genome assembly qualities when performing comparative genomics taskssummary_cds()
allows users to retrieve summary statistics for a coding sequence (CDS) file.
We noticed, that many CDS files stored in NCBI or ENSEMBL databases contain sequences that aren't divisible by 3 (division into codons).
This makes it difficult to divide CDS into codons for e.g. codon alignments or translation into protein sequences. In
addition, some CDS files contain a significant amount of sequences that do not start with AUG (start codon).
This function enables users to quantify how many of these sequences exist in a downloaded CDS file to process
these files according to the analyses at hand.reference
in meta.retrieval()
changed from reference = TRUE
to reference = FALSE
.
This way all genomes (reference AND non-reference) genomes will be downloaded by default. This is what users seem to prefer.getCollection()
now also retrieves GTF
files when db = 'ensembl'
getAssemblyStats()
now also performs md5 checksum testgetGTF()
: users can now specify the NCBI Taxonomy ID or Accession ID in addition to the scientific name in argument 'organism' to retrieve genome assembliesgetGFF()
: users can now specify the NCBI Taxonomy ID or Accession ID for ENSEMBL in addition to the scientific name in argument 'organism' to retrieve genome assembliesgetMarts()
will now throw an error when BioMart servers cannot be reached (#36)getGenome()
now also stores the genome summary statistics (see ?summary_genome()
) for the retrieved species in the documentation
folder to provide
users with insights regarding the genome assembly qualityreference
is now set from reference = TRUE
to reference = FALSE
(= new default)get*()
functions now received a new argument release
which allows users to retrieve
specific release versions of genomes, proteomes, etc from ENSEMBL
and ENSEMBLGENOMES
get*()
functions received two new arguments clean_retrieval
and gunzip
which
allows users to upzip the downloaded files directly in the get*()
function call and rename
the file for more convenient downstream analysesFunction changes:
the function meta.retrieval() will now pick up the download at the organism where it left off and will report which species have already been retrieved
all get*() functions and the meta.retrieval() function receive a new argument reference which allows users to retrieve not-reference or not-representative genome versions when downloading from NCBI RefSeq or NCBI Genbank
the argument order in meta.retrieval() changed from meta.retrieval(kingdom, group, db, ...) to meta.retrieval(db,kingdom, group, ...) to make the argument order more consistent with the get*() functions
the argument order in getGroups() changed from getGroups(kingdom, db) to getGroups(db, kingdom) to make the argument order more consistent with the get*() and meta.retrieval() functions
New Functions: