Biomartr Versions Save

Genomic Data Retrieval with R

v1.0.7

5 months ago

biomartr 1.0.7

New features

Generalization of Biomart database access #108

  • Generalized biomart database interface (now uses https and port 433)
  • added cache for biomart database overview
  • added more unit tests for listGenomes() and biomart()

Bug fixes

  • fixed listGenomes() filter error #107
  • Bacteria collection corner case bug fixed #109

v1.0.6

7 months ago

biomartr 1.0.6

New features

  • Some cool new generalization, and check out function biomartr:::supported_biotypes(db = "refseq"). This function will simplify a lot of stuff downstream. (#104)

  • Tests are now much quicker to run, because biomartr::is.genome.available (which is used basically everywhere) now reads files with data.table instead of reader. (#104)

Bug fixes

  • Fixing bug in is.genome.available() where the skip_bacteria argument was not passed on internally to is.genome.available.refseq.genbank() (#105)

v1.0.5

7 months ago

Package generalization

Over 5000 lines have been edited, most of them removed (#100), to generalize the package to make it more safe for future development. This progress is still ongoing.

  • @Roleren is joining as package author and new core developer of biomartr.

New features

  • Ensembl genomes is no longer a different database compared to ensembl in biomaRt, since this split is artifical. It is adviced to use only "ensembl" as db from now on, but "ensemblgenomes" will still work.
  • Annotation did mean gff, but it should be both gff and gtf getter, with format specification, this is now fixed and generalized.
  • Added in new kingdom for ensembl: protists supportwith correct collection getters
  • The retrieval from the UniProt database is now updated to the new API/FTP path system. Now users can retrieve proteomes using the functions getProteome(db = "uniprot", ...) and getProteomeSet(db = "uniprot", ...) (see #82)
  • new function getBioSet: Generic Bio data set extractor
  • new function getBio: A wrapper to all bio getters, selected with 'type' argument
  • a new function getUniProtSTATS(): Retrieve UniProt Database Information File (STATS)

Power user cache

The package now supports caching of back end files which used to be saved to /tmp folder (i.e. lost on computer restart). This make it easy for power users who want higher speed. For more info, see the function ?cachedir_set

Bug fixes

  • Fixed many wrong urls and non working functions, more tests are added to make sure they work.
  • Fixed fungi collection accessor for ensembl

v1.0.4

11 months ago

Patch release to fix major big where retrieval stopped due to parsing issues in getAssemblySummary()

New Features

  • in getSummaryFile() all columns of the assembly_summary.txt are now specified with names and correct data types (#92)

Bug Fixes

  • whenever the low-level function getKingdomAssemblySummary() was called by all get*() functions, due to an error in the assembly_summary.txt file for viruses where the total gene count was stored as character and not as integer (as is the case for all other assembly_summary.txt files), an error occurred stating that dplyr::bind_rows() cannot join column $X35 due to differences in data types. This has now been resolved by parsing the correct data types with readr(#92)

v1.0.3

1 year ago

Minor maintenance fixes to ensure smooth installation on R versions >4.0.0.

  • adding pull request #88 which fixes issues with http to https curl requests (Many thanks to @Roleren)

v1.0.2

2 years ago

biomartr 1.0.2

Overall, this new version fixes a big internet connection issue to NCBI and ENSEMBL. Users can now reinstall the new version from CRAN and will realize that their initially failing downloads will run now, without having to change their code.

New Functions

  • New function check_annotation_biomartr() helps to check whether downloaded GFF or GTF files are corrupt. Find more details here

  • new function getCollectionSet() allows users to retrieve a Collection: Genome, Proteome, CDS, RNA, GFF, Repeat Masker, AssemblyStats of multiple species

Example:

# define scientific names of species for which
# collections shall be retrieved
organism_list <- c("Arabidopsis thaliana", 
                   "Arabidopsis lyrata", 
                   "Capsella rubella")
# download the collection of Arabidopsis thaliana from refseq
# and store the corresponding genome file in '_ncbi_downloads/collection'
 getCollectionSet( db       = "refseq", 
             organism = organism_list, 
             path = "set_collections")

New Features

  • the getGFF() function receives a new argument remove_annotation_outliers to enable users to remove corrupt lines from a GFF file Example:
Ath_path <- biomartr::getGFF(organism = "Arabidopsis thaliana", remove_annotation_outliers = TRUE)
  • the getGFFSet() function receives a new argument remove_annotation_outliers to enable users to remove corrupt lines from a GFF file

  • the getGTF() function receives a new argument remove_annotation_outliers to enable users to remove corrupt lines from a GTF file

  • adding a new message system to biomartr::organismBM(), biomartr::organismAttributes(), and biomartr::organismFilters() so that large API queries don't seem so unresponsive

  • getCollection() receives new arguments release, remove_annotation_outliers, and gunzip that will now be passed on to downstream retrieval functions

  • the getGTF(), getGenome() and getGenomeSet() functions receives a new argument assembly_type = "toplevel" to enable users to choose between toplevel and primary assembly when using ensembl database. Setting assembly_type = "primary_assembly" will save a lot a space on hard drives for people using large ensembl genomes.

  • all get*() functions with release argument now check if the ENSEMBL release is >45 (Many thanks to @Roleren #31 #61)

  • in all get*() functions, the readr::write_tsv(path = ) was exchanged to readr::write_tsv(file = ), since the readr package version > 1.4.0 is depreciating the path argument.

  • tbl_df() was deprecated in dplyr 1.0.0. Please use tibble::as_tibble() instead. -> adjusted organismBM() accordingly

  • custom_download(), getGENOMEREPORT(), and other download functions now have specified withr::local_options(timeout = max(30000000, getOption("timeout"))) which extends the default 60sec timeout to 30000000sec

Bug Fixes

  • Fixing bug where genome availability check in getCollection() was only performed in NCBI RefSeq and not in other databases due to a constant used in is.genome.available() rather than a variable (Many thanks to Takahiro Yamada for catching the bug) #53

  • fixing an issue that caused the read_cds() function to fail in data.table mode (Many thanks to Clement Kent) #57

  • fixing an SSL bug that was found on Ubuntu 20.04 systems #66 (Many thanks to Håkon Tjeldnes)

  • fixing global variable issue that caused clean.retrieval() to fail when no documentation file was in a meta.retrieval() folder

  • The NCBI recently started adding NA values as FTP file paths in their species summary files for species without reference genomes. As a result meta.retrieval() stopped working, because no FTP paths were found for some species. This issue was now fixed by adding the filter rule !is.na(ftp_path) into all get*() functions (Many thanks for making me aware of this issue Ashok Kumar Sharma #34 and Dominik Merges #72)

  • Fixing an issue in custom_download() where the method argument was causing issues when downloading from https directed ftp sites (Many thanks to @cmatKhan) #76

  • Fixing issue when trying to combine multiple summary-stats files where NA's were present in the list item that was passed along for combination in meta.retrieval() #73 (Many thanks to Dominik Merges)

  • Fixing a bug in download.database.all() where the lack of removing listed file *-metadata.json caused corruption of the download process (Many thanks to Jaruwatana Lotharukpong)

v0.9.1

4 years ago

Minor updates to comply with CRAN policy.

v0.9.0

5 years ago

Please be aware that as of April 2019, ENSEMBLGENOMES was retired (see details here). Hence, all biomartr functions were updated and won't support data retrieval from ENSEMBLGENOMES servers anymore.

New Functions

  • New function clean.retrieval() enables formatting and automatic unzipping of meta.retrieval output (find out more here: https://ropensci.github.io/biomartr/articles/MetaGenome_Retrieval.html#un-zipping-downloaded-files)
  • New function getGenomeSet() allows users to easily retrieve genomes of multiple specified species. In addition, the genome summary statistics for all retrieved species will be stored as well to provide users with insights regarding the genome assembly quality of each species. This file can be used as Supplementary Information file in publications to facilitate reproducible research.
  • New function getProteomeSet() allows users to easily retrieve proteomes of multiple specified species
  • New function getCDSSet() allows users to easily retrieve coding sequences of multiple specified species
  • New function getGFFSet() allows users to easily retrieve GFF annotation files of multiple specified species
  • New function getRNASet() allows users to easily retrieve RNA sequences of multiple specified species
  • New function summary_genome() allows users to retrieve summary statistics for a genome assembly file to assess the influence of genome assembly qualities when performing comparative genomics tasks
  • New function summary_cds() allows users to retrieve summary statistics for a coding sequence (CDS) file. We noticed, that many CDS files stored in NCBI or ENSEMBL databases contain sequences that aren't divisible by 3 (division into codons). This makes it difficult to divide CDS into codons for e.g. codon alignments or translation into protein sequences. In addition, some CDS files contain a significant amount of sequences that do not start with AUG (start codon). This function enables users to quantify how many of these sequences exist in a downloaded CDS file to process these files according to the analyses at hand.

New Features of Existing Functions

  • the default value of argument reference in meta.retrieval() changed from reference = TRUE to reference = FALSE. This way all genomes (reference AND non-reference) genomes will be downloaded by default. This is what users seem to prefer.
  • getCollection() now also retrieves GTF files when db = 'ensembl'
  • getAssemblyStats() now also performs md5 checksum test
  • all md5 checksum tests now retrieve the new md5checkfile format from NCBI RefSeq and Genbank
  • getGTF(): users can now specify the NCBI Taxonomy ID or Accession ID in addition to the scientific name in argument 'organism' to retrieve genome assemblies
  • getGFF(): users can now specify the NCBI Taxonomy ID or Accession ID for ENSEMBL in addition to the scientific name in argument 'organism' to retrieve genome assemblies
  • getMarts() will now throw an error when BioMart servers cannot be reached (#36)
  • getGenome() now also stores the genome summary statistics (see ?summary_genome()) for the retrieved species in the documentation folder to provide users with insights regarding the genome assembly quality
  • In all get*() functions the default for argument reference is now set from reference = TRUE to reference = FALSE (= new default)
  • all get*() functions now received a new argument release which allows users to retrieve specific release versions of genomes, proteomes, etc from ENSEMBL and ENSEMBLGENOMES
  • all get*() functions received two new arguments clean_retrieval and gunzip which allows users to upzip the downloaded files directly in the get*() function call and rename the file for more convenient downstream analyses

v0.8.0

5 years ago

v0.7.0

6 years ago

Function changes:

  • the function meta.retrieval() will now pick up the download at the organism where it left off and will report which species have already been retrieved

  • all get*() functions and the meta.retrieval() function receive a new argument reference which allows users to retrieve not-reference or not-representative genome versions when downloading from NCBI RefSeq or NCBI Genbank

  • the argument order in meta.retrieval() changed from meta.retrieval(kingdom, group, db, ...) to meta.retrieval(db,kingdom, group, ...) to make the argument order more consistent with the get*() functions

  • the argument order in getGroups() changed from getGroups(kingdom, db) to getGroups(db, kingdom) to make the argument order more consistent with the get*() and meta.retrieval() functions

New Functions:

  • new internal functions existingOrganisms() and existingOrganisms_ensembl() which check the organisms that have already been downloaded