Benchmarking common tasks on proteins in various languages and packages
Open source software packages to parse files in various formats from the Protein Data Bank (PDB) and manipulate protein structures exist in many languages, often as part of Bio* projects.
This repository aims to collate benchmarks for common tasks across various languages and packages. The collection of scripts may also be useful to get an idea how each package works.
Please feel free to contribute scripts from other packages, or submit improvements to the scripts already present - I'm looking for the fastest implementation for each software that makes use of the provided API.
Disclosure: I contributed the BioStructures.jl package to BioJulia and have made contributions to Biopython.
[1] Gajda MJ, hPDB - Haskell library for processing atomic biomolecular structures in protein data bank format, BMC Research Notes 2013, 6:483 - link
The PDB files can be downloaded to directory data
by running julia tools/download_data.jl
from this directory. If you have all the software installed, and compiled where applicable, you can run sh tools/run_benchmarks.sh
from this directory to run the benchmarks and store the results in benchmarks.csv
. The mean over a number of runs is taken for each benchmark to obtain the values below.
Benchmarks were carried out on an Intel Xeon CPU E5-1620 v3 3.50GHz x 8 processor with 32 GB 2400 MHz DDR4 RAM. The operating system was CentOS v8.1. Time is the elapsed time.
Currently, 16 packages across 7 programming languages are included in the benchmarks:
Note that direct comparison between these times should be treated with caution, as each package does something slightly different. For example, things that increase parsing time include:
Each package supports these to varying degrees.
BioStructures | MIToS | Biopython | ProDy | MDAnalysis | biotite | atomium | Bio3D | Rpdb | BioJava | BioPerl | BioRuby | GEMMI | Victor | ESBTL | chemfiles-python | chemfiles-cxx | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Parse PDB 1CRN / ms | 0.75 | 0.63 | 7.3 | 3.1 | 4.2 | 4.4 | 7.0 | 10.0 | 9.5 | 8.1 | 43.0 | 21.0 | 0.24 | 7.6 | 2.4 | 4.5 | 0.67 |
Parse PDB 1HTQ / s | 2.6 | 2.8 | 16.0 | 2.1 | 1.5 | 4.8 | 20.0 | 2.9 | 14.0 | 1.3 | 49.0 | 13.0 | 0.36 | 11.0 | - | - | - |
Parse mmCIF 1CRN / ms | 2.0 | - | 16.0 | - | - | 4.8 | 13.0 | - | - | 40.0 | - | - | 0.97 | - | - | 3.8 | 0.99 |
Parse mmCIF 1HTQ / s | 8.0 | - | 45.0 | - | - | 9.0 | 36.0 | - | - | 17.0 | - | - | 1.5 | - | - | 2.0 | 2.0 |
Parse MMTF 1CRN / ms | 1.1 | - | 4.5 | - | - | 1.2 | 4.6 | - | - | 4.1 | - | - | - | - | - | 3.2 | 0.44 |
Parse MMTF 1HTQ / s | 3.6 | - | 16.0 | - | - | 0.16 | 43.0 | - | - | 0.74 | - | - | - | - | - | - | - |
Count / ms | 0.17 | 0.017 | 0.21 | 8.8 | 0.068 | - | - | 0.16 | 0.2 | - | 0.42 | 0.073 | 0.004 | - | - | 0.75 | 0.092 |
Distance / ms | 0.012 | 0.0044 | 0.25 | 50.0 | 0.62 | - | - | 19.0 | 1.3 | - | 0.53 | 0.32 | 0.001 | - | - | 0.55 | 0.19 |
Ramachandran / ms | 1.4 | - | 120.0 | 210.0 | 1200.0 | - | - | - | - | - | - | - | - | - | - | 7.4 | 2.1 |
Language | Julia | Julia | Python | Python | Python | Python | Python | R | R | Java | Perl | Ruby | C++/Python | C++ | C++ | Python | C++ |
License | MIT | MIT | Biopython | MIT | GPLv2 | BSD 3-Clause | MIT | GPLv2 | GPLv2/GPLv3 | LGPLv2.1 | GPL/Artistic | Ruby | MPLv2/LGPLv3 | GPLv3 | GPLv3 | BSD 3-Clause | BSD 3-Clause |
Hierarchichal parsing | ✓ | ✗ | ✓ | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ |
Supports disorder | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ |
Writes PDBs | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ |
Parses PDB header | ✗ | ✗ | ✓ | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✗ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ |
Superimposition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
PCA | ✗ | ✗ | ✗ | ✓ | ✓ | ✗ | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
Benchmarks as a plot, sorted by increasing time to parse PDB 1CRN:
It is instructive to run parsers over the whole PDB to see where errors arise. This approach has led to me submitting corrections for small mistakes (e.g. duplicate atoms, residue number errors) in a few PDB structures. As of July 2018, the PDB entries that error with the Biopython (permissive mode) and BioJulia parsers are:
Running Biopython in non-permissive mode picks up more potential problems such as broken chains and mixed blank/non-blank alt loc IDs. For further discussion on errors in PDB files see the Biopython documentation. The scripts to reproduce the whole PDB checking can be found in checkwholepdb
. There is also a script to check recent PDB changes that can be run as a CRON job.
If you use these benchmarks, please cite the BioStructures.jl paper where they appear:
Greener JG, Selvaraj J and Ward BJ. BioStructures.jl: read, write and manipulate macromolecular structures in Julia, Bioinformatics 36(14):4206-4207 (2020) - link - PDF
If you want to contribute benchmarks for a package, please make a pull request with the script(s) in a directory like the other packages. I will run the benchmarks again and change the README, thanks.