ScaffoldGraph is an open-source cheminformatics library, built using RDKit and NetworkX, for the generation and analysis of scaffold networks and scaffold trees.
ScaffoldGraph is an open-source cheminformatics library, built using RDKit and NetworkX, for the generation and analysis of scaffold networks and scaffold trees.
Features | Installation | Quick-start | Examples | Contributing | References | Citation
SG | SNG | SH | STG | |
---|---|---|---|---|
Computes Scaffold Networks | X | X | - | - |
Computes HierS Networks | X | - | - | - |
Computes Scaffold Trees | X | X | X | X |
Command Line Interface | X | X | - | X |
Graphical Interface | - * |
- | X | - |
Accessible Library | X | - | - | - |
Results can be computed in parallel | X | X | - | - |
Benchmark for 150,000 molecules ** |
15m 25s | 27m 6s | - | - |
Limit on input molecules | N/A *** |
10,000,000 | 200,000 **** |
10,000,000 |
*
While ScaffoldGraph has no explicit GUI, it contains functions for interactive scaffoldgraph visualization.
**
Tests performed on an Intel Core i7-6700 @ 3.4 GHz with 32GB of RAM, without parallel processing. I could not find
the code for STG and do not intend to search for it, SNG report that both itself and SH are both faster in the
benchmark test.
***
Limited by available memory
****
Graphical interface has an upper limit of 2,000 scaffolds
conda config --add channels conda-forge
conda install -c uclcheminformatics scaffoldgraph
# Basic installation.
pip install scaffoldgraph
# Install with ipycytoscape.
pip install scaffoldgraph[vis]
# Install with rdkit-pypi (Linux, MacOS).
pip install scaffoldgraph[rdkit]
# Install with all optional packages.
pip install scaffoldgraph[rdkit, vis]
Warning: rdkit cannot be installed with pip, so must be installed through other means
Update (17/06/21): rdkit can now be installed through the rdkit-pypi wheels for Linux and MacOS, and can be installed alongside ScaffoldGraph optionally (see above instructions).
Update (16/11/21): Jupyter lab users may also need to follow the extra installation instructions here / here when using the ipycytoscape visualisation utility.
The ScaffoldGraph CLI is almost analogous to SNG consisting of a two step process (Generate --> Aggregate).
ScaffoldGraph can be invoked from the command-line using the following command:
$ scaffoldgraph <command> <input-file> <options>
Where "command" is one of: tree, network, hiers, aggregate or select.
The first step of the process is to generate an intermediate scaffold graph. The generation commands are: network, hiers and tree
For example, if a user would like to generate a network from two files:
$ ls
file_1.sdf file_2.sdf
They would first use the commands:
$ scaffoldgraph network file_1.sdf file_1.tmp
$ scaffoldgraph network file_2.sdf file_2.tmp
Further options:
--max-rings, -m : ignore molecules with # rings > N (default: 10)
--flatten-isotopes -i : remove specific isotopes
--keep-largest-fragment -f : only process the largest disconnected fragment
--discharge-and-deradicalize -d : remove charges and radicals from scaffolds
The second step of the process is aggregating the temporary files into a combined graph representation.
$ scaffoldgraph aggregate file_1.tmp file_2.tmp file.tsv
The final network is now available in 'file.tsv'. Output formats are explained below.
Further options:
--map-mols, -m <file> : generate a file mapping molecule IDs to scaffold IDs
--map-annotations <file> : generate a file mapping scaffold IDs to annotations
--sdf : write the output as an SDF file
ScaffoldGraph allows a user to select a subset of a scaffold network or tree using a molecule-based query, i.e. selecting only scaffolds for molecules of interest.
This command can only be performed on an aggregated graph (Not SDF).
$ scaffoldgraph select <graph input-file> <input molecules> <output-file> <options>
Options:
<graph input-file> : A TSV graph constructed using the aggregate command
<input molecules> : Input query file (SDF, SMILES)
<output-file> : Write results to specified file
--sdf : Write the output as an SDF file
ScaffoldGraphs CLI utility supports input files in the SMILES and SDF formats. Other file formats can be converted using OpenBabel.
ScaffoldGraph expects a delimited file where the first column defines a SMILES string, followed by a molecule identifier. If an identifier is not specified the program will use a hash of the molecule as an identifier.
Example SMILES file:
CCN1CCc2c(C1)sc(NC(=O)Nc3ccc(Cl)cc3)c2C#N CHEMBL4116520
CC(N1CC(C1)Oc2ccc(Cl)cc2)C3=Nc4c(cnn4C5CCOCC5)C(=O)N3 CHEMBL3990718
CN(C\C=C\c1ccc(cc1)C(F)(F)F)Cc2coc3ccccc23 CHEMBL4116665
N=C1N(C(=Nc2ccccc12)c3ccccc3)c4ccc5OCOc5c4 CHEMBL4116261
...
ScaffoldGraph expects an SDF file, where the molecule identifier is specified in the title line. If the title line is blank, then a hash of the molecule will be used as an identifier.
Note: selecting subsets of a graph will not be possible if a name is not supplied
The generate commands (network, hiers, tree) produce an intermediate tsv containing 4 columns:
The aggregate command produces a tsv containing 4 columns
An SDF file can be produced by the aggregate and select commands. This SDF is formatted according to the SDF specification with added property fields:
ScaffoldGraph makes it simple to construct a graph using the library API. The resultant graphs follow the same API as a NetworkX DiGraph.
Some example notebooks can be found in the 'examples' directory.
import scaffoldgraph as sg
# construct a scaffold network from an SDF file
network = sg.ScaffoldNetwork.from_sdf('my_sdf_file.sdf')
# construct a scaffold tree from a SMILES file
tree = sg.ScaffoldTree.from_smiles('my_smiles_file.smi')
# construct a scaffold tree from a pandas dataframe
import pandas as pd
df = pd.read_csv('activity_data.csv')
network = sg.ScaffoldTree.from_dataframe(
df, smiles_column='Smiles', name_column='MolID',
data_columns=['pIC50', 'MolWt'], progress=True,
)
Multi-processing
It is simple to construct a graph from multiple input source in parallel, using the concurrent.futures module and the sg.utils.aggregate function.
from concurrent.futures import ProcessPoolExecutor
from functools import partial
import scaffoldgraph as sg
import os
directory = './data'
sdf_files = [f for f in os.listdir(directory) if f.endswith('.sdf')]
func = partial(sg.ScaffoldNetwork.from_sdf, ring_cutoff=10)
graphs = []
with ProcessPoolExecutor(max_workers=4) as executor:
futures = executor.map(func, sdf_files)
for future in futures:
graphs.append(future)
network = sg.utils.aggregate(graphs)
Creating custom scaffold prioritisation rules
If required a user can define their own rules for prioritizing scaffolds during scaffold tree construction. Rules can be defined by subclassing one of four rule classes:
BaseScaffoldFilterRule, ScaffoldFilterRule, ScaffoldMinFilterRule or ScaffoldMaxFilterRule
When subclassing a name property must be defined and either a condition, get_property or filter function. Examples are shown below:
import scaffoldgraph as sg
from scaffoldgraph.prioritization import *
"""
Scaffold filter rule (must implement name and condition)
The filter will retain all scaffolds which return a True condition
"""
class CustomRule01(ScaffoldFilterRule):
"""Do not remove rings with >= 12 atoms if there are smaller rings to remove"""
def condition(self, child, parent):
removed_ring = child.rings[parent.removed_ring_idx]
return removed_ring.size < 12
@property
def name(self):
return 'custom rule 01'
"""
Scaffold min/max filter rule (must implement name and get_property)
The filter will retain all scaffolds with the min/max property value
"""
class CustomRule02(ScaffoldMinFilterRule):
"""Smaller rings are removed first"""
def get_property(self, child, parent):
return child.rings[parent.removed_ring_idx].size
@property
def name(self):
return 'custom rule 02'
"""
Scaffold base filter rule (must implement name and filter)
The filter method must return a list of filtered parent scaffolds
This rule is used when a more complex rule is required, this example
defines a tiebreaker rule. Only one scaffold must be left at the end
of all filter rules in a rule set
"""
class CustomRule03(BaseScaffoldFilterRule):
"""Tie-breaker rule (alphabetical)"""
def filter(self, child, parents):
return [sorted(parents, key=lambda p: p.smiles)[0]]
@property
def name(self):
return 'custom rule 03'
Custom rules can subsequently be added to a rule set and supplied to the scaffold tree constructor:
ruleset = ScaffoldRuleSet(name='custom rules')
ruleset.add_rule(CustomRule01())
ruleset.add_rule(CustomRule02())
ruleset.add_rule(CustomRule03())
graph = sg.ScaffoldTree.from_sdf('my_sdf_file.sdf', prioritization_rules=ruleset)
Contributions to ScaffoldGraph will most likely fall into the following categories:
Please send Pull Requests to: http://github.com/UCLCheminformatics/ScaffoldGraph
ScaffoldGraphs testing is located under test/
. Run all tests using:
$ python setup.py test
or run an individual test: pytest --no-cov tests/core
When contributing new features please include appropriate test files
ScaffoldGraph uses Travis CI for continuous integration
If you use this software in your own work please cite our paper, and the respective papers of the methods used.
@article{10.1093/bioinformatics/btaa219,
author = {Scott, Oliver B and Chan, A W Edith},
title = "{ScaffoldGraph: an open-source library for the generation and analysis of molecular scaffold networks and scaffold trees}",
journal = {Bioinformatics},
year = {2020},
month = {03},
issn = {1367-4803},
doi = {10.1093/bioinformatics/btaa219},
url = {https://doi.org/10.1093/bioinformatics/btaa219},
note = {btaa219}
eprint = {https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btaa219/32984904/btaa219.pdf},
}