A machine learning method for the discovery of the minimum marker gene combinations for cell type identification from single-cell RNA sequencing
[Release Note:] Pre-release of NS-Forest v4.0.
Follow the tutorial to get started.
Download 'NSForest_v4dot0_dev.py' and replace the version in the tutorial. Sample code below.
adata_median = preprocessing_medians(adata, cluster_header)
adata_median.varm["medians_" + cluster_header].stack().plot.hist(bins=30, title = 'cluster medians')
adata_median_binary = preprocessing_binary(adata_median, cluster_header, "medians_" + cluster_header)
adata_median_binary.varm["binary_scores_" + cluster_header].stack().plot.hist(bins=30, title='binary scores')
## make a copy of prepared adata
adata_prep = adata_median_binary.copy()
NSForest(adata_prep, cluster_header=cluster_header, n_trees=1000, n_genes_eval=6,
medians_header = "medians_" + cluster_header, binary_scores_header = "binary_scores_" + cluster_header,
gene_selection = "BinaryFirst_high", outputfilename="BinaryFirst_high")
Full Changelog: https://github.com/JCVenterInstitute/NSForest/compare/v3.9...v4.0_dev
[Release Note:] Major code optimizations based on algorithm v3.0. No algorithmic change to v3.0.
Changes of parameter name from v3.0
[old name] = [new name]
threads = n_jobs
howManyInformativeGenes2test = n_top_genes
InformativeGenes = n_binary_genes
clusterLabelcolumnHeader = cluster_header
rfTrees = n_trees
Median_Expression_Level = median_cutoff = 0 #set to 0
Genes_to_testing = n_genes_eval
dataDummy = df_dummies
column = cl
NS-Forest can be installed using pip
:
sudo pip install nsforest
If you are using a machine on which you lack administrative access, NS-Forest can be installed locally using pip
:
pip install --user nsforest
NS-Forest can also be installed using conda
:
conda install -c ttl074 nsforest
Will be uploaded to official conda channel soon.
Prerequisites:
Follow the tutorial to get started.
If you download 'NSForest_v3dot9_2.py' directly, replace the version to the most updated one in the tutorial.
If you download the pip
or conda
package, use the following in the tutorial.
import nsforest as ns
ns.NSForest()
Earlier versions are managed in Releases.
Version 2 and beyond:
Aevermann BD, Zhang Y, Novotny M, Keshk M, Bakken TE, Miller JA, Hodge RD, Lelieveldt B, Lein ES, Scheuermann RH. A machine learning method for the discovery of minimum marker gene combinations for cell-type identification from single-cell RNA sequencing. Genome Res. 2021 Jun 4:gr.275569.121. doi: 10.1101/gr.275569.121.
Version 1.3/1.0:
Aevermann BD, Novotny M, Bakken T, Miller JA, Diehl AD, Osumi-Sutherland D, Lasken RS, Lein ES, Scheuermann RH. Cell type discovery using single-cell transcriptomics: implications for ontological representation. Hum Mol Genet. 2018 May 1;27(R1):R40-R47. doi: 10.1093/hmg/ddy100.
This project is licensed under the MIT License.
Full Changelog: https://github.com/JCVenterInstitute/NSForest/compare/v3.0...v3.9
[Release Note:] Major code optimizations based on algorithm v3.0. No change to the algorithm itself.
Changes of parameter name from v3.0
[old name] = [new name]
threads = n_jobs
howManyInformativeGenes2test = n_top_genes
InformativeGenes = n_binary_genes
clusterLabelcolumnHeader = cluster_header
rfTrees = n_trees
Median_Expression_Level = median_cutoff = 0 #set to 0
Genes_to_testing = n_genes_eval
dataDummy = df_dummies
column = cl
Download NSForest_v3dot9_1.py.
Follow the tutorial to get started.
This is version 3.9.1. Earlier releases are managed in Releases.
Version 2 and beyond:
Aevermann BD, Zhang Y, Novotny M, Keshk M, Bakken TE, Miller JA, Hodge RD, Lelieveldt B, Lein ES, Scheuermann RH. A machine learning method for the discovery of minimum marker gene combinations for cell-type identification from single-cell RNA sequencing. Genome Res. 2021 Jun 4:gr.275569.121. doi: 10.1101/gr.275569.121.
Version 1.3/1.0:
Aevermann BD, Novotny M, Bakken T, Miller JA, Diehl AD, Osumi-Sutherland D, Lasken RS, Lein ES, Scheuermann RH. Cell type discovery using single-cell transcriptomics: implications for ontological representation. Hum Mol Genet. 2018 May 1;27(R1):R40-R47. doi: 10.1093/hmg/ddy100.
This project is licensed under the MIT License.
Full Changelog: https://github.com/JCVenterInstitute/NSForest/compare/v3.0...v3.1
[Release note:] New version of NS-Forest is redeveloped to operate directly on a scanpy object. The algorithm is essentially the same, and in testing returns identical results to NS-Forest v2.0 when the same parameters are used.
Install python 3.6 or above. Download NSForest_v3.py file
from NSForest_v3 import *
import itertools
adata_markers = NS_Forest(adata) #Runs NS_Forest on scanpy object
Markers = list(itertools.chain.from_iterable(adata_markers['NSForest_Markers'])) #gets list of minimal markers from dataframe for display in scanpy plotting functions
Binary_Markers = list(itertools.chain.from_iterable(adata_markers['Binary_Genes'])) #gets list of binary markers from dataframe for display in scanpy plotting functions
NS_Forest(adata, clusterLabelcolumnHeader = "louvain", rfTrees = 1000, Median_Expression_Level = 0, Genes_to_testing = 6, betaValue = 0.5)
Necessary and Sufficient Forest is a method that takes cluster results from single cell/nuclei RNAseq experiments and generates lists of minimal markers needed to define each “cell type cluster”.
The method begins by re-encoding the cluster labels into binary classifications, and Random Forest models are generated comparing each cluster versus all. The top fifteen genes are then reranked using a score measuring how binary they are, e.g., a gene with expression in the target cluster but no expression in the other clusters would have a high binary score. Expression cutoffs for the top six genes ranked by binary score are then determined by generating individual decision trees and extracting the decision path information. Then all combinations of the top six most binary genes are evaluated using f-beta score as an objective function (the beta value default set at 0.5, which weights the f-measure score more toward precision as opposed to recall).
See code for detailed comments.
This is version 3.0 The earlier releases were described in the below publications.
Version 2
Aevermann BD, Zhang Y, Novotny M, Keshk M, Bakken TE, Miller JA, Hodge RD, Lelieveldt B, Lein ES, Scheuermann RH. A machine learning method for the discovery of minimum marker gene combinations for cell-type identification from single-cell RNA sequencing. Genome Res. 2021 Jun 4:gr.275569.121. doi: 10.1101/gr.275569.121. Epub ahead of print. PMID: 34088715.
version 1.3/1.0:
Aevermann BD, Novotny M, Bakken T, Miller JA, Diehl AD, Osumi-Sutherland D, Lasken RS, Lein ES, Scheuermann RH. Cell type discovery using single-cell transcriptomics: implications for ontological representation. Hum Mol Genet. 2018 May 1;27(R1):R40-R47. doi: 10.1093/hmg/ddy100.
This project is licensed under the MIT License - see the https://opensource.org/licenses/MIT for details
Full Changelog: https://github.com/JCVenterInstitute/NSForest/compare/v2.0...v3.0
Install Jupyter notebook and python 2.7
Necessary and Sufficient Forest is a method that takes cluster results from single cell/nuclei RNAseq experiments and generates lists of minimal markers needed to define each “cell type cluster”.
The method begins by re-encoding the cluster labels into binary classifications, and Random Forest models are generated comparing each cluster versus all. The top fifteen genes are then reranked using a score measuring how binary they are, e.g., a gene with expression in the target cluster but no expression in the other clusters would have a high binary score. Expression cutoffs for the top six genes ranked by binary score are then determined by generating individual decision trees and extracting the decision path information. Then all permutations of the top six most binary genes are evaluated using f-beta score as an objective function (the beta value default set at 0.5, which weights the f-measure score more toward precision as opposed to recall).
See code for detailed comments.
This is version 2.0 The initial release was version 1.3. Version 1.0 was described in:
Aevermann BD, Novotny M, Bakken T, Miller JA, Diehl AD, Osumi-Sutherland D, Lasken RS, Lein ES, Scheuermann RH. Cell type discovery using single-cell transcriptomics: implications for ontological representation. Hum Mol Genet. 2018 May 1;27(R1):R40-R47. doi: 10.1093/hmg/ddy100.
This project is licensed under the MIT License - see the LICENSE.md file for details
Install Jupyter notebook and python 2.7
Necessary and Sufficient Forest is a method that takes cluster results from single cell/nuclei RNAseq experiments and generates lists of minimal markers needed to define each “cell type cluster”.
The method begins by re-encoding the cluster labels into binary classifications, and Random Forest models are generated comparing each cluster versus all. The top ten ranked features from the Random Forest are then tested using f-measure as an objective function. For example, during the first step all top ten features are independently evaluated for their discriminatory power at an expression value where 75% of the cells have greater than or equal expression. Given that 25% of the cells are lost de facto, the maximum f-measure for the first step is estimated to be around 0.87 (there will be cases where its higher or lower, such as having equal expression across all cells). After the best f-measure is found classifying with one gene than the remaining nine genes are tested in combination with the top first gene, again using an expression value where 75% of the cells have expression. After the best pair of genes is found, the remaining 8 genes are tested in third position, and onward until the analysis reaches 6 combinations.
See code for detailed comments.
The initial release is version 1.3. Version 1.0 was described in:
Aevermann BD, Novotny M, Bakken T, Miller JA, Diehl AD, Osumi-Sutherland D, Lasken RS, Lein ES, Scheuermann RH. Cell type discovery using single-cell transcriptomics: implications for ontological representation. Hum Mol Genet. 2018 May 1;27(R1):R40-R47. doi: 10.1093/hmg/ddy100.
This project is licensed under the MIT License - see the LICENSE.md file for details