A python pipeline to identify IS (Insertion Sequence) elements in genome and metagenome
.faa
under directory results/proteome
after you run ISEScan on your genome sequences. For how to do so, please check my comments on May 2022.ISEScan is a python pipeline to identify IS (Insertion Sequence) elements in genome. It includes an option to report either complete IS elements or both complete and partial IS elements. It might be a good idea to try reporting both complete and partial IS elements when it is used to identify the IS elements in the assemblies of metegenome. ISEScan reports both complete and partial IS elements by default.
ISEScan was developed using Python3. It 1) scans genome (or metagenome) in fasta format; 2) predicts/translates (using FragGeneScan) genome into proteome; 3) searches the pre-built pHMMs (profile Hidden Markov Models) of transposases (two files shipped with ISEScan; clusters.faa.hmm and clusters.single.faa) against the proteome and identifies the transposase gene in genome; 4) then extends the identified transposase gene into the complete IS (Insertion Sequence) elements based on the common characteristics shared by the known IS elements reported by literatures and database; 5) finally reports the identified IS elements in a few result files (e.g. a file containing a list of IS elements, a file containing sequences of IS elements in fasta format, an annotation file in GFF3 format).
Zhiqun Xie, Haixu Tang. ISEScan: automated identification of Insertion Sequence Elements in prokaryotic genomes. Bioinformatics, 2017, 33(21): 3340-3347.
Download: full text, SupplementaryMaterials.docx, SupplementaryMaterials.xlsx.
Zhiqun Xie: [email protected]
ISEScan was tested on Linux only and can be installed from Bioconda packages and source code. Install from Bioconda is recommended as it is the simplest way for non-experienced users.
I have no idea about ISEScan on mac as I only fully tested it on Linux. If you cannot install ISEScan on mac from Bioconda, you can try installing ISEScan from source codes. For installing ISEScan from source codes, I knew there was an issue to compile FragGensScan on Mac but I once solved it. To solve the problem of running FragGeneScan on Mac, please modify two source files in FragGeneScan source codes: 1) open util_lib.c and comment out ‘#include <malloc.h>’ on line3; 2) open hmm_lib.c and comment out ‘‘#include <malloc.h>’ on line6 and replace values.h with limits.h on line4. The modified FragGeneScan can run on Mac and Linux without problem according to my test result.
The steps below will install ISEScan package via bioconda to /apps/inst/miniconda3/. You can install ISEScan to other place by changing the default miniconda3 install path in step Install Miniconda3. Visit Bioconda recipe for ISEScan for more details (Thanks both pbasting and tseemann for making it available!).
url -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
h Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh
if you have no idea about the questions.o you wish the installer to initialize Miniconda3
y running conda init? [yes|no]
no] >>> yes
m Miniconda3-latest-Linux-x86_64.sh
ource ~/.bashrc
onda config --add channels defaults
onda config --add channels bioconda
onda config --add channels conda-forge
conda install isescan
isescan.py -h
).cp /apps/inst/miniconda3/test/NC_012624.fna ./
isescan.py --seqfile NC_012624.fna --output results --nthread 2
Note: replace /apps/inst/miniconda3
in commands with your conda install path.
If system reports isescan.py: command not found...
, please add ISEScan package to your PATH
(replace /apps/inst/miniconda3
in the command below with your conda install path):
export PATH=/apps/inst/miniconda3/bin/:$PATH
Then, try ISEScan again:
isescan.py --seqfile NC_012624.fna --output results --nthread 2
Install ISEScan
Download the latest ISEScan from https://github.com/xiezhq/ISEScan/releases, e.g. Source code (tar.gz).
Uncompress the .zip (or .tar.gz) file.
nzip v1.7.2.2.zip
ar -zvxf v1.7.2.2.tar.gz
This will create a ISEScan folder, e.g. ISEScan-1.7.2.2. You need to go to ISEScan folder to configure and run it.
d ISEScan-1.7.2.2
Install dependencies before you run ISEScan
d ssw201507
cc -Wall -O3 -pipe -fPIC -shared -rdynamic -o libssw.so ssw.c ssw.h
p libssw.so ../
xport LD_LIBRARY_PATH=/home/xiezhq/projects/ISEScan-1.7.2.2:$LD_LIBRARY_PATH
In command export LD_LIBRARY_PATH=/home/xiezhq/projects/ISEScan-1.7.2.2:$LD_LIBRARY_PATH
, please replace /home/xiezhq/projects/ISEScan-1.7.2.2
with the actual path of libssw.so on your computer!
Add the required packages to your $PATH before you run ISEScan
xport PATH=$PATH:/apps/inst/FragGeneScan1.30:/apps/inst/hmmer-3.3/bin:/apps/inst/ncbi-blast-2.10.0+/bin
In command export above, please replace /apps/inst/FragGeneScan1.30
, /apps/inst/hmmer-3.3/bin
and /apps/inst/ncbi-blast-2.10.0+/bin
with the actual paths of FragGeneScan, HMMER and BLAST on your computer!
The lastest version becomes available on Bioconda is in a few hours or days after it is released on https://github.com/xiezhq/ISEScan. You can run the command below to upgrade the existing ISEScan if the existing ISEScan was installed by Bioconda.
conda update isescan
By manual upgrade, you may get the lastest version immediately from https://github.com/xiezhq/ISEScan). It is quite easy to upgrade the existing ISEScan to the latest version: copy all .py files from the latest version to the ISEScan install directory.
which isescan.py
to help find where it is on your system.
hich isescan.py
apps/inst/miniconda3/bin/isescan.py
ar -zxf v1.7.2.2.2.tar.gz
d ISEScan-1.7.2.2.2/
p *.py /apps/inst/miniconda3/bin/
ython3 isescan.py --version
or
sescan.py --version
ython3 isescan.py --seqfile /apps/inst/miniconda3/test/NC_012624.fna --output /home/xiezhq/results --nthread 2
Let's try an example, NC_012624.fna.
The command below scans NC_012624.fna (genome sequence of Sulfolobus_islandicus_Y_N_15_51, ~42 kb), and outputs all results in results
directory:
p /apps/inst/miniconda3/test/NC_012624.fna ./
sescan.py --seqfile NC_012624.fna --output results --nthread 2
Note: run isescan.py -h
or isescan.py --help
to get help.
Wait for its finishing. It may take a while (~40 seconds) as ISEScan uses the HMMER to scan the genome sequences and it will use 621 profile HMM models to scan each protein sequence (predicted by FragGeneScan) in the genome sequence. HMMER searching is usually more sensitive but slower than the regular BLAST searching for remote homologs. The running time for larger genome will increase quickly, e.g. about 20 minutes for NC_000913.fna (genome sequence of Escherichia coli str. K-12 substr. MG1655, ~4.6 Mb) with two cpu cores on my virtual machine.
After ISEScan finish running, you can find the output files in results directory:
Details about NC_012624.fna.sum:
#
, followed by the summarization of IS content for each sequence in NC_012624. The last line is the summarization of IS content for all sequences in NC_012624.>
in NC_012624.fna, usuall the texts between >
and the first blank character in a head lineDetails about NC_012624.fna.csv (NC_012624.fna.tsv, NC_012624.fna.raw):
Sometimes, we want to run hundres of genomes in one line of command and then wait for all computing jobs to complete. Before doing it, we assume:
conda activate base
isescan.py --seqfile NC_012624.fna --output results
python3 /home/xiezhq/projects/ISEScan-1.7.2.2/isescan.py --seqfile genome1.fa --output results
where genome1.fa is your genome sequence file in fasta format. By default, ISEScan will use one CPU core but you can change it using command option --nthread NTHREAD
, e.g.
isescan.py --seqfile genome1.fa --output results --nthread 2
Now, let's run 200 genomes in one line of command and then wait for all computing jobs to complete (probably several days or weeks, depending on how many hours are required for each of your 200 genomes on average). If your computer has 8 CPU cores, you can execute the command below:
nohup cat test.fna.list | xargs -n 1 -P 4 -I{} isescan.py --seqfile {} --output results --nthread 2 > log.txt &
In the command line,
/N/dc2/scratch/zhiqxie/hmp/HMASM/SRS014235.scaffolds.fa
/N/dc2/scratch/zhiqxie/hmp/HMASM/SRS049959.scaffolds.fa
/N/dc2/scratch/zhiqxie/hmp/HMASM/SRS020233.scaffolds.fa
/N/dc2/scratch/zhiqxie/hmp/HMASM/SRS022609.scaffolds.fa
/N/dc2/scratch/zhiqxie/hmp/HMASM/SRS024132.scaffolds.fa
top -c -u xiezhq
(assuming your user name is xiezhq).It might take several days or weeks for 200 genomes to complete. It depends on how many CPU cores you have on your computer and how fast each CPU core is. Please do not load too many ISEScan jobs because each ISEScan job will consume part of your RAM on your computer. However, you can always test and estimate how many GB RAM and how many hours are required for one genome.
results/proteome
directory and clusters.faa.hmm.NC_012624.fna.faa and clusters.single.faa.NC_012624.fna.faa in results/hmm
directory, and then rerun it:
sescan.py --seqfile NC_012624.fna --output results
,
) and .tsv (columns are separated by tab
) result files, which are much easier for users to parse the results (Thanks oschwengers for his suggestion)--seqfile
and --output
to remove the positional parmater seqfile
, proteome
and hmm
(Thanks oschwengers for his suggestion)dir4prediction
(Thanks oschwengers for his suggestion)--removeShortIS
and --no-FragGeneScan
, and remove removeShortIS
and translateGenome
from constants.py. (Thanks EricDeveaud for his suggestion and codes)--nthread
to isescan.py, and remove nthread
and nproc
from constants.py.clusters.faa.hmm
and clusters.single.faa
, both files are now built upon the curated ACLAME dataset (ACLAME is a mobile genetic element database.)constants.py
to report either complete IS elements or both complete and partial IS elements