In-memory nucleotide sequence k-mer counting, filtering, graph traversal and more
JOSS submission updates
This is the v2.0 release of khmer and the first from our new lab at the University of California, Davis. It features Python 3 compatibility, streaming I/O from Unix Pipes, mixed-pair sequence file format support, and a new parameter to simplify memory usage. We also have a software paper in-press describing the project and the citation reminders have been updated to reflect that.
Overall there are an additional 2,380 lines of Python code (mostly tests) and 283 less lines of C++ (despite adding features). This release is the product of over 1,000 commits to the codebase since v1.4.
Documentation is at https://khmer.readthedocs.org/en/v2.0/
All scripts now accept input from named (like /dev/stdin
, or that created using <( list )
process substituion) and unnamed pipes (like output piped in from another program with |
). The STDIN stream can also be specified using a single dash: -
. #1186 @mr-c #1042 #763 @SherineAwad #1085 @ctb
There is now a -M
/--max-memory-usage
parameter that sets the number of tables (-N
/--n_tables
) and tablesize (-x
/--max-tablesize
) parameters automatically to match the desired memory usage. #1106 #621 #1126 #390 #1117 #1055 #1050 #1214 #1179 #1133 #1145 @ctb @qingpeng @bocajnotnef
normalize-by-median.py
now supports mixed paired and unpaired (or "broken-paired") input. Behavior can be forced to either treat all reads as singletons or to require all reads be properly paired using --force_single
or --paired
, respectively. If --paired
is set, --unpaired-reads
can be used to include a file of unpaired reads. The unpaired reads will be examined after all of the other sequence files. normalize-by-median.py
now has a --quiet
option to reduce the amount of output. #1200 @bocajnotnef
split-paired-reads.py
--output-orphaned
/-0
has been added to allow for orphaned reads and give them a file to be sorted into. #847 #1164 @ctb
All scripts that output any kind of columnar data now do so in CSV format, with headers. Previously this had to be enabled with --csv
. (Affects abundance-dist-single.py
, abundance-dist.py
, count-median.py
,
and count-overlap.py
.) normalize-by-median.py --report
also now outputs in CSV format. #1011 #1180 @ctb
sample-reads-randomly.py
now retains pairs in the output, by default. This can be overridden to match previous behavior with --force_single
.
We support gzip and bzip2 input and output file compression everywhere that it makes sense #505 #747 @bocajnotnef
unique-kmers.py
estimates the k-mer cardinality of a dataset using the HyperLogLog probabilistic data structure. This allows very low memory consumption, which can be configured through an expected error rate. Even with low error rate (and higher memory consumption), it is still much more efficient than exact counting and alternative methods. It supports multicore processing (using OpenMP) and streaming, and so can be used in conjunction with other scripts (like normalize-by-median.py
and filter-abund.py
). This script is the work of @luizirber and the subject of a paper in draft. #390 #1239 #1252 #1053 #1072 #1145 #1176 #1207 #1204 #1245
For clarity the Count-Min Sketch based data structure previously known as "counting_hash" or "counting_table" and variations of these is now known as countgraph
. Likewise with the Bloom Filter based data structure previously known at "hashbits", "presence_table" and variations of these is now known as nodegraph
. Many options relating to table
have been changed to graph
. #1112 #1209 @mr-c
All binary khmer formats (presence tables, counting tables, tag sets, stop tags, and partition subsets) have changed. Files are now pre-pended with the string OXLI
to indicate that they are from this project. #519 #1031 @mr-c #1159 @luizirber
Files of the above types made in previous versions of khmer are not compatible with v2.0; the reverse is also true.
In addition to the OXLI
string, the Nodegraph and Countgraph file format now includes the number of occupied bins. See http://khmer.readthedocs.org/en/v2.0/dev/binary-file-formats for details. #1093 @ctb @mr-c #1101 #1103 @kdmurray91
Previously, load-graph.py
appended a .pt
extension to the specified output filename and partition-graph.py appended a .pt
to the given input filename. Now, load-graph.py
writes to the specified output filename and partition-graph.py
does not append a .pt
to the given input filename. #1189 #747 @bocajnotnef
The total number of unique k-mers will always be reported every time a new countgraph is made. The --report-total-kmers
option has been removed from abundance-dist-single.py
, filter-abund-single.py
, and normalize-by-median.py
to reflect this. Likewise with --write-fp-rate
for load-into-counting.py
and load-graph.py
; the false positive rate will always be written to the .info
files. #1097 #1180 @ctb
To simplify the codebase --save-on-failure
and its helper option --dump-frequency
have been removed from normalize-by-median.py
.
--out
is now --output
for both normalize-by-median.py
and trim-low-abund.py
. #1188 #1164 @ctb
The common option --min-tablesize
was renamed to --max-tablesize
to reflect this more desirable behavior.
In conjuction with the new split-paired-reads.py
--output-orphaned
option, the option --force-paired
/-p
has been eliminated.
As CSV format is now the default, the --csv
option has been removed.
count-overlap.py has been removed.
When normalize-by-median.py
decides to keep both parts of a pair of reads it was only adding the k-mers & counts from one to the countgraph. #1000 #1010 @drtamermansour @bocajnotnef
The partition map file format was not robust to truncation and would hang waiting for more data. #437 #1037 #1048 @ctb
extract-paired-reads.py
and split-paired-reads.py
no longer create default files when the user supplies filename(s). #1005 #1132 @kdmurray91
find-knots.py
was missing a --force
option and unit tests. #358 #1078 @ctb
The check for excessively high false-positive rate has also received a --force
option #1168 @bocajnotnef
A bug leading to an infinite loop with large gzipped countgraphs was found #1038 #1043 @kdmurray91
All scripts that create nodegraphs or countgraphs report the total number of unique k-mers. #491 #609 #429 @mr-c
Read pairs from SRA are fully supported. Reported by @macmanes in #1027, fixed by @kdmurray91 @SherineAwad in #1173 #1088
Added Hashtable::get_kmers()
, get_kmer_hashes()
, and get_kmer_counts()
with corresponding CPython functions. #1047 #1049 @ctb
The DEFAULT_DESIRED_COVERAGE
for normalize-by-median.py
is now 20. #1073 #1081 @ctb
FIFOs are no longer seen as empty. #1147 #1163 @bocajnotnef
When the k-size is requested to be larger than 32 (which is unsupported) a helpful error message is reported. #1094 #1050 @ctb
We try to report more helpfully during errors, such as suggesting the --force
option when outputs files already exist. #1162 #1170 @bocajnotnef
There is a paper related to trim-low-abund.py
: "Crossing the streams: a framework for streaming analysis of short DNA sequencing reads" and it has been added to the CITATION file and program output. #1180 #1130 @ctb
We have dropped support for Python 2.6 #1009 #1180 @ctb
Our user documentation got a bit out of date and has been updated. #1156 #1247 @bocajnotnef @mr-c #1104 @kdmurray91 #1267 @ctb Links to lists of publications that use khmer have been added #1063 #1222 @mr-c The help text from the scripts has also had a thorough cleanup for formatting. #1268 @mr-c
fastq-to-fasta.py
's --n_keep
option has incorrect help text. We now point out that all reads with Ns will be dropped by default unless this option is supplied. #657 #814 #1208 @ACharbonneau @bocajnotnef
We've updated the URL to the '88m-reads.fa.gz' file. #1242 #1269 @mr-c
@camillescott designed and implemented an optimization for normalize-by-median.py
#862
abundance-dist.py
can now be used without counts over 255 with --no-bigcount
. #1067 #909 @drtamermansour @bocajnotnef Its input file requirement can no longer be overridden #1201 #1202 @bocajnotnef
khmer v2.0 will be released as a package for the Debian GNU/Linux operating system. Big thanks to @kdmurray91 for his assistance. #1148 #1240 The C++ library, now named liboxli, will have its own package as well.
sandbox/multi-rename.py
now wraps long FASTA sequences at 80 columns. #450 #1136 @SherineAwad
The khmer project is now a Python 3 codebase with backwards compatibility to Python 2.7. Huge credit to @luizirber #978 #922 #1045 #1066 #1089 #1157 #1191 #1108 Many developer impacting changes including the file khmer/_khmermodule.cc
is now khmer/_khmer.cc
. #169 #904
@camillescott did an extensive refactor of the C++ graph traversal code which removed a considerable amount of redundant code and will be very useful for future work. #1231 #1080
We now use some and allow all C++11 features in the codebase. #598 #1122 @mr-c
normalize-by-median.py
was extensively refactored. #1006 #1010 #1057 #1039 #1135 #1182 @bocajnotnef @ctb @camillescott
The CPython glue was refactored so that CountingHash and Hashbits inherit from Hashtable. #1044 @ctb
The tests no longer stop on the first failed test. #1124 #1134 @ctb and some noisy tests were silenced #1125 #1137 @bocajnotnef
The check_space()
calls were cleaned up. #1167 #1166 #1170 #993
Developer docs have been expanded #737 #1184 @bocajnotnef #1083 #1282 @ctb @mr-c #1269
A lot of code was deleted: TRACE related code in #274 #1180 @ctb hashtable_collect_high_abundance_kmers
in #1142 #1044 @ctb lib/ht-diff.cc
, lib/test-HashTables.cc
, lib/test-Parser.cc
#1144, @mr-c bink.ipynb
, lib/graphtest.cc
, lib/primes.hh
#1289 @mr-c
@bocajnotnef deleted more unused code and added new tests elsewhere to increase testing coverage in #1236. @mr-c had his own go in #1279
cppcheck installation for OSX has been documented #777 #952 #945 @elmbeech
ccache and git-merge-changelog has been documented for Linux users #610 #1122 #614 @mr-c
The graphalign parameters can be saved/loaded from disk. In addition the align_forward
method has been introduced. #755 #750 @mr-c @ctb
labelhash
is now known as graphlabels
#1032 #1209 @mr-c It is also now a 'friend' of Hashtable and one can make either a nodegraph or countgraph version. These graphlabels can now be saved & loaded from disk. #1021 @ctb
Spelling is hard; we've added instructions on how to run codespell to the developer docs. #890 #1203 @bocajnotnef
A redundant and contradictory named test has been removed. Reported by @jgluck in #662 fixed by @bocajnotnef in #1220 @SherineAwad contributed some additional tests #809 #615.
The new oxli command, while disabled in the v2.0 release, has been added to all the QA makefile targets as we continue to refactor the codebase. #1199 #1218 @bocajnotnef
The CPython code was audited to ensure that all possible C++ exceptions were caught and dealt with. The exception hierarchy was also simplified #1016 #1015 #1017 #1151 @kdmurray91 @mr-c
get_kadian_count
has been removed. #1034 #1194 @ctb
We use argparse's metavar
s to aid with autogenerated documentation for the scripts. This has been documented in the dev docs. #620 #1222 @mr-c
Sometimes one makes a lot of commits while refining a feature or pull request. We've documented a field-tested way to turn a pile of commits into a single commit without the pain of git rebase
. #1013 #660 #1222 @mr-c
We use Coverity to test for various issues with our C++ code. The Makefile target has been updated for changes on their side. #1007 #1222 @mr-c
There is a new update()
function to merge two nodegraphs of the same size and ksize. #1051 @ctb
Despite the checklist, formatting errors still occur. We must be vigilant! #1075 @luizirber
There is a new filter_on_median
function. #862 #1077 @camillescott
There are new scripts in the sandbox/
which output k-mer counts: sandbox/{count-kmers.py,count-kmers-single.py}. #983 @ctb
A large effort to make the codebase 'pylint clean' has begun with #1175 @bocajnotnef Likewise the cpychecker tool was re-run on the CPython code and issues found there were addressed #1196 @mr-c
As repeatedly promised, we've updated our list of contributors to include everyone with a commit in git. #1023 @mr-c
thread_utils.is_pair()
has been dropped in favor of utils.check_is_pair()
#1284 @mr-c
The Doxygen produced documentation is improving. The location of included headers is now autodetected for Doxygen and cppcheck.
load-graph.py
in multithreaded mode will find slightly different number of unique kmers. This is being investigated in #1248
@ctb, @bocajnotnef, @mr-c, @luizirber, @kdmurray91, @SherineAwad, @camillescott, *@ACharbonneau, *@elmbeech, @drtamermansour
* Indicates new contributors
@jgluck, @ACharbonneau, @macmanes
This is the v1.4.1 release of khmer. Due to the upcoming Python 3 compatibility in both khmer and Screed we need to modify the dependency between khmer and the Screed library to be only the existing version 0.8, and not some future version.
If you have khmer 1.4 installed then there is no benefit to upgrading; this point release is to keep pip install khmer
still working when we release the next version of Screed with Python 3 support. The next version of khmer, v2.0, will also have Python 3 support.
Documentation is at https://khmer.readthedocs.org/en/v1.4.1/ (no changes from v1.4)
All of these are pre-existing.
Some users have reported that normalize-by-median.py will utilize more memory than it was configured for. This is being investigated in https://github.com/ged-lab/khmer/issues/266
Some scripts only output FASTA even if given a FASTQ file. This issue is being tracked in https://github.com/ged-lab/khmer/issues/46
This is the v1.4 release of khmer featuring the results of our March and April (PyCon) coding sprints and the 16 new contributors; the use of the new v0.8 release of screed (the library we use for pure Python reading of nucleotide sequence files); and the addition of @luizirber's HyperLogLog counter for quick cardinality estimation.
Documentation is at https://khmer.readthedocs.org/en/v1.4/
Casava 1.8 read naming is now fully supported and in general the scripts no longer mangle read names. Side benefits: split-paired-reads.py
will no longer drop reads with 'bad' names; count-median.py
can generate output in CSV format. #759 #818 @ctb #873 @ahaerpfer
Most scripts now support a "broken" interleaved paired-read format for FASTA/FASTQ nucleotide sequence files. trim-low-abund.py
has been promoted from the sandbox as well (with streaming support). #759 @ctb #963 @sguermond #933 @standage
The script to transform an interleaved paired-read nucleotide sequence file into two files now allows one to name the output files which can be useful in combination with named pipes for streaming processing #762 @ctb
Streaming everywhere: thanks to screed v0.8 we now support streaming of almost all inputs and outputs. #830 @aditi9783 #812 @mr-c #917 @bocajnotnef #882 @standage
Need a quick way to count total number of unique k-mers in very low memory? the unique-kmers.py
script in the sandbox uses a HyperLogLog counter to quickly (and with little memory) provide an estimate with a controllable error rate. #257 #738 #895 #902 @luizirber
normalize-by-median.py
can now process both a paired interleaved sequence file and a file of unpaired reads in the same invocation thus removing the need to write the counting table to disk as required in the workaround. #957 @susinmotion
Paired-end reads from Casava 1.8 no longer require renaming for use in normalize-by-median.py
and abund-filter.py
when used in paired mode #818 @ctb
Python version support clarified. We do not (yet) support Python 3.x #741 @mr-c
If a single output file mode is chosen for normalize-by-median.py we now default to overwriting the output. Appending the output is available by using the append redirection operator from the shell. #843 @drtamermansour
Scripts that consume sequence data using C++ will now properly throw an error on truncated files. #897 @kdmurray91 And while writing to disk we properly check for errors #856 #962 @mr-c
abundance-dist-single.py
no longer fails with small files and many threads. #900 @mr-c
Many documentation updates #753 @PamelaM, #782 @bocajnotnef, #845 @alameldin, #804 @ctb, #870 @SchwarzEM, #953 #942 @safay, #929,@davelin1, #687 #912 #926 @mr-c
Installation instructions for Conda, Arch Linux, and Mac Ports have been added #723 @reedacartwright #952 @elmbeech #930 @ahaerpfer
The example script for the STAMPS database has been fixed to run correctly #781 @drtamermansour
split-paired-reads.py
: added -o
option to allow specification of an output directory #752 @bede
Fixed a string formatting and a boundry error in sample-reads-randomly.py
#773 @qingpeng #995 @ctb
CSV output added to abundance-dist.py
, abundance-dist-single.py
, and count-overlap.py
, and readstats.py
#831 #854 #855 @drtamermansour #959 @anotherthomas
TSV/JSON output of load-into-counting.py
enhanced with the total number of reads processed #996 @kdmurray91
Output files are now also checked to be writable before loading the input files #672 @pgarland @bocajnotnef
interleave-reads.py
now prints the output filename nicely #827 @kdmurray91
Cleaned up error for input file not existing #772 @jessicamizzi #851 @ctb
Fixed error in find-knots.py
#860 @TheOneHyer
The help text for load-into-counting.py
for the --no-bigcounts
/-b
flag has been clarified #857 @kdmurray91
@lexnederbragt confirmed an old bug has been fixed with his test for whitespace in sequence identifiers interacting with the extract-partitions.py
script #979
Now safe to copy-and-paste from the user documentation as the smart quotes have been turned off. #967 @ahaerpfer
The script make-coverage.py
has been restored to the sandbox. #920 @SherineAwad
normalize-by-median.py
will warn if two of the input files have the same name #932 @elmbeech
Switched away from using --user
install for developers #740 @mr-c @drtamermansour & #883 @standage
Developers can now see a summary of important Makefile targets via make help
#783 @standage
The unused khmer.load_pe
module has been removed #828 @kdmurray91
Versioneer bug due to new screed release was squashed #835 @mr-c
A Python 2.6 and 2.7.2 specific bug was worked around #869 @kdmurray91 @ctb
added functions hash_find_all_tags_list and hash_get_tags_and_positions to CountingHash objects #749 #765 @ctb
The make diff-cover
and ChangeLog formatting requirements have been added to checklist #766 @mr-c
A useful message is now presented if large tables fail to allocate enough memory #704 @mr-c
A checklist for developers adding new CPython types was added #727 @mr-c
The sandbox graduation checklist has been updated to include streaming support #951 @sguermond
Specific policies for sandbox/ and scripts/ content, and a process for adding new command line scripts into scripts/ have been added to the developer documentation #799 @ctb
Sandbox scripts update: corrected #! Python invocation #815 @Echelon9, executable bits, copyright headers, no underscores in filenames #823 #826 #850 @alameldin several scripts deleted, docs + requirements updated #852 @ctb
Avoid running big-memory tests on OS X #819 @ctb
Unused callback code was removed #698 @mr-c
The CPython code was updated to use the new checklist and follow additional best practices #785 #842 @luizirber
Added a read-only view of the raw counting tables #671 @camillescott #869 @kdmurray91
Added a Python method for quickly getting the number of underlying tables in a counting or presence table #879 #880 @kdmurray91
The C++ library can now be built separately for the brave and curious developer #788 @kdmurray91
The ReadParser object now keeps track of the number of reads processed #877 @kdmurray91
Documentation is now reproducible #886 @mr-c
Python future proofing: specify floor division #863 @mr-c
Miscellaneous spelling fixes; thanks codespell! #867 @mr-c
Debian package list update #984 @mr-c
khmer.kfile.check_file_status()
has been renamed to check_input_files()
#941 @proteasome
filter-abund.py
now uses it to check the input counting table #931 @safay
normalize-by-median.py
was refactored to not pass the ArgParse object around #965 @susinmotion
Developer communication has been clarified #969 @sguermond
Tests using the 'fail_okay=true' parameter to runscript
have been updated to confirm the correct error occurred. 3 faulty tests were fixed and the docs were clarified #968 #971 @susinmotion
FASTA test added for extract-long-sequences.py
#901 @jessicamizzi
'added silly test for empty file warning' #557 @wltrimbl @bocajnotnef
A couple tests were made more resilient and some extra error checking added in CPython land #889 @mr-c
Copyright added to pull request checklist #940 @sguermond
khmer_exception
s are now based on std::string
s which plugs a memory leak #938 @anotherthomas
Python docstrings were made PEP257 compliant #936 @ahaerpfer
Some C++ comments were converted to be Doxygen compliant #950 @josiahseaman
The counting and presence table warning logic was refactored and centralized #944 @susinmotion
The release checklist was updated to better run the post-install tests #911 @mr-c
The unused method find_all_tags_truncate_on_abundance
was removed from the CPython API #924 @anotherthomas
OS X warnings quieted #887 @mr-c
All of these are pre-existing.
Some users have reported that normalize-by-median.py will utilize more memory than it was configured for. This is being investigated in https://github.com/ged-lab/khmer/issues/266
Some scripts only output FASTA even if given a FASTQ file. This issue is being tracked in https://github.com/ged-lab/khmer/issues/46
@ctb, @kdmurray91, @mr-c, @drtamermansour, @luizirber, @standage, @bocajnotnef, *@susinmotion, @jessicamizzi, *@elmbeech, *@anotherthomas, *@sguermond, *@ahaerpfer, *@alameldin, *@TheOneHyer, *@aditi9783, *@proteasome, *@bede, *@davelin1, @Echelon9, *@reedacartwright, @qingpeng, *@SchwarzEM, *@scottsievert, @PamelaM, @SherineAwad, *@josiahseaman, *@lexnederbragt,
* Indicates new contributors
@moorepants, @teshomem, @macmanes, @lexnederbragt, @r-gaia-cs, @magentashades
This is the v1.3 release of khmer featuring a new FAST[AQ] parser from the SeqAn project.
Docs at: https://khmer.readthedocs.org/en/v1.3/
Fixes the two multithreaded reading of sequence files issues: FASTQ parsing and the recently found read dropping issue. Several khmer scripts now support reading from non-seekable plain and gziped FAST[AQ] files (a.k.a pipe or streaming support). @mr-c #642
restore threading to load-graph.py #699 @mr-c
increase filter_abund.py coverage #568 @wrightmhw Provide scripts/ testing coverage for check_space_for_hashtable #386 #678 #718 @b-wyss Use absolute URI in CODE_OF_CONDUCT #684 @jsspencer give SeqAn credit #712 @mr-c Added testing to make sure all sandbox scripts are import-able and execfile-able. #709 @ctb reduce memory requirements to run tests #701 @ctb Two minor bug fixes to sandbox scripts #706 @ctb Upgrade of trim-low-abund for better, more profitable streaming. #601 @ctb Add --force or --expert or --ignore flag to all khmer scripts that do sanity checking #399 #647 @jessicamizzi Add XDECREF for returned read tuple in ReadParser.read_pair_iterator() #693 @mr-c @camillescott
All of these are pre-existing.
Some users have reported that normalize-by-median.py will utilize more memory than it was configured for. This is being investigated in https://github.com/ged-lab/khmer/issues/266
If your k-mer table is truncated on write, an error may not be reported; this is being tracked in https://github.com/ged-lab/khmer/issues/443. However, khmer will now (correctly) fail when trying to read a truncated file (See #333).
Paired-end reads from Casava 1.8 currently require renaming for use in normalize-by-median and abund-filter when used in paired mode. The integration of a fix for this is being tracked in https://github.com/ged-lab/khmer/issues/23
Some scripts only output FASTA even if given a FASTQ file. This issue is being tracked in https://github.com/ged-lab/khmer/issues/46
A user reported that abundance-dist-single.py fails with small files and many threads. This issue is being tracked in https://github.com/ged-lab/khmer/issues/75
@mr-c, @ctb, @camillescott, @b-wyss, @wrightmhw, @jsspencer
This is the v1.2 release of khmer: minor new features and bug fixes. The start of this release cycle coincided with the Mozilla Science Lab Global Sprint 2014. We honor and thank the 19 new contributors (including four Michigan State University undergraduates) who volunteered their time to contribute!
Docs at: https://khmer.readthedocs.org/en/v1.2/
@mr-c and @ctb are proud to announce khmer's code of conduct http://khmer.readthedocs.org/en/v1.2/dev/CODE_OF_CONDUCT.html #664
All scripts list which files have been created during their execution #477 @bocajnotnef
All scripts now only output status messages to STDERR instead of STDOUT #626 @b-wyss
docs/ a fairly major re-organization and brand new developer docs @ctb @mr-c
load-into-counting.py: --summary-info
: machine readable summary in JSON or TSV format #649 @kdmurray91
scripts/extract-partitions.py: added documentation for make install-dependencies
is useful for developers #539 @mr-c
Sandbox scripts have been cleaned up, or removed (see the sandbox/README.rst for details) #589 @ctb
do-partition.py's excessive spawning of threads fixed. #637 @camillescott Fixed unique k-mer count reporting in load-graph, load-into-counting, and normalize-by-median. #562 @mr-c Clarified and test the requirement for a 64-bit operating system #529 @Echelon9 Removed some of the broken multi-threading options #511 @majoras-masque Fix table.get("wrong_length_string") gives core dump #585 @Echelon9 filter-abund lists parameters that it doesn't use #524 @jstapleton Reduction of memory required to run the test suite #542 @leogargu BibTeX included in CITATIONS #541 @HLWiencko
delete ScoringMatrix::assign as it is unused #502 @RodPic
Root all of our C++ exceptions to a common base exception #508 @iglpdc
deleted KhmerError #503 @drlabratory
normalize-by-median reporting output after main loop exits, in case it hadn't been triggered #586 @ctb
Many issues discovered by cppcheck cleaned up #506 @brtaylor92
Developers have a new Makefile target to autofix formatting: make format
#612 @brtaylor92
normalize-by-median.py test coverage increased #361 @SherineAwad
Several unused functions were removed #599 @brtaylor92
Developer docs now link to the stdc++ docs as appropriate #629 @mr-c
Added tests for non-sequential access to input files #644 @bocajnotnef
Removed khmer/theading_args.py #653 @bocajnotnef
Improved test for maximum k value #658 @pgarland
ReadParser no longer crashes if n_threads = 0 #86 @jiarong
All of these are pre-existing.
Multithreaded reading will drop reads. This major issue has been present for several khmer releases and was only found via a much larger test case that we had been previously using. Credit to @camillescott. Workaround: disable threading. The next release will fix this and the other FAST[AQ] parsing issues. https://github.com/ged-lab/khmer/issues/681
Some users have reported that normalize-by-median.py will utilize more memory than it was configured for. This is being investigated in https://github.com/ged-lab/khmer/issues/266
Some FASTQ files confuse our parser when running with more than one thread. For example, while using load-into-counting.py. If you experience this then add "--threads=1" to your command line. This issue is being tracked in https://github.com/ged-lab/khmer/issues/249
If your k-mer table is truncated on write, an error may not be reported; this is being tracked in https://github.com/ged-lab/khmer/issues/443. However, khmer will now (correctly) fail when trying to read a truncated file (See #333).
Paired-end reads from Casava 1.8 currently require renaming for use in normalize-by-median and abund-filter when used in paired mode. The integration of a fix for this is being tracked in https://github.com/ged-lab/khmer/issues/23
Some scripts only output FASTA even if given a FASTQ file. This issue is being tracked in https://github.com/ged-lab/khmer/issues/46
A user reported that abundance-dist-single.py fails with small files and many threads. This issue is being tracked in https://github.com/ged-lab/khmer/issues/75
@mr-c, @ctb, *@bocajnotnef, *@Echelon9, *@jlippi, *@kdmurray91, @qingpeng, *@leogargu, *@jiarong, *@brtaylor92, *@iglpdc, @camillescott, *@HLWiencko, *@cowguru2000, *@drlabratory, *@jstapleton, *@b-wyss, *@jgluck, @fishjord, *@SherineAwad, *@pgarland, *@majoras-masque, @chuckpr, *@RodPic, @luizirber, *@jrherr
*
Denotes new contributor
This is v1.1, a minor version release; this version adds several new scripts.
Docs at: https://khmer.readthedocs.org/en/v1.1/
Release notes w/links: https://github.com/ged-lab/khmer/releases/tag/v1.1
All of these are pre-existing.
Some users have reported that normalize-by-median.py will utilize more memory than it was configured for. This is being investigated in https://github.com/ged-lab/khmer/issues/266
Some FASTQ files confuse our parser when running with more than one thread. For example, while using load-into-counting.py. If you experience this then add "--threads=1" to your command line. This issue is being tracked in https://github.com/ged-lab/khmer/issues/249
If your k-mer table is truncated on write, an error may not be reported; this is being tracked in https://github.com/ged-lab/khmer/issues/443. However, khmer will now (correctly) fail when trying to read a truncated file (See #333).
Paired-end reads from Casava 1.8 currently require renaming for use in normalize-by-median and abund-filter when used in paired mode. The integration of a fix for this is being tracked in https://github.com/ged-lab/khmer/issues/23
Some scripts only output FASTA even if given a FASTQ file. This issue is being tracked in https://github.com/ged-lab/khmer/issues/46
A user reported that abundance-dist-single.py fails with small files and many threads. This issue is being tracked in https://github.com/ged-lab/khmer/issues/75
@mr-c, @ctb, @camillescott, @wrightmhw, @chuckpr, @luizirber, @accaldwell, @znruss
This is bugfix release. Note: the installation instructions have been slightly simplified.
https://khmer.readthedocs.org/en/v1.0.1/
This release successfully installs and passes its unit tests on Debian 6.0 "Squeeze", Debian 7.0 "Wheezy", Fedora 19, OS X 7 "Lion", OS X 8 "Mountain Lion", Red Hat Enterprise Linux 6, Scientific Linux 6, Ubuntu 10.04 LTS, and Ubuntu 12.04 LTS. Thanks to the UW-Madison Build and Test Lab for their testing infrastructure.
fixed thread hanging issue #406 @ctb Explicit python2 invocation #404 @mr-c MANIFEST.in,setup.py: fix to correct zlib packaging #365 @mr-c fixed check_space_for_hashtable to use args.n_tables #382 @ctb Bug fix: make-initial-stoptags.py error on missing .ht input file, actual input file is .pt #391 @mr-c
include calc-best-assembly.py in v1.0.1 #409 @ctb updated normalize-by-median documentation for loadtable #378 @ctb updated diginorm for new FP rate info; corrected spelling error #398 @ctb Add spellcheck to code review checklist. #397 @ctb
All of these are pre-existing.
Some users have reported that normalize-by-median.py will utilize more memory than it was configured for. This is being investigated in https://github.com/ged-lab/khmer/issues/266
Some FASTQ files confuse our parser when running with more than one thread. For example, while using load-into-counting.py. If you experience this then add "--threads=1" to your command line. This issue is being tracked in https://github.com/ged-lab/khmer/issues/249
If your k-mer table (hashfile) gets truncated, perhaps from a full filesystem, then our tools currently will get stuck. This is being tracked in https://github.com/ged-lab/khmer/issues/247 and https://github.com/ged-lab/khmer/issues/246
Paired-end reads from Casava 1.8 currently require renaming for use in normalize-by-median and abund-filter when used in paired mode. The integration of a fix for this is being tracked in https://github.com/ged-lab/khmer/issues/23
annotate-partitions.py only outputs FASTA even if given a FASTQ file. This issue is being tracked in https://github.com/ged-lab/khmer/issues/46
A user reported that abundance-dist-single.py fails with small files and many threads. This issue is being tracked in https://github.com/ged-lab/khmer/issues/75
@mr-c, @ctb, @luizirber, @RamRS, @ctSkennerton
582 changed files with 40,527 additions and 31,772 deletions.
The team has been hard at work since v0.8 to refine the codebase into a stable product.
https://khmer.readthedocs.org/en/latest/
With the 1.0 release we are making a commitment to using Semantic Versioning[0]: the version number will reflect the impact of the changes between releases. New major versions will likely require you to change how you use the project. Minor versions indicate new functionality that doesn't impact the existing. Patch versions indicate backwards-compatible fixes. Right now we are limiting this promise to the command-line interface. A future release will introduce a stable and mature Python API to the khmer project and at that time we will extend the version system to include that API.
CITATION: Each script now outputs information on how to cite it. There is a new paper to describes the project overall: MR Crusoe et al., 2014. doi: 10.6084/m9.figshare.979190
The documentation for the scripts has undergone an overhaul. The scripts now output extensive notes and the formal documentation website is generated from the scripts themselves and will never be out of sync.
https://khmer.readthedocs.org/en/latest/scripts.html
git clone of the khmer repo reqs > 0.5 GiB #223 @mr-c new khmer/file module #357 @RamRS Floating point exception in count-overlap.py #282 @qingpeng add documentation for sample-reads-randomly #192 @mr-c only build zlib and bzip2 when needed #168 @mr-c
khmer tools should output intelligent error messages when fed empty files #135 @RamRS set IParser::ParserState::ParserState:fill_id to zero at initialization #356 @mr-c demote nose & sphinx to extra dependencies. #351 @mr-c CID 1054792 (Medium) Uninitialized scalar field (UNINIT_CTOR) #179 @mr-c CID 1077117 (Medium): Division or modulo by zero (DIVIDE_BY_ZERO) #182 @mr-c if --savehash is specified then don't continue if there is not enough free disk space #245 @RamRS finish fixing implicit downcasts #330 @mr-c Clean up compile warnings in subset.cc #172 @mr-c all scripts need to output their version #236 @mr-c environmental variables need documenting #303 @mr-c C++ code should be consistently formatted #261 @mr-c Clean up ancillary files #146 @mr-c squash option not implemented in abundance-dist-single.py #271 @RamRS Add documentation on how to tie into a particular tagged version #29 @mr-c pip install -e fails with compile error #352 @mr-c remove the unused KTable object #337 @luizirber zlib 1.2.3 -> zlib 1.2.8 #336 @mr-c CID 1173035: Uninitialized scalar field (UNINIT_CTOR) #311 @mr-c CID 1153101: Resource leak in object (CTOR_DTOR_LEAK) #309 @mr-c remove khmer::read_parsers::IParser::ParserState::thread_id #323 @mr-c several modifications about count-overlap.py script #324 @qingpeng fixed runscript to handle SystemExit #332 @ctb CID 1063852: Uninitialized scalar field (UNINIT_CTOR) #313 @mr-c [infrastructure] update to new Doxyfile format, make version number autoupdate #315 @mr-c Removed an extraneous using namespace khmer; in kmer.hh, #276 @fishjord Minimum and recommended python version #94 @mr-c KmerCount class appears to be unused #302 @mr-c If loadhash is specified in e.g. normalize-by-median, don't complain about default hashsize parameters #117 @RamRS
All of these are pre-existing.
Some users have reported that normalize-by-median.py will utilize more memory than it was configured for. This is being investigated in https://github.com/ged-lab/khmer/issues/266
Some FASTQ files confuse our parser when running with more than one thread. For example, while using load-into-counting.py. If you experience this then add "--threads=1" to your command line. This issue is being tracked in https://github.com/ged-lab/khmer/issues/249
If your k-mer table (hashfile) gets truncated, perhaps from a full filesystem, then our tools currently will get stuck. This is being tracked in https://github.com/ged-lab/khmer/issues/247 and https://github.com/ged-lab/khmer/issues/96 and https://github.com/ged-lab/khmer/issues/246
Paired-end reads from Casava 1.8 currently require renaming for use in normalize-by-median and abund-filter when used in paired mode. The integration of a fix for this is being tracked in https://github.com/ged-lab/khmer/issues/23
annotate-partitions.py only outputs FASTA even if given a FASTQ file. This issue is being tracked in https://github.com/ged-lab/khmer/issues/46
A user reported that abundance-dist-single.py fails with small files and many threads. This issue is being tracked in https://github.com/ged-lab/khmer/issues/75
@camillescott, @mr-c, @ctb, @luizirber, @RamRS, @qingpeng