eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.
To download and unpack prebuilt binaries:
$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.2.0/tsv-utils-v2.2.0_linux-x86_64_ldc2.tar.gz | tar xz
$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.2.0/tsv-utils-v2.2.0_osx-x86_64_ldc2.tar.gz | tar xz
Installation instructions are in the ReleasePackageReadme.txt
file in the release package.
To be notified of new releases:
GitHub supports notification of new releases. Click the "Watch" button on the repository page and select "Releases Only".
Release 2.2.0 Changes:
tsv-filter
: New feature, count matches rather than filtering (--c|count
). This option causes the number of matching lines to be printed rather than the individual matching lines.tsv-filter
: New feature, marking records rather than filtering (--label
). This option causes every record to be marked with an indication of whether it satisfied the test. Marking is done by appending a new field with an indicator value. See PR #338 for details.--line-buffered
). This option causes each line to read and written as soon as it is available. This overrides the default buffering behavior. This is useful when reading from slow input streams. See PR #336 for details.Other Changes
To download and unpack prebuilt binaries:
$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.1.2/tsv-utils-v2.1.2_linux-x86_64_ldc2.tar.gz | tar xz
$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.1.2/tsv-utils-v2.1.2_osx-x86_64_ldc2.tar.gz | tar xz
Installation instructions are in the ReleasePackageReadme.txt
file in the release package.
To be notified of new releases:
GitHub supports notification of new releases. Click the "Watch" button on the repository page and select "Releases Only".
Release 2.1.2 Changes
File.write
to File.rawWrite
. See PR #316.-disable-fp-elim
. This option is no longer available starting with LDC 1.24.0 (next version) and is a required change. See PR #316.Prebuilt binaries have been built using the latest LDC compiler (ldc-1.23.0).
To download and unpack prebuilt binaries:
$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.1.1/tsv-utils-v2.1.1_linux-x86_64_ldc2.tar.gz | tar xz
$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.1.1/tsv-utils-v2.1.1_osx-x86_64_ldc2.tar.gz | tar xz
Installation instructions are in the ReleasePackageReadme.txt
file in the release package.
To be notified of new releases:
GitHub supports notification of new releases. Click the "Watch" button on the repository page and select "Releases Only".
Release 2.1.1 Changes
csv2tsv
buffer utilization. Enables better performance of subsequent tasks in a pipeline due to more frequent writes to standard output (better parallelization). Minor performance benefits to csv2tsv
by itself. See PR #305.tsv-utils
use in the D Language project tester. See PR #306.Prebuilt binaries have been built using the latest LDC compiler (ldc-1.23.0).
To download and unpack prebuilt binaries:
$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.1.0/tsv-utils-v2.1.0_linux-x86_64_ldc2.tar.gz | tar xz
$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.1.0/tsv-utils-v2.1.0_osx-x86_64_ldc2.tar.gz | tar xz
Installation instructions are in the ReleasePackageReadme.txt
file in the release package.
To be notified of new releases:
GitHub supports notification of new releases. Click the "Watch" button on the repository page and select "Releases Only".
Release 2.1.0 Changes: csv2tsv
csv2tsv
is significantly faster as a result of switching to a buffer-based conversion algorithm. The 2.1.0
version runs 40-60% faster than the 2.0.0
version on tests on Mac OS, depending on the type of file. See PR #301 for details.--r|tab-replacement
and --n|newline-replacement
. See PR #303 for details.Other Changes
To download and unpack prebuilt binaries:
$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.0.0/tsv-utils-v2.0.0_linux-x86_64_ldc2.tar.gz | tar xz
$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v2.0.0/tsv-utils-v2.0.0_osx-x86_64_ldc2.tar.gz | tar xz
Installation instructions are in the ReleasePackageReadme.txt
file in the release package.
To be notified of new releases:
GitHub supports notification of new releases. Click the "Watch" button on the repository page and select "Releases Only".
Release 2.0.0 Changes: Named Field Support
Release 2.0.0 adds named field support to all tools in the tsv-utils toolkit. This is a significant usability improvement.
Named fields can be used with any file or data stream that has a header line. Named fields are enabled by the --H|header
option. Field numbers can be used as well, just as in the prior versions of the toolkit. Glob-style wildcards can be used and escapes can be used to specify field names containing special characters.
Details are available in the Field Syntax section of the Tools Reference manual.
Examples - Assume a file with the header fields:
1 test_name
2 run
3 elapsed_time
4 user_time
5 system_time
6 max_memory
Commands like the following can be used:
$ # Select individual fields, like 'cut'
$ tsv-select data.tsv -H -f user_time # Field 4
$ tsv-select data.tsv -H -f test_name,user_time # Fields 1,4
$ tsv-select data.tsv -H -f '*_time' # Fields 3,4,5
$ # Filter lines using numeric comparisons against individual fields
$ tsv-filter data.tsv -H --lt elapsed_time:100
$ tsv-filter data.tsv -H --gt elapsed_time:100 --lt system_time:20
$ # Statistical summaries
$ tsv-summarize data.tsv -H --median elapsed_time
$ tsv-summarize data.tsv -H --median '*_time'
$ tsv-summarize data.tsv -H --group-by test_name --median '*_time'
$ # Uniq'ing on a field
$ tsv-uniq data.tsv -H -f test_name
$ # Joins - Assume another file 'test_info.tsv' with 'test_name' and
$ # 'expected_time' fields. A join can be performed using column names.
$ tsv-join -H -f test_into.tsv data.tsv --key-fields test_name --append-fields expected_time
See the reference docs or online help for details on specific tools. There is also documentation in the Tools Overview section of the main project README file.
Named field support addresses enhancement request #25. It implemented via PRs #284 through #300.
Other Changes
To download and unpack prebuilt binaries:
$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.6.1/tsv-utils-v1.6.1_linux-x86_64_ldc2.tar.gz | tar xz
$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.6.1/tsv-utils-v1.6.1_osx-x86_64_ldc2.tar.gz | tar xz
Installation instructions are in the ReleasePackageReadme.txt
file in the release package.
To be notified of new releases:
GitHub supports notification of new releases. Click the "Watch" button on the repository page and select "Releases Only".
Release 1.6.1 Changes:
tsv-split --lines-per-file
functionality (PR #280).@safe
attribution changes to enable Windows compilation of bufferedByLine
(Issue #282, PR #283).To download and unpack prebuilt binaries:
$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.6.0/tsv-utils-v1.6.0_linux-x86_64_ldc2.tar.gz | tar xz
$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.6.0/tsv-utils-v1.6.0_osx-x86_64_ldc2.tar.gz | tar xz
Installation instructions are in the ReleasePackageReadme.txt
file in the release package.
To be notified of new releases:
GitHub supports notification of new releases. Click the "Watch" button on the repository page and select "Releases Only".
Release 1.6.0 Changes:
Prebuilt binaries have been updated to use the latest LDC compiler (1.20.1).
tsv-select
: New feature, the ability to exclude fields (PR #267).
Fields to exclude are specified with the --e|exclude option. Examples:
$ # Drop the first field, keep everything else.
$ # Equivalent to `cut -f 2- file.tsv`
$ tsv-select --exclude 1 file.tsv
$ # Drop fields 3-10, keep everything else
$ tsv-select --exclude 3-10 file.tsv
See the tsv-select reference for more information.
New tool: tsv-split
(PR #270)
tsv-split
is used to split one or more input files into multiple output files. There are three modes of operation:
Fixed number of lines per file (--l|lines-per-file NUM
): Each input block of NUM lines is written to a new file. This is similar to the Unix split
utility.
Random assignment (--n|num-files NUM
): Each input line is written to a randomly selected output file. Random selection is from NUM files.
Random assignment by key (--n|num-files NUM, --k|key-fields FIELDS
): Input lines are written to output files using fields as a key. Each unique key is randomly assigned to one of NUM output files. All lines with the same key are written to the same file.
Examples:
$ # Split a file into files of 10,000 lines each.
$ tsv-split data.txt --lines-per-file 10000 --dir split_files
$ # Split a file into 1000 files with lines randomly assigned.
$ tsv-split data.txt --num-files 1000 --dir split_files
# Randomly assign lines to 1000 files using field 3 as a key.
$ tsv-split data.tsv --num-files 1000 -key-fields 3 --dir split_files
See the tsv-split reference for more information.
To download and unpack prebuilt binaries:
$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.5.0/tsv-utils-v1.5.0_linux-x86_64_ldc2.tar.gz | tar xz
$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.5.0/tsv-utils-v1.5.0_osx-x86_64_ldc2.tar.gz | tar xz
Installation instructions are in the ReleasePackageReadme.txt
file in the release package.
To be notified of new releases:
GitHub supports notification of new releases. Click the "Watch" button on the repository page and select "Releases Only".
Release 1.5.0 Changes:
Prebuilt binaries have been updated to use the latest LDC compiler (1.20.0).
tsv-filter
: Field list support (PR #259).
Field list provide a compact way to specify multiple fields for a command. Most tsv-utils tools already support field lists, now tsv-filter
does as well. Examples:
$ # Select lines where fields 1-10 are not empty.
$ tsv-filter --not-empty 1-10 data.tsv
$ # Select lines where fields 1-5 and 17 are less than 100
$ tsv-filter --lt 1-5,17:100 data.tsv
tsv-filter
: New field length tests based on either characters or bytes (PR #258).
The new operators allow filtering on field length. Field length can be measured in either characters or bytes. (Characters can occupy multiple bytes in UTF-8). Examples:
$ # Keep only lines where field 3 is less than 50 characters
$ tsv-filter --char-len-lt 3:50 data.tsv
$ # Find lines where field 5 is more than 20 bytes
$ tsv-filter --byte-len-gt 5:20
Character length tests have names of the form: --char-len-eq|ne|lt|le|gt|ge]
. Byte length tests have names of the form: --byte-len-[eq|ne|lt|le|gt|ge]
.
tsv-filter
: Improved error messages when invalid regular expressions are used.
The error message printed by tsv-filter
now includes the error text provided by the D regular expression engine. This is helpful when trying to debug complex regular expressions. Examples:
$ # Old error message (tsv-filter 1.4.4)
$ tsv-filter --regex 4:'abc(d|e' data.tsv
[tsv-filter] Error processing command line arguments: Invalid values in option: '--regex 4:abc(d|e'. Expected: '--regex <field>:<val>' where <field> is a number and <val> is a regular expression.
$ # New error message (tsv-filter 1.5.0)
[tsv-filter] Error processing command line arguments: Invalid regular expression: '--regex 4:abc(d|e'. no matching ')'
Pattern with error: `abc(d|e` <--HERE-- ``
Expected: '--regex <field>:<val>' or '--regex <field-list>:<val>' where <val> is a regular expression.
The formatting of the message can be improved and is likely to be updated in the future.
tsv-uniq
: Performance improvements (PRs #234, #235).
Better memory management and other changes improved tsv-uniq
performance by 5-35% depending on the operation.
tsv-sample
: Performance improvements reading large data blocks from standard input (PR #238).
Sampling and shuffling operations requiring that all data be read into memory were unnecessarily slow when large amounts of data was read from standard input. Performance issues were noticed with data sizes larger than 10 GB. This is now fixed.
Sample bash scripts included in release package (PR #254).
Sample versions of the tsv-sort
and tsv-sort-fast
scripts described on the Tips and Tricks page are now included in the repository and in prebuilt binary packages.
Changes:
New tsv-sample
option --i|inorder
This option preserves input order when using simple or weighted random sampling. These sampling modes are engaged when a sample size is selected via the --n|num NUM
option. Documentation was updated to better reflect the distinction between shuffling the full data set and random sampling which selects a subset of lines. (PR #226)
tsv-summarize
--min
and --max
operators changed to preserve original input string
The prior behavior of the operators was to read the values to a double, then use numeric formatting to print the recorded double. In some cases this would cause the original input to change, especially if it was a long format number, for example, 16 digits long. (PR #220)
The prior behavior makes sense for calculations like mean and median, but not for min and max. In particular, preserving the original values allows them to be joined with or compared to the original data.
Prebuilt binaries have been updated to use the latest LDC compiler (1.17.0).
To download and unpack the prebuilt binaries:
$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.4.4/tsv-utils-v1.4.4_linux-x86_64_ldc2.tar.gz | tar xz
$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.4.4/tsv-utils-v1.4.4_osx-x86_64_ldc2.tar.gz | tar xz