Tsv Utils Versions Save

eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.

v1.4.3

4 years ago

Two changes:

  • New tsv-pretty option --a|auto-preamble - Enables automatic detection of preambles. Lines at the start of the file that should be printed as is, without reformatting into pretty printed columns. For more information and examples see PR #218.
  • Prebuilt binaries have been updated to use the latest LDC compiler (1.16.0).

To download and unpack the prebuilt binaries:

$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.4.3/tsv-utils-v1.4.3_linux-x86_64_ldc2.tar.gz | tar xz

$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.4.3/tsv-utils-v1.4.3_osx-x86_64_ldc2.tar.gz | tar xz

v1.4.2

4 years ago

One change:

  • Fixes incorrect comma use in the dub.json file. Needed to support planned changes in dub. Also needed for dlang CI pipelines.

There are no changes to any of the tools.

To download and unpack the prebuilt binaries:

$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.4.2/tsv-utils-v1.4.2_linux-x86_64_ldc2.tar.gz | tar xz

$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.4.2/tsv-utils-v1.4.2_osx-x86_64_ldc2.tar.gz | tar xz

v1.4.1

5 years ago

This release contains one new feature and several performance improvements:

  • tsv-uniq --number - Line numbering grouped by key (new feature). The key is either the whole line or a subset of fields. Each unique key gets its own set of line numbers. See the tsv-uniq reference for details.
  • Improved I/O read performance. This was achieved by using a buffered version of std.stdio.File.byLine. Especially effective for narrow files. Tools using byLine (most of the tools) typically see a 10-40% performance gain, depending on tool and type of file (measured on OS X). Implementation documentation: tsv_utils.common.utils.bufferedByLine.
  • Updated compiler to LDC 1.15.0 for prebuilt binaries (frontend/druntime/phobos 2.085.1). This includes an update to LLVM 8.0 and a couple of improvements to memory allocation and GC collection. The latter improved performance of several of the tools, especially tools like tsv-join that allocate large amounts of memory.

To download and unpack the prebuilt binaries:

$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.4.1/tsv-utils-v1.4.1_linux-x86_64_ldc2.tar.gz | tar xz

$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.4.1/tsv-utils-v1.4.1_osx-x86_64_ldc2.tar.gz | tar xz

v1.3.2

5 years ago

This release modifies tsv-sample random value printing so most values are printed in decimal notation, without exponents. This is for subsequent processing by GNU sort. Sorting numbers with exponents requires "general numeric" order (option 'g'), which is much slower than "numeric" order (option 'n'). See Shuffling large files on the Tips and Tricks page for more info.

To download and unpack the prebuilt binaries:

$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.3.2/tsv-utils-v1.3.2_linux-x86_64_ldc2.tar.gz | tar xz

$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.3.2/tsv-utils-v1.3.2_osx-x86_64_ldc2.tar.gz | tar xz

v1.3.1

5 years ago

In this release:

  • tsv-sample: Adds full-line as key to distinct sampling. This completes the work that has been done on sampling over the last few point releases. tsv-sample now supports a fair set of sampling modes. Performance is also good, in keeping with the tradition of the other tsv-utils tools.
  • Prebuilt binaries have been updated to use the latest LDC compiler (1.12.0). This is a significant performance boost to regex search in tsv-filter. Unfortunately csv2tsv is a little slower.
  • The build system now supports using LDC's LTO compiled druntime and phobos libraries (those shipped with the compiler). This eliminates the need to download the druntime and phobos source code at build time. This is more convenient and supports package managers better.
  • Code level documentation now generates good documentation when used with the dpldocs documentation system. Go to the tsv-utils code documentation to see the result.

To download and unpack the prebuilt binaries:

$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.3.1/tsv-utils-v1.3.1_linux-x86_64_ldc2.tar.gz | tar xz

$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.3.1/tsv-utils-v1.3.1_osx-x86_64_ldc2.tar.gz | tar xz

v1.2.3

5 years ago

This release add several new sampling algorithms that improve runtime performance and memory utilization for a number of sampling use-cases. There are no new forms of sampling, just additional algorithms. The new algorithms:

  • A skip sampling implementation of Bernoulli sampling.
  • An implementation of reservoir sampling "Algorithm R" used for unweighted random sampling.
  • A line order randomization algorithm based on array shuffling.

Formal performance benchmarks have not been run. However, tests run on Mac OS as part of development show favorable results relative to other available tools, including GNU shuf.

To download and unpack the prebuilt binaries:

$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.2.3/tsv-utils-v1.2.3_linux-x86_64_ldc2.tar.gz | tar xz

$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.2.3/tsv-utils-v1.2.3_osx-x86_64_ldc2.tar.gz | tar xz

v1.2.2

5 years ago

This release adds new capabilities and performance improvements to tsv-sample. Documentation was also updated to improve clarity. Key changes:

  • New feature: Simple random sampling with replacement - All lines from input sources are read in, then lines are repeated selected at random and written out. Lines can be output multiple times. The process continues until the specified number of samples has been written. Invoke using the -r|--replace and -n|--num NUM options.
  • New feature: Random value printing - A new feature was added for generating random values for all input lines. In the default case it shows the values used for Bernoulli sampling trials. It can also be used with 'distinct' sampling to show the sampling bucket a line is placed in based on the key-fields specified. This feature is invoked with the --gen-random-inorder option. A related feature, --print-random, was updated so that it is now supported by all applicable sampling modes.
  • Line order randomization performance improvements: One of the basic tsv-sample use cases is line order randomization. The case where all input lines are being permuted was re-written and is now quite a bit faster and uses less memory. This applies to both weighted and unweighted sampling. (The case where a subsampling is being done via the -n|--num option uses reservoir sampling was already fast.)
  • Command line option change - The option for specifying the probability used for Bernoulli sampling was changed from -r|--rate to -p|prob. This was done to create a more consistent set of option names for new features and features that may be added in the future.

To download and unpack the prebuilt binaries:

$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.2.2/tsv-utils-v1.2.2_linux-x86_64_ldc2.tar.gz | tar xz

$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.2.2/tsv-utils-v1.2.2_osx-x86_64_ldc2.tar.gz | tar xz

v1.2.1

5 years ago

This release adds features for tsv-utils automated tests. There are no changes to any of the tools.

The new testing features add support for different correct output results for different compiler/library versions. The main case is for changes to error message text, which in some cases includes text from the phobos library.

Alternate test outputs were added for a planned change to Phobos in an upcoming release. This was bundled into a tagged release to support the D language project tester where tsv-utils is used.

To download and unpack the prebuilt binaries:

$ # Linux
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.2.1/tsv-utils-v1.2.1_linux-x86_64_ldc2.tar.gz | tar xz

$ # MacOS
$ curl -L https://github.com/eBay/tsv-utils/releases/download/v1.2.1/tsv-utils-v1.2.1_osx-x86_64_ldc2.tar.gz | tar xz

v1.2.0

5 years ago

This release changes the repository name from eBay/tsv-utils-dlang to eBay/tsv-utils. This better reflects the functionality provided by the TSV Utilities. There are no other changes. Please report any issues found with the name change on the Issues page.

v1.1.20

5 years ago

Release v1.1.20 contains a few minor updates:

  • tsv-summarize: unique-count operator - Performance improvement by avoiding unnecessary string copies. 40% faster on one benchmark.
  • Bash completion fixes