A set of Unix shell command line tools for quick and convenient batch processing of tabular text files (a.k.a., tab-delimited, tsv, csv, or flat data file format) with a header line. Provides column reference by name, automatic delimiter and compression detection for per-line transformations, sql-like group-by operation and relational join.
URL: https://github.com/stefan-schroedl/tabulator
Author: Stefan Schroedl
Unix/Linux comes with several tools, such as cut,
paste,
join,
sort,
to
process tabular text data files (a.k.a., tab-delimited, csv, tsv,
or flat file format). However, they have shortcomings that often requires
additional scripting and prevents them from being directly used in one-liners.
For example, having to count columns to pass as arguments is cumbersome and
error-prone; join
needs presorted files, and works with a single key column;
there is no direct 'group-by' functionality.
One remedy would be to load the data into a relational database or noSQL system (e.g., Hadoop/Pig) first, but these might not be available or be more time-consuming for short, ad-hoc tasks.
Tabulator
is a collection of command-line shell tools for Unix/Linux platforms
that build on native shell programs and can be used as filters, but make them more
easy and flexible to use. In particular, they
Installation is easy -- just unpack the tarball, add the unpacked directory to your PATH.
Tabulator is licensed under the MIT LICENSE, see LICENSE for details.
Here is a brief list of the programs together with their main functionality.
Each one provides more documentation and examples when called with the -m
or
-h
options. A common assumption is that the first line in input files contains
the column names.
tblcat:
concatenate files of the same data format without header repetition.tblcmd:
execute a program on the body of a file (e.g., sort
, uniq
),
without affecting the header.tbldesc:
for each column, summarize its type (e.g., char, int, float),
percentage of undefined values, min/max/mean/median/std, etc. Can also provide
correlation coefficients with a target column.Example: Suppose
file
is
name,house_nr,height,shoe_size
arthur,42,6.1,11.5
berta,101,5.5,8
chris,333,5.9,10
don,77,5.9,12.5
Then
tbldesc file
prints:
summarizing file_desc (4 lines, target column: shoe_size)
field name type char% uniq min max avg std mse corr prob%
1 name char 100 4 [arthur; berta; chris; don]
2 house_nr int 0 4 42 333 138 114 172 -0.287 71.25
3 height float 0 3 5.5 6.1 5.85 0.218 4.89 0.812 18.82
4 shoe_size float 0 4 8 12.5 10.5 1.7 0.0 1.0 0.00
tblmap:
simple line-wise ("map") computation similar to awk.Example: Compute ratio of columns
sales
andclients
for lines where columnregion
has valueus
:
tblmap -s'region=="us"' -c'sales_per_client=sales/client' file
tblred:
compute ("reduce") aggregations (e.g., sum
, min
, max
, avg
,
etc.) over groups of keys -- similar to the SQL group by
operator.Example:
tblred -k'region' 'sales_ratio=sales/sum(sales)'
computes for each line proportion of column
sales
to total sales for all lines with the same value of columnregion
.
tbljoin:
Relational join. In contrast to Unix join
, input files don't
need to be pre-sorted, and multiple join columns can be specified.Example: Suppose
file1
is
name,street,house
zorro,desert road,5
john,main st,2
arthur,pan-galactic bypass,42
arthur,main st,15
and
file2
is
name,street,phone
john,main st,654-321
arthur,main st,121-212
john,round cir,123-456
Then
tbljoin file1 file2
gives
name,street,house,phone
arthur,main st,15,121-212
john,main st,2,654-321
tblhist:
computation and ascii-plotting of the histogram of column values.tblsplit:
split a file into several ones based on a column value.Example: Suppose
file
is
continent,country
americas,us
americas,mx
europe,de
europe,fr
Then
tblsplit -rk'continent' file
generates two files,file.select.continent=americas
:
country
us
mx
and
file.select.continent=europe
:
country
de
fr
tblsort:
interface for unix sort.tbltex:
formatting for latex tables.tbltranspose:
transposition of rows and columns.tbluniq:
check for and cut out duplicate columns; also, discover functional
value dependencies.tblcolumn:
format columns in a more readable way, using the unix 'column'
program, and aligning and shortening numbers.tblless:
page through formatted column output (calls tblcolumn).tblmap
(since it translates into an cut
/awk
script) and for tbljoin
(using sort
and join
), For tblred
, presort the file first, then run it
with option -s
.tblmap
, tbljoin
, and tblcmd
work by first translating the command into
a standalone shell script.'
or "
; it is
the user's responsibility to ensure that the column delimiter cannot occur
within column values.tbljoin
, tblred
, tblhist
, tbluniq
, tblcat
, and tbltex
have some
restrictions when run as a filter (repeated reading is necessary in some cases).