Qsv Versions Save

CSVs sliced, diced & analyzed.

0.127.0

2 weeks ago

📊 Enhanced Frequency Analysis 📊

This a quick release adding several frequency enhancements for more detailed frequency analysis. The frequency command now includes a percentage column, calculates other values, and supports limiting unique counts and negative limits. These options provides additional context for Datapusher+, qsv-pro and describegpt so their metadata inferences are more accurate and comprehensive.

Previously, for a 775-row CSV file containing one column named state with entries for all 50 states, frequency only showed[^1]:

qsv frequency freq_state_example.csv | qsv table
field  value  count
state  NY     100
state  NJ     70
state  CA     60
state  MA     55
state  FL     45
state  TX     43
state  NM     40
state  AZ     39
state  NV     38
state  MI     35

Now, there's a new percentage column and other values calculation, both of which have configurable options:

qsv frequency freq_state_example.csv | qsv table
field  value       count  percentage
state  NY          100    12.90323
state  NJ          70     9.03226
state  CA          60     7.74194
state  MA          55     7.09677
state  FL          45     5.80645
state  TX          43     5.54839
state  NM          40     5.16129
state  AZ          39     5.03226
state  NV          38     4.90323
state  MI          35     4.51613
state  Other (40)  250    32.25806

This release is also out of cycle to address a big performance regression in the excel command caused by unnecessary formula info retrieval for the --error-format option introduced in 0.126.0. This has been fixed, and the excel command is now back to its speedy self.


Added

Changed

Fixed

Full Changelog: https://github.com/jqnatividad/qsv/compare/0.126.0...0.127.0

[^1]: with its default --limit setting of 10 only show the top 10 unique values in the column, sorted by occurence

0.126.0

3 weeks ago

🤖 Expanded Metadata Inferencing 🤖

describegpt headlines this release, with its new ability to support other local Large Language Models (LLMs) using popular tools that serve them through APIs such as Ollama and Jan. This broadens the tool's utility in diverse AI environments. Beyond OpenAI, qsv can now use other popular LLMs like Llama 3, Mistral, and Gemma. It also unlocks expanded metadata inferencing capabilities in qsv pro.

Several commands got additional options: cat with --no-headers support in the rowskey subcommand; excel with new options like --error-format and short --metadata mode; and foreach with a --dry-run option. frequency also got new options, including --unq-limit for limiting unique counts, support for negative limits, and a --lmt-threshold option for compiling comprehensive frequencies below a threshold. slice now supports negative indices and new JSON output options, providing more flexibility in data slicing.

This is all rounded out with sqlp improvements, including support for single-line comments in SQL scripts and a special SKIP_INPUT value to skip input preprocessing when using table functions directly in Polars SQL (e.g. read_csv() and read_parquet()) - all while increasing performance thanks to the Polars engine being upgraded to 0.39.2.


New Features

  • cat: Added --no-headers support to the rowskey subcommand.
  • describegpt: Added compatibility for other local Large Language Models (LLMs) such as Ollama and Jan, broadening the tool's utility in diverse AI environments.
  • excel: Introduced new options in the excel command: --error-format for better error handling and a short --metadata JSON mode.
  • foreach: added a --dry-run option, allowing users to preview the results of scripts without executing them.
  • frequency: New options added such as --unq-limit for limiting unique counts; support for negative limits to only show frequencies >= abs(negative limit); and a --lmt-threshold option to allow the compilation of comprehensive frequencies below the threshold - all providing more detailed control over frequency analysis.
  • slice: Support for negative indices to slice from the end and new JSON output options.
  • sqlp: sqlp now supports single-line comments and includes a special SKIP_INPUT value for more efficient data loading. The Polars engine has also been upgraded to 0.39.2, providing enhanced performance and stability.

Changes and Optimizations

  • Performance Enhancements: Microoptimizations in datefmt and validate commands, and increased default length for --infer-len in sqlp for improved performance.
  • Dependency Updates: Numerous updates including bumping Luau, jql-runner, pyo3, and other dependencies to enhance stability and security.
  • Benchmarks Added: New performance benchmarks for sqlp vs duckdb added to ensure there are no performance regressions between releases. Right now, sqlp is faster than duckdb in most cases (thanks to Polars - see the latest TPC-H benchmarks), but we want to make sure that we keep it that way.

Security and Robustness

  • Security Fixes: Updated rustls to fix a specific CVE, and other minor fixes to enhance the security and robustness of network and data processing features.
  • Bug Fixes: Various bug fixes including improvements in error formatting in excel and robustness in fetch and fetchpost commands.

Added

Changed

Fixed

Removed

Full Changelog: https://github.com/jqnatividad/qsv/compare/0.125.0...0.126.0

0.125.0

1 month ago

In this release, we focused on the 🏎️ need for even more speed 🏎️ .

This was done primarily by tweaking several supporting qsv crates. qsv-docopt now parses command-line arguments slightly faster. qsv-stats, the crate behind commands like stats, schema, tojsonl, and frequency, has been further optimized for speed. qsv-dateparser has been updated to support new timezone handling options in datefmt. qsv-sniffer also got a speed boost.

Per the benchmark suite, stats is 25% faster (1.563 secs vs 2.067 secs) when computing the 13 "streaming" stats and 14% faster when computing --everything (17 columns of addl stats - 3.149 secs vs 3.656 secs) for the 1M row, 41 column, 520mb sample of NYC's 311 data.

The count command has been refactored to utilize Polars' SQLContext, which leverages LazyFrames evaluation to automagically count even very large files in just a few seconds. Previously, count was already using Polars, but it mistakenly fell back to a slower counting mode. Now, it consistently delivers fast performance, even without an index. On the same benchmark suite, it takes 0.052 secs vs 0.503 seconds - almost 10x faster!

As count is not just a top-level command, but also a widely used helper used by several qsv commands, this gives the entire suite a nice performance boost.

Continuing on the performance front, the excel command now has a new short --metadata mode, allowing users to just get a "shorter" version of the metadata report that only list the workbook's top level metadata (sheet index, sheet name, sheet type, visibility) instead of the full metadata report (which also has info like num rows, column metadata, etc.). On the benchmark suite, the short metadata report takes all of 0.005 secs vs 11.237 secs for the 1M row xlsx version of the same NYC 311 data - more than 3 orders of magnitude faster! (it may actually be faster since 0.005 secs is at the limits of what hyperfine can measure)

The datefmt command also got some major enhancements with new timezone handling and timestamp parsing options, though at the cost of a small 15% performance penalty.

Lastly, we are excited to announce that qsv will be featured at the CSV,Conf,V8 conference in Puebla, Mexico on May 28-29. I'll be presenting a talk titled "qsv: A Blazing Fast CSV Data-Wrangling Toolkit". Hope to see you there!.


Added

Changed

Fixed

Full Changelog: https://github.com/jqnatividad/qsv/compare/0.124.1...0.125.0

0.124.1

2 months ago

Datapusher+ "Speed of Insight" Release! 🚀🚀🚀

This release is all about speed, speed, speed! We've made qsv even faster by leveraging Polars' multithreaded, mem-mapped CSV reader to get near-instant row counts of large CSV files, and near instant SQL queries and aggregations with Datapusher+ - automagically inferring metadata and giving you quick insights into your data in seconds!

We're demoing our qsv-powered Datapusher+ at the March 2024 installment of CKAN Montly Live on March 20, 2024, 13:00-14:00 UTC. Join us!

Beyond pushing data reliably at speed into your CKAN Datastore (it pushes real good! 😉), DP+ does some extended analysis, processing and enrichment of the data so it can be readily Used.

Both fetch and fetchpost commands now also have a --disk-cache option and are fully synched - forming the foundation for high-speed data enrichment from Web Services - including datHere's forthcoming, fully-integrated Data Enrichment Service.

🏇🏽 Hi-ho Quicksilver, away! 🏇🏽


Added

Changed

Fixed

Removed

Full Changelog: https://github.com/jqnatividad/qsv/compare/0.123.0...0.124.1

0.123.0

2 months ago

OPEN DATA DAY 2024 Release! 🎉🎉🎉

In celebration of Open Data Day, we're releasing qsv 0.123.0 - the biggest release ever with 330+ commits! qsv 0.123.0 continues to focus on performance, stability and reliability as we continue setting the stage for qsv's big brother - qsv pro.

We've been baking qsv pro for a while now, and it's almost ready for release. qsv pro is a cross-platform Desktop Data Wrangling tool marrying an Excel-like UI with the power of qsv, backed by cloud-based data cleaning, enrichment and enhancement service that's easy to use for casual Excel users and Data Publishers, yet powerful enough for data scientists and data engineers.

Stay tuned!

Highlights:

  • sqlp now has automatic read_csv() fast path optimization, often making optimized queries run dramatically faster - e.g what took 6.09 seconds for a non-trivial SQL aggregation on an 18 column, 657mb CSV with 7.43 million rows now takes just 0.14 seconds with the optimization - 🚀 43.5x FASTER 🚀 ! [^1] [^1]: measurements taken on an Apple Mac Mini 2023 model with an M2 Pro chip with 12 CPU cores & 32GB of RAM, running macOS Sonoma 14.4
# with fast path optimization turned off
/usr/bin/time qsv sqlp taxi.csv --no-optimizations "select VendorID,sum(total_amount) from taxi group by VendorID order by VendorID"
VendorID,total_amount
1,52377417.52985942
2,89959869.13054822
4,600584.610000027
(3, 2)
        6.09 real         6.82 user         0.16 sys

# with fast path optimization, fully exploiting Polars' multithreaded, mem-mapped CSV reader!
 /usr/bin/time qsv sqlp taxi.csv "select VendorID,sum(total_amount) from taxi group by VendorID order by VendorID"
VendorID,total_amount
1,52377417.52985942
2,89959869.13054822
4,600584.610000027
(3, 2)
        0.14 real         1.09 user         0.09 sys

# in contrast, csvq takes 72.46 seconds - 517.57x slower
/usr/bin/time csvq "select VendorID,sum(total_amount) from taxi group by VendorID order by VendorID"
+----------+---------------------+
| VendorID |  SUM(total_amount)  |
+----------+---------------------+
| 1        |  52377417.529256366 |
| 2        |    89959869.1264675 |
| 4        |   600584.6099999828 |
+----------+---------------------+
       72.46 real        65.15 user        75.17 sys

"Traditional" SQL engines

qsv and csvq both operate on "bare" CSVs. For comparison, let's contrast qsv's performance against "traditional" SQL engines that require setup and import (aka ETL). Not counting setup and import time (which alone, takes several minutes), we get:

sqlite3.43.2 takes 2.910 seconds - 20.79x slower

sqlite> .timer on
sqlite> select VendorID,sum(total_amount) from taxi group by VendorID order by VendorID;
1,52377417.53
2,89959869.13
4,600584.61
Run Time: real 2.910 user 2.569494 sys 0.272972

PostgreSQL 15.6 using PgAdmin 4 v6.12 takes 18.527 seconds - 132.34x slower

Screenshot 2024-03-06 at 10 14 04 AM

even with an index, qsv sqlp is still 5.96x faster

Screenshot 2024-03-08 at 7 57 57 AM
  • sqlp now supports JSONL output format and adds compression support for Avro and Arrow output formats.
  • fetch now has a --disk-cache option, so you can cache web service responses to disk, complete with cache control and expiry handling!
  • jsonl is now multithreaded with additional --batch and --job options.
  • split now has three modes: split by record count, split by number of chunks and split by file size.
  • datefmt is a new top-level command for date formatting. We extracted it from apply to make it easier to use, and to set the stage for expanded date and timezone handling.
  • enum now has a --start option.
  • excel now has a --keep-zero-time option and now has improved datetime/duration parsing/handling with upgrade of calamine from 0.23 to 0.24.
  • tojsonl now has --trim and --no-boolean options and eliminated false positive boolean inferences.

Added

Changed

Fixed

Removed

New Contributors

Full Changelog: https://github.com/jqnatividad/qsv/compare/0.122.0...0.123.0

0.122.0

3 months ago

👉 REQUEST FOR USE CASES: 👈

Please help define the future of qsv. Add what you're currently using qsv for here - https://github.com/jqnatividad/qsv/discussions/1529

Not only does it help us catalog what use cases we should optimize for, posters will get higher priority access to the qsv pro preview.

Highlights:

  • qsvpy is now available in the prebuilt binaries for select platforms! It's a new qsv binary variant with the python feature, enabling the py command. Three subvariants are available - qsvpy310, qsvpy311 and qsvpy312, corresponding to Python 3.10, 3.11 and 3.12 respectively.
  • Removed generate command as generate's main dependency is unmaintained and has old dependencies. generate was also not used much, as the test data it generated was not well suited for training models and it was too slow so we decided to remove it even before the synthesize (#235) command is ready.
  • reverse now has index support and can work in "streaming" mode and handle larger than memory CSV files.
  • sort and sample: users can now choose from three Random Number Generator (RNG) algorithms with the --rng option - standard, faster & cryptosecure.
  • pseudo now has --start, --increment & --formatstr options.
  • fmt now has a --no-final-newline option to suppress the final newline for better interoperability with other tools, specifically Excel. It also treats "T" as special value for tab character for the --out-delimiter option.

Added

Changed

Removed

Full Changelog: https://github.com/jqnatividad/qsv/compare/0.121.0...0.122.0

0.121.0

4 months ago

Two days ago, qsv 0.120.0 was released. Hours later, significant updates occurred in our ecosystem: Polars upgraded to version 0.36, Homebrew rolled out support for Rust 1.75.0, and our pull request for 'cached' was merged.

In light of these developments, we're releasing 0.121.0 out of cycle to leverage the new features, fixes and performance enhancements in these key components integral to qsv.


👉 REQUEST FOR USE CASES: 👈 Please help define the future of qsv. Add what you're currently using qsv for here - https://github.com/jqnatividad/qsv/discussions/1529

Not only does it help us catalog what use cases we should optimize for, posters will get higher priority access to the qsv pro preview.


Added

Changed

Fixed

Full Changelog: https://github.com/jqnatividad/qsv/compare/0.120.0...0.121.0

0.120.0

4 months ago

Happy New Year! 🎉🎉🎉 Here's the first release of 2024, the biggest ever with 280+ commits! qsv 0.120.0 continues to focus on performance, stability and reliability as we continue setting the stage for qsv's big brother - qsv pro.

Apart from wrapping qsv with a User Interface, qsv pro also comes with a retinue of related cloud-based data cleaning, enrichment and enhancement services along with expanded metadata inferencing to make your Data Useful, Usable and Used!

qsv pro draws inspiration from OpenRefine, but reimagined without its file size and speed limitations, with qsv pro having the ability to process multi-gigabyte files in seconds.

It incorporates hard lessons we learned in the past 12 years deploying Data Portals and Data Pipelines to create a new Data/Metadata Wrangling and AI-assisted Data Publishing service that's easy to use for casual Excel users and Data Publishers, yet powerful enough for data scientists and data engineers.

But it's not quite ready for release yet, so stay tuned!

We're now taking signups for a preview release however, so if you're interested, please sign up here!

Excitingly, qsv was also mentioned on Hacker News in this thread Dec 23, 2023! As a result, we're now almost at 2,000+ stars on GitHub from 900 stars on Dec 22! 🎉🎉🎉

Stay tuned for more advancements in 2024 – it's set to be a landmark year for qsv! 🦄🦄🦄


Added

Changed

Fixed

Removed

Full Changelog: https://github.com/jqnatividad/qsv/compare/0.119.0...0.120.0

0.119.0

5 months ago

Highlights:

As we prepare for version 1.0, we're focusing on performance, stability and reliability as we set the stage for qsv pro - a cloud-backed UI version of qsv powered by Tauri, set to be released in 2024. Stay tuned!

  • diff is now out of beta and blazingly fast! Give "the fastest CSV-diff in the world" a try :wink:!
  • joinp now supports snappy automatic compression/decompression!
  • sqlp & joinp now recognize the QSV_COMMENT_CHAR environment variable, allowing you to skip comment lines in your input CSV files. They're also faster with the upgrade to Polars 0.35.4.
  • sqlp now supports subqueries, table aliases, and more!
  • luau: upgraded embedded Luau from 0.599 to 0.604; refactored code to reduce unneeded allocations and increase performance (more than doubling it!) as we prepare for extended recipe support.
  • cat is now even faster with the --flexible option. If you know your CSV files are valid, you can use this option to skip CSV validation and make cat run twice as fast!
  • qsv can now add a Byte Order Mark (BOM) header sequence to produce Excel-friendly CSVs with the QSV_OUTPUT_BOM environment variable.
  • stats, sort, schema & validate are now faster with the use of atoi_simd to directly convert &[u8] to integer, skipping unnecessary utf8 validation, while also using SIMD CPU instructions for noticeably faster performance.

Added

Changed

Fixed

Removed

Full Changelog: https://github.com/jqnatividad/qsv/compare/0.118.0...0.119.0


NOTE:

To verify prebuilt binary zip archives - click here.

0.118.0

6 months ago

Highlights:

  • With the Polars upgrade to 0.34.2, the sqlp and joinp enjoy expanded capabilities and a noticeable performance boost. 🦄🏇
  • We now publish the 500, 1000, 5000 and 15000 Geonames cities indices for the geocode command, with users able to easily switch indices with the index-load subcommand. As the name implies, the 500 index contains cities with populations of 500 or more, the 1000 index contains cities with populations of 1000 or more, and so on.
    The 15000 index (default) is the smallest (13mb) and fastest with ~26k cities. The 500 index is the largest(56mb) and slowest, with ~200k cities. The 5000 index is 21mb with ~53k cities. The 1000 index is 44mb with ~140k cities. 🎠
  • The geocode command now returns US Census FIPS codes for US places with the %json and %pretty-json formats, returning both US State and US County FIPS codes, with upcoming support for Cities and other US Census geographies (School Districts, Voting Districts, Congressional Districts, etc.) 🎠
  • Improved performance for stats, schema and tojsonl commands with the stats cache bincode refactor. This is especially noticeable for large CSV files as stats previously created large bincode cache files by default.
    The bincode cache allows other commands (currently, only schema and tojsonl) to skip recomputing statistics and deserialize the saved stats data structures directly into memory. Now, it will only create a bincode file if the --stats-binout option is specified (typically, before using the schema an tojsonl commands). stats will still continue to create a stats CSV cache file by default, but it will be much smaller than the bincode file, and is universally applicable, unlike the bincode cache. 🏇
  • self-update will now verify updates. This is done by verifying the zipsign signature of the release zip archive before applying it. This should make it harder for malicious actors to compromise the self-update process. Version 0.118.0 has the verification code, and future releases will use this new verification process. Regardless, we will zipsign all zip archives starting with this release. Users can manually verify the signatures by downloading the zipsign public key and running the zipsign command line tool. See Verifying the Integrity of the Prebuilt Binaries Zip Archive for more info. 🦄
  • The frequency command now supports the --ignore-case option for case-insensitive frequency counts. 🦄🎠
  • The schema command can now compile case-insensitive enum constraints. 🦄
  • Improved performance for apply and applydp commands with faster compile-time perfect hash functions for operations lookups. 🏇
  • Several minor performance improvements and bug fixes with snappy, sniff & cat commands. 🏇

Added

Changed

Fixed

Full Changelog: https://github.com/jqnatividad/qsv/compare/0.117.0...0.118.0