Cylondata Cylon Versions Save

Cylon is a fast, scalable, distributed memory, parallel runtime with a Pandas like DataFrame.

v0.6.0

1 year ago

Cylon 0.6.0 is a major release. We are excited to present UCC, Gloo integration, More distributed operations

Features

Cylon C++ and Python

  • Implemention of Slice, Head and Tail Operations
  • adding conda docker
  • Ucc integration
  • adding cylonflow as a submodule
  • Use generic operator
  • Summit fixes
  • Adding custom mpirun params cmake var
  • Adding cmake parallelism flag
  • Gloo python binding
  • Enabling gloo CI
  • Add downloading catch2 header dynamically
  • Dist sort cpu
  • Cylon Gloo integration
  • Adding distributed scalar aggregates
  • Extending datatypes
  • Allowing custom MPI_Comm for MPI

Build

  • Updating to Arrow 0.9.x
  • Windows build support
  • MacOS build support
  • Conda build is the default build
  • Improving docker build

You can download source code from Github Conda binaries are available in Anaconda

Commits

91bdd54 Update conda-actions.yml (#645) d1739ed Added buildable instructions for Rivanna (#643) d9a6420 Arrow 9.0.0 and gcc-11 update (#601) 4c867b1 Summit Fixes (#623) 7f8a3b1 Fixing sample bug (#631) ce12454 Cython binding for slice, head and tail (#619) ef4c904 #610: SampleArray util method replaced by using arrow::compute::Take … (#612) 4694a9e Minor fixes (#608) 121b386 Fixing: Corrupted result when joining tables contain list data types #615 (#616) 68fa598 Summit fixes (#607) de3ec7b fixing bash splitting (#606) 0a489fc adding cmake parallelism flag (#605) 035fd70 Implement Slice, Head and Tail Operation in both centralize and distr… (#592) d99a6f2 adding custom mpirun params cmake var (#604) f20c119 Update README-summit.md (#603) 4bc27f9 Create README-summit.md (#602) e6b7306 Minor fixes (#596) 2e6ac80 adding conda docker (#600) 4dd359f Ucc integration (#591) 61b4a82 adding cylonflow as a submodule (#593) e4dd38b Use generic operator (#583) 6c0dfa8 Gloo python binding (#587) 773f11f Gloo python bindings (#585) 2fc95be Add downloading catch2 header dynamically (#584) c56ab2d Enabling gloo CI (#582) a820ed8 Dist sort cpu (#574) f68cc62 Adding UCC build (#579) 2759a30 Cylon Gloo integration (#576) b2c0820 Adding distributed scalar aggregates (#570) 9c2fdc4 Extending datatypes (#568) e3d553c Bump ua-parser-js from 0.7.22 to 0.7.31 in /docs (#566) 3bafb75 Bump ssri from 6.0.1 to 6.0.2 in /docs (#565) 814a463 minor fixes (#564) be92253 Bump lodash from 4.17.20 to 4.17.21 in /docs (#561) e87dd7c Bump shelljs from 0.8.4 to 0.8.5 in /docs (#562) 71bd8bf Bump nanoid from 3.1.22 to 3.2.0 in /docs (#563) 49b343d Allowing custom MPI_Comm for MPI (#559) fa52dd4 Update contributors.md 54d4a53 added io functions (#550) 1a8c3d7 Fixing 554 (#558) 887ea18 update arrow link (#557) 1ce4c6b Fixing 552 (#553) f5e31a1 Merging 0.5.0 release (#547)

Contributors

Ahmet Uyar Chathura Widanage Damitha Sandeepa Lenadora dependabot[bot] Hasara Maithree Kaiying Shan niranda perera Supun Kamburugamuve Vibhatha Lakmal Abeykoon Ziyao22 Arup Kumar Sarker Mills Wellons Staylor Gregor von Laszewski

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

0.5.0

2 years ago

Cylon 0.5.0 is a major release. We are excited to present GCylon, cudf-based distributed DataFrame for Nvidia GPUs, UCX integration, Anaconda support, and much more.

Features

Cylon C++ and Python

  • Adding UCX integration with MPI
  • Adding read distribution
  • Changing join column naming convention to match SQL and pandas
  • Adding Dataframe.applymap, Dataframe.isin
  • Add iloc operation to DataFrame
  • Adding null handling to table operators and Comparators
  • Adding Equal/ distributed equal operators
  • Adding array flattening
  • Adding Repartition
  • Adding mapreduce style group-by aggregators
  • Adding table level AllGather, Gather and Broadcast operators
  • Performance improvements and bug fixes

Build

  • Updating to Arrow 0.5.x
  • Windows build support
  • MacOS build support
  • Conda build is the default build
  • Improving docker build

Gcylon

First release of Gcylon which supports distributed DataFrame processing on Nvidia GPUs using CuDF:

  • Implemented shuffling and distributed sorting
  • Distributed Join/merge
  • Distributed GroupBy
  • DataFrame Set operations
  • Repartitioning DataFrames
  • Distributed IO for reading/writing CSV, JSON and Parquet files

You can download source code from Github Conda binaries are available in Anaconda

Commits

3344bf95 Mapreduce style group-by aggregators (#535) 50ef890b Remove minor warnings (#544) 559e8eb3 Adding CPU serializer (#539) abb44049 fixed unused variable/parameter and casting warnings (#542) 62a3f080 Distributed IO (#533) 15d06d6c Bump color-string from 1.5.4 to 1.7.4 in /docs (#534) 810c4ed7 fixing RNG issue (#538) fbb049bb fixing build error (#536) a10e0528 Bump algoliasearch-helper from 3.3.3 to 3.6.2 in /docs (#532) 112ea97f Repartition - CPU (#526) 79c4b739 create a MacOS yml file (#530) b9e7a8c4 Repartition - GPU (#528) 2191b9f5 fixed function name change in cudf api from gcylon test files (#529) 3e9036ee Upgrading to arrow 5.0.0 (#525) 24d182ab Groupby values null handling (#527) 54a5074b Null handling for Comparators (#524) 0b9516e7 Adding array flattening (#522) b3fc2a2a Implemented MergeOrSort when merging sorted tables (#523) 1e061b2f Feature/equal (#499) e378d1dc reformatted gcylon codes with tab size 2, non-functional changes (#521) 8450d9b1 Added support for sliced tables in gather, broadcast and sorting (#520) 92b8124c Update windows.yml 1f9790d7 Update macos.yml d33f9ac8 Update conda-actions.yml 963d4914 Update c-cpp.yml 2229981d added mpi datatype dispatching for primitive data types (#519) d9936b4d Head tail operators (#512) ac99d009 Formatting code (#518) fff84ccb Code formatting (#517) f32f04da Null handling in splitters and build arrays (#511) 4cab7ca4 Delete files from CPP example folder that are not needed (#516) d1744302 moving tutorial repo to (#514) 9cd7911f Python example cleanup (#513) fe4caf37 Distributed sorting (#510) 2302f58f Minor improvements to the Table API (#508) 71eb80a1 adding new test utils (#507) 24b83dd3 Adding to docker docs (#498) 6f2faf8f Update conda.md 4f8f3c7f Gcylon docs (#501) a7862580 Adding contributing guide to documentation (#496) 8ab8b2d6 changing join column naming convention to match SQL and pandas (#487) f18b91fe improvements to ucx build from conda (#484) 912fb543 Windows build (#482) 216758a2 making improvements to the build (#483) 4e2894eb Add functions to dataframe (#481) 1f1ddd9c Documentation update (#479) e6233151 Bump tar from 6.1.5 to 6.1.11 in /docs (#477) 1e5db7b6 improve docs (#476) 58c0595d removing extra examples (#474) 3c823f6f Gcylon integration (#470) 92748eb5 Cpp example cleanup (#475) fa14527d Docs improvements (#469) 13062206 Bump url-parse from 1.4.7 to 1.5.3 in /docs (#473) 8234ae7b Bump path-parse from 1.0.6 to 1.0.7 in /docs (#472) c8b435b6 Bump tar from 6.0.5 to 6.1.5 in /docs (#471) 1cc28dd3 Performance improvements (#453) 9092bbf0 MacOS build (#464) d59d91ea Add iloc operation to DataFrame (#465) 8d7a8dc7 Removed glog files from the header files (#463) ea62eef0 License updates (#462) 2f562650 changed all relative Cylon header references to global (#461) 123c93c3 Building in conda env without using conda-build (#457) 3b3a2853 Compilation document improvements (#454) 8578b1f1 Adding barrier at the end of the test case (#458) e6eded5f Fix for empty df (#455) 8f149924 Fixed mpi test case (#456) cb069980 Changes to the Docs (#451) 4ce1d7eb updates to the docker readme e011e0f6 enhancing readme adfa6c05 adding read distribution (#432) bd2e024d UCX integration (#439) a42d04ad Bump ws from 6.2.1 to 6.2.2 in /docs (#437) 710b562e Bump dns-packet from 1.3.1 to 1.3.4 in /docs (#435) 07aee740 adding new operators to DataFrame API (#429) 71e57f84 Updating to arrow 4.0 (#418) a490dc21 changing ctx to const reference in methods (#419) 18a5447b missing docs (#428) 38534f55 0.4.1 release (#427) 10f5a6a3 Enabling scalars in df set_item (#425) 0be78972 Op bench refactor (#417) ec964d89 Bug fixes in dataframe (#420) e0ba9643 Update c-cpp.yml 0200c021 adding finalize check and removing destructor finalize call. (#412) 149919c2 Update README.md 016c5c92 adding missing test case 56095357 Update README.md e3ca0bf5 0.4.0 release (#411)

Contributors

Ahmet Uyar Chathura Widanage Damitha Sandeepa Lenadora dependabot[bot] Hasara Maithree Kaiying Shan niranda perera Supun Kamburugamuve Vibhatha Lakmal Abeykoon Ziyao22

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

0.4.1

3 years ago

Cylon 0.4.1 is a bug fix release.

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

0.4.0

3 years ago

Cylon 0.4.0 is a major release with the following features.

Major Features

Python

  • DataFrame API similar to Pandas supporting around 40 operators commonly used in Pandas.
  • Conda build and conda based binaries for Linux for installing.
  • Python binding to all the operators added on the C++ level.
  • Providing compute functions with both Arrow and Numpy for filtering, math operations and comparison operators.
  • Added operator benchmarks.
  • Added new options for CSV reading supporting all the options in PyArrow for reading CSV.

C++

  • Added distributed multi-column operations on tables for join, union, intersection, set difference and sort.
  • Added improved hash operations using Bytell Hash Maps. Improved performance by 2 times for union, intersection, set difference and unique.
  • Added new aggregate operations for GroupBy operation (Mean, Variance, Std Dev, Quantile, NUnique, Median).
  • Implemented GroupBy aggregators using CRTP (Curiously recurring template pattern).
  • Improved indexing at the core by Added more types, improved performance of indexed lookups.
  • Added unique distributed operator.
  • Added temporal data types like DateTime, Date32 (seconds resolution), Date64 (milliseconds resolution) and TImestamp (with time zone information).
  • Other performance improvements and bug fixes.

Build

  • Compiling using external Apache Arrow installation (local/ pip).

Applications and Benchmarks

  • Implementing a subset of TPC-XBB queries (Queries 6, 7, 9, 14, 22, 23) and the rest is ongoing.
  • Applications with connections to deep learning.

You can download source code from Github

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

v0.3.1

3 years ago

Cylon 0.3.1 is a bug fix release.

You can download source code from Github

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

v0.3.0

3 years ago

Cylon 0.3.0 adds the following features. Please note that this release may not be backward compatible with previous releases.

Major Features

C++

  • Adding order-by and distributed table sort operations
  • Multiple partitioning schemes (modulo, hash, and range)
  • C++ API refactoring
  • Performance improvements in the existing C++ API

Python (Pycylon)

  • Exposing table operators similar to Pandas (28 new operators).
    • Comparison operators
    • Logical Operators
    • Math operators
    • Null/NA value filtering and filling
    • Filtering and updating (including inplace ops)
    • Schema refactoring
    • Experimental indexing abstract
  • Distributed Data sorting Python bindings
  • Adding new examples for updated operations. (https://github.com/cylondata/cylon/tree/master/python/examples)

You can download source code from Github

Examples

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

0.2.0

3 years ago

Cylon 0.2.0 adds the following features. Please note that this release may not be backward compatible with v0.1.0.

Major Features

C++

  • Adding aggregates and group-by API
  • Creating tables using std::vectors or cylon::Columns
  • C++ API refactoring
  • Major performance improvements in the existing C++ API

Python (Pycylon)

  • Extending Cython API for extended development for other Cython/Python libraries
  • Aggregates and Groupby addition
  • Column name-based relational algebra operations and aggregate/groupby ops addition
  • Major performance improvements in the existing Python API

Java (JCylon)

  • Performance improvements

You can download source code from Github

Examples

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

0.1.0

3 years ago

Cylon 0.1.0 is the first open-source public release of Cylon Project. We are excited to bring a high-performance data engineering toolkit that can work as a library as well as a standalone framework. This is the first step towards building a complete toolkit designed to work with AI/ML systems and integrate with data processing systems with the vision "data engineering everywhere".

You can download source code from Github

Who should use Cylon?

  • Users of Pandas dataframes or SQL interface
  • Those needing parallel data engineering
  • Those needing Python C++ Java interoperability
  • HPC Python (Dask) and Big Data (Kubernetes) environments

Major Features in v0.1.0

  • Introducing Cylon C++ engine based on Apache Arrow.
  • Cylon C++, Python (PyCylon) and Java language bindings
  • Seamless integration with Pandas and NumPy
  • Distributed operations using MPI
  • Local and distributed operations (Select, Project, Joins, Intersection, Union, Subtract)
  • Jupyter notebook support and experimental Google Colab support

Examples

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0