Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.
Desbordante is a high-performance data profiler that is capable of discovering and validating many different patterns in data using various algorithms.
The Discovery task is designed to identify all instances of a specified pattern type of a given dataset.
The Validation task is different: it is designed to check whether a specified pattern instance is present in a given dataset. This task not only returns True or False, but it also explains why the instance does not hold (e.g. it can list table rows with conflicting values).
The currently supported data patterns are:
The discovered patterns can have many uses:
Desbordante can be used via three interfaces:
A brief introduction to the tool and its use cases can be found here (in English) and here (in Russian). Next, a list of various articles and guides can be found here. Finally, an extensive list of tutorial examples that cover each supported pattern is available here.
Usage examples:
python3 cli.py --task=fd --table=../examples/datasets/university_fd.csv , True
[Course Classroom] -> Professor
[Classroom Semester] -> Professor
[Classroom Semester] -> Course
[Professor] -> Course
[Professor Semester] -> Classroom
[Course Semester] -> Classroom
[Course Semester] -> Professor
python3 cli.py --task=afd --table=../examples/datasets/inventory_afd.csv , True --error=0.1
[Id] -> ProductName
[Id] -> Price
[ProductName] -> Price
python3 cli.py --task=mfd_verification --table=../examples/datasets/theatres_mfd.csv , True --lhs_indices=0 --rhs_indices=2 --metric=euclidean --parameter=5
True
For more information consult documentation and help files.
Desbordante features can be accessed from within Python programs by employing the Desbordante Python library. The library is implemented in the form of Python bindings to the interface of the Desbordante C++ core library, using pybind11. Apart from discovery and validation of patterns, this interface is capable of providing valuable additional information which can, for example, describe why a given pattern does not hold. All this allows end users to solve various data quality problems by constructing ad-hoc Python programs. To show the power of this interface, we have implemented several demo scenarios:
There is also an interactive demo for all of them, and all of these python scripts are here. The ideas behind them are briefly discussed in this preprint (Section 3).
Simple usage examples:
import desbordante
TABLE = 'examples/datasets/university_fd.csv'
algo = desbordante.fd.algorithms.Default()
algo.load_data(table=(TABLE, ',', True))
algo.execute()
result = algo.get_fds()
print('FDs:')
for fd in result:
print(fd)
FDs:
[Course Classroom] -> Professor
[Classroom Semester] -> Professor
[Classroom Semester] -> Course
[Professor] -> Course
[Professor Semester] -> Classroom
[Course Semester] -> Classroom
[Course Semester] -> Professor
import desbordante
TABLE = 'examples/datasets/inventory_afd.csv'
ERROR = 0.1
algo = desbordante.afd.algorithms.Default()
algo.load_data(table=(TABLE, ',', True))
algo.execute(error=ERROR)
result = algo.get_fds()
print('AFDs:')
for fd in result:
print(fd)
AFDs:
[Id] -> Price
[Id] -> ProductName
[ProductName] -> Price
import desbordante
TABLE = 'examples/datasets/theatres_mfd.csv'
METRIC = 'euclidean'
LHS_INDICES = [0]
RHS_INDICES = [2]
PARAMETER = 5
algo = desbordante.mfd_verification.algorithms.Default()
algo.load_data(table=(TABLE, ',', True))
algo.execute(lhs_indices=LHS_INDICES, metric=METRIC,
parameter=PARAMETER, rhs_indices=RHS_INDICES)
if algo.mfd_holds():
print('MFD holds')
else:
print('MFD does not hold')
MFD holds
>>> import desbordante
>>> import pandas as pd
>>> pyro = desbordante.afd.algorithms.Pyro() # same as desbordante.afd.algorithms.Default()
>>> df = pd.read_csv('examples/datasets/iris.csv', sep=',', header=None)
>>> pyro.load_data(table=df)
>>> pyro.execute(error=0.0)
>>> print(f'[{", ".join(map(str, pyro.get_fds()))}]')
[[0 1 2] -> 4, [0 2 3] -> 4, [0 1 3] -> 4, [1 2 3] -> 4]
>>> pyro.execute(error=0.1)
>>> print(f'[{", ".join(map(str, pyro.get_fds()))}]')
[[2] -> 0, [2] -> 3, [2] -> 1, [0] -> 2, [3] -> 0, [0] -> 3, [0] -> 1, [1] -> 3, [1] -> 0, [3] -> 2, [3] -> 1, [1] -> 2, [2] -> 4, [3] -> 4, [0] -> 4, [1] -> 4]
>>> pyro.execute(error=0.2)
>>> print(f'[{", ".join(map(str, pyro.get_fds()))}]')
[[2] -> 0, [0] -> 2, [3] -> 2, [1] -> 2, [2] -> 4, [3] -> 4, [0] -> 4, [1] -> 4, [3] -> 0, [1] -> 0, [2] -> 3, [2] -> 1, [0] -> 3, [0] -> 1, [1] -> 3, [3] -> 1]
>>> pyro.execute(error=0.3)
>>> print(f'[{", ".join(map(str, pyro.get_fds()))}]')
[[2] -> 1, [0] -> 2, [2] -> 0, [2] -> 3, [0] -> 1, [3] -> 2, [3] -> 1, [1] -> 2, [3] -> 0, [0] -> 3, [4] -> 1, [1] -> 0, [1] -> 3, [4] -> 2, [4] -> 3, [2] -> 4, [3] -> 4, [0] -> 4, [1] -> 4]
While the Python interface makes building interactive applications possible, Desbordante also offers a web interface which is aimed specifically for interactive tasks. Such tasks typically involve multiple steps and require substantial user input on each of them. Interactive tasks usually originate from Python scenarios, i.e. we select the most interesting ones and implement them in the web version. Currently, only the typo detection scenario is implemented. The web interface is also useful for pattern discovery and validation tasks: a user may specify parameters, browse results, employ advanced visualizations and filters, all in a convenient way.
You can try the deployed web version here. You have to register in order to process your own datasets. Keep in mind that due to high demand various time and memory limits are enforced: processing is aborted if they are exceeded. The source code of the web interface is kept in a separate repo.
No worries! Desbordante offers a novel type of data profiling, which may require that you first familiarize yourself with its concepts and usage. The most challenging part of Desbordante are the primitives: their definitions and applications in practice. To help you get started, here’s a step-by-step guide:
Here is a list of papers about patterns, organized in the recommended reading order in each item:
Desbordante is available at the Python Package Index (PyPI). Dependencies:
To install Desbordante type:
$ pip install desbordante
However, as Desbordante core uses C++, additional requirements on the machine are imposed. Therefore this installation option may not work for everyone. Currently, only manylinux2014 (Ubuntu 20.04+, or any other linux distribution with gcc 10+) is supported. If the above does not work for you consider building from sources.
NOTE: Only Python 3.11+ is supported for CLI
Сlone the repository, change the current directory to the project directory and run the following commands:
pip install -r cli/requirements.txt
python3 cli/cli.py --help
The following instructions were tested on Ubuntu 20.04+ LTS.
Prior to cloning the repository and attempting to build the project, ensure that you have the following software:
To use test datasets you will need:
Clone the repository, change the current directory to the project directory and run the following commands:
./build.sh
python3 -m venv venv
source venv/bin/activate
python3 -m pip install .
Now it is possible to import desbordante
as a module from within the created virtual environment.
In order to build tests, pull the test datasets using the following command:
./pull_datasets.sh
then build the tests themselves:
./build.sh -j$(nproc)
The Python module can be built by providing the --pybind
switch:
./build.sh --pybind -j$(nproc)
See ./build.sh --help
for more available options.
The ./build.sh
script generates the following file structure in /path/to/Desbordante/build/target
:
├───input_data
│ └───some-sample-csv\'s.csv
├───Desbordante_test
├───desbordante.cpython-*.so
The input_data
directory contains several .csv files that are used by Desbordante_test
. Run Desbordante_test
to perform unit testing:
cd build/target
./Desbordante_test --gtest_filter='*:-*HeavyDatasets*'
desbordante.cpython-*.so
is a Python module, packaging Python bindings for the Desbordante core library. In order to use it, simply import
it:
cd build/target
python3
>>> import desbordante
We use easyloggingpp in order to log (mostly debug) information in the core library. Python bindings search for a configuration file in the working directory, so to configure logging, create logging.conf
in the directory from which desbordante will be imported. In particular, when running the CLI with python3 ./relative/path/to/cli.py
, logging.conf
should be located in .
.
If, when cloning the repo with git lfs installed, git clone
produces the following (or similar) error:
Cloning into 'Desbordante'...
remote: Enumerating objects: 13440, done.
remote: Counting objects: 100% (13439/13439), done.
remote: Compressing objects: 100% (3784/3784), done.
remote: Total 13440 (delta 9537), reused 13265 (delta 9472), pack-reused 1
Receiving objects: 100% (13440/13440), 125.78 MiB | 8.12 MiB/s, done.
Resolving deltas: 100% (9537/9537), done.
Updating files: 100% (478/478), done.
Downloading datasets/datasets.zip (102 MB)
Error downloading object: datasets/datasets.zip (2085458): Smudge error: Error downloading datasets/datasets.zip (2085458e26e55ea68d79bcd2b8e5808de731de6dfcda4407b06b30bce484f97b): batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
delete the already cloned version, set GIT_LFS_SKIP_SMUDGE=1
environment variable and clone the repo again:
GIT_LFS_SKIP_SMUDGE=1 git clone [email protected]:Mstrutov/Desbordante.git
If type hints don't work for you in Visual Studio Code, for example, then install stubs using the command:
pip install desbordate-stubs
NOTE: Stubs may not fully support current version of desbordante
package, as they are updated independently.
If you use this software for research, please cite one of our papers:
If you have any questions regarding the tool usage you can ask it in our google group. To contact dev team email George Chernishev, Maxim Strutovsky or Nikita Bobrov.