Sherlock Project Versions Save

This repository provides data and scripts to use Sherlock, a DL-based model for semantic data type detection: https://sherlock.media.mit.edu.

v1.0.0

2 years ago

This release provides:

  • a significant speedup and memory reduction of the feature extraction phase,
  • bugfixes in the feature extraction pipeline,
  • the code of the original model architecture (tensorflow keras),
  • alignment of the SherlockModel class with the scikit-learn API (i.e. w/ fit, predict, predict_proba methods),
  • improved notebooks demonstrating 1) full reproduction of the feature extraction and model training/evaluation pipelines, 2) out-of-the-box usage of the Sherlock model for a given table, 3) how performance can be improved with additional classifiers.

Contributions by: @lowecg @madelonhulsebos

v0.1.0

2 years ago

This release reflects the code that was used for the experiments in the paper "Sherlock: a deep learning approach to semantic data type detection" (link to the paper on arXiv). This release provides code for:

  • Download of the original train and test data used for the experiment results as reported in the paper.
  • Feature extraction to numerically represent new columns.
  • Evaluating a trained Sherlock model on unseen table columns.
  • Retraining the original Sherlock model.

This release consists inefficiencies and bugs, hence it is recommended to use the latest release of this project in production settings or new research projects. More about this project can be found on this website.