Fonduer Versions Save

A knowledge base construction engine for richly formatted data

v0.9.0

2 years ago

0.9.0 - 2021-06-22

This is a long-awaited release with some performance improvements and some breaking changes. See the changelog for details.

Added

@HiromuHota: Support spaCy v2.3. (#506)
@HiromuHota: Add HOCRDocPreprocessor and HocrVisualLinker to support hOCR as input file. (#476) (#519)
@YasushiMiyata: Add multiline Japanese strings support to fonduer.parser.visual_parser.hocr_visual_parser. (#534) (#542)
@YasushiMiyata: Add commit process immediately after add to fonduer.parser.Parser. (#494) (#544)

Changed

@HiromuHota: Renamed VisualLinker to PdfVisualParser, which assumes the followings: (#518)
- pdf_path should be a directory path, where PDF files exist, and cannot be a file path.
- The PDF file should have the same basename (os.path.basename) as the document. E.g., the PDF file should be either "123.pdf" or "123.PDF" for "123.html".
@HiromuHota: Changed Parser's signature as follows: (#518)
- Renamed vizlink to visual_parser.
- Removed pdf_path. Now this is required only by PdfVisualParser.
- Removed visual. Provide visual_parser if visual information is to be parsed.
@YasushiMiyata: Changed UDFRunner's and UDF's data commit process as follows: (#545)
- Removed add process on single-thread in _apply in UDFRunner.
- Added UDFRunner._add of y on multi-threads to Parser, Labeler and Featurizer.
- Removed y of document parsed result from out_queue in UDF.

Fixed

@YasushiMiyata: Fix test code test_postgres.py::test_cand_gen_cascading_delete. (#538) (#539)
@HiromuHota: Process the tail text only after child elements. (#333) (#520)

v0.8.3

3 years ago

0.8.3 - 2020-09-11

This is a big release with a lot of changes. These changes are summarized here. Check the Changelog for more details.

Added

@YasushiMiyata: Add get_max_row_num to fonduer.utils.data_model_utils.tabular. (#469) (#480)
@HiromuHota: Add get_bbox() to Sentence and SpanMention. (#429)
@HiromuHota: Add a custom MLflow model that allows you to package a Fonduer model. See here for how to use it. (#259) (#407)
@HiromuHota: Support spaCy v2.2. (#384) (#432)
@wajdikhattel: Add multinary candidates. (#455) (#456)
@HiromuHota: Add nullables to candidate_subclass() to allow NULL mention in a candidate. (#496) (#497)
@HiromuHota: Copy textual functions in data_model_utils.tabular to data_model_utils.textual. (#503) (#505)

Changed

@YasushiMiyata: Enable RegexMatchSpan with concatenates words by sep="(separator)" option. (#270) (#492)
@HiromuHota: Enabled "Type hints (PEP 484) support for the Sphinx autodoc extension." (#421)
@HiromuHota: Switched the Cython wrapper for Mecab from mecab-python3 to fugashi. Since the Japanese tokenizer remains the same, there should be no impact on users. (#384) (#432)
@HiromuHota: Log a stack trace on parsing error for better debug experience. (#478) (#479)
@HiromuHota: get_cell_ngrams and get_neighbor_cell_ngrams yield nothing when the mention is not tabular. (#471) (#504)

Deprecated

@HiromuHota: Deprecated bbox_from_span and bbox_from_sentence. (#429)
@HiromuHota: Deprecated visualizer.get_box in favor of span.get_bbox(). (#445) (#446)
@HiromuHota: Deprecate textual functions in data_model_utils.tabular. (#503) (#505)

Fixed

@senwu: Fix pdf_path cannot be without a trailing slash. (#442) (#459)
@kaikun213: Fix bug in table range difference calculations. (#420)
@HiromuHota: mention_extractor.apply with clear=True now works even if it's not the first run. (#424)
@HiromuHota: Fix get_horz_ngrams and get_vert_ngrams so that they work even when the input mention is not tabular. (#425) (#426)
@HiromuHota: Fix the order of args to Bbox. (#443) (#444)
@HiromuHota: Fix the non-deterministic behavior in VisualLinker. (#412) (#458)
@HiromuHota: Fix an issue that the progress bar shows no progress on preprocessing by executing preprocessing and parsing in parallel. (#439)
@HiromuHota: Adopt to mlflow>=1.9.0. (#461) (#463)
@HiromuHota: Correct the entity type for NumberMatcher from "NUMBER" to "CARDINAL". (#473) (#477)
@HiromuHota: Fix _get_axis_ngrams not to return None when the input is not tabular. (#481)
@HiromuHota: Fix Visualizer.display_candidates not to draw rectangles on wrong pages. (#488)
@HiromuHota: Persist doc only when no error happens during parsing. (#489) (#490)

v0.8.2

4 years ago

0.8.2 - 2020-04-28

A summary of the changes of this release are below. Check the Changelog for more details.

Deprecated

@HiromuHota: Use of undecorated labeling functions is deprecated and will not be supported as of v0.9.0. Please decorate them with snorkel.labeling.labeling_function.

Fixed

@HiromuHota: Labeling functions can now be decorated with snorkel.labeling.labeling_function. (#400 <https://github.com/HazyResearch/fonduer/issues/400>) (#401 <https://github.com/HazyResearch/fonduer/pull/401>)

v0.8.1

4 years ago

0.8.1 - 2020-04-13

A summary of the changes of this release are below. Check the Changelog for more details.

Fonduer has a new mode argument to support switching between different learning modes (e.g., STL or MLT).

Click to see example usage

# Create task for each relation.
tasks = create_task(
    task_names = TASK_NAMES,
    n_arities = N_ARITIES,
    n_features = N_FEATURES,
    n_classes = N_CLASSES,
    emb_layer = EMB_LAYER,
    model="LogisticRegression",
    mode = MODE,
)

Added

@senwu: Add mode argument in create_task to support STL and MTL.

v0.8.0

4 years ago

0.8.0 - 2020-04-07

A summary of the changes of this release are below. Check the Changelog for more details.

Rather than maintaining a separate learning engine, we switch to Emmental, a deep learning framework for multi-task learning. Switching to a more general learning framework allows Fonduer to support more applications and multi-task learning.

Click to see example usage

# With Emmental, you need do following steps to perform learning:
# 1. Create task for each relations and EmmentalModel to learn those tasks.
# 2. Wrap candidates into EmmentalDataLoader for training.
# 3. Training and inference (prediction).

import emmental

# Collect word counter from candidates which is used in LSTM model.
word_counter = collect_word_counter(train_cands)

# Initialize Emmental. For customize Emmental, please check here:
# https://emmental.readthedocs.io/en/latest/user/config.html
emmental.init(fonduer.Meta.log_path)

#######################################################################
# 1. Create task for each relations and EmmentalModel to learn those tasks.
#######################################################################

# Generate special tokens which are used for LSTM model to locate mentions.
# In LSTM model, we pad sentence with special tokens to help LSTM to learn
# those mentions. Example:
# Original sentence: Then Barack married Michelle.
# ->  Then ~~[[1 Barack 1]]~~ married ~~[[2 Michelle 2]]~~.
arity = 2
special_tokens = []
for i in range(arity):
    special_tokens += [f"~~[[{i}", f"{i}]]~~"]

# Generate word embedding module for LSTM.
emb_layer = EmbeddingModule(
    word_counter=word_counter, word_dim=300, specials=special_tokens
)

# Create task for each relation.
tasks = create_task(
    ATTRIBUTE,
    2,
    F_train[0].shape[1],
    2,
    emb_layer,
    mode="mtl",
    model="LogisticRegression",
)

# Create Emmental model to learn the tasks.
model = EmmentalModel(name=f"{ATTRIBUTE}_task")

# Add tasks into model
for task in tasks:
    model.add_task(task)

#######################################################################
# 2. Wrap candidates into EmmentalDataLoader for training.
#######################################################################

# Here we only use the samples that have labels, which we filter out the
# samples that don't have significant marginals.
diffs = train_marginals.max(axis=1) - train_marginals.min(axis=1)
train_idxs = np.where(diffs > 1e-6)[0]

# Create a dataloader with weakly supervisied samples to learn the model.
train_dataloader = EmmentalDataLoader(
    task_to_label_dict={ATTRIBUTE: "labels"},
    dataset=FonduerDataset(
        ATTRIBUTE,
        train_cands[0],
        F_train[0],
        emb_layer.word2id,
        train_marginals,
        train_idxs,
    ),
    split="train",
    batch_size=100,
    shuffle=True,
)


# Create test dataloader to do prediction.
# Build test dataloader
test_dataloader = EmmentalDataLoader(
    task_to_label_dict={ATTRIBUTE: "labels"},
    dataset=FonduerDataset(
        ATTRIBUTE, test_cands[0], F_test[0], emb_layer.word2id, 2
    ),
    split="test",
    batch_size=100,
    shuffle=False,
)


#######################################################################
# 3. Training and inference (prediction).
#######################################################################

# Learning those tasks.
emmental_learner = EmmentalLearner()
emmental_learner.learn(model, [train_dataloader])

# Predict based the learned model.
test_preds = model.predict(test_dataloader, return_preds=True)

Changed

@senwu: Switch to Emmental as the default learning engine.
@HiromuHota: Change ABSTAIN to -1 to be compatible with Snorkel of 0.9.X. Accordingly, user-defined labels should now be 0-indexed (used to be 1-indexed). (#310) (#320)
@HiromuHota: Use executemany_mode="batch" instead of deprecated use_batch_mode=True. (#358)
@HiromuHota: Use tqdm.notebook.tqdm instead of deprecated tqdm.tqdm_notebook. (#360)
@HiromuHota: To support ImageMagick7, expand the version range of Wand. (#373)
@HiromuHota: Comply with PEP 561 for type-checking codes that use Fonduer.
@HiromuHota: Make UDF.apply of all child classes unaware of the database backend, meaning PostgreSQL is not required if UDF.apply is directly used instead of UDFRunner.apply. (#316) (#368)

Fixed

@senwu: Fix mention extraction to return mention classes instead of data model classes.