A knowledge base construction engine for richly formatted data
This is a long-awaited release with some performance improvements and some breaking changes. See the changelog for details.
HOCRDocPreprocessor
and HocrVisualLinker
to support hOCR as input file. (#476) (#519)fonduer.parser.visual_parser.hocr_visual_parser
. (#534) (#542)fonduer.parser.Parser
. (#494) (#544)@HiromuHota: Renamed VisualLinker
to PdfVisualParser
, which assumes the followings: (#518)
pdf_path
should be a directory path, where PDF files exist, and cannot be a file path.os.path.basename
) as the document. E.g., the PDF file should be either "123.pdf" or "123.PDF" for "123.html".@HiromuHota: Changed Parser
's signature as follows: (#518)
vizlink
to visual_parser
.pdf_path
. Now this is required only by PdfVisualParser
.visual
. Provide visual_parser
if visual information is to be parsed.@YasushiMiyata: Changed UDFRunner
's and UDF
's data commit process as follows: (#545)
add
process on single-thread in _apply
in UDFRunner
.UDFRunner._add
of y
on multi-threads to Parser
, Labeler
and Featurizer
.y
of document parsed result from out_queue
in UDF
.This is a big release with a lot of changes. These changes are summarized here. Check the Changelog for more details.
get_max_row_num
to fonduer.utils.data_model_utils.tabular
. (#469) (#480)Sentence
and SpanMention
. (#429)nullables
to candidate_subclass()
to allow NULL mention in a candidate. (#496) (#497)data_model_utils.tabular
to data_model_utils.textual
. (#503) (#505)get_cell_ngrams
and get_neighbor_cell_ngrams
yield nothing when the mention is not tabular. (#471) (#504)bbox_from_span
and bbox_from_sentence
. (#429)visualizer.get_box
in favor of span.get_bbox()
. (#445) (#446)data_model_utils.tabular
. (#503) (#505)get_horz_ngrams
and get_vert_ngrams
so that they work even when the input mention is not tabular. (#425) (#426)_get_axis_ngrams
not to return None
when the input is not tabular. (#481)Visualizer.display_candidates
not to draw rectangles on wrong pages. (#488)A summary of the changes of this release are below. Check the Changelog for more details.
snorkel.labeling.labeling_function
.snorkel.labeling.labeling_function
. (#400 <https://github.com/HazyResearch/fonduer/issues/400>
) (#401 <https://github.com/HazyResearch/fonduer/pull/401>
)A summary of the changes of this release are below. Check the Changelog for more details.
Fonduer has a new mode
argument to support switching between different learning modes (e.g., STL or MLT).
# Create task for each relation.
tasks = create_task(
task_names = TASK_NAMES,
n_arities = N_ARITIES,
n_features = N_FEATURES,
n_classes = N_CLASSES,
emb_layer = EMB_LAYER,
model="LogisticRegression",
mode = MODE,
)
mode
argument in create_task to support STL
and MTL
.A summary of the changes of this release are below. Check the Changelog for more details.
Rather than maintaining a separate learning engine, we switch to Emmental, a deep learning framework for multi-task learning. Switching to a more general learning framework allows Fonduer to support more applications and multi-task learning.
# With Emmental, you need do following steps to perform learning:
# 1. Create task for each relations and EmmentalModel to learn those tasks.
# 2. Wrap candidates into EmmentalDataLoader for training.
# 3. Training and inference (prediction).
import emmental
# Collect word counter from candidates which is used in LSTM model.
word_counter = collect_word_counter(train_cands)
# Initialize Emmental. For customize Emmental, please check here:
# https://emmental.readthedocs.io/en/latest/user/config.html
emmental.init(fonduer.Meta.log_path)
#######################################################################
# 1. Create task for each relations and EmmentalModel to learn those tasks.
#######################################################################
# Generate special tokens which are used for LSTM model to locate mentions.
# In LSTM model, we pad sentence with special tokens to help LSTM to learn
# those mentions. Example:
# Original sentence: Then Barack married Michelle.
# -> Then ~~[[1 Barack 1]]~~ married ~~[[2 Michelle 2]]~~.
arity = 2
special_tokens = []
for i in range(arity):
special_tokens += [f"~~[[{i}", f"{i}]]~~"]
# Generate word embedding module for LSTM.
emb_layer = EmbeddingModule(
word_counter=word_counter, word_dim=300, specials=special_tokens
)
# Create task for each relation.
tasks = create_task(
ATTRIBUTE,
2,
F_train[0].shape[1],
2,
emb_layer,
mode="mtl",
model="LogisticRegression",
)
# Create Emmental model to learn the tasks.
model = EmmentalModel(name=f"{ATTRIBUTE}_task")
# Add tasks into model
for task in tasks:
model.add_task(task)
#######################################################################
# 2. Wrap candidates into EmmentalDataLoader for training.
#######################################################################
# Here we only use the samples that have labels, which we filter out the
# samples that don't have significant marginals.
diffs = train_marginals.max(axis=1) - train_marginals.min(axis=1)
train_idxs = np.where(diffs > 1e-6)[0]
# Create a dataloader with weakly supervisied samples to learn the model.
train_dataloader = EmmentalDataLoader(
task_to_label_dict={ATTRIBUTE: "labels"},
dataset=FonduerDataset(
ATTRIBUTE,
train_cands[0],
F_train[0],
emb_layer.word2id,
train_marginals,
train_idxs,
),
split="train",
batch_size=100,
shuffle=True,
)
# Create test dataloader to do prediction.
# Build test dataloader
test_dataloader = EmmentalDataLoader(
task_to_label_dict={ATTRIBUTE: "labels"},
dataset=FonduerDataset(
ATTRIBUTE, test_cands[0], F_test[0], emb_layer.word2id, 2
),
split="test",
batch_size=100,
shuffle=False,
)
#######################################################################
# 3. Training and inference (prediction).
#######################################################################
# Learning those tasks.
emmental_learner = EmmentalLearner()
emmental_learner.learn(model, [train_dataloader])
# Predict based the learned model.
test_preds = model.predict(test_dataloader, return_preds=True)