A Modern C++ Data Sciences Toolkit
MAKE_NUMERIC_IDENTIFIER
instead of MAKE_NUMERIC_IDENTIFIER_UDL
on GCC 7.1.1.cmake
would link in additional exception handling libraries that would cause a crash during indexing by building the mman-win32
library as shared.murmur_hash
.d29bf8b4cbeef21db087cf8042efe5afe25c7bd3c460997728d58b92c24ec283 beam-search-constituency-parser-4.tar.gz
ce44c7d96a8339ff4b597f35a35534ccf93ab99b7d45cbbdddffe7e362b9c20e crf.tar.gz
672b10c398c1a193ba91dc8c0493d729ad3f73d9192ef33100baeb8afd4f5cde gigaword-embeddings-50d.tar.gz
40cd87901eb29b69e57e4bca14bc2539d7d6b4ad5c186d6f3b1532a60c5163b0 greedy-constituency-parser.tar.gz
a0a3814c1f82780f1296d600eba260f474420aa2d93f000e390c71a0ddac42d9 greedy-perceptron-tagger.tar.gz
xz{i,o}fstream
to meta::io
if compiled with liblzma available.util::disk_vector<const T>
can now be used to specify a read-only view of a disk-backed vector.ir_eval::print_stats
now takes a num_docs
parameter to properly display evaluation metrics at a certain cutoff point, which was always 5 beforehand. This fixes a bug in query-runner
where the stats were not being computed according to the cutoff point specified in the configuration.ir_eval::avg_p
now correctly stops computing after num_docs
. Before, if you specified num_docs
as a smaller value than the size of the result list, it would erroneously keep calculating until the end of the result list instead of stopping after num_docs
elements.{inverted,forward}_index
can now be loaded from read-only filesystems.d29bf8b4cbeef21db087cf8042efe5afe25c7bd3c460997728d58b92c24ec283 beam-search-constituency-parser-4.tar.gz
ce44c7d96a8339ff4b597f35a35534ccf93ab99b7d45cbbdddffe7e362b9c20e crf.tar.gz
672b10c398c1a193ba91dc8c0493d729ad3f73d9192ef33100baeb8afd4f5cde gigaword-embeddings-50d.tar.gz
40cd87901eb29b69e57e4bca14bc2539d7d6b4ad5c186d6f3b1532a60c5163b0 greedy-constituency-parser.tar.gz
a0a3814c1f82780f1296d600eba260f474420aa2d93f000e390c71a0ddac42d9 greedy-perceptron-tagger.tar.gz
Add an embedding_analyzer
that represents documents with their averaged word vectors.
Add a parallel::reduction
algorithm designed for parallelizing complex accumulation operations (like an E step in an EM algorithm)
Parallelize feature counting in feature selector using the new parallel::reduction
Add a parallel::for_each_block
algorithm to run functions on (relatively) equal sub-ranges of an iterator range in parallel
Add a parallel merge sort as parallel::sort
Add a util/traits.h
header for general useful traits
Add a Markov model implementation in sequence::markov_model
Add a generic unsupervised HMM implementation. This implementation supports HMMs with discrete observations (what is used most often) and sequence observations (useful for log mining applications). The forward-backward algorithm is implemented using both the scaling method and the log-space method. The scaling method is used by default, but the log-space method is useful for HMMs with sequence observations to avoid underflow issues when the output probabilities themselves are very small.
Add the KL-divergence retrieval function using pseudo-relevance feedback with the two-component mixture-model approach of Zhai and Lafferty, called kl_divergence_prf
. This ranker internally can use any language_model_ranker
subclass like dirichlet_prior
or jelinek_mercer
to perform the ranking of the feedback set and the result documents with respect to the modified query.
The EM algorithm used for the two-component mixture model is provided as the index::feedback::unigram_mixture
free function and returns the feedback model.
Add the Rocchio algorithm (rocchio
) for pseudo-relevance feedback in the vector space model.
Breaking Change. To facilitate the above to changes, we have also broken the ranker
hierarchy into one more level. At the top we have ranker
, which has a pure virtual function rank()
that can be overridden to provide entirely custom ranking behavior. This is the class the KL-divergence and Rocchio methods derive from, as we need to re-define what it means to rank documents (first retrieving a feedback set, then ranking documents with respect to an updated query).
Most of the time, however, you will want to derive from the second level ranking_function
, which is what was called ranker
before. This class provides a definition of rank()
to perform document-at-a-time ranking, and expects deriving classes to instead provide initial_score()
and score_one()
implementations to define the scoring function used for each document. Existing code that derived from ranker
prior to this version of MeTA likely needs to be changed to instead derive from ranking_function
.
Add the util::transform_iterator
class and util::make_transform_iterator
function for providing iterators that transform their output according to a unary function.
Breaking Change. whitespace_tokenizer
now emits only word tokens by default, suppressing all whitespace tokens. The old default was to emit tokens containing whitespace in addition to actual word tokens. The old behavior can be obtained by passing false
to its constructor, or setting suppress-whitespace = false
in its configuration group in config.toml.
(Note that whitespace tokens are still needed if using a sentence_boundary
filter but, in nearly all circumstances, icu_tokenizer
should be preferred.)
Breaking Change. Co-occurrence counting for embeddings now uses history that crosses sentence boundaries by default. The old behavior (clearing the history when starting a new sentence) can be obtained by ensuring that a tokenizer is being used that emits sentence boundary tags and by setting break-on-tags = true
in the [embeddings]
table of config.toml
.
Breaking Change. All references in the embeddings library to "coocur" are have changed to "cooccur". This means that some files and binaries have been renamed. Much of the co-occurrence counting part of the embeddings library has also been moved to the public API.
Co-occurrence counting now is performed in parallel. Behavior of its merge strategy can be configured with the new [embeddings]
config parameter merge-fanout = n
, which specifies the maximum number of on-disk chunks to allow before kicking off a multi-way merge (default 8).
packed_write
and packed_read
overloads: for std::pair
, stats::dirichlet
, stats::multinomial
, util::dense_matrix
, and util::sparse_vector
ranker_factory
to allow construction/loading of language_model_ranker subclasses (useful for the kl_divergence_prf
implementation)util::make_fixed_heap
helper function to simplify the declaration of util::fixed_heap
classes with lambda function comparators.cranfield
that contains non-binary relevance judgments to facilitate these new tests.cmake
when building a static ICU library. meta-utf
is now forced to be a shared library, which (1) should save on binary sizes and (2) ensures that the statically build ICU is linked into the libmeta-utf.so
library to avoid undefined references to ICU functions.identifiers.h
would change behavior based on the NDEBUG
macro's setting. This behavior has been removed, and opaque identifiers are always on.disk_index::doc_name
and disk_index::doc_path
have been deprecated in
favor of the more general (and less confusing) metadata()
. They will be removed in a future major release.d29bf8b4cbeef21db087cf8042efe5afe25c7bd3c460997728d58b92c24ec283 beam-search-constituency-parser-4.tar.gz
ce44c7d96a8339ff4b597f35a35534ccf93ab99b7d45cbbdddffe7e362b9c20e crf.tar.gz
672b10c398c1a193ba91dc8c0493d729ad3f73d9192ef33100baeb8afd4f5cde gigaword-embeddings-50d.tar.gz
40cd87901eb29b69e57e4bca14bc2539d7d6b4ad5c186d6f3b1532a60c5163b0 greedy-constituency-parser.tar.gz
a0a3814c1f82780f1296d600eba260f474420aa2d93f000e390c71a0ddac42d9 greedy-perceptron-tagger.tar.gz
Please note that the embeddings model has changed. Please re-download.
indexer-num-threads
config option.parallel_for
filesystem::remove_all
for Windows systems to avoid spurious failures caused by virus scanners keeping files open after we deleted themgzstreambuf::underflow
d29bf8b4cbeef21db087cf8042efe5afe25c7bd3c460997728d58b92c24ec283 beam-search-constituency-parser-4.tar.gz
ce44c7d96a8339ff4b597f35a35534ccf93ab99b7d45cbbdddffe7e362b9c20e crf.tar.gz
2a75ab9750ad2eabfe1b53889b15a31f79bd2315f71c2a4a62f6364586a6042d gigaword-embeddings-50d.tar.gz
40cd87901eb29b69e57e4bca14bc2539d7d6b4ad5c186d6f3b1532a60c5163b0 greedy-constituency-parser.tar.gz
a0a3814c1f82780f1296d600eba260f474420aa2d93f000e390c71a0ddac42d9 greedy-perceptron-tagger.tar.gz
config.h
when used as a sub-project via
add_subdirectory()d29bf8b4cbeef21db087cf8042efe5afe25c7bd3c460997728d58b92c24ec283 beam-search-constituency-parser-4.tar.gz
ce44c7d96a8339ff4b597f35a35534ccf93ab99b7d45cbbdddffe7e362b9c20e crf.tar.gz
2a75ab9750ad2eabfe1b53889b15a31f79bd2315f71c2a4a62f6364586a6042d gigaword-embeddings-50d.tar.gz
40cd87901eb29b69e57e4bca14bc2539d7d6b4ad5c186d6f3b1532a60c5163b0 greedy-constituency-parser.tar.gz
a0a3814c1f82780f1296d600eba260f474420aa2d93f000e390c71a0ddac42d9 greedy-perceptron-tagger.tar.gz
Add a minimal perfect hashing implementation for language_model
, and unify
the querying interface with the existing language model.
Add a CMake install()
command to install MeTA as a library (issue #143). For
example, once the library is installed, users can do:
find_package(MeTA 2.4 REQUIRED)
add_executable(my-program src/my_program.cpp)
target_link_libraries(my-program meta-index) # or whatever libs needed from MeTA
Feature selection functionality added to multiclass_dataset
and
binary_dataset
and views (issues #111, #149 and PR #150 thanks to @siddshuk).
auto selector = features::make_selector(*config, training_vw);
uint64_t total_features_selected = 20;
selector->select(total_features_selected);
auto filtered_dset = features::filter_dataset(dset, *selector);
Users can now, similar to hash_append
, declare standalone functions in the
same scope as their type called packed_read
and packed_write
which will be
called by io::packed::read
and io::packed::write
, respectively, via argument-dependent lookup.
lm::diff
meta::hashing
library: hash_append
overload for
std::vector
, manually-seeded hash functioninstall()
std::vector
operations to io::packed
d29bf8b4cbeef21db087cf8042efe5afe25c7bd3c460997728d58b92c24ec283 beam-search-constituency-parser-4.tar.gz
ce44c7d96a8339ff4b597f35a35534ccf93ab99b7d45cbbdddffe7e362b9c20e crf.tar.gz
2a75ab9750ad2eabfe1b53889b15a31f79bd2315f71c2a4a62f6364586a6042d gigaword-embeddings-50d.tar.gz
40cd87901eb29b69e57e4bca14bc2539d7d6b4ad5c186d6f3b1532a60c5163b0 greedy-constituency-parser.tar.gz
a0a3814c1f82780f1296d600eba260f474420aa2d93f000e390c71a0ddac42d9 greedy-perceptron-tagger.tar.gz
Forward and inverted indexes are now stored in one directory. To make use of your existing indexes, you will need to move their directories. For example, a configuration that used to look like the following
dataset = "20newsgroups"
corpus = "line.toml"
forward-index = "20news-fwd"
inverted-index = "20news-inv"
will now look like the following
dataset = "20newsgroups"
corpus = "line.toml"
index = "20news-index"
and your folder structure should now look like
20news-index
├── fwd
└── inv
You can do this by simply moving the old folders around like so:
mkdir 20news-index
mv 20news-fwd 20news-index/fwd
mv 20news-inv 20news-index/inv
stats::multinomial
now can report the number of unique event types
counted (unique_events()
)
std::vector
can now be hashed via hash_append
.
d29bf8b4cbeef21db087cf8042efe5afe25c7bd3c460997728d58b92c24ec283 beam-search-constituency-parser-4.tar.gz
ce44c7d96a8339ff4b597f35a35534ccf93ab99b7d45cbbdddffe7e362b9c20e crf.tar.gz
2a75ab9750ad2eabfe1b53889b15a31f79bd2315f71c2a4a62f6364586a6042d gigaword-embeddings-50d.tar.gz
40cd87901eb29b69e57e4bca14bc2539d7d6b4ad5c186d6f3b1532a60c5163b0 greedy-constituency-parser.tar.gz
a0a3814c1f82780f1296d600eba260f474420aa2d93f000e390c71a0ddac42d9 greedy-perceptron-tagger.tar.gz
wiki-page-rank
; see the website for
more information on obtaining the required data.directed_graph::add_edge()
.find_first_of
and find_last_of
in util::string_view
.forward_index
now knows how to tokenize a document down to a
feature_vector
, provided it was generated with a non-LIBSVM analyzer.batch_train
. Shuffling the data
causes horrible access patterns in the postings file, so the data
should instead shuffled before indexing.util::array_view
s can now be constructed as empty.util::multiway_merge
has been made more generic. You can now specify
both the comparison function and merging criteria as parameters, which
default to operator<
and operator==
, respectively.io::mifstream
and io::mofstream
have been
added for places where a moveable ifstream
or ofstream
is desired
as a workaround for older standard libraries lacking these move
constructors.indexer-num-threads
(which defaults to the number of threads on
the system), and the number of threads allowed to concurrently write to
disk can be controlled via indexer-max-writers
(which defaults to 8).d29bf8b4cbeef21db087cf8042efe5afe25c7bd3c460997728d58b92c24ec283 beam-search-constituency-parser-4.tar.gz
ce44c7d96a8339ff4b597f35a35534ccf93ab99b7d45cbbdddffe7e362b9c20e crf.tar.gz
2a75ab9750ad2eabfe1b53889b15a31f79bd2315f71c2a4a62f6364586a6042d gigaword-embeddings-50d.tar.gz
40cd87901eb29b69e57e4bca14bc2539d7d6b4ad5c186d6f3b1532a60c5163b0 greedy-constituency-parser.tar.gz
a0a3814c1f82780f1296d600eba260f474420aa2d93f000e390c71a0ddac42d9 greedy-perceptron-tagger.tar.gz
word_embeddings
for loading and
querying trained embeddings. To facilitate returning word embeddings, a simple
util::array_view
class was added.fastapprox
into the math
namespace).probe_map::extract()
for inline_key_value_storage
type; old
implementation forgot to delete all sentinel values before returning the
vector.l1norm()
in sgd_model
.gmap
calculation where 0 average precision was ignoredmultiway_merge
.printing::progress
. Before, progress::operator()
in
tight loops could dramatically hurt performance, particularly due to frequent
calls to std::chrono::steady_clock::now()
. Now, progress::operator()
simply sets an atomic iteration counter and a background thread periodically
wakes to update the progress output.store-full-text = true
(default false) in the corpus config, the string metadata field
"content" will be added. This is to simplify the creation of full text
metadata: the user doesn't have to duplicate their dataset in metadata.dat
,
and metadata.dat
will still be somewhat human-readable without large strings
of full text added.make_index
to take a user-supplied corpus object../unit-test
instead of ctest
. There
aren't really many advantages for us to using CTest at this point with the new
unit test framework, so just use our unit test executable.metadata_parser
would not consume spaces in string
metadata fields. Thanks to hopsalot on the forum for the bug report!clang
related to their
shipped version of string_view
lacking a const to_string()
method./profile
executable ensures that the file exists before operating on
it. Thanks to @domarps for the PR!util::multiway_merge
algorithm for performing the
merge-step of an external memory merge sort.