Resin Save

Vector space search engine. Available as a HTTP service or as an embedded library.

Project README

⍼ Resin.Search

NuGet version (Resin.Search)

Overview | How to install | User guide

HTTP search engine/embedded library

Launch a Resin HTTP server or use the Resin search library to search through any vector space. With hardware accelerated vector operations from MathNet Resin is especially well suited for problem spaces that can be defined as such.

Vector spaces are configured by implementing IModel<T>.

Document database

Resin stores data as document collections. It applies your prefered IModel<T> onto your data while you write and query it. The write pipeline produces a set of indices (graphs), one for each document field, that you may interact with by using the Resin web GUI, the Resin read/write JSON HTTP API, or programmatically.

Vector-based indices

Resin indices are binary search trees and creates clusters of those vectors that are similar to each other, as you populate them with your data. Graph nodes are created in the Tokenize method of your model. When a node is added to the graph its cosine angle, i.e. its similarity to other nodes, determine its position (path) within the graph.

Customizable vector spaces

Resin comes pre-loaded with two IModel vector space configurations: one for text and another for MNIST images. The text model has been tested by validating indices generated from Wikipedia search engine backup files as well as by parsing Common Crawl WAT, WET and WARC files, to determine at which scale Resin may operate in and at what accuracy.

The image model is included mostly as an example of how to implement your own prefered machine-learning algorithm for building custom-made search indices. The error rate of the image classifier is ~5%.

Performance

Currently, Wikipedia size data sets produce indices capable of sub-second phrase searching.

You may also

  • build, validate and optimize indices using the command-line tool Sir.Cmd
  • read efficiently by specifying which fields to return in the JSON result
  • implement messaging formats such as XML (or any other, really) if JSON is not suitable for your use case
  • construct queries that join between fields and even between collections, that you may post as JSON to the read endpoint or create programatically.
  • construct any type of indexing scheme that produces any type of embeddings with virtually any dimensionality using either sparse or dense vectors.

Applications

Executables

  • Sir.HttpServer: HTTP search service with HTML GUI and HTTP JSON API for reading and writing.
  • Sir.Cmd: Command line tool that executes commands that implement Sir.ICommand. Write, validate, optimize and more via command-line.

Libraries

  • Sir.Search: In-process search engine.
  • Sir.Core: Core types and shared interfaces, such as IModel, ICommand and IVector.
  • Sir.CommonCrawl: Command for downloading and indexing Common Crawl WAT and WET files.
  • Sir.Mnist: Command for training and testing the accuracy of a index of MNIST images.
  • Sir.Wikipedia: Command for indexing Wikipedia.

Roadmap

  • v0.1a - bag-of-characters vector space language model
  • v0.2a - HTTP API
  • v0.3a - query language
  • v0.4 - linear classifier image model
  • v0.5 - semantic language model
  • v1.0 - voice model
  • v2.0 - image-to-voice
  • v2.1 - voice-to-text
  • v2.2 - text-to-image
  • v2.3 - AI

Backlog

Huge

  • Distribute data set across many servers (sharding, replication; RPC) or in other ways allow for horisontal scaling

Big

  • Memory mapping (to increase speed of querying and perhaps also writing; to increase scalability)
  • Update index (allow removal of documents; allow appending to an already persisted index token's postings list)
  • Async IO (for scalability)
  • Indexing of types other than string
  • Enable combining fields with different types in a document/model
  • Split application into "crawler" and "search"
Open Source Agenda is not affiliated with "Resin" Project. README Source: kreeben/resin

Open Source Agenda Badge

Open Source Agenda Rating