Fast Word Segmentation with Triangular Matrix
Fast Word Segmentation using a Triangular Matrix approach.
Faster 2x, lower memory consumption constant O(1) vs. linear O(n), better scaling, more GC friendly.
For a Word Segmentation using a Dynamic Programming approach have a look at WordSegmentationDP.
For a Word Segmentation with Spelling Correction use WordSegmentation and LookupCompound of the SymSpell library.
- thequickbrownfoxjumpsoverthelazydog
+ the quick brown fox jumps over the lazy dog
- iitwasabrightcolddayinaprilandtheclockswerestrikingthirteen
+ it was a bright cold day in april and the clocks were striking thirteen
- itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness
+ it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness
4 milliseconds for segmenting an 185 char string into 53 words (single core on 2012 Macbook Pro)
Fast Word Segmentation for noisy text
Sub-millisecond compound aware automatic spelling correction
SymSpell vs. BK-tree: 100x faster fuzzy string search & spell checking
WordSegmentationMT targets .NET Standard v2.0 and can be used in:
The SymSpell, Demo, DemoCompound and Benchmark projects can be built with the free Visual Studio Code, which runs on Windows, MacOS and Linux.
Dictionary quality is paramount for word segmentation quality. In order to achieve this two data sources were combined by intersection: Google Books Ngram data which provides representative word frequencies (but contains many entries with spelling errors) and SCOWL — Spell Checker Oriented Word Lists which ensures genuine English vocabulary (but contained no word frequencies required for ranking of suggestions within the same edit distance).
The frequency_dictionary_en_82_765.txt was created by intersecting the two lists mentioned below. By reciprocally filtering only those words which appear in both lists are used. Additional filters were applied and the resulting list truncated to ≈ 80,000 most frequent words.
You can build your own frequency dictionary for your language or your specialized technical domain. Languages with non-latin characters are supported, e.g Cyrillic, Chinese or Georgian.
WordSegmentationTM is contributed by SeekStorm - the high performance Search as a Service & search API