Lug Save

Parsing expression grammar (PEG) embedded domain specific language and parsing machine for C++17

Project README

lug Build Status License

An embedded domain specific language for expressing parsers as extended parsing expression grammars (PEGs) in C++17

lug

Features

  • Natural syntax more akin to external parser generator languages
  • Separation of syntatic and lexical rules, with customizable implicit whitespace skipping
  • Direct and indirect left recursion with precedence levels to disambiguate subexpressions with mixed left/right recursion
  • Traditional PEG syntax has been extended to support attribute grammars
  • Cut operator to commit to currently matched parse prefix and prune all backtrack entries
  • Deferred evaluation of semantic actions, ensuring actions do not execute on failed branches or invalid input
  • Generated parsers are compiled to special-purpose bytecode and executed in a virtual parsing machine
  • UTF-8 text parsing with complete Level 1 and partial Level 2 support of the UTS #18 Unicode Regular Expressions technical standard
  • Automatic line and column tracking with customizable tab width and alignment
  • Uses expression template functors to implement the rules of the domain specific language
  • Header only library using C++17 language and library features
  • Relatively small with the intent of parser core to remain under 1500 lines of terse code

It is based on research introduced in the following papers:

Bryan Ford, Parsing expression grammars: a recognition-based syntactic foundation, Proceedings of the 31st ACM SIGPLAN-SIGACT symposium on Principles of Programming Languages, p.111-122, January 2004

Sérgio Medeiros et. al, A parsing machine for PEGs, Proceedings of the 2008 symposium on Dynamic Languages, p.1-12, July 2008

Kota Mizushima et. al, Packrat parsers can handle practical grammars in mostly constant space, Proceedings of the 9th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering, p.29-36, June 2010

Sérgio Medeiros et. al, Left recursion in Parsing Expression Grammars, Science of Computer Programming, v.96 n.P2, p.177-190, December 2014

Leonardo Reis et. al, The formalization and implementation of Adaptable Parsing Expression Grammars, Science of Computer Programming, v.96 n.P2, p.191-210, December 2014

Sérgio Medeiros et. al, A parsing machine for parsing expression grammars with labeled failures, Proceedings of the 31st Annual ACM symposium on Applied Computing, p.1960-1967, April 2016

Building

As a header only library, lug itself does not need to be built. Simply ensure the lug header directory is in your include path and you're good to go.

As a baseline, the following compiler versions are known to work with lug.

Compiler Language Mode
Clang 5.0.0 (September 2017) -std=c++17 or -std=gnu++17
GCC 7.1.0 (May 2017) -std=c++17 or -std=gnu++17
Microsoft Visual C++ 2017 15.5 (December 2017) Platform Toolset: Visual Studio 2017 Toolset (v141), Language Standard: ISO C++17 Standard (/std:c++17)

To build the sample programs and unit tests, a makefile is provided for Linux and BSD platforms and a Visual Studio solution is available for use on Windows.

Syntax Reference

Operator Syntax
Sequence e1 > e2
Ordered Choice e1 | e2
Zero-or-More *e
One-or-More +e
Optional ~e
Positive Lookahead &e
Negative Lookahead !e
Terminal Description
chr(c) Matches the UTF-8, UTF-16, or UTF-32 character c
chr(c1, c2) Matches characters in the UTF-8, UTF-16, or UTF-32 interval [c1-c2]
str(s) Matches the sequence of characters in a string
bre(s) POSIX Basic Regular Expression (BRE)
any Matches any single character
any(flags) Matches a character exhibiting any of the character properties
all(flags) Matches a character with all of the character properties
none(flags) Matches a character with none of the character properties
eps Matches the empty string
eoi Matches the end of the input sequence
eol Matches a Unicode line-ending
nop No operation, does not emit any instructions
cut Emits a cut operation into the stream of semantic actions without matching
Literal Name Description
_cx Character Expression Matches the UTF-8, UTF-16, or UTF-32 character literal
_sx String Expression Matches the sequence of characters in a string literal
_rx Regular Expression POSIX Basic Regular Expression (BRE)
_icx Case Insensitive Character Expression Same as _cx but case insensitive
_isx Case Insensitive String Expression Same as _sx but case insensitive
_irx Case Insensitive Regular Expression Same as _rx but case insensitive
_scx Case Sensitive Character Expression Same as _cx but case sensitive
_ssx Case Sensitive String Expression Same as _sx but case sensitive
_srx Case Sensitive Regular Expression Same as _rx but case sensitive

TODO

  • parser error recovery
  • add an interactive processing mode flag to input sources?
  • handle exceptions thrown from semantic actions in semantics::accept?
  • feature: symbol tables and parsing conditions
  • feature: Adams-Nestra grammars and whitespace alignment
  • feature: syntax to specify number range of allowed iteration
  • optimization: tail recursion
  • optimization: reduce number of false-positive left-recursive calls even further by lazily evaluating rule mandate
  • optimization: additional instructions (test_char, test_any, test_range, test_class)
  • more samples, testing, and bug fixing
  • increase compiler warning level and fix any issues
  • documentation
Open Source Agenda is not affiliated with "Lug" Project. README Source: jwtowner/lug

Open Source Agenda Badge

Open Source Agenda Rating