There are two pieces of core tsunami technology which need a multi-stage parsing strategy which preserves whitespace:
The simplest situation occurs when we only have a lightweight markup syntax
: we simply convert the markup to an AST
and then we can render HTML text, tracking which fragment or fragments of plaintext markup were used to define each fragment of output HTML.
enter new data types
The principle behind embedded data types
is that we can take the basic lightweight markup syntax and add to it. Perhaps we want to recognize addresses: we want a modularized way to allow text nodes to be analyzed and possibly "enhanced" in the AST (in other words, a single text node would be split into multiple nodes).
- do standard markup parsing (text -> AST)
- detect data types in text nodes and split those nodes into more nodes (AST -> AST)
- render the AST, allowing modules to override how nodes or subtrees are rendered
- HTML, with a perfect mapping from each fragment of HTML to one or more fragments of plaintext
This is probably the best option for parsing when all text must be valid and tokenization really needs to depend on semantics. (In PEGs, tokenization and parsing are not split up.)
- packrat parsing is depth first search with memoization
- earley parsing is breadth first with memoization
V archived V
Macros like Boo
Low Level Virtual Machine
It enables teaching and collaborative research in code generation, optimization, program analysis, binary transformation, and software correctness. Phoenix is used as a research platform by Microsoft Research and will be the universal compiler backend for upcoming Microsoft languages and development tools.
- GNU C Compiler Internals/Architecture
- Parsing Techniques - Second Edition