tsunami

tsunami parsing model

Luke Breuer
2010-12-28 04:05 UTC

introduction
There are two pieces of core tsunami technology which need a multi-stage parsing strategy which preserves whitespace: The simplest situation occurs when we only have a lightweight markup syntax: we simply convert the markup to an AST and then we can render HTML text, tracking which fragment or fragments of plaintext markup were used to define each fragment of output HTML.
enter new data types
The principle behind embedded data types is that we can take the basic lightweight markup syntax and add to it. Perhaps we want to recognize addresses: we want a modularized way to allow text nodes to be analyzed and possibly "enhanced" in the AST (in other words, a single text node would be split into multiple nodes).
possible pipeline
  1. plaintext
  2. do standard markup parsing (text -> AST)
  3. detect data types in text nodes and split those nodes into more nodes (AST -> AST)
  4. render the AST, allowing modules to override how nodes or subtrees are rendered
  5. HTML, with a perfect mapping from each fragment of HTML to one or more fragments of plaintext
Parsing Expression Grammar (PEG)
This is probably the best option for parsing when all text must be valid and tokenization really needs to depend on semantics. (In PEGs, tokenization and parsing are not split up.)
Links
notes
(from tef)
  • packrat parsing is depth first search with memoization
  • earley parsing is breadth first with memoization
V archived V
dynamic parsing?
Macros like Boo or Nemerle?
links
  • LLVM
    Low Level Virtual Machine
  • Phoenix
    It enables teaching and collaborative research in code generation, optimization, program analysis, binary transformation, and software correctness. Phoenix is used as a research platform by Microsoft Research and will be the universal compiler backend for upcoming Microsoft languages and development tools.
  • GNU C Compiler Internals/Architecture
  • Parsing Techniques - Second Edition