Extensible Parsing Systems

This page is a mirrored copy of an article originally posted on the LShift blog; see the archive index here.

Francis reminded me of the Packrat Parsing algorithm the other day, so I took another look at the Scheme implementation I’d thrown together a few months ago for my ThiNG prototype. Currently, it’s just the basic parsing combinators and support data-structures, but I’d like to make it into an extensible reader, just like for codename. I think I could do a better job this time, since the host language is Scheme and I wouldn’t have to write an interpreter for the macro language, since I’d already have Scheme’s eval available.

The current packrat-parser macro expands into a function representing the parser. To build a quality extensible parsing system, with support for debugging and introspection, I suspect I’d need to move away from representing parsers as functions, and start to move toward representing them in a more self-describing way, perhaps more as objects in the OO sense. For instance, when expanding:

(packrat-parser expr
                (expr ((a < - 'num '+ b <- 'num)
                       (+ a b))
                      ((a <- mulexp) a))
                (mulexp ((a <- 'num '* b <- 'num)
                         (* a b))))

the current implementation produces a parser built directly from packrat parsing combinators (hand-edited slightly for legibility):

(letrec ((expr (lambda (g1687)
                 (results->result g1687 'expr
                                  (lambda ()
                                    ((packrat-or (packrat-check-base 'num
                                                   (lambda (A)
                                                     (packrat-check-base '+
                                                       (lambda (dummy1)
                                                         (packrat-check-base 'num
                                                           (lambda (B)
                                                             (lambda (g1693)
                                                               (make-result (+ A B) g1693))))))))
                                                 (packrat-check mulexp
                                                   (lambda (g1696)
                                                     (lambda (g1697) (make-result g1696 g1697)))))
         (mulexp (lambda (g1698)
                   (results->result g1698 'mulexp
                                    (lambda ()
                                      ((packrat-check-base 'num
                                         (lambda (A)
                                           (packrat-check-base '*
                                             (lambda (dummy2)
                                               (packrat-check-base 'num
                                                 (lambda (B)
                                                   (lambda (g1704) (make-result (* A B) g1704))))))))

whereas an extensible parser would require some kind of reified representation of the composite parser, along the lines of

`((expr   (or (seq (num + num)   ,(lambda (a _ b) (+ a b)))
              (seq (mulexp)      ,(lambda (a) a))))
  (mulexp (or (seq (num * num)   ,(lambda (a _ b) (* a b))))))

which could then be interpreted, in order to parse some input, or stepped through, in order to debug a parser or parser extension, or used in various diagnostic printouts to help pinpoint errors in parsed text or parsing extension grammars.

These guys seem to be thinking along the same lines as me, inasmuch as they think packrat-parsing is a good fit for Scheme. I note that they specifically mention stateful packrat parsing as an area of interest; some kind of pseudo-stateful parsing is needed to allow for local grammar extensions (similar to Scheme’s let-syntax), and we’ve come up with a few ideas for solving the stateful-parsing problem that could be tried here.


On 26 May, 2008 at 2:30 am, Jon wrote:

Is the code available anywhere?

On 3 July, 2008 at 5:41 pm, tonyg wrote:

Hi Jon,

The precise code I was talking about is very project-specific, experimental, and now quite outdated, but is available here.

Much better, though, is the portable packrat parser library for Scheme I wrote a couple of weeks after this post, which is described here and available by following the instructions on that page.