From cowan Sat Mar 6 22:45:29 2010 Subject: The Lojban parser (was: LR(k) Lojban Grammar) To: lojban@cuvmb.cc.columbia.edu (Lojban List) From: cowan Date: Mon, 4 Dec 1995 11:30:15 -0500 (EST) In-Reply-To: <199512041239.HAA05204@locke.ccil.org> from "Paulo Barreto" at Dec 4, 95 07:27:00 am X-Mailer: ELM [version 2.4 PL24] Content-Type: text Content-Length: 2030 Status: OR X-From-Space-Date: Mon Dec 4 11:30:15 1995 X-From-Space-Address: cowan Message-ID: > I wish to point out en passant that there *is* at least one context- > sensitive construct in Lojban, one that is strictly syntactical in > nature: the ZOI quote, that requires the chosen opening delimiter to be > identical to the closing one. This is of course dealt with very simply > by the preparser. (.oisai I wish it wasn't necessary...) Since the subject of the preparser has come up, and since I'm the one that understands the thing, I probably ought to outline what it does. It is a "phased" implementation built as a series of nested routines each of which calls the previous phase: 1. Isolate words from the input stream: folds case, maps digits to digit cmavo, removes punctuation. Whitespace and "." are word separators. 2. Break up compound cmavo; recognize cmene. Anything not a cmavo or a cmene is assumed to be a brivla (no check is made for cmavo clinging to the front of a brivla, or for brivla validity). 3. Process constructs involving "zo", "zoi", "la'o", "lo'u"/"le'u", changing the quoted words to instances of "any_word", "any_words", or "anything". 4. Assign selma'o to cmavo. 5. Append a "fa'o" to a text that doesn't end in one. 6. Glue together strings of words involving "zei" into brivla-equivalents. 7. Absorb BAhE into the next following word. 8. Absorb BU into the preceding word, changing effective selma'o to BY. 9. Absorb UI, CAI, Y, DAhO, FUhE, FUhO, UINAI, and CAINAI into previous word. 10. Attempt to compound the "lexer_?" rules. This is done by a simpleminded backtracking recursive-descent parser, built by hand using some C macros on the basis of the grammar rules. The backtracking is very crude: any failure causes the whole structure to be decomposed into its tokens and another rule is attempted. In the end, each compound is reduced to a single token in the "lexer_?_7??" series, which is what the YACC parser actually sees. Anything that can't be compounded is passed directly to YACC. -- John Cowan cowan@ccil.org e'osai ko sarji la lojban.