[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lojban] LALR1 question

To: lojban@yahoogroups.com
Subject: Re: [lojban] LALR1 question
From: "Bob LeChevalier (lojbab)" <lojbab@lojban.org>
Date: Fri, 31 Aug 2001 20:49:15 -0400
In-reply-to: <20010827135306.A31570@digitalkingdom.org>
References: <20010827163844.A919@twcny.rr.com> <Pine.GSO.4.33.0108271059530.15411-100000@ucsub.colorado.edu> <20010827163844.A919@twcny.rr.com>

At 01:53 PM 8/27/01 -0700, Robin Lee Powell wrote:

On Mon, Aug 27, 2001 at 04:38:44PM -0400, Rob Speer wrote:
> On Mon, Aug 27, 2001 at 11:17:55AM -0600, Jay Kominek wrote:
> > Hrm. A parser with back tracking or more look ahead (or basically any
> > parser which can parse a larger class of languages) could take the same
> > YACC grammar and be able to parse "le broda joi le brode" correctly... I
> > think...
>

> So, what is the advantage to having Lojban interpreted by a parserwhich can't

> backtrack or look ahead more than one word?

All other parsers turn out to be _extremely_ inefficient, IIRC from my
course on such things.

So, basically, it's to make computers happy.


Some history which I think will answer some of the questions on this issue.

Originally, JCB thought to prove Loglan's unambiguity using the theories ofa guy named Yngve. I'm not entirely sure what those theories are or howthey were related to the problem. In 1977 or 1978, IIRC, a Loglanist, Ibelieve it was Doug Landauer, proposed using YACC as a more formal methodof proving the language unambiguous, and made a first cut at a machinegrammar. It was quickly found that Loglan as it was then was nowhere nearunambiguous - I think they were only able to come up with a working grammarfor around 30% of the language.

Two or three Loglanists worked with JCB over the next few years to devise amachine grammar that would work. At the time, the machine grammar wasconsidered something distinct from the human grammar of the language, andall that was needed was that it be able to parse a corpus of speciallydesigned test sentences in the same way that the human grammar would. Eventhis proved difficult, and was not achieved until 1982. The majormilestones were Jeff Prothero's idea to use YACC's error recovery system tohandle elidable terminators (and Jeff was the first one to get a moderatelycomplete machine grammar as a result, though it still had problems), and a6 months period during which Scott Layson lived with JCB to finish the lastremaining problems. Part of the problem was getting access to suitablecomputers - this was the era of CP/M in home computers, and YACC ran onmainframes that JCB had no access to. So these various other people useduniversity connections to get time on machines. Somewhere in here, BobMcIvor worked to convert YACC to run on a home computer - since he readsthis forum, he may be able to fill on his role in all this.

There were two major problems with the Layson/JCB machine grammar of1982. First of all, it was known that the test corpus was incomplete - itcovered those things that JCB thought were important, but did not covereverything that had been used with the language. Thus, parsing randomLoglan tests often failed because of things that had not been in the testcorpus. So JCB sought to slowly expand the corpus along with the machinegrammar to describe that corpus.

The second problem was that the machine grammar did not really work. Largechunks of the language were hidden in C code routines as the"Preparser". Unlike current Lojban, there was NO formalization of therules for the Preparser. Mostly it included identifying words, and thenglomming together some known sequences as unparseable units that would bearbitrarily declared grammatical and flagged as such by invisible tokenscalled "machine lexemes". It also included treating collections of cmavowritten as a single word without spaces as if it were a singleword/grammatical unit. Thus the TLI Loglan equivalent of "lenu" wasgrammatically distinct from "le nu", and ANY string of cmavo starting witha member of PA-equivalent was considered a number, while any string ofcmavo starting with a member of PU-equivalent (which then included all ofthe tense and modal words) was considered a "tense".

It took very little for people to find grammatical strings that the parserapproved which were nonsense or parsed incorrectly, but which were not partof the test corpus. There thus came a period of debate as to whether the"human grammar" or the "machine grammar" defined the language.

[2 paragraphs of context with no parser info follow, so feel free to skipthem.]

Right about then is when the community splintered. Jim Carter proposedsome extensions to the language which he had found useful in doing thefirst extensive set of translations using the language. JCB dislikedalmost all of these, and pc didn't like most of them, dubbing Carter'susage and formalisms as "Nalgol" 'because he got everything in Loglanbackwards'. But Jim Carter persisted in advocating for his changes, andBob Chassell as then-editor of Lognet published his advocacy. This andother things led JCB to feel a loss of control over the language, and hetook back essentially dictatorial power over TLI and the language. Almosteveryone else left the community in response.

I knew JCB personally in San Diego and was oblivious to most of thepolitics, so I stayed on, and eventually started working on the dictionaryrevision. But almost no one was doing anything, and my efforts bogged downtoo. Finally in 1986, I attempted to get some new people in the DC areainvolved, and made new efforts to get people going again. This became theWashington Loglan Users Group, which a year later became LLG after JCB andI split.

Before the split I contacted several old Loglanists, and got Scott Laysonto send me the YACC grammar, parser and corpus, which he had converted foruse on an MS-DOS PC. This was primarily so that I would have a referencestandard in teaching the new people I had recruited the language, becauseto put it simply, I knew little more than they did. The split occurredwhen JCB accused Nora and me of copyright violation in distributingLogFlash with wordlists via Shareware on a BBS, and he seemed to think thatI intended to freely distribute Layson's parser as well (which I wasn't,since we hadn't written it ourselves). Jeff Prothero then stepped in withhis own effort, a backtracking parser based on his own version of the YACCgrammar, which he claimed was in the public domain anyway since hisoriginal work on it had been done as a student on U of Washingtoncomputers, and he had never signed anything over to TLI. Prothero engagedin several stunts, including compressing the YACC grammar into anunreadable solid block of C code so that no one could practically comparehis version with the TLI version of the grammar. This led to lawsuitthreats and further heightened the sense that we needed a version of Loglanderived independently of JCB's copyright-claims. That version is whatbecame Lojban.

In June of 1987, with me having essentially no knowledge of YACC or machineparsing, Jeff Taylor and I started working on a new from-scratch Loglangrammar and parser. Jeff had done an SLR(1) parser for Loglan for hisMaster's work in computer science, and had the knowhow that I didnot. Over the next several months, we built up a new grammar, buying acopy of Abraxas Software's PCYACC because all of the freeware versions ofYACC were unable to hold a grammar as large as Loglan's (Indeed PCYACC wasalso unable to do so, and they eventually modified their program at ourbehest to make the lookahead table large enough to hold the then-language.)

Thus I can answer the question that we used YACC, because it was what JCBhad established a YACC-based "machine grammar" as the standard for thelanguage, and YACC was the tool that was readily at hand for us to get ouralternate Loglan standard in place quickly, and the volunteers I had at thetime knew YACC parsing well.

Nora, Jeff and I disliked the "hidden grammar" of the Preparser, as well asthe violation of audiovisual isomorphism that came from parsing "lenu"differently from "le nu", so the essential difference between what becamethe Lojban grammar and the TLI grammar was that our Preparser was a lexeronly that identified individual words by their token-type or "lexeme"(which term became "selma'o" even though the word types of cmene and brivlaexisted). The human Loglan grammar was ALREADY known NOT to be LR(k) forany known value of k, because there were some potentially infiniterecursively defined strings that could occur at the point where YACC had touse error logic to insert a terminator. We thus kept the "machinelexemes" that JCB had used to get by the infinite strings problem - theseare the machine grammar rules numbered in the 800s. (In other words, weknew about the problems that Curnow encountered).

However, we insisted that the full grammar WITH inserted machine lexemes beverified unambiguous with YACC, and thus all of the lexer rules were fullyspelled out in the machine grammar. These are the 900-series rules, andwhen we tested the grammar, all of those rules were tested aswell. However since the language with the lexer rules was so big that itblew up other YACC versions, when John Cowan went to develop what becamethe official parser after Jeff Taylor no longer had time to work on theproject (and John was using a different YACC version) he chose to commentout the lexer rules and thus simplify the part of the language that YACChad to parse, in effect resurrecting the Preparser of TLI days, except thatits innards were fully documented and the written, uncommented-out machinegrammar was declared to be the standard rather than the parser (which mighthave errors in its hard-coded implementation of the lexer rules). Notethat there are a few "grammarless" tokens that were never implemented inuncommented rules, like SI/SA/SU, because they effectively generatednull-grammar strings of up to infinite length. These were numbered in the1100s, and described in a textual algorithm to resolve in which order theyshould be processed. (hopefully this answers And)

At the same time, Prothero had developed a backtracking parser for the samegrammar that did not rely on YACC. But since Cowan was finding minorthings that needed revision in the YACC grammar as he implemented theparser, keeping the two in sync was more than we could manage. But for abrief time there was indeed an non-YACC parser for Lojban.

After Cowan finished the parser, but before we had baselined the grammar, Itracked down Doug Landauer, the original Loglan YACCer. Among the thingswe discussed one day, was the idea of recasting the Lojban grammar in LR(2)to LR(4) version, and he was willing to write the YACC equivalent tools tosupport that effort. The advantage of LR(2) or LR(4) was that each levelwould eliminate a chunk of the lexer rules, especially the compound logicalconnectives that used NA and SE. Most of the other lexer rules used onlyselma'o that never appeared in any non-lexer rules, but NA and SE did(eventually, however, we had the FIhO/no-FIhO dichotomy which also brokethis concept).


We did not carry out this project because
1) Doug never actually did what he offered to do

2) Recasting the grammar in a new form would take a lot of time from peoplelike lojbab and Cowan who had too little, with only a theoretical benefit;there was no real doubt that the language was unambiguous because we wereable to test the lexer rules in PCYACC, and generating a software versionof those rules seemed to be fully in hand.3) The grammar was essentially solid as it was, and the language wasstarting to see use, and there was no reason to risk it; I was well-tiredof being a language "developer" and wanted a community of users.4) The real bottom line was that YACC was universally available, understoodby lots of people in the community, and hence we would not be dependent onone or two people who had the right software in the right version as wellas more specialized knowhow in order to be able to support thegrammar/parser development.

These arguments still remain. The technology apparently exists to do anLR(4) version of the grammar, but we've baselined, and proving equivalenceis tougher than writing software. We similar cannot formally prove thatthe E-BNF is identical to the YACC grammar, which is why the latter takesprecedence over the former, if ever a conflict is found).

To answer xod, my understanding is that a different grammar form wouldideally have the identical net result, but have a significantly smallernumber of rules because we could eliminate some of the kluges needed to getYACC to group things the correct way. So Jay's answer:

If you were merely upping the amount of look ahead so that you could leave
out terminators in a lot more cases, then it would probably make things
easier to remember. (Humans can stick missing terminators back in rather
handily, in many cases.)

(The last 3 paragraphs were pulled almost entirely out of my ass. Should
they happen to be correct, then I'll be impressed. See, however, the note
about asking a linguist.)

happens to be exactly correct in describing my thinking. We would not beupping the lookahead except to make the machine grammar match the humangrammar better, and thus at most eliminate some kluges. There was neverany intent to change the human language, only the way that we described itfor the machine so that it could be proven unambiguous.


Hopefully I've dealt with some of the questions usefully.
--
lojbab lojbab@lojban.org
Bob LeChevalier, President, The Logical Language Group, Inc.
2904 Beau Lane, Fairfax VA 22031-1303 USA 703-385-0273
Artificial language Loglan/Lojban: http://www.lojban.org

Follow-Ups:
- Re: [lojban] LALR1 question
  - From: Richard Curnow <richard@rrbcurnow.freeuk.com>
- Re: [lojban] LALR1 question
  - From: Robert McIvor <rmcivor@macsrule.com>

References:
- Re: [lojban] LALR1 question
  - From: Rob Speer <rob@twcny.rr.com>
- Re: [lojban] LALR1 question
  - From: Jay Kominek <jay.kominek@colorado.edu>
- Re: [lojban] LALR1 question
  - From: Robin Lee Powell <rlpowell@digitalkingdom.org>

Prev by Date: RE: [lojban] Symbolic Logic and Lojban
Next by Date: klanrvolti
Previous by thread: Re: [lojban] LALR1 question
Next by thread: Re: [lojban] LALR1 question
Index(es):
- Date
- Thread