From lojbab@lojban.org Sat Sep 01 19:25:59 2001
Return-Path: <lojbab@lojban.org>
X-Sender: lojbab@lojban.org
X-Apparently-To: lojban@yahoogroups.com
Received: (EGP: mail-7_3_2); 2 Sep 2001 02:25:58 -0000
Received: (qmail 8832 invoked from network); 2 Sep 2001 02:25:58 -0000
Received: from unknown (10.1.10.27)
  by l9.egroups.com with QMQP; 2 Sep 2001 02:25:58 -0000
Received: from unknown (HELO stmpy-3.cais.net) (205.252.14.73)
  by mta2 with SMTP; 2 Sep 2001 02:25:56 -0000
Received: from user.lojban.org (224.dynamic.cais.com [207.226.56.224])
  by stmpy-3.cais.net (8.11.1/8.11.1) with ESMTP id f822Po847111
  for <lojban@yahoogroups.com>; Sat, 1 Sep 2001 22:25:50 -0400 (EDT)
Message-Id: <4.3.2.7.2.20010831185654.00bd77a0@pop.cais.com>
X-Sender: vir1036@pop.cais.com
X-Mailer: QUALCOMM Windows Eudora Version 4.3.2
Date: Fri, 31 Aug 2001 20:49:15 -0400
To: lojban@yahoogroups.com
Subject: Re: [lojban] LALR1 question
In-Reply-To: <20010827135306.A31570@digitalkingdom.org>
References: <20010827163844.A919@twcny.rr.com>
  <Pine.GSO.4.33.0108271059530.15411-100000@ucsub.colorado.edu>
  <20010827163844.A919@twcny.rr.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; format=flowed
From: "Bob LeChevalier (lojbab)" <lojbab@lojban.org>

At 01:53 PM 8/27/01 -0700, Robin Lee Powell wrote:
>On Mon, Aug 27, 2001 at 04:38:44PM -0400, Rob Speer wrote:
> > On Mon, Aug 27, 2001 at 11:17:55AM -0600, Jay Kominek wrote:
> > > Hrm. A parser with back tracking or more look ahead (or basically any
> > > parser which can parse a larger class of languages) could take the same
> > > YACC grammar and be able to parse "le broda joi le brode" correctly... I
> > > think...
> >
> > So, what is the advantage to having Lojban interpreted by a parser 
> which can't
> > backtrack or look ahead more than one word?
>
>All other parsers turn out to be _extremely_ inefficient, IIRC from my
>course on such things.
>
>So, basically, it's to make computers happy.

Some history which I think will answer some of the questions on this issue.

Originally, JCB thought to prove Loglan's unambiguity using the theories of 
a guy named Yngve. I'm not entirely sure what those theories are or how 
they were related to the problem. In 1977 or 1978, IIRC, a Loglanist, I 
believe it was Doug Landauer, proposed using YACC as a more formal method 
of proving the language unambiguous, and made a first cut at a machine 
grammar. It was quickly found that Loglan as it was then was nowhere near 
unambiguous - I think they were only able to come up with a working grammar 
for around 30% of the language.

Two or three Loglanists worked with JCB over the next few years to devise a 
machine grammar that would work. At the time, the machine grammar was 
considered something distinct from the human grammar of the language, and 
all that was needed was that it be able to parse a corpus of specially 
designed test sentences in the same way that the human grammar would. Even 
this proved difficult, and was not achieved until 1982. The major 
milestones were Jeff Prothero's idea to use YACC's error recovery system to 
handle elidable terminators (and Jeff was the first one to get a moderately 
complete machine grammar as a result, though it still had problems), and a 
6 months period during which Scott Layson lived with JCB to finish the last 
remaining problems. Part of the problem was getting access to suitable 
computers - this was the era of CP/M in home computers, and YACC ran on 
mainframes that JCB had no access to. So these various other people used 
university connections to get time on machines. Somewhere in here, Bob 
McIvor worked to convert YACC to run on a home computer - since he reads 
this forum, he may be able to fill on his role in all this.

There were two major problems with the Layson/JCB machine grammar of 
1982. First of all, it was known that the test corpus was incomplete - it 
covered those things that JCB thought were important, but did not cover 
everything that had been used with the language. Thus, parsing random 
Loglan tests often failed because of things that had not been in the test 
corpus. So JCB sought to slowly expand the corpus along with the machine 
grammar to describe that corpus.

The second problem was that the machine grammar did not really work. Large 
chunks of the language were hidden in C code routines as the 
"Preparser". Unlike current Lojban, there was NO formalization of the 
rules for the Preparser. Mostly it included identifying words, and then 
glomming together some known sequences as unparseable units that would be 
arbitrarily declared grammatical and flagged as such by invisible tokens 
called "machine lexemes". It also included treating collections of cmavo 
written as a single word without spaces as if it were a single 
word/grammatical unit. Thus the TLI Loglan equivalent of "lenu" was 
grammatically distinct from "le nu", and ANY string of cmavo starting with 
a member of PA-equivalent was considered a number, while any string of 
cmavo starting with a member of PU-equivalent (which then included all of 
the tense and modal words) was considered a "tense".

It took very little for people to find grammatical strings that the parser 
approved which were nonsense or parsed incorrectly, but which were not part 
of the test corpus. There thus came a period of debate as to whether the 
"human grammar" or the "machine grammar" defined the language.

[2 paragraphs of context with no parser info follow, so feel free to skip 
them.]

Right about then is when the community splintered. Jim Carter proposed 
some extensions to the language which he had found useful in doing the 
first extensive set of translations using the language. JCB disliked 
almost all of these, and pc didn't like most of them, dubbing Carter's 
usage and formalisms as "Nalgol" 'because he got everything in Loglan 
backwards'. But Jim Carter persisted in advocating for his changes, and 
Bob Chassell as then-editor of Lognet published his advocacy. This and 
other things led JCB to feel a loss of control over the language, and he 
took back essentially dictatorial power over TLI and the language. Almost 
everyone else left the community in response.

I knew JCB personally in San Diego and was oblivious to most of the 
politics, so I stayed on, and eventually started working on the dictionary 
revision. But almost no one was doing anything, and my efforts bogged down 
too. Finally in 1986, I attempted to get some new people in the DC area 
involved, and made new efforts to get people going again. This became the 
Washington Loglan Users Group, which a year later became LLG after JCB and 
I split.

Before the split I contacted several old Loglanists, and got Scott Layson 
to send me the YACC grammar, parser and corpus, which he had converted for 
use on an MS-DOS PC. This was primarily so that I would have a reference 
standard in teaching the new people I had recruited the language, because 
to put it simply, I knew little more than they did. The split occurred 
when JCB accused Nora and me of copyright violation in distributing 
LogFlash with wordlists via Shareware on a BBS, and he seemed to think that 
I intended to freely distribute Layson's parser as well (which I wasn't, 
since we hadn't written it ourselves). Jeff Prothero then stepped in with 
his own effort, a backtracking parser based on his own version of the YACC 
grammar, which he claimed was in the public domain anyway since his 
original work on it had been done as a student on U of Washington 
computers, and he had never signed anything over to TLI. Prothero engaged 
in several stunts, including compressing the YACC grammar into an 
unreadable solid block of C code so that no one could practically compare 
his version with the TLI version of the grammar. This led to lawsuit 
threats and further heightened the sense that we needed a version of Loglan 
derived independently of JCB's copyright-claims. That version is what 
became Lojban.

In June of 1987, with me having essentially no knowledge of YACC or machine 
parsing, Jeff Taylor and I started working on a new from-scratch Loglan 
grammar and parser. Jeff had done an SLR(1) parser for Loglan for his 
Master's work in computer science, and had the knowhow that I did 
not. Over the next several months, we built up a new grammar, buying a 
copy of Abraxas Software's PCYACC because all of the freeware versions of 
YACC were unable to hold a grammar as large as Loglan's (Indeed PCYACC was 
also unable to do so, and they eventually modified their program at our 
behest to make the lookahead table large enough to hold the then-language.)

Thus I can answer the question that we used YACC, because it was what JCB 
had established a YACC-based "machine grammar" as the standard for the 
language, and YACC was the tool that was readily at hand for us to get our 
alternate Loglan standard in place quickly, and the volunteers I had at the 
time knew YACC parsing well.

Nora, Jeff and I disliked the "hidden grammar" of the Preparser, as well as 
the violation of audiovisual isomorphism that came from parsing "lenu" 
differently from "le nu", so the essential difference between what became 
the Lojban grammar and the TLI grammar was that our Preparser was a lexer 
only that identified individual words by their token-type or "lexeme" 
(which term became "selma'o" even though the word types of cmene and brivla 
existed). The human Loglan grammar was ALREADY known NOT to be LR(k) for 
any known value of k, because there were some potentially infinite 
recursively defined strings that could occur at the point where YACC had to 
use error logic to insert a terminator. We thus kept the "machine 
lexemes" that JCB had used to get by the infinite strings problem - these 
are the machine grammar rules numbered in the 800s. (In other words, we 
knew about the problems that Curnow encountered).

However, we insisted that the full grammar WITH inserted machine lexemes be 
verified unambiguous with YACC, and thus all of the lexer rules were fully 
spelled out in the machine grammar. These are the 900-series rules, and 
when we tested the grammar, all of those rules were tested as 
well. However since the language with the lexer rules was so big that it 
blew up other YACC versions, when John Cowan went to develop what became 
the official parser after Jeff Taylor no longer had time to work on the 
project (and John was using a different YACC version) he chose to comment 
out the lexer rules and thus simplify the part of the language that YACC 
had to parse, in effect resurrecting the Preparser of TLI days, except that 
its innards were fully documented and the written, uncommented-out machine 
grammar was declared to be the standard rather than the parser (which might 
have errors in its hard-coded implementation of the lexer rules). Note 
that there are a few "grammarless" tokens that were never implemented in 
uncommented rules, like SI/SA/SU, because they effectively generated 
null-grammar strings of up to infinite length. These were numbered in the 
1100s, and described in a textual algorithm to resolve in which order they 
should be processed. (hopefully this answers And)

At the same time, Prothero had developed a backtracking parser for the same 
grammar that did not rely on YACC. But since Cowan was finding minor 
things that needed revision in the YACC grammar as he implemented the 
parser, keeping the two in sync was more than we could manage. But for a 
brief time there was indeed an non-YACC parser for Lojban.

After Cowan finished the parser, but before we had baselined the grammar, I 
tracked down Doug Landauer, the original Loglan YACCer. Among the things 
we discussed one day, was the idea of recasting the Lojban grammar in LR(2) 
to LR(4) version, and he was willing to write the YACC equivalent tools to 
support that effort. The advantage of LR(2) or LR(4) was that each level 
would eliminate a chunk of the lexer rules, especially the compound logical 
connectives that used NA and SE. Most of the other lexer rules used only 
selma'o that never appeared in any non-lexer rules, but NA and SE did 
(eventually, however, we had the FIhO/no-FIhO dichotomy which also broke 
this concept).

We did not carry out this project because
1) Doug never actually did what he offered to do
2) Recasting the grammar in a new form would take a lot of time from people 
like lojbab and Cowan who had too little, with only a theoretical benefit; 
there was no real doubt that the language was unambiguous because we were 
able to test the lexer rules in PCYACC, and generating a software version 
of those rules seemed to be fully in hand.
3) The grammar was essentially solid as it was, and the language was 
starting to see use, and there was no reason to risk it; I was well-tired 
of being a language "developer" and wanted a community of users.
4) The real bottom line was that YACC was universally available, understood 
by lots of people in the community, and hence we would not be dependent on 
one or two people who had the right software in the right version as well 
as more specialized knowhow in order to be able to support the 
grammar/parser development.

These arguments still remain. The technology apparently exists to do an 
LR(4) version of the grammar, but we've baselined, and proving equivalence 
is tougher than writing software. We similar cannot formally prove that 
the E-BNF is identical to the YACC grammar, which is why the latter takes 
precedence over the former, if ever a conflict is found).

To answer xod, my understanding is that a different grammar form would 
ideally have the identical net result, but have a significantly smaller 
number of rules because we could eliminate some of the kluges needed to get 
YACC to group things the correct way. So Jay's answer:
>If you were merely upping the amount of look ahead so that you could leave
>out terminators in a lot more cases, then it would probably make things
>easier to remember. (Humans can stick missing terminators back in rather
>handily, in many cases.)
>
>(The last 3 paragraphs were pulled almost entirely out of my ass. Should
>they happen to be correct, then I'll be impressed. See, however, the note
>about asking a linguist.)

happens to be exactly correct in describing my thinking. We would not be 
upping the lookahead except to make the machine grammar match the human 
grammar better, and thus at most eliminate some kluges. There was never 
any intent to change the human language, only the way that we described it 
for the machine so that it could be proven unambiguous.

Hopefully I've dealt with some of the questions usefully.
--
lojbab lojbab@lojban.org
Bob LeChevalier, President, The Logical Language Group, Inc.
2904 Beau Lane, Fairfax VA 22031-1303 USA 703-385-0273
Artificial language Loglan/Lojban: http://www.lojban.org