[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lojban] LALR1 question



On Friday, August 31, 2001, at 08:49 PM, Bob LeChevalier (lojbab) wrote:

Originally, JCB thought to prove Loglan's unambiguity using the theories of
a guy named Yngve. I'm not entirely sure what those theories are or how
they were related to the problem. In 1977 or 1978, IIRC, a Loglanist, I
believe it was Doug Landauer, proposed using YACC as a more formal method
of proving the language unambiguous, and made a first cut at a machine
grammar. It was quickly found that Loglan as it was then was nowhere near
unambiguous - I think they were only able to come up with a working grammar
for around 30% of the language.

Two or three Loglanists worked with JCB over the next few years to devise a
machine grammar that would work. At the time, the machine grammar was
considered something distinct from the human grammar of the language, and
all that was needed was that it be able to parse a corpus of specially
designed test sentences in the same way that the human grammar would. Even
this proved difficult, and was not achieved until 1982. The major
milestones were Jeff Prothero's idea to use YACC's error recovery system to
handle elidable terminators (and Jeff was the first one to get a moderately
complete machine grammar as a result, though it still had problems), and a
6 months period during which Scott Layson lived with JCB to finish the last
remaining problems. Part of the problem was getting access to suitable
computers - this was the era of CP/M in home computers, and YACC ran on
mainframes that JCB had no access to. So these various other people used
university connections to get time on machines. Somewhere in here, Bob
McIvor worked to convert YACC to run on a home computer - since he reads
this forum, he may be able to fill on his role in all this.

This is essentially correct. YACC on a CP/M computer required about 45 minutes
per pass. (My present Mac, which is less than half the speed of current Macs does a
much larger grammar in less than 1 second.

There were two major problems with the Layson/JCB machine grammar of
1982. First of all, it was known that the test corpus was incomplete - it
covered those things that JCB thought were important, but did not cover
everything that had been used with the language. Thus, parsing random
Loglan tests often failed because of things that had not been in the test
corpus. So JCB sought to slowly expand the corpus along with the machine
grammar to describe that corpus.

The second problem was that the machine grammar did not really work. Large
chunks of the language were hidden in C code routines as the
"Preparser". Unlike current Lojban, there was NO formalization of the
rules for the Preparser. Mostly it included identifying words, and then
glomming together some known sequences as unparseable units that would be
arbitrarily declared grammatical and flagged as such by invisible tokens
called "machine lexemes". It also included treating collections of cmavo
written as a single word without spaces as if it were a single
word/grammatical unit. Thus the TLI Loglan equivalent of "lenu" was
grammatically distinct from "le nu", and ANY string of cmavo starting with
a member of PA-equivalent was considered a number, while any string of
cmavo starting with a member of PU-equivalent (which then included all of
the tense and modal words) was considered a "tense".

Although JCB insisted on calling it a preparser, it was never much more
than a lexer. When I took over the grammar, one of the first changes I made
was to eliminate the 'machine lexemes by the equivalent of subscripting lexemes.
e,g, NO1 NO2 for different negations. Another change was to allow a speaker
to pause virtually anywhere (except in the middle of a word) and still get a parse.
There is a slight ambiguity here. A pause between le and po (Your le nu) would
cause the parser to parse differently than an unpaused lepo. This was done by
rescanning whenever a pause did not make sense and eliminating it. The next
version concatenated all cmavo before lexing, and using a finite state grammar
to lex the concatenation. It is possible this finite state grammar could be converted
to a YACC grammar, but I have not attempted it yet. The lexer produces the correct
subscripted lexemes for input to the conflict-free YACC grammar. There is an
implementation in progress which will take a written Loglan sentence, break it down
into stress-marked syllables and/or reconstruct a correctly punctuated written Loglan
sentence from a written string of stress-marked syllables (with required pauses marked)
As before excess pauses are eliminated. In this version the two meanings of lepo can
be done with stress LEpo and lePO. By using stress many of the unnatural necessary
pauses can be eliminated. The syllable string is then submitted to the parser.
This latter phase is incomplete, largely because I haven't had the time to devote to
it. The remaining problems mainly have to do with proper recognition of acronyms and
'strong-quoted' non-Loglan words.

It took very little for people to find grammatical strings that the parser
approved which were nonsense or parsed incorrectly, but which were not part
of the test corpus. There thus came a period of debate as to whether the
"human grammar" or the "machine grammar" defined the language.

[2 paragraphs of context with no parser info follow, so feel free to skip
them.]

Right about then is when the community splintered. Jim Carter proposed
some extensions to the language which he had found useful in doing the
first extensive set of translations using the language. JCB disliked
almost all of these, and pc didn't like most of them, dubbing Carter's
usage and formalisms as "Nalgol" 'because he got everything in Loglan
backwards'. But Jim Carter persisted in advocating for his changes, and
Bob Chassell as then-editor of Lognet published his advocacy. This and
other things led JCB to feel a loss of control over the language, and he
took back essentially dictatorial power over TLI and the language. Almost
everyone else left the community in response.

I knew JCB personally in San Diego and was oblivious to most of the
politics, so I stayed on, and eventually started working on the dictionary
revision. But almost no one was doing anything, and my efforts bogged down
too.

Largely because the people that were doing something were the ones
above that were ordered by JCB to have nothing further to do with Loglan for
at least one year.

Finally in 1986, I attempted to get some new people in the DC area
involved, and made new efforts to get people going again. This became the
Washington Loglan Users Group, which a year later became LLG after JCB and
I split.

Before the split I contacted several old Loglanists, and got Scott Layson
to send me the YACC grammar, parser and corpus, which he had converted for
use on an MS-DOS PC. This was primarily so that I would have a reference
standard in teaching the new people I had recruited the language, because
to put it simply, I knew little more than they did. The split occurred
when JCB accused Nora and me of copyright violation in distributing
LogFlash with wordlists via Shareware on a BBS, and he seemed to think that
I intended to freely distribute Layson's parser as well (which I wasn't,
since we hadn't written it ourselves). Jeff Prothero then stepped in with
his own effort, a backtracking parser based on his own version of the YACC
grammar, which he claimed was in the public domain anyway since his
original work on it had been done as a student on U of Washington
computers, and he had never signed anything over to TLI. Prothero engaged
in several stunts, including compressing the YACC grammar into an
unreadable solid block of C code so that no one could practically compare
his version with the TLI version of the grammar. This led to lawsuit
threats and further heightened the sense that we needed a version of Loglan
derived independently of JCB's copyright-claims. That version is what
became Lojban.

In June of 1987, with me having essentially no knowledge of YACC or machine
parsing, Jeff Taylor and I started working on a new from-scratch Loglan
grammar and parser. Jeff had done an SLR(1) parser for Loglan for his
Master's work in computer science, and had the knowhow that I did
not. Over the next several months, we built up a new grammar, buying a
copy of Abraxas Software's PCYACC because all of the freeware versions of
YACC were unable to hold a grammar as large as Loglan's (Indeed PCYACC was
also unable to do so, and they eventually modified their program at our
behest to make the lookahead table large enough to hold the then-language.)

Thus I can answer the question that we used YACC, because it was what JCB
had established a YACC-based "machine grammar" as the standard for the
language, and YACC was the tool that was readily at hand for us to get our
alternate Loglan standard in place quickly, and the volunteers I had at the
time knew YACC parsing well.

Nora, Jeff and I disliked the "hidden grammar" of the Preparser, as well as
the violation of audiovisual isomorphism that came from parsing "lenu"
differently from "le nu",

As I indicated above, Loglanists are currently required to pause between le and pa
to give the two-word meaning, and the pause must be written (with a comma). The
pauseless one-word form is the commonest occurrence. Consecutive cmavo which
are parsed as a single lexeme may be written separately or combined, but will appear
as combined in the parse output.

I believe JCB learned his lesson from the split, and afterwards accepted open
discussion and criticism and never again attempted to impose his will in the fashion
described by Lojbab. Loglan has remained an open language, although changes now
are rare and mainly extensions, rather than changes to preexisting structures.

Sincerely,

Robert A McIvor