[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lojban] zoi bug in camxes?



On Mon, Jan 24, 2011 at 11:20:40PM -0300, Jorge Llambías wrote:
> On Mon, Jan 24, 2011 at 10:03 PM, Robin Lee Powell
> <rlpowell@digitalkingdom.org> wrote:
> >
> > zoi gy gyrate gy fails in camxes; that seems like a bug (in camxes)
> > to me.  It seems to me that the final zoi delimiter must have a
> > pause on both ends.  But I haven't read the relevant CLL bit in
> > quite some time; what does it say about that?
> 
> CLL: "The cmavo “zoi” (of selma'o ZOI) is a quotation mark for quoting
> non-Lojban text. Its syntax is “zoi X. text .X”, where X is a Lojban
> word (called the delimiting word) which is separated from the quoted
> text by pauses, and which is not found in the written text or spoken
> phoneme stream."
> 
> It doesn't say that the first X need be preceded by a pause, nor that
> the final X need be followed by a pause.
> 
> But even the pauses that CLL does mention aren't always needed. For
> example camxes probably approves of "zoidadida".
> 
> > Certainly for
> >
> >  zoi gy. gyrations .gy.
> >
> > to "work" but
> >
> >  zoi gy gyrate gy
> >
> > to "not work" is a bug in camxes by my standards; it needs to be one
> > or the other.
> 
> Why? From a Lojbanic perspective "gyrations" is a single word, while
> "gyrate" are three words, so there doesn't seem to be a reason (unless
> you know English, but the Lojban parser doesn't) to treat it as one.
> 

I might not be able to forgive you, xorxes, for making me download
and read the source code to the official parser.  Looking at it, I
a) think we can do better and b) think I better understand why the
CLL is confusingly worded.

In the technical description of the parser, the following statement
is made:

    a. If the Lojban word "zoi" (selma'o ZOI) is identified, take the
   following Lojban word (which should be end delimited with a pause for
   separation from the following non-Lojban text) as an opening delimiter.
   Treat all text following that delimiter, until that delimiter recurs
   *after a pause*, as grammatically a single token (labelled 'anything_699'
   in this grammar).  There is no need for processing within this text
   except as necessary to find the closing delimiter.

This seems pretty clear-cut to me, but it has almost nothing to do
with the implementation, which contradicts the opening example in
this thead in how it processes anything_699.

(BTW, I'm not clear as to whether a pause is both space and '.', or
whether it is only '.'.  Help?)

The implementation is contained in filter.c, in particular the
following lines:

        case ZOI_START_MODE:
                tok = lex();
                if (isEnd(tok)) return tok;
                tok->type = any_word_698;
                mode = ZOI_STRING_MODE;
                delim = tok;
                return tok;
        case ZOI_STRING_MODE:
                result = newtoken();
                result->type = anything_699;
                for (;;) {
                        tok = lex();
                        if (isEnd(tok)) return tok;
                        if (strcmp(tok->text, delim->text) == 0) break;
                        tok->type = -1;
                        add(result, tok);
                        }
                mode = ZOI_END_MODE;
                return result;
        case ZOI_END_MODE:
                /* note: token has already been read */
                tok->type = any_word_698;
                mode = NORMAL_MODE;
                return tok;

If you follow lex(), you find getword(), which is the low-level
tokenizer in the parser.  It reads ' ' or '.' delimited strings,
which means it considers "pano" a single token.

As a result, it behaves much like camxes does with gyration, but
I believe it would differ from camxes in parsing "gyrate", which
at this level of processing it would insist on treating as a single
token rather than three Lojban words.

In no case does it go looking for the delimiter inside individual
tokens, a behavior which camxes matches.

The code has the effect of treating everything between the delimiter
words as a single token, but misses edge cases because of the way
the tokenizer works.

-Alan

-- 
You received this message because you are subscribed to the Google Groups "lojban" group.
To post to this group, send email to lojban@googlegroups.com.
To unsubscribe from this group, send email to lojban+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/lojban?hl=en.