[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lojban] Questions on isolating utterances before completely parsing



On Thu, Oct 14, 2010 at 04:13:49PM -0700, symuyn wrote:
> I've got a hypothetical problem. It's pretty long, but please bear
> with me.
> 
> Let's say that, hypothetically, someone is creating a text editor for
> Lojban, one which shows the syntactical structure of whatever you've
> typed *while you type*. The text would be displayed somewhat like
> this:
> 
>    ‹mi ‹‹klama klama› ‹klama bo klama›››
> 
> Let's also imagine, hypothetically, that this person has made the
> editor pre-parse all whitespace/dot-separated chunks of text into the
> valsi that the chunks correspond to, identifying their selma'o and all
> that (e.g. "melo" → [<"me" in ME> <"lo" in LE>]). This is before
> checking the grammar of the text.
> 
> So this hypothetical text editor uses two parsers right now: a chunks-
> of-text-to-valsi parser and a sequence-of-valsi-to-textual-structures
> parser.
> 
> Let's also say that, hypothetically, in testing this text editor, that
> this person encountered a problem.
> 
> The hypothetical text editor becomes slower and slower when the text
> grows in size. This is because, unfortunately, the entire text has to
> be parsed whenever a new word is added or existing text is deleted.
> 
> What to do? The person hypothetically comes up with an idea! There
> could be a *third* parser between the already existing two parsers,
> one that converts sequences of valsi into unparsed utterances! The
> third parser would ignore everything except I, NIhO, LU, LIhU, TO,
> TOI, TUhE, and TUhU, using those selma'o to create a tree of unparsed
> utterances.
> 
> For instance, the third parser would convert the sequence of valsi [i
> cusku lu klama i klama li'u to mi cusku toi i cusku] into [[i cusku lu
> [[klama] [i klama]] li'u to [mi cusku] toi] [i cusku]].
> 
> Therefore, with this new parser, the hypothetical editor can keep
> track of what the boundaries of the utterance *currently being edited*
> is, and re-parse *only the current utterance* when it's edited.
> 
> But then, the person finds a problem with that solution! A fatal flaw:
> *LIhU, TOI, and TUhE are elidable*.
> 
> Because of that, it seems that it's impossible to isolate an utterance
> from the text it is in without parsing the whole text for complete
> grammar.
> 
> That's the end of the hypothetical situation. My questions are as
> following:
> 
> * Is it true that the fact that LIhU, TOI, and TUhE are elidable makes
> isolating an utterance impossible without completely parsing the text
> the utterance is in? (Just making sure.)
> * Should the person make the third parser anyway while making LIhU,
> TOI, and TUhE *required and non-elidable*?
> * Is there another practical solution for the editor?
> 
> Remember, the problem is that the hypothetical text editor is getting
> slow because otherwise it needs to parse the entire text for every
> edit.
> 

My first thought, in reading this, is that what you really want is
to have the state of the program (parse tree and token stream) saved
at the place right before the place the person makes the edit.

Then, after the edit is made, you restore the program to the state
it was right before it parsed the text just edited, and continue
parsing, picking up the newly edited text.  This is called a
continuation.[1]

Of course, you don't know where the user is going to be editing, so
you have to save multiple continuations during the parse, and restart
whichever one is closest and earlier than the edit, and throw away
the continuations that are now invalid because they occured later in
the text.

This doesn't have to be an excessive number of continuations.  You only
have to save as many continuations as you need to make your program fast
enough.  You should have some idea about how much text is "too much" for
performance--how much text makes your parser too slow.  So you can
create continuations every N bytes knowing you'll only have to parse M
bytes (where M<=N) after any edit.

This saves you the work of having three different parsers in
exchange for whatever overhead the continuation has.  That is going to
depend on other factors in your design, but you'd essentially be
trading memory for speed.

Edits early in the document are of course going to need to reparse
everything after the edit, but:

 * you'll parse the most recently changed text first, and you can
   provide feedback on this before the parse fully finishes.
 * Once you get to the next N boundary in your text (accounting for
   additions and deletions made, of course), you can compare the results
   from the current parse to the parse tree you got last time, and decide
   to stop once they start being identical.

I can't really imagine implementing this in any other way, but I'm
curious to read everyone else's ideas. 

1: http://en.wikipedia.org/wiki/Continuation

-Alan
-- 
.i ko djuno fi le do sevzi

-- 
You received this message because you are subscribed to the Google Groups "lojban" group.
To post to this group, send email to lojban@googlegroups.com.
To unsubscribe from this group, send email to lojban+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/lojban?hl=en.