From lojban-out@lojban.org Wed Aug 16 05:39:49 2006
Return-Path: <lojban-out@lojban.org>
X-Sender: lojban-out@lojban.org
X-Apparently-To: lojban@yahoogroups.com
Received: (qmail 25878 invoked from network); 16 Aug 2006 12:37:48 -0000
Received: from unknown (66.218.67.36)
  by m34.grp.scd.yahoo.com with QMQP; 16 Aug 2006 12:37:48 -0000
Received: from unknown (HELO chain.digitalkingdom.org) (64.81.49.134)
  by mta10.grp.scd.yahoo.com with SMTP; 16 Aug 2006 12:37:48 -0000
Received: from lojban-out by chain.digitalkingdom.org with local (Exim 4.62)
	(envelope-from <lojban-out@lojban.org>)
	id 1GDKeO-0008Ic-NU
	for lojban@yahoogroups.com; Wed, 16 Aug 2006 05:37:40 -0700
Received: from chain.digitalkingdom.org ([64.81.49.134])
	by chain.digitalkingdom.org with esmtp (Exim 4.62)
	(envelope-from <lojban-list-bounce@lojban.org>)
	id 1GDKbr-0008F6-2F; Wed, 16 Aug 2006 05:35:42 -0700
Received: with ECARTIS (v1.0.0; list lojban-list); Wed, 16 Aug 2006 05:34:45 -0700 (PDT)
Received: from nobody by chain.digitalkingdom.org with local (Exim 4.62)	(envelope-from <nobody@digitalkingdom.org>)	id 1GDKag-0008ET-EH	for lojban-list-real@lojban.org; Wed, 16 Aug 2006 05:33:47 -0700
Received: from nf-out-0910.google.com ([64.233.182.191])	by chain.digitalkingdom.org with esmtp (Exim 4.62)	(envelope-from <pdf23ds@gmail.com>)	id 1GDKaS-0008E1-Bg	for lojban-list@lojban.org; Wed, 16 Aug 2006 05:33:41 -0700
Received: by nf-out-0910.google.com with SMTP id x30so687758nfb        for <lojban-list@lojban.org>; Wed, 16 Aug 2006 05:33:31 -0700 (PDT)
Received: by 10.49.19.18 with SMTP id w18mr648390nfi;        Wed, 16 Aug 2006 05:33:31 -0700 (PDT)
Received: by 10.49.92.8 with HTTP; Wed, 16 Aug 2006 05:33:31 -0700 (PDT)
Message-ID: <737b61f30608160533g388659c4v7b8020357f7664c@mail.gmail.com>
Date: Wed, 16 Aug 2006 07:33:31 -0500
In-Reply-To: <1155767873.6227.13.camel@localhost.localdomain>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <737b61f30608151434h6ed71ec2k123f043c1ad59838@mail.gmail.com>	 <1155767873.6227.13.camel@localhost.localdomain>
X-Spam-Score: -2.4 (--)
X-archive-position: 12468
X-ecartis-version: Ecartis v1.0.0
Errors-to: lojban-list-bounce@lojban.org
X-original-sender: pdf23ds@gmail.com
X-list: lojban-list
X-Spam-Score: -2.4 (--)
To: lojban@yahoogroups.com
X-Originating-IP: 64.81.49.134
X-eGroups-Msg-Info: 1:0:0:0
X-eGroups-From: "Chris Capel" <pdf23ds@gmail.com>
From: "Chris Capel" <lojban-out@lojban.org>
Reply-To: pdf23ds@gmail.com
Subject: [lojban] Re: parsing with error detection and recovery
X-Yahoo-Group-Post: member; u=116389790; y=6iUn0fShRlZE9Nhryj-Hb2DfzVI0Ufe5_R-vptavIiLm-WhmoQ
X-Yahoo-Profile: lojban_out
X-Yahoo-Message-Num: 26897

On 8/16/06, John Leuner <jewel@subvert-the-dominant-paradigm.net> wrote:

[rearranged a bit]

> I'm confused about how
> the parser would distinguish between rules failing "naturally" (in a
> successful parse there will be many points at which various rules
> failed) and those failures which would cause a larger unit to fail, eg a
> "sentence" or a "selbri".

Rules failing would never be an error. There would be a group of rules
that are flagged in the grammar (interpreted by the user of the parse
tree, not actually affecting the parser) as being error conditions
when they *suceed*. Actively matching different ways to mess up the
input. This would require a lot of rules.

> Error recovery seems to be quite a hard problem.

I think it's quite doable, but a tricky and lengthy process. My hunch
is that a mature set of the error rules will triple or quadruple the
size of the Lojban grammar.

> As a first step you could try modifying an existing PEG parser to
> produce simple error messages. As far as I know none of the existing
> ones do this yet.

Well, there are a couple ways in which PEG parsers themselves can
produce error messages, but the parsers are extremely simple, so
there's really not much room there. I think most of the work has to be
done to the grammar itself. On the other hand, I *am* planning on
doing a lot of experimenting with simple PEG grammars to prove the
general concept, and to teach myself how PEG grammars are written,
before I do anything with Lojban.

> My parser just tells you that the parse failed and the
> point at which it failed.

What are your criteria for this? Just that the input stream wasn't
fully eaten? I think that ideally, you'd be able to be confident
enough in your grammar that you could pick a starting rule that isn't
supposed to parse to the end of the stream, and use it to repeatedly
apply to an input stream. And that any errors would result in error
rules in the grammar rather than an incompletely parsed input. This
could be a great efficiency optimization (for memory usage) for
suitable input, becuase the memoization cache can be cleared after
each substructure is parsed.

For Lojban, this basically amounts to calling the parser with the
'bridi' rule instead of the 'text' rule.

Chris Capel
-- 
"What is it like to be a bat? What is it like to bat a bee? What is it
like to be a bee being batted? What is it like to be a batted bee?"
-- The Mind's I (Hofstadter, Dennet)


To unsubscribe from this list, send mail to lojban-list-request@lojban.org
with the subject unsubscribe, or go to http://www.lojban.org/lsg2/, or if
you're really stuck, send mail to secretary@lojban.org for help.