From nobody@digitalkingdom.org Tue Aug 15 14:35:20 2006 Received: with ECARTIS (v1.0.0; list lojban-list); Tue, 15 Aug 2006 14:35:21 -0700 (PDT) Received: from nobody by chain.digitalkingdom.org with local (Exim 4.62) (envelope-from ) id 1GD6Yu-0007hV-O3 for lojban-list-real@lojban.org; Tue, 15 Aug 2006 14:35:00 -0700 Received: from wx-out-0506.google.com ([66.249.82.234]) by chain.digitalkingdom.org with esmtp (Exim 4.62) (envelope-from ) id 1GD6Yt-0007hN-Ux for lojban-list@lojban.org; Tue, 15 Aug 2006 14:35:00 -0700 Received: by wx-out-0506.google.com with SMTP id r21so1626501wxc for ; Tue, 15 Aug 2006 14:34:58 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:mime-version:content-type:content-transfer-encoding:content-disposition; b=jdEB4qbOijSgoLFLm9+kNf6oHMmVjYNlSvTf1G/SQ4j72jlQZR0d4cdhlIggNOSt/9Nhg5GIkZqe+RCPpsqS6GfYeEA2C5m0yCIz6AuVoBXppBbyIXVUF8LMvVjHZ6+obfII311Qb3S8vihtpJ6PWAkpHz8BADMq0EltHxjvvaw= Received: by 10.49.8.15 with SMTP id l15mr1849194nfi; Tue, 15 Aug 2006 14:34:58 -0700 (PDT) Received: by 10.49.92.8 with HTTP; Tue, 15 Aug 2006 14:34:57 -0700 (PDT) Message-ID: <737b61f30608151434h6ed71ec2k123f043c1ad59838@mail.gmail.com> Date: Tue, 15 Aug 2006 16:34:57 -0500 From: "Chris Capel" To: lojban-list@lojban.org Subject: [lojban] parsing with error detection and recovery MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline X-Spam-Score: -0.8 (/) X-archive-position: 12463 X-ecartis-version: Ecartis v1.0.0 Sender: lojban-list-bounce@lojban.org Errors-to: lojban-list-bounce@lojban.org X-original-sender: pdf23ds@gmail.com Precedence: bulk Reply-to: lojban-list@lojban.org X-list: lojban-list I'm looking into implementing a friendly PEG parser. The current PEG grammar (and morphology) are very unfriendly, in that invalid lojban text is simply not parsable, as opposed to being parsable with possible errors listed. But a parser with error detection could be easily based on the existing PEG grammars by adding additional rules (with lower precedence than any rules for valid Lojban) that are specially marked and are associated with descriptive error messages. Adding these rules would also add substantial error recovery/tolerance to parsers. For instance, the morphology rules in the BPFK Peg Morphology[1] will only parse consonants that don't appear in invalid consonant clusters. If a consonant cluster is invalid, it will stop parsing. But by adding error rules for consonants that don't check the validity (that only get matched if the ones that do check don't match) or that check for specific kinds of invalid pairs, the output of the parser could be more likely to finish, and could tell the user why the cluster is invalid. Composing a good set of these rules would definitely be quite an art, but seems like a good approach. So, my question is this: is there an easy way to prove the equivalence of PEG parser A with the parts of parser B that apply only to valid input? My first hunch is that as long as B is derived from A by only adding rules where the added rule is an error condition if ever matched in an input, and by only modifying existing rules either by renaming them (and all references to them) or by adding options to the end that point toward error rules, then parser B will return a parse tree with no matches on error rules if and only if parser A would be able to parse the input at all. But I'm not completely sure that's the case. Chris Capel -- "What is it like to be a bat? What is it like to bat a bee? What is it like to be a bee being batted? What is it like to be a batted bee?" -- The Mind's I (Hofstadter, Dennet) To unsubscribe from this list, send mail to lojban-list-request@lojban.org with the subject unsubscribe, or go to http://www.lojban.org/lsg2/, or if you're really stuck, send mail to secretary@lojban.org for help.