From phma@webjockey.net Wed Jan 15 05:14:35 2003
Return-Path: <phma@ixazon.dynip.com>
X-Sender: phma@ixazon.dynip.com
X-Apparently-To: lojban@yahoogroups.com
Received: (EGP: mail-8_2_3_0); 15 Jan 2003 13:14:34 -0000
Received: (qmail 9364 invoked from network); 15 Jan 2003 13:14:34 -0000
Received: from unknown (66.218.66.216)
  by m1.grp.scd.yahoo.com with QMQP; 15 Jan 2003 13:14:34 -0000
Received: from unknown (HELO blackcat.ixazon.lan) (208.150.110.21)
  by mta1.grp.scd.yahoo.com with SMTP; 15 Jan 2003 13:14:34 -0000
Received: by blackcat.ixazon.lan (Postfix, from userid 1001)
  id E851886DD; Wed, 15 Jan 2003 13:14:33 +0000 (UTC)
Organization: dis
To: lojban@yahoogroups.com
Subject: Re: [lojban] Word break algorithm so far
Date: Wed, 15 Jan 2003 08:14:33 -0500
User-Agent: KMail/1.5
References: <F24b3yJmCSZb5HOcfBz00012c78@hotmail.com>
In-Reply-To: <F24b3yJmCSZb5HOcfBz00012c78@hotmail.com>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200301150814.33387.phma@webjockey.net>
From: Pierre Abbat <phma@webjockey.net>
X-Yahoo-Group-Post: member; u=92712300

On Monday 13 January 2003 11:06, Jorge Llambias wrote:
> la pier cusku di'e
>
> >3. Pick the first piece that has not been resolved.
> > C. If the piece does not end in 'y' or a consonant and has no
> > consonant that is adjacent to a consonant when 'y' is removed:
> > I. Number the consonants starting with 1 and find the last one whose
> > number is a power of 2.
> > II. If this consonant is the first letter in the piece or there are
> > no consonants, resolve the string as a cmavo.
> > III.If this consonant is not the first letter, split before it.
>
> Why do you need I, II and III? Shouldn't you just split before
> every consonant at this point?

I wrote the program to split once each time it examines a piece, or at most 
twice, doing two different kinds of split. Given that constraint, this is the 
most efficient way to break a piece that consists entirely of cmavo. If a 
piece ends in a long string of BY, it hits another part of the algorithm that 
takes quadratic time, so taking nlogn time on this is moot.

I have to check whether the consonant is the first letter, otherwise I would 
break off a null piece, which is an error (though currently marked as a 
cmavo).

phma

