From phma@webjockey.net Wed Jan 15 05:14:35 2003 Return-Path: X-Sender: phma@ixazon.dynip.com X-Apparently-To: lojban@yahoogroups.com Received: (EGP: mail-8_2_3_0); 15 Jan 2003 13:14:34 -0000 Received: (qmail 9364 invoked from network); 15 Jan 2003 13:14:34 -0000 Received: from unknown (66.218.66.216) by m1.grp.scd.yahoo.com with QMQP; 15 Jan 2003 13:14:34 -0000 Received: from unknown (HELO blackcat.ixazon.lan) (208.150.110.21) by mta1.grp.scd.yahoo.com with SMTP; 15 Jan 2003 13:14:34 -0000 Received: by blackcat.ixazon.lan (Postfix, from userid 1001) id E851886DD; Wed, 15 Jan 2003 13:14:33 +0000 (UTC) Organization: dis To: lojban@yahoogroups.com Subject: Re: [lojban] Word break algorithm so far Date: Wed, 15 Jan 2003 08:14:33 -0500 User-Agent: KMail/1.5 References: In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200301150814.33387.phma@webjockey.net> From: Pierre Abbat X-Yahoo-Group-Post: member; u=92712300 X-Yahoo-Message-Num: 18273 On Monday 13 January 2003 11:06, Jorge Llambias wrote: > la pier cusku di'e > > >3. Pick the first piece that has not been resolved. > > C. If the piece does not end in 'y' or a consonant and has no > > consonant that is adjacent to a consonant when 'y' is removed: > > I. Number the consonants starting with 1 and find the last one whose > > number is a power of 2. > > II. If this consonant is the first letter in the piece or there are > > no consonants, resolve the string as a cmavo. > > III.If this consonant is not the first letter, split before it. > > Why do you need I, II and III? Shouldn't you just split before > every consonant at this point? I wrote the program to split once each time it examines a piece, or at most twice, doing two different kinds of split. Given that constraint, this is the most efficient way to break a piece that consists entirely of cmavo. If a piece ends in a long string of BY, it hits another part of the algorithm that takes quadratic time, so taking nlogn time on this is moot. I have to check whether the consonant is the first letter, otherwise I would break off a null piece, which is an error (though currently marked as a cmavo). phma