From rob@twcny.rr.com Tue Apr 23 21:59:31 2002 Return-Path: X-Sender: rob@twcny.rr.com X-Apparently-To: lojban@yahoogroups.com Received: (EGP: mail-8_0_3_1); 24 Apr 2002 04:59:31 -0000 Received: (qmail 89396 invoked from network); 24 Apr 2002 04:59:31 -0000 Received: from unknown (66.218.66.216) by m8.grp.scd.yahoo.com with QMQP; 24 Apr 2002 04:59:31 -0000 Received: from unknown (HELO mailout5.nyroc.rr.com) (24.92.226.169) by mta1.grp.scd.yahoo.com with SMTP; 24 Apr 2002 04:59:30 -0000 Received: from mail1.twcny.rr.com (mail1-1.nyroc.rr.com [24.92.226.139]) by mailout5.nyroc.rr.com (8.11.6/Road Runner 1.12) with ESMTP id g3O4xSH09036 for ; Wed, 24 Apr 2002 00:59:28 -0400 (EDT) Received: from riff ([24.92.246.4]) by mail1.twcny.rr.com (Post.Office MTA v3.5.3 release 223 ID# 0-59787U250000L250000S0V35) with ESMTP id com for ; Wed, 24 Apr 2002 00:59:27 -0400 Received: from rob by riff with local (Exim 3.35 #1 (Debian)) id 170EsL-0001By-00 for ; Wed, 24 Apr 2002 00:59:29 -0400 Date: Wed, 24 Apr 2002 00:59:29 -0400 To: lojban@yahoogroups.com Subject: Re: [lojban] cmavo frequency list Message-ID: <20020424045929.GB4465@twcny.rr.com> References: <20020424002708.GA3992@twcny.rr.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.3.28i X-Is-It-Not-Nifty: www.sluggy.com Sender: Rob Speer From: Rob Speer Reply-To: rob@twcny.rr.com X-Yahoo-Group-Post: member; u=2572649 X-Yahoo-Profile: squeekybobo X-Yahoo-Message-Num: 14096 On Tue, Apr 23, 2002 at 08:32:27PM -0600, Jay Kominek wrote: > > On Tue, 23 Apr 2002, Rob Speer wrote: > > > I seem to remember that there is so far no accurate list of the > > frequencies with which each cmavo is used. > > Wee > > > So I wrote a script which would search Lojban text for cmavo, even in > > compounds, and count up the frequency for each one. > > Out of curiousity, are you using jbofi'e or vlatai or something along > those lines to handle the lexing? No. It would probably be better if I did, but right now I match against this regular expression to determine whether a word is a cmavo (or cmavo compound): ^([bcdfgjklmnprstvxz\.]?[aeiou]'?[aeiou]*)+\.?$ I had to leave out cmavo with "y", because otherwise I'd get false positives on lujvo like "ricyci'e". > And, have you considered trying to include the IRC channel logs? I considered it. Where could I get them? The problem there is that I'd need some way to distinguish Lojban text from English. > > Another script found the 121 cmavo which were not used anywhere. Some of > > these were expected (lau) while others were quite surprising that they > > have gone unused (ro'e). And of course most of the MEX words are in > > there, but they are important nonetheless. > > I'd like to point out (for what little it is worth), that I've used the > following: > > ke'e ko'o > ci'i mo'a > ro'e ro'o I'm sure some of these, especially ro'e and ro'o, have been used many times - but their usage didn't make it into any finished text. -- Rob Speer