From rob@twcny.rr.com Tue Apr 23 21:59:31 2002
Return-Path: <rob@twcny.rr.com>
X-Sender: rob@twcny.rr.com
X-Apparently-To: lojban@yahoogroups.com
Received: (EGP: mail-8_0_3_1); 24 Apr 2002 04:59:31 -0000
Received: (qmail 89396 invoked from network); 24 Apr 2002 04:59:31 -0000
Received: from unknown (66.218.66.216)
  by m8.grp.scd.yahoo.com with QMQP; 24 Apr 2002 04:59:31 -0000
Received: from unknown (HELO mailout5.nyroc.rr.com) (24.92.226.169)
  by mta1.grp.scd.yahoo.com with SMTP; 24 Apr 2002 04:59:30 -0000
Received: from mail1.twcny.rr.com (mail1-1.nyroc.rr.com [24.92.226.139])
  by mailout5.nyroc.rr.com (8.11.6/Road Runner 1.12) with ESMTP id g3O4xSH09036
  for <lojban@yahoogroups.com>; Wed, 24 Apr 2002 00:59:28 -0400 (EDT)
Received: from riff ([24.92.246.4]) by mail1.twcny.rr.com
  (Post.Office MTA v3.5.3 release 223
  ID# 0-59787U250000L250000S0V35) with ESMTP id com
  for <lojban@yahoogroups.com>; Wed, 24 Apr 2002 00:59:27 -0400
Received: from rob by riff with local (Exim 3.35 #1 (Debian))
  id 170EsL-0001By-00
  for <lojban@yahoogroups.com>; Wed, 24 Apr 2002 00:59:29 -0400
Date: Wed, 24 Apr 2002 00:59:29 -0400
To: lojban@yahoogroups.com
Subject: Re: [lojban] cmavo frequency list
Message-ID: <20020424045929.GB4465@twcny.rr.com>
References: <20020424002708.GA3992@twcny.rr.com> <Pine.GSO.4.40.0204232025580.16634-100000@ucsub.colorado.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <Pine.GSO.4.40.0204232025580.16634-100000@ucsub.colorado.edu>
User-Agent: Mutt/1.3.28i
X-Is-It-Not-Nifty: www.sluggy.com
Sender: Rob Speer <rob@riff>
From: Rob Speer <rob@twcny.rr.com>
Reply-To: rob@twcny.rr.com
X-Yahoo-Group-Post: member; u=2572649
X-Yahoo-Profile: squeekybobo

On Tue, Apr 23, 2002 at 08:32:27PM -0600, Jay Kominek wrote:
> 
> On Tue, 23 Apr 2002, Rob Speer wrote:
> 
> > I seem to remember that there is so far no accurate list of the
> > frequencies with which each cmavo is used.
> 
> Wee
> 
> > So I wrote a script which would search Lojban text for cmavo, even in
> > compounds, and count up the frequency for each one.
> 
> Out of curiousity, are you using jbofi'e or vlatai or something along
> those lines to handle the lexing?

No. It would probably be better if I did, but right now I match against
this regular expression to determine whether a word is a cmavo (or cmavo
compound):

^([bcdfgjklmnprstvxz\.]?[aeiou]'?[aeiou]*)+\.?$

I had to leave out cmavo with "y", because otherwise I'd get false
positives on lujvo like "ricyci'e".

> And, have you considered trying to include the IRC channel logs?

I considered it. Where could I get them?

The problem there is that I'd need some way to distinguish Lojban text
from English.

> > Another script found the 121 cmavo which were not used anywhere. Some of
> > these were expected (lau) while others were quite surprising that they
> > have gone unused (ro'e). And of course most of the MEX words are in
> > there, but they are important nonetheless.
> 
> I'd like to point out (for what little it is worth), that I've used the
> following:
> 
> ke'e ko'o
> ci'i mo'a
> ro'e ro'o

I'm sure some of these, especially ro'e and ro'o, have been used many
times - but their usage didn't make it into any finished text.

-- 
Rob Speer


