From jay.kominek@colorado.edu Wed Apr 24 16:42:27 2002 Return-Path: X-Sender: kominek@ucsub.colorado.edu X-Apparently-To: lojban@yahoogroups.com Received: (EGP: mail-8_0_3_1); 24 Apr 2002 23:42:27 -0000 Received: (qmail 24583 invoked from network); 24 Apr 2002 23:42:26 -0000 Received: from unknown (66.218.66.218) by m6.grp.scd.yahoo.com with QMQP; 24 Apr 2002 23:42:26 -0000 Received: from unknown (HELO ucsub.colorado.edu) (128.138.129.12) by mta3.grp.scd.yahoo.com with SMTP; 24 Apr 2002 23:42:26 -0000 Received: from ucsub.colorado.edu (kominek@ucsub.colorado.edu [128.138.129.12]) by ucsub.colorado.edu (8.11.6/8.11.2/ITS-5.0/student) with ESMTP id g3ONgP828494 for ; Wed, 24 Apr 2002 17:42:25 -0600 (MDT) Date: Wed, 24 Apr 2002 17:42:25 -0600 (MDT) To: lojban@yahoogroups.com Subject: Re: [lojban] cmavo frequency list In-Reply-To: <20020424045929.GB4465@twcny.rr.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE From: Jay Kominek X-Yahoo-Group-Post: member; u=20706630 X-Yahoo-Profile: jfkominek X-Yahoo-Message-Num: 14100 On Wed, 24 Apr 2002, Rob Speer wrote: > > And, have you considered trying to include the IRC channel logs? > > I considered it. Where could I get them? http://miranda.org/~jkominek/lojban/ > The problem there is that I'd need some way to distinguish Lojban text > from English. Look at the log, line by line. Strip away any front matter on the line, check for the [ english ] stuff some people use, and then look at letter frequency. Lojban ought to have a distinctly different frequency curve than English. You'd be surprised how effective that can be for language identification. The other alternative is to score each line, (possibly including the above method), with scores based on things like the presence of 'w', 'q', '!', and psuedo-common gismu and lujvo which wouldn't likely be mixed in with English. Actually the first method is another scoring methods. Pretend I made sense. - Jay Kominek Plus =C3=A7a change, plus c'est la m=C3=AAme chose