From jay.kominek@colorado.edu Wed Apr 24 16:42:27 2002
Return-Path: <kominek@ucsub.colorado.edu>
X-Sender: kominek@ucsub.colorado.edu
X-Apparently-To: lojban@yahoogroups.com
Received: (EGP: mail-8_0_3_1); 24 Apr 2002 23:42:27 -0000
Received: (qmail 24583 invoked from network); 24 Apr 2002 23:42:26 -0000
Received: from unknown (66.218.66.218)
  by m6.grp.scd.yahoo.com with QMQP; 24 Apr 2002 23:42:26 -0000
Received: from unknown (HELO ucsub.colorado.edu) (128.138.129.12)
  by mta3.grp.scd.yahoo.com with SMTP; 24 Apr 2002 23:42:26 -0000
Received: from ucsub.colorado.edu (kominek@ucsub.colorado.edu [128.138.129.12])
  by ucsub.colorado.edu (8.11.6/8.11.2/ITS-5.0/student) with ESMTP id g3ONgP828494
  for <lojban@yahoogroups.com>; Wed, 24 Apr 2002 17:42:25 -0600 (MDT)
Date: Wed, 24 Apr 2002 17:42:25 -0600 (MDT)
To: lojban@yahoogroups.com
Subject: Re: [lojban] cmavo frequency list
In-Reply-To: <20020424045929.GB4465@twcny.rr.com>
Message-ID: <Pine.GSO.4.40.0204241706550.7262-100000@ucsub.colorado.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=utf-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
From: Jay Kominek <jay.kominek@colorado.edu>
X-Yahoo-Group-Post: member; u=20706630
X-Yahoo-Profile: jfkominek


On Wed, 24 Apr 2002, Rob Speer wrote:

> > And, have you considered trying to include the IRC channel logs?
>
> I considered it. Where could I get them?

http://miranda.org/~jkominek/lojban/

> The problem there is that I'd need some way to distinguish Lojban text
> from English.

Look at the log, line by line. Strip away any front matter on the line,
check for the [ english ] stuff some people use, and then look at letter
frequency. Lojban ought to have a distinctly different frequency curve
than English. You'd be surprised how effective that can be for language
identification.

The other alternative is to score each line, (possibly including the above
method), with scores based on things like the presence of 'w', 'q', '!',
and psuedo-common gismu and lujvo which wouldn't likely be mixed in with
English. Actually the first method is another scoring methods. Pretend I
made sense.

- Jay Kominek <jay.kominek@colorado.edu>
Plus =C3=A7a change, plus c'est la m=C3=AAme chose



