[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lojban] cmavo frequency list



On Wed, 24 Apr 2002, Rob Speer wrote:

> > And, have you considered trying to include the IRC channel logs?
>
> I considered it. Where could I get them?

http://miranda.org/~jkominek/lojban/

> The problem there is that I'd need some way to distinguish Lojban text
> from English.

Look at the log, line by line. Strip away any front matter on the line,
check for the [ english ] stuff some people use, and then look at letter
frequency. Lojban ought to have a distinctly different frequency curve
than English. You'd be surprised how effective that can be for language
identification.

The other alternative is to score each line, (possibly including the above
method), with scores based on things like the presence of 'w', 'q', '!',
and psuedo-common gismu and lujvo which wouldn't likely be mixed in with
English. Actually the first method is another scoring methods. Pretend I
made sense.

- Jay Kominek <jay.kominek@colorado.edu>
  Plus ça change, plus c'est la même chose