Message-Id: <199511291044.FAA24120@locke.ccil.org>
Date:         Wed, 29 Nov 1995 05:33:29 -0500
Reply-To: Logical Language Group <lojbab@ACCESS.DIGEX.NET>
Sender: Lojban list <LOJBAN@CUVMB.BITNET>
From: Logical Language Group <lojbab@ACCESS.DIGEX.NET>
Subject:      Lojban frequency data on digex ftp site
To: John Cowan <cowan@LOCKE.CCIL.ORG>
Status: OR

Jorge and others have inquired at various times about Lojban word frequencies.
We have had data about gismu for a while, extracted from a large volumeof
unfiltered Lojban text, and the frequency counts are stored in columns 161-4
of the gismu list.  But that unfiltered text included a lot of
embedded English, leading to two words (curve and since) getting false
values, and it also included a lot of our teaching documents, which are
heavily dominated by use of selective words to make learning easier.

I extracted a sample of 900K characters of Lojban text, using a quick and
unsophisticated filter - if a line was all Lojban text, it probably was
intended to be natural sentences.  I used that filter on 6 months of emai
archive from 8/94 to 1/95, so as to get a good sampleing of the
current wave of active Lojbanists, who were writing heavily in the language
during that period, and added it to a similar filtering of archoves of
ckafybarja discussions, texts writtne by Nick, Ivan, Colin, David Twery, and
the local Lojban crew.  The result is still somewhat dominated by
text written by Nick, especially the 100K associated with hsi translation
of the texts from the original Colossal Cave Adnture game, a  Lojban version of
which might eventually get completed %^), but is more balanced, and is NOT
skewed by teaching materials or by discussion of controversial words in the
list, since suchy controversies tend to embed the Lojban in otherwise English
sentences.

lujvo are a problme in that most older texts were written before the rafsi
list was finalized, and have never been updated to the new set of rafsi.
The frequencies generated, and expecially, the glosses for the lujvo, are
thus suspect unless you make the effort to figure out whether the word is
in its current form, and possibly match up various rafsi choices for the
same tanru, all of which result officially in the "same word" even though the
spelling may be different.

But then, relatively few lujvo have achieved enough frequency to be considered
statistically common, and Lojban grammar terms like "brivla", which are the most
 well known, are justly reduced in frequency in filtered block text as opposed
 to incidental appearance in English discussions of the language.

I reported two numbers for cmavo (and compounds), one being the raw frequecy
of each cmavo or compound as a standalone word, and then a second which is the
frequency for each cmavo after breaking up compounds into individual words.
The latter is a more trutworthy set of values, since some people write
"le nu" as two words, but most write it as a compound "lenu".

The results are in ftp.access.digex.net /pub/lojbab/frqsumm.zip, which
is around 125K /15000 lines.  I will be using the cmavo compound data to
determine which cmpounds to list in the dictionary, and I may adjust the
Logflash default order for words based on the new data, especially for cmavo
which had never previously been analyzed for frequency.

Enjoy.

lojbab