Received: from VMS.DC.LSOFT.COM (vms.dc.lsoft.com [205.186.43.2]) by locke.ccil.org (8.6.9/8.6.10) with ESMTP id FAA24120 for ; Wed, 29 Nov 1995 05:44:11 -0500 Message-Id: <199511291044.FAA24120@locke.ccil.org> Received: from PEACH.EASE.LSOFT.COM (205.186.43.4) by VMS.DC.LSOFT.COM (LSMTP for OpenVMS v1.0a) with SMTP id E82946CB ; Wed, 29 Nov 1995 5:35:08 -0500 Date: Wed, 29 Nov 1995 05:33:29 -0500 Reply-To: Logical Language Group Sender: Lojban list From: Logical Language Group Subject: Lojban frequency data on digex ftp site X-To: lojban@cuvmb.cc.columbia.edu To: John Cowan Status: OR X-From-Space-Date: Wed Nov 29 05:44:14 1995 X-From-Space-Address: LOJBAN%CUVMB.BITNET@UBVM.CC.BUFFALO.EDU Jorge and others have inquired at various times about Lojban word frequencies. We have had data about gismu for a while, extracted from a large volumeof unfiltered Lojban text, and the frequency counts are stored in columns 161-4 of the gismu list. But that unfiltered text included a lot of embedded English, leading to two words (curve and since) getting false values, and it also included a lot of our teaching documents, which are heavily dominated by use of selective words to make learning easier. I extracted a sample of 900K characters of Lojban text, using a quick and unsophisticated filter - if a line was all Lojban text, it probably was intended to be natural sentences. I used that filter on 6 months of emai archive from 8/94 to 1/95, so as to get a good sampleing of the current wave of active Lojbanists, who were writing heavily in the language during that period, and added it to a similar filtering of archoves of ckafybarja discussions, texts writtne by Nick, Ivan, Colin, David Twery, and the local Lojban crew. The result is still somewhat dominated by text written by Nick, especially the 100K associated with hsi translation of the texts from the original Colossal Cave Adnture game, a Lojban version of which might eventually get completed %^), but is more balanced, and is NOT skewed by teaching materials or by discussion of controversial words in the list, since suchy controversies tend to embed the Lojban in otherwise English sentences. lujvo are a problme in that most older texts were written before the rafsi list was finalized, and have never been updated to the new set of rafsi. The frequencies generated, and expecially, the glosses for the lujvo, are thus suspect unless you make the effort to figure out whether the word is in its current form, and possibly match up various rafsi choices for the same tanru, all of which result officially in the "same word" even though the spelling may be different. But then, relatively few lujvo have achieved enough frequency to be considered statistically common, and Lojban grammar terms like "brivla", which are the most well known, are justly reduced in frequency in filtered block text as opposed to incidental appearance in English discussions of the language. I reported two numbers for cmavo (and compounds), one being the raw frequecy of each cmavo or compound as a standalone word, and then a second which is the frequency for each cmavo after breaking up compounds into individual words. The latter is a more trutworthy set of values, since some people write "le nu" as two words, but most write it as a compound "lenu". The results are in ftp.access.digex.net /pub/lojbab/frqsumm.zip, which is around 125K /15000 lines. I will be using the cmavo compound data to determine which cmpounds to list in the dictionary, and I may adjust the Logflash default order for words based on the new data, especially for cmavo which had never previously been analyzed for frequency. Enjoy. lojbab