Date: Thu, 16 Jul 92 00:15:32 -0400
From: lojbab@grebyn.com (Logical Language Group)
Message-Id: <9207160415.AA19973@daily.grebyn.com>
To: cowan@snark.thyrsus.com, nsn@mullian.ee.mu.oz.au
Subject: Zipf and Lojban, early data
Content-Length: 2545
Lines: 48


It occurred to me after responding to John's message last night that we
had data that might indicate whether Zif's law is actually being
followed in the lujvo that are being made and used in text, and/or
whether John's approach of leaving final terms expanded was general or
specific to him (if even wide spread in his usage.  So I looked at the
data.

Of 2591 words, 775 or 30% have the final term unreduced.  Now some
percent of these (indeed I suspect at least 20%) cannot be reduced - I
may be able to get data on actual vs potential reductions in a few days
with a program Nora's written for my testing the results of the rafsi
tuning.  But even with that caveat ignore, 30% is not a general practice
of leaving final terms unexpanded.

The data further shows some amount of Zipf correlation, though perhaps
not as much as I would expect.  The correlation is with number of terms:
among 2-term lujvo, some 32% of final terms are unreduced (of 1844
lujvo).  Among 3-termers (401 lujvo), there are only 22% with final
return unreduced.  Only 2 of the 58 lujvo with 4 or more terms have the
final term unreduced.

There is also a Zipf correlation between length of word and frequency of
use.  I looked at lujvo with more than 10 occurances as a percentage ofg
the whole.  Among 2-termers, the percentage is 15%, dropping to 9% among
3 termers and 8% among 4 termers.

Finally, but only obvious at the two-term level, there is a Zipf cross
correlation between these two features.  Among 2 term words used less
than 10 times, 33% are unreduced, but among those use more than 10
times, only 27% are unreduced.  This seems large enough to be
significant, but I'm too rusty on my statistics to be sure.  If so, then
it seems to indicate that people may be more prone to propose words with
unreduced final terms, but that the words that catch on are those with
final terms reduced (or possibly that when they catch on they get
shortened, or people instinctively shorten words that they expect to use
often).  Of course that 6% difference may be a false correlation
actually reflecting the original tuning of the rafsi data, in that words
used often tend to be made up from words that get used in a lot of lujvo
and hence are more likely to have a usable short rafsi assigned to them.

Whether all this is useful statistics or just playing woth numbers, it
seems clear that these usage statistics may have a bunch of other
applications for learning how we are using Lojban than merely for tuning
the rafsi list.

Thought you-all might be interested