Date: Thu, 16 Jul 92 00:15:32 -0400 From: lojbab@grebyn.com (Logical Language Group) Message-Id: <9207160415.AA19973@daily.grebyn.com> To: cowan@snark.thyrsus.com, nsn@mullian.ee.mu.oz.au Subject: Zipf and Lojban, early data Content-Length: 2545 Lines: 48 It occurred to me after responding to John's message last night that we had data that might indicate whether Zif's law is actually being followed in the lujvo that are being made and used in text, and/or whether John's approach of leaving final terms expanded was general or specific to him (if even wide spread in his usage. So I looked at the data. Of 2591 words, 775 or 30% have the final term unreduced. Now some percent of these (indeed I suspect at least 20%) cannot be reduced - I may be able to get data on actual vs potential reductions in a few days with a program Nora's written for my testing the results of the rafsi tuning. But even with that caveat ignore, 30% is not a general practice of leaving final terms unexpanded. The data further shows some amount of Zipf correlation, though perhaps not as much as I would expect. The correlation is with number of terms: among 2-term lujvo, some 32% of final terms are unreduced (of 1844 lujvo). Among 3-termers (401 lujvo), there are only 22% with final return unreduced. Only 2 of the 58 lujvo with 4 or more terms have the final term unreduced. There is also a Zipf correlation between length of word and frequency of use. I looked at lujvo with more than 10 occurances as a percentage ofg the whole. Among 2-termers, the percentage is 15%, dropping to 9% among 3 termers and 8% among 4 termers. Finally, but only obvious at the two-term level, there is a Zipf cross correlation between these two features. Among 2 term words used less than 10 times, 33% are unreduced, but among those use more than 10 times, only 27% are unreduced. This seems large enough to be significant, but I'm too rusty on my statistics to be sure. If so, then it seems to indicate that people may be more prone to propose words with unreduced final terms, but that the words that catch on are those with final terms reduced (or possibly that when they catch on they get shortened, or people instinctively shorten words that they expect to use often). Of course that 6% difference may be a false correlation actually reflecting the original tuning of the rafsi data, in that words used often tend to be made up from words that get used in a lot of lujvo and hence are more likely to have a usable short rafsi assigned to them. Whether all this is useful statistics or just playing woth numbers, it seems clear that these usage statistics may have a bunch of other applications for learning how we are using Lojban than merely for tuning the rafsi list. Thought you-all might be interested