Return-Path: <@FINHUTC.HUT.FI:LOJBAN@CUVMB.BITNET> Received: from FINHUTC.hut.fi by xiron.pc.helsinki.fi with smtp (Linux Smail3.1.28.1 #14) id m0pT45z-0000Q9C; Sun, 6 Feb 94 09:39 EET Message-Id: Received: from FINHUTC.HUT.FI by FINHUTC.hut.fi (IBM VM SMTP V2R2) with BSMTP id 1853; Sun, 06 Feb 94 09:39:42 EET Received: from SEARN.SUNET.SE (NJE origin MAILER@SEARN) by FINHUTC.HUT.FI (LMail V1.1d/1.7f) with BSMTP id 1851; Sun, 6 Feb 1994 09:39:42 +0200 Received: from SEARN.SUNET.SE (NJE origin LISTSERV@SEARN) by SEARN.SUNET.SE (LMail V1.2a/1.8a) with BSMTP id 5432; Sun, 6 Feb 1994 08:38:54 +0100 Date: Sun, 6 Feb 1994 02:37:18 -0500 Reply-To: Logical Language Group Sender: Lojban list From: Logical Language Group Subject: GEN: More on Lojban Letter Frequencies To: lojban@cuvmb.cc.columbia.edu Content-Length: 3812 Lines: 71 John Cowan's letter frequency data posted the other day is probably more accurate than he thought. I did the same exercise on a significantly larger set of Lojban text (and one perhaps less weighted by Nick's massive contribution to the corpus of Lojban text). The results were almost identical, except for the letter 'o', and I suspect the value for that letter may be an arithmetic or copying error on his part, since his data sums to slightly less than 1000. My data is based on 75315 words of Lojban text (367K) compared to Cowan's 20K words, and probably includes the vast majority of such text in the archives which is greater than single sentence length. I was similarly careful in removing non-lojban from the text body, even to removing the contents of zoi and la'o quotes manually. My raw and normalized frequencies are in the two left columns below. The third column is John's data. The 4th and 5th columns are the old static data. The 6th column is the normalized static results based on taking only 1 copy of each word in the raw Lojban text I used for the dynamic data combined with the gismu list, cmavo list, and Nick's lujvo list. This approximates a maximal list of words that could appear in the dictionary, though it probably has a small excess of meaningful cmavo compounds. Old results Current static static static letter dynamic no-lujvo with-lujvo with-lujvo/cmavo raw Lojbab Cowan ' 13888 048 045 037 028 057 a 30431 106 105 118 084 125 b 6016 021 021 025 024 024 c 11849 041 042 043 029 037 d 7123 025 023 026 024 023 e 26810 094 095 059 044 075 f 3678 013 013 017 017 013 g 4159 015 014 017 016 018 i 37295 130 132 124 076 107 j 5108 018 017 029 028 022 k 9546 033 033 034 031 031 l 21156 074 073 041 039 048 m 8971 031 032 030 029 034 n 15557 054 055 067 058 051 o 17890 062 057 047 029 042 p 6062 021 022 024 024 026 r 11410 040 039 054 084 058 s 10229 036 037 040 038 045 t 7762 027 026 043 038 034 u 21556 075 076 076 050 067 v 3310 012 010 014 013 015 x 2180 008 008 012 015 011 y 1487 005 004 002 158 022 z 3101 011 009 010 010 013 , 71 ______ 75315 wds 9300 words/compounds 367090 char 67361 char The two dynamic data-sets gave identical rank-ordering, thus confirming my observation that almost 4x the amount of data had little effect. The new static data significantly differed from theory, and was not all that far from the dynamic data ordering - no letter moved more than 4 positions from the dynamic rank except 'y' which is probably used excessively in current Lojban text because people don't know the rafsi well enough to use reduced forms all the times that they could. both dynamic: iaeul on'cr skmtd pbjgf vzxy old no-lujvo: iaune rotcl s'kmj dbpfg vxzy old with-lujvo: yarin uelts kcmoj 'pbdf gxvz new static: aieur 'nlso ctmkp bdjyg vfzx lojbab