Received: from ELI.CS.YALE.EDU by NEBULA.SYSTEMSZ.CS.YALE.EDU via SMTP; Thu, 3 Feb 1994 18:30:31 -0500 Received: from YALEVM.YCC.YALE.EDU by eli.CS.YALE.EDU via SMTP; Thu, 3 Feb 1994 18:30:06 -0500 Message-Id: <199402032330.AA01378@eli.CS.YALE.EDU> Received: from CUVMB.CC.COLUMBIA.EDU by YaleVM.YCC.Yale.Edu (IBM VM SMTP V2R2) with BSMTP id 5237; Thu, 03 Feb 94 18:27:22 EST Received: from CUVMB.COLUMBIA.EDU by CUVMB.CC.COLUMBIA.EDU (Mailer R2.07) with BSMTP id 3979; Thu, 03 Feb 94 16:44:14 EDT Date: Thu, 3 Feb 1994 14:52:44 -0500 Reply-To: Logical Language Group Sender: Lojban list From: Logical Language Group Subject: Character frequencies for Lojban -- a first cut X-To: lojban@cuvmb.cc.columbia.edu To: Erik Rauch Status: RO X-Status: X-From-Space-Date: Thu Feb 3 09:52:44 1994 X-From-Space-Address: @YaleVM.YCC.YALE.EDU:LOJBAN@CUVMB.BITNET Some time ago (back in JL9:33-34), lojbab generated a list of static letter frequencies for Lojban: how often each letter a-z and ' occurs in: 1) the gismu and cmavo lists; 2) those plus a rough guess at what a lujvo list would look like (at that time, we didn't have one). Of course, this data totally ignored the fact that some words occur more often than others, so it was suitable for making a Lojban Scrabble set, but not for Lojban cryptanalysis. Well, I took 20,000 words of Lojban I had on my PC, very carefully excluded all English stuff, folded case (upper case is so marginal in Lojban it's not worth treating as separate), and stripped everything except a-z and ' (the . character is really optional, though strongly recommended, and some writers don't use it). Then I could generate a first cut at dynamic frequencies of characters based on actual running text. Some of the text is not fully grammatical, but it's probably all "lexically sound", which is all that really matters. I tried to make sure that multiple versions of the same text weren't included, to avoid biases. Here are the results, plus lojbab's old data: static static letter dynamic no-lujvo with-lujvo ' 045 037 028 a 105 118 084 b 021 025 024 c 042 043 029 d 023 026 024 e 095 059 044 f 013 017 017 g 014 017 016 i 132 124 076 j 017 029 028 k 033 034 031 l 073 041 039 m 032 030 029 n 055 067 058 o 057 047 029 p 022 024 024 r 039 054 084 s 037 040 038 t 026 043 038 u 076 076 050 v 010 014 013 x 008 012 015 y 004 002 158 z 009 010 010 Here are the three different rank orders: dynamic: iaeul on'cr skmtd pbjgf vzxy no-lujvo: iaune rotcl s'kmj dbpfg vxzy with-lujvo: yarin uelts kcmoj 'pbdf gxvz As you can see, the dynamic rank-ordering agrees fairly well with the no-lujvo static rank-ordering, especially at the top and the bottom. The with-lujvo rank-ordering puts "y" at the top, which reflects the fact that the "lujvo-list" used to build it contained mostly proposals that had never been used, many of them dating back to pre-Lojban days. But otherwise it too is fairly sane. As I said in the Subject header, all this is a first cut. We will need our 50,000-word dictionary for honest static frequencies, and maybe 500,000 words of running text for honest dynamic frequencies. Watch this space. :-) -- John Cowan sharing account for now e'osai ko sarji la lojban.