From lojban-out@lojban.org Sat Dec 04 10:47:34 2004 Return-Path: X-Sender: lojban-out@lojban.org X-Apparently-To: lojban@yahoogroups.com Received: (qmail 36310 invoked from network); 4 Dec 2004 18:47:34 -0000 Received: from unknown (66.218.66.217) by m3.grp.scd.yahoo.com with QMQP; 4 Dec 2004 18:47:34 -0000 Received: from unknown (HELO chain.digitalkingdom.org) (64.81.49.134) by mta2.grp.scd.yahoo.com with SMTP; 4 Dec 2004 18:47:33 -0000 Received: from lojban-out by chain.digitalkingdom.org with local (Exim 4.34) id 1CaewA-0007WU-EZ for lojban@yahoogroups.com; Sat, 04 Dec 2004 10:47:18 -0800 Received: from chain.digitalkingdom.org ([64.81.49.134]) by chain.digitalkingdom.org with esmtp (Exim 4.34) id 1Caevk-0007Vw-L7; Sat, 04 Dec 2004 10:46:52 -0800 Received: with ECARTIS (v1.0.0; list lojban-list); Sat, 04 Dec 2004 10:46:49 -0800 (PST) Received: from rlpowell by chain.digitalkingdom.org with local (Exim 4.34) id 1CaevN-0007Vh-JC for lojban-list@lojban.org; Sat, 04 Dec 2004 10:46:29 -0800 Date: Sat, 4 Dec 2004 10:46:29 -0800 Message-ID: <20041204184629.GU25791@chain.digitalkingdom.org> Mail-Followup-To: lojban-list@lojban.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.6+20040722i X-archive-position: 9064 X-ecartis-version: Ecartis v1.0.0 Sender: lojban-list-bounce@lojban.org Errors-to: lojban-list-bounce@lojban.org X-original-sender: rlpowell@digitalkingdom.org X-list: lojban-list To: lojban@yahoogroups.com X-eGroups-Remote-IP: 64.81.49.134 X-eGroups-From: Robin Lee Powell From: Robin Lee Powell Reply-To: rlpowell@digitalkingdom.org Subject: [lojban] Updated Letter Frequency Data X-Yahoo-Group-Post: member; u=116389790 X-Yahoo-Profile: lojban_out X-Yahoo-Message-Num: 23466 I've just generated new letter frequency data based on all but the first section of: http://www.teddyb.org/~rlpowell/hobbies/lojban/grammar/test_sentences.txt So basically, the cLL, Alice, and a bunch of IRC. If people would like to suggest other non-trivially sized Lojban texts to add, please let me know, but we've got ~650K characters here, so I think the statistics is pretty good. My data, sorted by number of occurences: 85004 i 68959 a 52225 e 50517 u 47944 o 43807 l 36358 n 33169 c 27097 m 24514 r 22989 s 21356 d 20536 ' 18317 t 17749 k 14459 b 13359 p 11990 j 8810 g 8007 z 6857 v 6616 x 6288 f 4580 y As ratios: 0.130472888242183 i 0.105845370809523 a 0.080160305261493 e 0.077538691065483 u 0.073589385839292 o 0.067239492438300 l 0.055806000549495 n 0.050911195121464 c 0.041591264560472 m 0.037626610305031 r 0.035285883344307 s 0.032779386867677 d 0.031520766469124 ' 0.028114816878406 t 0.027242992016969 k 0.022193161393507 b 0.020504768175936 p 0.018403486071523 j 0.013522494769818 g 0.012289967720991 z 0.010524829357167 v 0.010154917752226 x 0.009651469592805 f 0.007029855396795 y The only previous work on this I'm aware of is: http://www.lojban.org/files/papers/scrabble.unf Which, it turns out, is amazingly flawed (which is fine, because that was a long time ago!). Using the data without lujvo, we have: i 1045 a 991 u 642 n 563 e 496 r 460 o 395 t 361 c 360 l 348 s 339 ' 316 k 285 m 254 j 249 d 219 b 212 p 203 f 149 g 146 v 119 x 108 z 87 y 19 which is only marginally different from what I have. Using the data with lujvo, however, which IIRC is what the Scrabble frequencies were based on, we have the obviously biased: y 5553 r 2979 a 2949 i 2678 n 2047 u 1755 e 1560 l 1395 s 1363 t 1359 k 1107 m 1048 o 1046 c 1040 ' 1012 j 1008 p 872 b 865 d 862 f 616 g 589 x 532 v 490 z 359 -Robin -- http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/ Reason #237 To Learn Lojban: "Homonyms: Their Grate!" Proud Supporter of the Singularity Institute - http://singinst.org/