From lojban-out@lojban.org Sat Dec 04 19:07:06 2004 Return-Path: X-Sender: lojban-out@lojban.org X-Apparently-To: lojban@yahoogroups.com Received: (qmail 34404 invoked from network); 5 Dec 2004 03:07:06 -0000 Received: from unknown (66.218.66.167) by m17.grp.scd.yahoo.com with QMQP; 5 Dec 2004 03:07:06 -0000 Received: from unknown (HELO chain.digitalkingdom.org) (64.81.49.134) by mta6.grp.scd.yahoo.com with SMTP; 5 Dec 2004 03:07:06 -0000 Received: from lojban-out by chain.digitalkingdom.org with local (Exim 4.34) id 1Camjo-0006Ym-PB for lojban@yahoogroups.com; Sat, 04 Dec 2004 19:07:04 -0800 Received: from chain.digitalkingdom.org ([64.81.49.134]) by chain.digitalkingdom.org with esmtp (Exim 4.34) id 1CamjK-0006YE-Bi; Sat, 04 Dec 2004 19:06:34 -0800 Received: with ECARTIS (v1.0.0; list lojban-list); Sat, 04 Dec 2004 19:06:31 -0800 (PST) Received: from rlpowell by chain.digitalkingdom.org with local (Exim 4.34) id 1Camj8-0006Y3-3W for lojban-list@lojban.org; Sat, 04 Dec 2004 19:06:22 -0800 Date: Sat, 4 Dec 2004 19:06:22 -0800 Message-ID: <20041205030622.GW25791@chain.digitalkingdom.org> Mail-Followup-To: lojban-list@lojban.org References: <20041204184629.GU25791@chain.digitalkingdom.org> <20041204234414.GC6154@skunk.reutershealth.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20041204234414.GC6154@skunk.reutershealth.com> User-Agent: Mutt/1.5.6+20040722i X-archive-position: 9066 X-ecartis-version: Ecartis v1.0.0 Sender: lojban-list-bounce@lojban.org Errors-to: lojban-list-bounce@lojban.org X-original-sender: rlpowell@digitalkingdom.org X-list: lojban-list To: lojban@yahoogroups.com X-eGroups-Remote-IP: 64.81.49.134 X-eGroups-From: Robin Lee Powell From: Robin Lee Powell Reply-To: rlpowell@digitalkingdom.org Subject: [lojban] Re: Updated Letter Frequency Data X-Yahoo-Group-Post: member; u=116389790 X-Yahoo-Profile: lojban_out X-Yahoo-Message-Num: 23468 On Sat, Dec 04, 2004 at 06:44:14PM -0500, John Cowan wrote: > Robin Lee Powell scripsit: > > > My data, sorted by number of occurences: > > [snip] > > > The only previous work on this I'm aware of is: > > > > http://www.lojban.org/files/papers/scrabble.unf > > > > Which, it turns out, is amazingly flawed (which is fine, because > > that was a long time ago!). > > The two sets of statistics aren't comparable, because the Scrabble > data counts each distinct word only once, which is appropriate for > Scrabble. Your data (I assume) counts every letter in the running > text. I don't see how that's appropriate for scrabble, actually, but I can edit my data to work that way trivially: grep -v '^#' test_sentences.txt | sed 's/ -- .*//' | tr -d -c "aeiouybcdfgjklmnprstvxz' .A-Z" | tr ' .' '\n' | sort | uniq | tr -d -c "aeiouybcdfgjklmnprstvxz'" | sed 's/\(.\)/\1\n/g' | sort | uniq -c | sort -r Gives: 21732 i 17703 a 14387 o 11890 e 10319 u 9585 n 8434 c 8011 r 7560 l 7084 s 6816 m 5780 ' 5496 t 5144 d 4290 k 3870 b 3453 p 3124 j 2720 g 2032 x 2010 v 1915 z 1749 y 1632 f Which is within spitting distance of identical to my previous result. -Robin -- http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/ Reason #237 To Learn Lojban: "Homonyms: Their Grate!" Proud Supporter of the Singularity Institute - http://singinst.org/