From lojban-out@lojban.org Sat Dec 04 19:07:06 2004
Return-Path: <lojban-out@lojban.org>
X-Sender: lojban-out@lojban.org
X-Apparently-To: lojban@yahoogroups.com
Received: (qmail 34404 invoked from network); 5 Dec 2004 03:07:06 -0000
Received: from unknown (66.218.66.167)
  by m17.grp.scd.yahoo.com with QMQP; 5 Dec 2004 03:07:06 -0000
Received: from unknown (HELO chain.digitalkingdom.org) (64.81.49.134)
  by mta6.grp.scd.yahoo.com with SMTP; 5 Dec 2004 03:07:06 -0000
Received: from lojban-out by chain.digitalkingdom.org with local (Exim 4.34)
	id 1Camjo-0006Ym-PB
	for lojban@yahoogroups.com; Sat, 04 Dec 2004 19:07:04 -0800
Received: from chain.digitalkingdom.org ([64.81.49.134])
	by chain.digitalkingdom.org with esmtp (Exim 4.34)
	id 1CamjK-0006YE-Bi; Sat, 04 Dec 2004 19:06:34 -0800
Received: with ECARTIS (v1.0.0; list lojban-list); Sat, 04 Dec 2004 19:06:31 -0800 (PST)
Received: from rlpowell by chain.digitalkingdom.org with local (Exim 4.34)
	id 1Camj8-0006Y3-3W
	for lojban-list@lojban.org; Sat, 04 Dec 2004 19:06:22 -0800
Date: Sat, 4 Dec 2004 19:06:22 -0800
Message-ID: <20041205030622.GW25791@chain.digitalkingdom.org>
Mail-Followup-To: lojban-list@lojban.org
References: <20041204184629.GU25791@chain.digitalkingdom.org> <20041204234414.GC6154@skunk.reutershealth.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20041204234414.GC6154@skunk.reutershealth.com>
User-Agent: Mutt/1.5.6+20040722i
X-archive-position: 9066
X-ecartis-version: Ecartis v1.0.0
Sender: lojban-list-bounce@lojban.org
Errors-to: lojban-list-bounce@lojban.org
X-original-sender: rlpowell@digitalkingdom.org
X-list: lojban-list
To: lojban@yahoogroups.com
X-eGroups-Remote-IP: 64.81.49.134
X-eGroups-From: Robin Lee Powell <rlpowell@digitalkingdom.org>
From: Robin Lee Powell <lojban-out@lojban.org>
Reply-To: rlpowell@digitalkingdom.org
Subject: [lojban] Re: Updated Letter Frequency Data
X-Yahoo-Group-Post: member; u=116389790
X-Yahoo-Profile: lojban_out
X-Yahoo-Message-Num: 23468

On Sat, Dec 04, 2004 at 06:44:14PM -0500, John Cowan wrote:
> Robin Lee Powell scripsit:
> 
> > My data, sorted by number of occurences:
> 
> [snip]
> 
> > The only previous work on this I'm aware of is:
> > 
> > http://www.lojban.org/files/papers/scrabble.unf
> > 
> > Which, it turns out, is amazingly flawed (which is fine, because
> > that was a long time ago!).
> 
> The two sets of statistics aren't comparable, because the Scrabble
> data counts each distinct word only once, which is appropriate for
> Scrabble.  Your data (I assume) counts every letter in the running
> text.

I don't see how that's appropriate for scrabble, actually, but I can
edit my data to work that way trivially:

grep -v '^#' test_sentences.txt | sed 's/ -- .*//' | tr -d -c "aeiouybcdfgjklmnprstvxz' .A-Z" | tr ' .' '\n' | sort | uniq | tr -d -c "aeiouybcdfgjklmnprstvxz'" | sed 's/\(.\)/\1\n/g' | sort | uniq -c | sort -r

Gives:

  21732 i
  17703 a
  14387 o
  11890 e
  10319 u
   9585 n
   8434 c
   8011 r
   7560 l
   7084 s
   6816 m
   5780 '
   5496 t
   5144 d
   4290 k
   3870 b
   3453 p
   3124 j
   2720 g
   2032 x
   2010 v
   1915 z
   1749 y
   1632 f

Which is within spitting distance of identical to my previous result.

-Robin

-- 
http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/
Reason #237 To Learn Lojban: "Homonyms: Their Grate!"
Proud Supporter of the Singularity Institute - http://singinst.org/