Message-Id: <199402032330.AA01378@eli.CS.YALE.EDU>
Date:         Thu, 3 Feb 1994 14:52:44 -0500
Reply-To: Logical Language Group <lojbab@ACCESS.DIGEX.NET>
Sender: Lojban list <LOJBAN%CUVMB.bitnet@YaleVM.YCC.YALE.EDU>
From: Logical Language Group <lojbab@ACCESS.DIGEX.NET>
Subject:      Character frequencies for Lojban -- a first cut
To: Erik Rauch <rauch>
Status: RO

Some time ago (back in JL9:33-34), lojbab generated a list of static
letter frequencies for Lojban: how often each letter a-z and ' occurs
in:  1) the gismu and cmavo lists; 2) those plus a rough guess at what
a lujvo list would look like (at that time, we didn't have one).

Of course, this data totally ignored the fact that some words occur
more often than others, so it was suitable for making a Lojban Scrabble set,
but not for Lojban cryptanalysis.

Well, I took 20,000 words of Lojban I had on my PC, very carefully
excluded all English stuff, folded case (upper case is so marginal in
Lojban it's not worth treating as separate), and stripped everything
except a-z and ' (the . character is really optional, though strongly
recommended, and some writers don't use it).

Then I could generate a first cut at dynamic frequencies of characters
based on actual running text.  Some of the text is not fully grammatical,
but it's probably all "lexically sound", which is all that really matters.
I tried to make sure that multiple versions of the same text weren't
included, to avoid biases.

Here are the results, plus lojbab's old data:

                                        static          static
        letter          dynamic         no-lujvo        with-lujvo

        '               045             037             028
        a               105             118             084
        b               021             025             024
        c               042             043             029
        d               023             026             024
        e               095             059             044
        f               013             017             017
        g               014             017             016
        i               132             124             076
        j               017             029             028
        k               033             034             031
        l               073             041             039
        m               032             030             029
        n               055             067             058
        o               057             047             029
        p               022             024             024
        r               039             054             084
        s               037             040             038
        t               026             043             038
        u               076             076             050
        v               010             014             013
        x               008             012             015
        y               004             002             158
        z               009             010             010

Here are the three different rank orders:

        dynamic:        iaeul on'cr skmtd pbjgf vzxy
        no-lujvo:       iaune rotcl s'kmj dbpfg vxzy
        with-lujvo:     yarin uelts kcmoj 'pbdf gxvz

As you can see, the dynamic rank-ordering agrees fairly well with the
no-lujvo static rank-ordering, especially at the top and the bottom.
The with-lujvo rank-ordering puts "y" at the top, which reflects the fact
that the "lujvo-list" used to build it contained mostly proposals that
had never been used, many of them dating back to pre-Lojban days.
But otherwise it too is fairly sane.

As I said in the Subject header, all this is a first cut.  We will need
our 50,000-word dictionary for honest static frequencies, and maybe
500,000 words of running text for honest dynamic frequencies.  Watch
this space.  :-)

--
John Cowan              sharing account <lojbab@access.digex.net> for now
                e'osai ko sarji la lojban.