Received: from mail-wi0-f186.google.com ([209.85.212.186]:64236) by stodi.digitalkingdom.org with esmtps (TLSv1:RC4-SHA:128) (Exim 4.76) (envelope-from ) id 1US37A-0008GI-8n for lojban-list-archive@lojban.org; Tue, 16 Apr 2013 03:28:06 -0700 Received: by mail-wi0-f186.google.com with SMTP id hq12sf111868wib.3 for ; Tue, 16 Apr 2013 03:27:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20120806; h=x-received:x-beenthere:x-received:received-spf:x-received :mime-version:x-originating-ip:in-reply-to:references:from:date :message-id:subject:to:x-original-sender :x-original-authentication-results:reply-to:precedence:mailing-list :list-id:x-google-group-id:list-post:list-help:list-archive:sender :list-subscribe:list-unsubscribe:content-type; bh=5LECQVcH7wvI8GGH7slEywNjHFKxQbhUKsXQonG2Pdk=; b=KDFiSAJHW3HZSNSVeHtf0AIQg+szkEeBVcdONpKba5sKKDazyu+GpNuDkIHlg3Ceoj HGj5qO+qrJfGPNg7LWxSZRF2kHcy5yBLsLJDQZwQM1F9YI1b6KnudX4aEk+S5FP1EB9+ A9lM+M25A7/ldr5Lrw2RurKkqDhUZ0IPYTK883pKl5VA+tLKVn+IM+fkOHaChhi9PKtq 9gHUvvN3XkI7IEpEEBFP+mEKTJlpROoyuNJVBoooRq7An2r1+Asq52WsCmC52pN2zkAQ KvQ5OzuExJrHPoQPEuQy6Mv8RctvQC+C6y3MYqKXv149XIZiZun0gCqhslRJuS7vOgby dqJg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:x-beenthere:x-received:received-spf:x-received :mime-version:x-originating-ip:in-reply-to:references:from:date :message-id:subject:to:x-gm-message-state:x-original-sender :x-original-authentication-results:reply-to:precedence:mailing-list :list-id:x-google-group-id:list-post:list-help:list-archive:sender :list-subscribe:list-unsubscribe:content-type; bh=5LECQVcH7wvI8GGH7slEywNjHFKxQbhUKsXQonG2Pdk=; b=btIlZ0Mi2H9ZBjqMHmRVxcBIBWHPK9adK7MdIjqVyEBMJUogbVimUY/LVMm4BfyL+D OARahzqhowksK1PZOqZdlOxLeRNARtsVefn8wHMy62n+m/Llr648RBfYXvRbkLm3ugQs RWJJPrR0gxHX0Ee+OzyJOaBJeSyrsVN38sVP+AssPgFRwygY05Dw9UveiVDzieVHjAe8 igYPsIyyHm2ip8F15WEu6E5k3OlDMU35IE24qBkP2cj6gbvD+ebOqZ1py9sXe3LdzRHN UsMxkOL9RZOQIwtI96pJiGbKxysAi50mUbqbXB8KJt6OwR3SYMtl7Z0DD0QUL8dnQwhV ojkw== X-Received: by 10.180.76.108 with SMTP id j12mr90741wiw.3.1366108064789; Tue, 16 Apr 2013 03:27:44 -0700 (PDT) X-BeenThere: lojban@googlegroups.com Received: by 10.180.74.177 with SMTP id u17ls699648wiv.11.canary; Tue, 16 Apr 2013 03:27:39 -0700 (PDT) X-Received: by 10.180.106.232 with SMTP id gx8mr546538wib.2.1366108059710; Tue, 16 Apr 2013 03:27:39 -0700 (PDT) Received: from mail-ve0-f171.google.com (mail-ve0-f171.google.com [209.85.128.171]) by gmr-mx.google.com with ESMTPS id fs5si62834wib.1.2013.04.16.03.27.37 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 16 Apr 2013 03:27:38 -0700 (PDT) Received-SPF: neutral (google.com: 209.85.128.171 is neither permitted nor denied by best guess record for domain of ross@rossogilvie.id.au) client-ip=209.85.128.171; Received: by mail-ve0-f171.google.com with SMTP id b10so261052vea.30 for ; Tue, 16 Apr 2013 03:27:37 -0700 (PDT) X-Received: by 10.52.230.197 with SMTP id ta5mr842857vdc.103.1366108056825; Tue, 16 Apr 2013 03:27:36 -0700 (PDT) MIME-Version: 1.0 Received: by 10.58.22.193 with HTTP; Tue, 16 Apr 2013 03:27:06 -0700 (PDT) X-Originating-IP: [58.109.95.92] In-Reply-To: <7e6e66ac-0d51-4d9d-a49f-6e96741629dc@googlegroups.com> References: <20130415195125.GB11548@stodi.digitalkingdom.org> <20130416090920.GA18465@stodi.digitalkingdom.org> <7e6e66ac-0d51-4d9d-a49f-6e96741629dc@googlegroups.com> From: Ross Ogilvie Date: Tue, 16 Apr 2013 20:27:06 +1000 Message-ID: Subject: Re: [lojban] Request for a full frequency list of all lojbanic words for an Android app. To: lojban@googlegroups.com X-Gm-Message-State: ALoCoQkatdflB+Rplilqo7jc06fhHgNUT4Wg5/1l1yFuNzWtTk0bXf79YYX3opqoX1f3mVOe5LrZ X-Original-Sender: ross@rossogilvie.id.au X-Original-Authentication-Results: gmr-mx.google.com; spf=neutral (google.com: 209.85.128.171 is neither permitted nor denied by best guess record for domain of ross@rossogilvie.id.au) smtp.mail=ross@rossogilvie.id.au Reply-To: lojban@googlegroups.com Precedence: list Mailing-list: list lojban@googlegroups.com; contact lojban+owners@googlegroups.com List-ID: X-Google-Group-Id: 1004133512417 List-Post: , List-Help: , List-Archive: Sender: lojban@googlegroups.com List-Subscribe: , List-Unsubscribe: , Content-Type: multipart/alternative; boundary=089e0111d6b073bf7e04da77d14f X-Spam-Score: 0.0 (/) X-Spam_score: 0.0 X-Spam_score_int: 0 X-Spam_bar: / --089e0111d6b073bf7e04da77d14f Content-Type: text/plain; charset=ISO-8859-1 By corpus, I mean the collection of texts found here http://corpus.lojban.org/ At the top of the page there is a link to download them all in one text file. I took that document, ran a word frequency sorter on it, then filtered out all the non-lojban words using cmafi'e (available in this arch package https://aur.archlinux.org/packages/jbofihe-git/ , thank you zorun). I had a quick look and only spotted one english word in the first 1000: kinda. And there are some nonsense words like tene. Getting rid of these sorts of things would be much more time consuming. But if you want me to try something specific, let me know. btw, the two scripts needed are #!/bin/bash ### word_count tr -cs A-Za-z\' '\n' | tr A-Z a-z | sort | uniq -c | sort -k1,1nr -k2 #!/bin/zsh ##### filter_lojban while read i do j=$(echo $i | sed 's/[0-9]* //'| cmafihe 2> /dev/null | grep -v -e "CMENE" ) if [[ -n $j ]]; then echo $i fi done < $1 And then run them like this, assuming corpus.txt is your source word_count < corpus.txt > freq.txt filter_lojban freq.txt > filtered_freq.txt -- Ross On 16 April 2013 20:02, la gleki wrote: > > > On Tuesday, April 16, 2013 1:34:16 PM UTC+4, Ross Ogilvie wrote: >> >> Okay, I filtered my previous frequency list of lojban words, removing all >> cmene and non lojban words, then manually picked out some author's names >> that are brivla. >> >> > What do you mean by corpus? irc log saves only parsable sentences. But i > still can see many english words. What is the source of this corpus? > > Also i think that we can trim the list to only first 5000 words/clusters. > The rest can be added manually from jbovlaste. > > > >> Please find attached. >> >> -- Ross >> >> On 16 April 2013 19:09, Robin Lee Powell wrote: >> >>> On Tue, Apr 16, 2013 at 12:36:01AM -0700, la gleki wrote: >>> > >>> > >>> > On Monday, April 15, 2013 11:51:25 PM UTC+4, Robin Powell wrote: >>> > > >>> > > On Fri, Apr 12, 2013 at 07:57:17AM -0700, la gleki wrote: >>> > > > peeps, i need ur help. >>> > > > we are gonna have Swype/Swipe feature for MultiLing android >>> keyboard. I >>> > > > need a list of all lojbanic words + frequency of each. >>> > > > i know of a gismu frequency list. But it seems that not all gismu >>> are >>> > > there >>> > > > (less than 1342). What about cmavo, fu'ivla? >>> > > > >>> > > > Of course, most rare words can be given the lowest rating but what >>> are >>> > > the >>> > > > most frequent words? >>> > > > Can we rerun the algorithm to count all the occurrencies of all >>> words? >>> > > >>> > > >>> > > http://users.digitalkingdom.**org/~rlpowell/hobbies/lojban/** >>> flashcards/?C=M;O=D >>> > > -- the _freq lists should have everything. >>> > > >>> > > It should be pretty easy to regenerate this stuff with the latest >>> > > from http://corpus.lojban.org/ , but I am (as usual) not >>> > > volunteering. >>> > > >>> > >>> > Is there a script that can generate such lists? >>> >>> The scripts I used are in that same directory; not sure what's what >>> at this point, though. >>> >>> -Robin >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "lojban" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to lojban+un...@**googlegroups.com. >>> To post to this group, send email to loj...@googlegroups.com. >>> >>> Visit this group at http://groups.google.com/**group/lojban?hl=en >>> . >>> For more options, visit https://groups.google.com/**groups/opt_out >>> . >>> >>> >>> >> -- > You received this message because you are subscribed to the Google Groups > "lojban" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to lojban+unsubscribe@googlegroups.com. > To post to this group, send email to lojban@googlegroups.com. > Visit this group at http://groups.google.com/group/lojban?hl=en. > For more options, visit https://groups.google.com/groups/opt_out. > > > -- You received this message because you are subscribed to the Google Groups "lojban" group. To unsubscribe from this group and stop receiving emails from it, send an email to lojban+unsubscribe@googlegroups.com. To post to this group, send email to lojban@googlegroups.com. Visit this group at http://groups.google.com/group/lojban?hl=en. For more options, visit https://groups.google.com/groups/opt_out. --089e0111d6b073bf7e04da77d14f Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable By corpus, I mean the collection of texts found here http://corpus.lojban.org/ At the top of the page there = is a link to download them all in one text file. I took that document, ran = a word frequency sorter on it, then filtered out all the non-lojban words u= sing cmafi'e (available in this arch package https://aur.archlinux.org/packages/jbofih= e-git/ , thank you zorun).

I had a quick look and only spotted one english word in the first 1000:= kinda. And there are some nonsense words like tene. Getting rid of these s= orts of things would be much more time consuming. But if you want me to try= something specific, let me know.

btw, the two scripts needed are

#!/bin/bash
### word_counttr -cs A-Za-z\' '\n' | tr A-Z a-z | sort | uniq -c | sort -k1= ,1nr -k2

#!/bin/zsh
##### filter_lojban
while read i
do
j=3D$(echo $i | sed 's/[0-9]* //'| cmafihe 2> /dev/null | grep -= v -e "CMENE" )
if [[ -n $j ]]; then
=A0=A0=A0 echo $i
f= i
done < $1

And then run them like this, assuming corpus.txt i= s your source
word_count < corpus.txt > freq.txt
filter_lojban freq.txt > fil= tered_freq.txt

-- Ross

On 16 April= 2013 20:02, la gleki <gleki.is.my.name@gmail.com> = wrote:


On Tuesday, April = 16, 2013 1:34:16 PM UTC+4, Ross Ogilvie wrote:
Okay, I filtered my previous frequency list of lojban words, removing all c= mene and non lojban words, then manually picked out some author's names= that are brivla.


What do you= mean by corpus? irc log saves only parsable sentences. But i still can see= many english words. What is the source of this corpus?

Also i think that we can trim the list to only first 50= 00 words/clusters. The rest can be added manually from jbovlaste.

=A0
Please find attached.

-- Ross

On 16 April 2013 19:09, Robin Lee Powell <rlpo...@d= igitalkingdom.org> wrote:
On Tue, Apr 16, 2013 at 12:36:01AM -0700, la gleki wrote:
>
>
> On Monday, April 15, 2013 11:51:25 PM UTC+4, Robin Powell wrote:
> >
> > On Fri, Apr 12, 2013 at 07:57:17AM -0700, la gleki wrote:
> > > peeps, i need ur help.
> > > we are gonna have Swype/Swipe feature for MultiLing android = keyboard. I
> > > need a list of all lojbanic words + frequency of each.
> > > i know of a gismu frequency list. But it seems that not all = gismu are
> > there
> > > (less than 1342). What about cmavo, fu'ivla?
> > >
> > > Of course, most rare words can be given the lowest rating bu= t what are
> > the
> > > most frequent words?
> > > Can we rerun the algorithm to count all the occurrencies of = all words?
> >
> >
> > http://users.digitalkingdom.<= u>org/~rlpowell/hobbies/lojban/flashcards/?C=3DM;O=3DD
> > -- the _freq lists should have everything.
> >
> > It should be pretty easy to regenerate this stuff with the latest=
> > from http= ://corpus.lojban.org/ , but I am (as usual) not
> > volunteering.
> >
>
> Is there a script that can generate such lists?

The scripts I used are in that same directory; not sure what's wh= at
at this point, though.

-Robin

--
You received this message because you are subscribed to the Google Groups &= quot;lojban" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to lojban+un...@googlegroups.com.
To post to this group, send email to loj...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups &= quot;lojban" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to lojban+unsubscribe@googlegroups.com.
To post to this group, send email to lojban@googlegroups.com.
Visit this group at http://groups.google.com/group/lojban?hl=3Den.
For more options, visit https://groups.google.com/groups/opt_out.
=A0
=A0

--
You received this message because you are subscribed to the Google Groups &= quot;lojban" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to lojban+unsubscribe@googlegroups.com.
To post to this group, send email to lojban@googlegroups.com.
Visit this group at http://groups.google.com/group/lojban?hl=3Den.
For more options, visit https://groups.google.com/groups/opt_out.
 
 
--089e0111d6b073bf7e04da77d14f--