Received: from mail-ye0-f186.google.com ([209.85.213.186]:59291) by stodi.digitalkingdom.org with esmtps (TLSv1:RC4-SHA:128) (Exim 4.76) (envelope-from ) id 1US3aR-0008UF-Ru for lojban-list-archive@lojban.org; Tue, 16 Apr 2013 03:58:22 -0700 Received: by mail-ye0-f186.google.com with SMTP id m9sf164696yen.13 for ; Tue, 16 Apr 2013 03:57:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20120806; h=x-received:x-beenthere:x-received:date:from:to:message-id :in-reply-to:references:subject:mime-version:x-original-sender :reply-to:precedence:mailing-list:list-id:x-google-group-id :list-post:list-help:list-archive:sender:list-subscribe :list-unsubscribe:content-type; bh=R1SgGEysJOlIla1y6dVvPZseLJAFG6/5VB773LFG4pc=; b=nZs1WuBi1jgJA4o66BRGTqL+9RXJkvu/nmq9sPcVGOtMtPvegutQ1LRJDTeGge8pC5 o6UcSeZFyb95AtCwOPCIWUEhZ/iH2Y3REfrb+P9XRblmpAG6UilYw0LyzUo3SgTR3tXR 5BBx6Aih5DrJbF1lS/UTJ/fxVVgsstQ5XadDZtt5t25cUgUdRR7v4h14Ch+SXLhyhTNV XkWG/PYlqKP/OKQIm0yhEPhckrRM3Fhe66jQo2Mlsk+WbNZXmpM7gy/pLTwk8B2irRlE tZouGQNpCsmRvLDHHml95unW0GK7mvRU2olftxH6tnbFoeX0JhWUmWnD0eqkJDSy8GgO bSrQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:x-beenthere:x-received:date:from:to:message-id :in-reply-to:references:subject:mime-version:x-original-sender :reply-to:precedence:mailing-list:list-id:x-google-group-id :list-post:list-help:list-archive:sender:list-subscribe :list-unsubscribe:content-type; bh=R1SgGEysJOlIla1y6dVvPZseLJAFG6/5VB773LFG4pc=; b=QS6Pgo8s/j3hcllvDsC6LNWA13VwTRcxI+yDszeyuuFG8HRAdsqd2IPGg5DeB7bycq 1QPuF3N/5ZxtW36S21j7/KuPbo44rV2EHo6oEcIxkKcOXlTz52F5N2RlUel4eE4LUv0v Ah4ird2XoErRSbix3xwQxyBmJDczEPeoaceJptDcge4R+LXN1xMhgJ76DLoqL/T3+1vg WZGTeb0dmblrfySJMgHVok2pnAUqeaNGpapg4VWrNCkQ3HOMemOqpXdd7ebokKCMOQTC lP8QNjCg5iHzMcESec5EDBnQ2ZT/sOf13VYhxdG/cyyh8fLxCYcnI1q75C72C+SqIm7e yMWQ== X-Received: by 10.49.72.225 with SMTP id g1mr82136qev.36.1366109876287; Tue, 16 Apr 2013 03:57:56 -0700 (PDT) X-BeenThere: lojban@googlegroups.com Received: by 10.49.130.74 with SMTP id oc10ls254257qeb.53.gmail; Tue, 16 Apr 2013 03:57:55 -0700 (PDT) X-Received: by 10.49.121.200 with SMTP id lm8mr84124qeb.5.1366109875482; Tue, 16 Apr 2013 03:57:55 -0700 (PDT) Date: Tue, 16 Apr 2013 03:57:54 -0700 (PDT) From: la gleki To: lojban@googlegroups.com Message-Id: <7702eef1-c428-4296-a009-81221e21c34c@googlegroups.com> In-Reply-To: References: <20130415195125.GB11548@stodi.digitalkingdom.org> <20130416090920.GA18465@stodi.digitalkingdom.org> <7e6e66ac-0d51-4d9d-a49f-6e96741629dc@googlegroups.com> Subject: Re: [lojban] Request for a full frequency list of all lojbanic words for an Android app. MIME-Version: 1.0 X-Original-Sender: gleki.is.my.name@gmail.com Reply-To: lojban@googlegroups.com Precedence: list Mailing-list: list lojban@googlegroups.com; contact lojban+owners@googlegroups.com List-ID: X-Google-Group-Id: 1004133512417 List-Post: , List-Help: , List-Archive: Sender: lojban@googlegroups.com List-Subscribe: , List-Unsubscribe: , Content-Type: multipart/alternative; boundary="----=_Part_1313_30222024.1366109874921" X-Spam-Score: -0.1 (/) X-Spam_score: -0.1 X-Spam_score_int: 0 X-Spam_bar: / ------=_Part_1313_30222024.1366109874921 Content-Type: text/plain; charset=ISO-8859-1 On Tuesday, April 16, 2013 2:27:06 PM UTC+4, Ross Ogilvie wrote: > > By corpus, I mean the collection of texts found here > http://corpus.lojban.org/ At the top of the page there is a link to > download them all in one text file. I took that document, ran a word > frequency sorter on it, then filtered out all the non-lojban words using > cmafi'e (available in this arch package > https://aur.archlinux.org/packages/jbofihe-git/ , thank you zorun). > > I had a quick look and only spotted one english word in the first 1000: > kinda. "eimi" is among the first 30 words. It's a name. {kinda} can also be a joke gismu, though. And there are some nonsense words like tene. Getting rid of these sorts of > things would be much more time consuming. But if you want me to try > something specific, let me know. > I'll run another script on irc log. > > btw, the two scripts needed are > > #!/bin/bash > ### word_count > tr -cs A-Za-z\' '\n' | tr A-Z a-z | sort | uniq -c | sort -k1,1nr -k2 > > #!/bin/zsh > ##### filter_lojban > while read i > do > j=$(echo $i | sed 's/[0-9]* //'| cmafihe 2> /dev/null | grep -v -e "CMENE" > ) > if [[ -n $j ]]; then > echo $i > fi > done < $1 > > And then run them like this, assuming corpus.txt is your source > word_count < corpus.txt > freq.txt > filter_lojban freq.txt > filtered_freq.txt > > -- Ross > > On 16 April 2013 20:02, la gleki >wrote: > >> >> >> On Tuesday, April 16, 2013 1:34:16 PM UTC+4, Ross Ogilvie wrote: >>> >>> Okay, I filtered my previous frequency list of lojban words, removing >>> all cmene and non lojban words, then manually picked out some author's >>> names that are brivla. >>> >>> >> What do you mean by corpus? irc log saves only parsable sentences. But i >> still can see many english words. What is the source of this corpus? >> >> Also i think that we can trim the list to only first 5000 words/clusters. >> The rest can be added manually from jbovlaste. >> >> >> >>> Please find attached. >>> >>> -- Ross >>> >>> On 16 April 2013 19:09, Robin Lee Powell wrote: >>> >>>> On Tue, Apr 16, 2013 at 12:36:01AM -0700, la gleki wrote: >>>> > >>>> > >>>> > On Monday, April 15, 2013 11:51:25 PM UTC+4, Robin Powell wrote: >>>> > > >>>> > > On Fri, Apr 12, 2013 at 07:57:17AM -0700, la gleki wrote: >>>> > > > peeps, i need ur help. >>>> > > > we are gonna have Swype/Swipe feature for MultiLing android >>>> keyboard. I >>>> > > > need a list of all lojbanic words + frequency of each. >>>> > > > i know of a gismu frequency list. But it seems that not all gismu >>>> are >>>> > > there >>>> > > > (less than 1342). What about cmavo, fu'ivla? >>>> > > > >>>> > > > Of course, most rare words can be given the lowest rating but >>>> what are >>>> > > the >>>> > > > most frequent words? >>>> > > > Can we rerun the algorithm to count all the occurrencies of all >>>> words? >>>> > > >>>> > > >>>> > > http://users.digitalkingdom.**org/~rlpowell/hobbies/lojban/** >>>> flashcards/?C=M;O=D >>>> > > -- the _freq lists should have everything. >>>> > > >>>> > > It should be pretty easy to regenerate this stuff with the latest >>>> > > from http://corpus.lojban.org/ , but I am (as usual) not >>>> > > volunteering. >>>> > > >>>> > >>>> > Is there a script that can generate such lists? >>>> >>>> The scripts I used are in that same directory; not sure what's what >>>> at this point, though. >>>> >>>> -Robin >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "lojban" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to lojban+un...@**googlegroups.com. >>>> To post to this group, send email to loj...@googlegroups.com. >>>> >>>> Visit this group at http://groups.google.com/**group/lojban?hl=en >>>> . >>>> For more options, visit https://groups.google.com/**groups/opt_out >>>> . >>>> >>>> >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "lojban" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to lojban+un...@googlegroups.com . >> To post to this group, send email to loj...@googlegroups.com >> . >> Visit this group at http://groups.google.com/group/lojban?hl=en. >> For more options, visit https://groups.google.com/groups/opt_out. >> >> >> > > -- You received this message because you are subscribed to the Google Groups "lojban" group. To unsubscribe from this group and stop receiving emails from it, send an email to lojban+unsubscribe@googlegroups.com. To post to this group, send email to lojban@googlegroups.com. Visit this group at http://groups.google.com/group/lojban?hl=en. For more options, visit https://groups.google.com/groups/opt_out. ------=_Part_1313_30222024.1366109874921 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

On Tuesday, April 16, 2013 2:27:06 PM UTC+4, Ross Ogilvie wrote:By corpus, I mean the collection = of texts found here http://corpus.lojban.org/ At the top of the page there is a link to do= wnload them all in one text file. I took that document, ran a word frequenc= y sorter on it, then filtered out all the non-lojban words using cmafi'e (a= vailable in this arch package https://aur.archlinux.org/packages/jb= ofihe-git/ , thank you zorun).

I had a quick look and only spotted one english word in the first 1000:= kinda.

"eimi" is among the first 30 words.= It's a name. {kinda} can also be a joke gismu, though.

And there are some nonsense = words like tene. Getting rid of these sorts of things would be much more ti= me consuming. But if you want me to try something specific, let me know.

I'll run another script on irc log.
=

btw, the two scripts needed are

#!/bin/bash
### word_counttr -cs A-Za-z\' '\n' | tr A-Z a-z | sort | uniq -c | sort -k1,1nr -k2
=
#!/bin/zsh
##### filter_lojban
while read i
do
j=3D$(echo $i | sed 's/[0-9]* //'| cmafihe 2> /dev/null | grep -v -e "CM= ENE" )
if [[ -n $j ]]; then
    echo $i
fi
done= < $1

And then run them like this, assuming corpus.txt is your so= urce
word_count < corpus.txt > freq.txt
filter_lojban freq.txt > fil= tered_freq.txt

-- Ross

Please find attached.

-- Ross

On Tue, Apr 16, 2013 at 12:36:01AM -0700, la gleki wrote:
>
>
> On Monday, April 15, 2013 11:51:25 PM UTC+4, Robin Powell wrote:
> >
> > On Fri, Apr 12, 2013 at 07:57:17AM -0700, la gleki wrote:
> > > peeps, i need ur help.
> > > we are gonna have Swype/Swipe feature for MultiLing android = keyboard. I
> > > need a list of all lojbanic words + frequency of each.
> > > i know of a gismu frequency list. But it seems that not all = gismu are
> > there
> > > (less than 1342). What about cmavo, fu'ivla?
> > >
> > > Of course, most rare words can be given the lowest rating bu= t what are
> > the
> > > most frequent words?
> > > Can we rerun the algorithm to count all the occurrencies of = all words?
> >
> >
> > http://users.digitalkingdom.<= u>org/~rlpowell/hobbies/lojban/flashcards/?C=3DM;O=3DD=
> > -- the _freq lists should have everything.
> >
> > It should be pretty easy to regenerate this stuff with the latest=
> > from http= ://corpus.lojban.org/ , but I am (as usual) not
> > volunteering.
> >
>
> Is there a script that can generate such lists?

The scripts I used are in that same directory; not sure what's what at this point, though.

-Robin

--
You received this message because you are subscribed to the Google Groups "= lojban" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to lojban+un...@googlegroups.com.
To post to this group, send email to loj...@googlegroups.com.


--
You received this message because you are subscribed to the Google Groups &= quot;lojban" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to lojban+unsubscribe@googlegroups.com.
To post to this group, send email to lojban@googlegroups.com.
Visit this group at http://groups.google.com/group/lojban?hl=3Den.
For more options, visit https://groups.google.com/groups/opt_out.
 
 
------=_Part_1313_30222024.1366109874921--