[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lojban] Request for a full frequency list of all lojbanic words for an Android app.



By corpus, I mean the collection of texts found here http://corpus.lojban.org/ At the top of the page there is a link to download them all in one text file. I took that document, ran a word frequency sorter on it, then filtered out all the non-lojban words using cmafi'e (available in this arch package https://aur.archlinux.org/packages/jbofihe-git/ , thank you zorun).

I had a quick look and only spotted one english word in the first 1000: kinda. And there are some nonsense words like tene. Getting rid of these sorts of things would be much more time consuming. But if you want me to try something specific, let me know.

btw, the two scripts needed are

#!/bin/bash
### word_count
tr -cs A-Za-z\' '\n' | tr A-Z a-z | sort | uniq -c | sort -k1,1nr -k2

#!/bin/zsh
##### filter_lojban
while read i
do
j=$(echo $i | sed 's/[0-9]* //'| cmafihe 2> /dev/null | grep -v -e "CMENE" )
if [[ -n $j ]]; then
    echo $i
fi
done < $1

And then run them like this, assuming corpus.txt is your source
word_count < corpus.txt > freq.txt
filter_lojban freq.txt > filtered_freq.txt

-- Ross

On 16 April 2013 20:02, la gleki <gleki.is.my.name@gmail.com> wrote:


On Tuesday, April 16, 2013 1:34:16 PM UTC+4, Ross Ogilvie wrote:
Okay, I filtered my previous frequency list of lojban words, removing all cmene and non lojban words, then manually picked out some author's names that are brivla.


What do you mean by corpus? irc log saves only parsable sentences. But i still can see many english words. What is the source of this corpus?

Also i think that we can trim the list to only first 5000 words/clusters. The rest can be added manually from jbovlaste.

 
Please find attached.

-- Ross

On 16 April 2013 19:09, Robin Lee Powell <rlpo...@digitalkingdom.org> wrote:
On Tue, Apr 16, 2013 at 12:36:01AM -0700, la gleki wrote:
>
>
> On Monday, April 15, 2013 11:51:25 PM UTC+4, Robin Powell wrote:
> >
> > On Fri, Apr 12, 2013 at 07:57:17AM -0700, la gleki wrote:
> > > peeps, i need ur help.
> > > we are gonna have Swype/Swipe feature for MultiLing android keyboard. I
> > > need a list of all lojbanic words + frequency of each.
> > > i know of a gismu frequency list. But it seems that not all gismu are
> > there
> > > (less than 1342). What about cmavo, fu'ivla?
> > >
> > > Of course, most rare words can be given the lowest rating but what are
> > the
> > > most frequent words?
> > > Can we rerun the algorithm to count all the occurrencies of all words?
> >
> >
> > http://users.digitalkingdom.org/~rlpowell/hobbies/lojban/flashcards/?C=M;O=D
> > -- the _freq lists should have everything.
> >
> > It should be pretty easy to regenerate this stuff with the latest
> > from http://corpus.lojban.org/ , but I am (as usual) not
> > volunteering.
> >
>
> Is there a script that can generate such lists?

The scripts I used are in that same directory; not sure what's what
at this point, though.

-Robin

--
You received this message because you are subscribed to the Google Groups "lojban" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lojban+un...@googlegroups.com.
To post to this group, send email to loj...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "lojban" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lojban+unsubscribe@googlegroups.com.
To post to this group, send email to lojban@googlegroups.com.
Visit this group at http://groups.google.com/group/lojban?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "lojban" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lojban+unsubscribe@googlegroups.com.
To post to this group, send email to lojban@googlegroups.com.
Visit this group at http://groups.google.com/group/lojban?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.