[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[lojban] N-grams of Lojban corpus



For various reasons we may need stats of N-grams from Lojban corpus.

Not that it's hard to generate such stats.

But we first need to preprocess the log of our history:
http://www.lojban.org/irclogs/irclogs.zip

Definitely, messages from "mensi", "livla" must be removed.

Anything else?

I'd like to eventually develop an algorithm of preprocessing this log.
Any help is welcomed.


I started adding different lists of N-grams here: https://mw.lojban.org/papri/N-grams_of_Lojban_corpus

But spreadsheets might be needed instead since list can be long.

PS. If you wonder where N-grams might be needed the immediate application is "collect most frequent phrases in Lojban and make a phrasebook out of that".

--
You received this message because you are subscribed to the Google Groups "lojban" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lojban+unsubscribe@googlegroups.com.
To post to this group, send email to lojban@googlegroups.com.
Visit this group at https://groups.google.com/group/lojban.
For more options, visit https://groups.google.com/d/optout.