[lojban] N-grams of Lojban corpus

To: "lojban@googlegroups.com" <lojban@googlegroups.com>

Subject: [lojban] N-grams of Lojban corpus

From: Gleki Arxokuna <gleki.is.my.name@gmail.com>

Date: Mon, 8 Feb 2016 16:35:14 +0300

Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20120806; h=mime-version:from:date:message-id:subject:to:content-type :x-original-sender:x-original-authentication-results:reply-to :precedence:mailing-list:list-id:x-spam-checked-in-group:list-post :list-help:list-archive:sender:list-subscribe:list-unsubscribe; bh=GJPD7+CwptwzIDl5LAgY3DPlUF/QozEtVZrWl7jviv4=; b=vdsOzI+43demWuVyGu1ZL6hSTSWqNkEhHTVt9UXX0MCR5mToa8ON35KhVYCGkQE6i6 pitX7eETDsgvWD9ixTFjOYtH94QngrD4w+EeQbJzIyLmq4abN90qTTTRf6+rnttknbFB IuMqmTlq691Jkh/39ypxcFYJUUJoayNn/kZkqijdNbI8AE3PQSh05NQ9h8rzaDpCz2El Bc2Vvnxa47sOazhWgX1H+mzoD4RJu454ldqEB1Rz+WBZmBks7sY0Z23aOQEqGAQTpOV6 FY1aSuzgHdaDtrHXNkEEhwjn8BgCLWJuKARcE+bRfFkn5+CX93aA4CfPcBWIhcKUiV9e OKtA==

List-archive: <https://groups.google.com/group/lojba>

List-help: <https://groups.google.com/support/>, <mailto:lojban+help@googlegroups.com>

List-id: <lojban.googlegroups.com>

List-post: <https://groups.google.com/group/lojban/post>, <mailto:lojban@googlegroups.com>

List-subscribe: <https://groups.google.com/group/lojban/subscribe>, <mailto:lojban+subscribe@googlegroups.com>

List-unsubscribe: <mailto:googlegroups-manage+1004133512417+unsubscribe@googlegroups.com>, <https://groups.google.com/group/lojban/subscribe>

Mailing-list: list lojban@googlegroups.com; contact lojban+owners@googlegroups.com

Reply-to: lojban@googlegroups.com

Sender: lojban@googlegroups.com

For various reasons we may need stats of N-grams from Lojban corpus.

Not that it's hard to generate such stats.

But we first need to preprocess the log of our history:

http://www.lojban.org/irclogs/irclogs.zip

Definitely, messages from "mensi", "livla" must be removed.

Anything else?

I'd like to eventually develop an algorithm of preprocessing this log.

Any help is welcomed.

I started adding different lists of N-grams here: https://mw.lojban.org/papri/N-grams_of_Lojban_corpus

But spreadsheets might be needed instead since list can be long.

PS. If you wonder where N-grams might be needed the immediate application is "collect most frequent phrases in Lojban and make a phrasebook out of that".

lojban+unsubscribe@googlegroups.com

lojban@googlegroups.com

https://groups.google.com/group/lojban

https://groups.google.com/d/optout