From lojban-out@lojban.org Thu Nov 14 18:38:57 2002
Return-Path: <lojban-out@lojban.org>
X-Sender: lojban-out@lojban.org
X-Apparently-To: lojban@yahoogroups.com
Received: (EGP: mail-8_2_3_0); 15 Nov 2002 02:38:57 -0000
Received: (qmail 62182 invoked from network); 15 Nov 2002 02:38:57 -0000
Received: from unknown (66.218.66.217)
  by m6.grp.scd.yahoo.com with QMQP; 15 Nov 2002 02:38:57 -0000
Received: from unknown (HELO digitalkingdom.org) (204.152.186.175)
  by mta2.grp.scd.yahoo.com with SMTP; 15 Nov 2002 02:38:57 -0000
Received: from lojban-out by digitalkingdom.org with local (Exim 4.05)
  id 18CWNl-0008BG-00
  for lojban@yahoogroups.com; Thu, 14 Nov 2002 18:38:57 -0800
Received: from digitalkingdom.org ([204.152.186.175] helo=chain)
  by digitalkingdom.org with esmtp (Exim 4.05)
  id 18CWNh-0008Au-00; Thu, 14 Nov 2002 18:38:53 -0800
Received: with ECARTIS (v1.0.0; list lojban-list); Thu, 14 Nov 2002 18:38:52 -0800 (PST)
Received: from cs6668125-184.austin.rr.com ([66.68.125.184] ident=root)
  by digitalkingdom.org with esmtp (Exim 4.05)
  id 18CWNb-0008Ag-00
  for lojban-list@lojban.org; Thu, 14 Nov 2002 18:38:47 -0800
Received: from cs6668125-184.austin.rr.com (asdf@localhost [127.0.0.1])
  by cs6668125-184.austin.rr.com (8.12.3/8.12.3) with ESMTP id gAF2jLWF088884
  for <lojban-list@lojban.org>; Thu, 14 Nov 2002 20:45:21 -0600 (CST)
  (envelope-from fracture@cs6668125-184.austin.rr.com)
Received: (from fracture@localhost)
  by cs6668125-184.austin.rr.com (8.12.3/8.12.3/Submit) id gAF2jLwX088883
  for lojban-list@lojban.org; Thu, 14 Nov 2002 20:45:21 -0600 (CST)
Date: Thu, 14 Nov 2002 20:45:21 -0600
To: lojban-list@lojban.org
Subject: [lojban] Re: IRC logs and text archives - volunteers wanted
Message-ID: <20021115024521.GA88242@allusion.net>
References: <5.1.0.14.0.20021113231400.0337d580@pop.east.cox.net> <5.1.0.14.0.20021113231400.0337d580@pop.east.cox.net> <5.1.0.14.0.20021114043147.033943d0@pop.east.cox.net>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;	protocol="application/pgp-signature"; boundary="LpQ9ahxlCli8rRTG"
Content-Disposition: inline
In-Reply-To: <5.1.0.14.0.20021114043147.033943d0@pop.east.cox.net>
User-Agent: Mutt/1.4i
X-archive-position: 2613
X-ecartis-version: Ecartis v1.0.0
Sender: lojban-list-bounce@lojban.org
Errors-to: lojban-list-bounce@lojban.org
X-original-sender: fracture@allusion.net
Precedence: bulk
X-list: lojban-list
X-eGroups-From: Jordan DeLong <fracture@allusion.net>
From: Jordan DeLong <lojban-out@lojban.org>
Reply-To: fracture@allusion.net
X-Yahoo-Group-Post: member; u=116389790
X-Yahoo-Profile: lojban_out

--LpQ9ahxlCli8rRTG
Content-Type: multipart/mixed; boundary="2oS5YaxWCcQjTEyO"
Content-Disposition: inline

--2oS5YaxWCcQjTEyO
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Nov 14, 2002 at 05:05:58AM -0500, Robert LeChevalier wrote:
> At 10:43 PM 11/13/02 -0600, Jordan wrote:
> >On Wed, Nov 13, 2002 at 11:23:11PM -0500, Bob LeChevalier-Logical Langua=
ge =3D
> >Group wrote:
[...]
> >I have essentially noninterrupted logs (10 megs of em) since Sun
> >May 12 08:40:20 2002, when I first joined.
>=20
> That's a lot! I wonder if Robin has room for that much (and more if it=20
> keeps accumulating at that rate).
>=20
> What percentage of it would you say is IN Lojban, as opposed to being=20
> discussion in English (or other languages) ABOUT Lojban
[...]
> We need to find someone willing to index them (and perhaps to weed out an=
y=20
> logs that do not have any substantial Lojban text - discussions about the=
=20
> language are interesting but are not a corpus of language usage), and to=
=20
> put them on a site where they can be looked at (lojban.org or=20
> elsewhere). And if they get put on a web site, I'd like the group I've=20
> asked for to maintain a list of web sites with Lojban text to include it.
[...]

So this morning I made a little script (i've attached this in case
anyone finds it useful) to weed out just the lines of text which
are lojban. While doing this I found that some of the middle had
duplicate lines from when I used to run to clients, so after killing
that the log is only 7.3Meg.

Anyway, the way the lojbo culling worked was to take each line, run
each word through vlatai and keep a tally of how many words were
lojbo and how many were glico (cmene only counted .2 because a *lot*
of english words are cmene), if that was greater than 80% the line
made it through. Obviously this is an error prone way to do things,
so there's a few things in there (of people saying things like
"nice" and "sure") which aren't lojban, and it may have also missed
some stuff which should go in there (though I don't think as much
of this happened).

All in all it gets either 8% or 11% lojban, depending on whether
you count by lines or bytes. Not all of it represents actual lojban
conversation though, some are snippets from english discussions
where someone broke into some lojban, and some comes from the
translation-game "zmitav".

I'll see if robin wants the file for freq count and/or putting on
the web or whatever.

--=20
Jordan DeLong - fracture@allusion.net
lu zo'o loi censa bakni cu terzba le zaltapla poi xagrai li'u
sei la mark. tuen. cusku

--2oS5YaxWCcQjTEyO
Content-Type: application/x-perl
Content-Disposition: attachment; filename="find_loj.pl"
Content-Transfer-Encoding: quoted-printable

#!/usr/bin/env perl -w=0A#=0A# Locate lojban text from a file containing so=
me lines which=0A# are lojban and some which are not.=0A#=0A# (there's hack=
s in here to make it irc-smart)=0A#=0Ause strict;=0A=0A# percent of words o=
n a line which must be lojban=0Amy $needed	=3D 80;=0A=0A##################=
############################################################=0A=0Aif ($#ARG=
V !=3D 1) {=0A	die "usage: $0 filename outfile";=0A}=0Amy $filename =3D $AR=
GV[0];=0Amy $outfile =3D $ARGV[1];=0A=0Aopen FILE, "<$filename"=0A	or die "=
open $filename: $!";=0Aopen OUTFILE, ">$outfile"=0A	or die "open $outfile: =
$!";=0Awhile (<FILE>) {=0A	my ($lcount, $cmene_count, $tcount);=0A	my ($lin=
e, $theline);=0A=0A	$theline =3D $_;=0A=0A	# trim the irc-style formatting =
out of this=0A	if (/\[\d\d:\d\d\] \<[^ ]+\> (.*)$/) {=0A	$line =3D $1;=0A	=
} elsif (/\[\d\d:\d\d\] \*\*\* [^ ]+ (.*)$/) {=0A	$line =3D $1;=0A	} elsif=
(/-----.*/) {=0A	# keep log thingies.=0A	print OUTFILE;=0A	next;=0A	} e=
lse {=0A	die "unkown line; $_";=0A	}=0A=0A	# init the lojbo and total=0A	$=
lcount =3D 0;=0A	$tcount =3D 0;=0A	$cmene_count =3D 0;=0A=0A	while ($line =
=3D~ /^([ \t]*[^ ]+)/) {=0A	my $word =3D $1;=0A=0A	$tcount++;=0A=0A	$lin=
e =3D~ s/^[ \t]*[^ ]+//;=0A	$word =3D~ s/^[ \t]*//;=0A=0A	next if $word =
=3D~ /[^a-zA-Z\']/;=0A=0A	$word =3D~ s/\'/\\\'/g;=0A	open VLATAI, "vlatai=
$word|";=0A	my $vlasays =3D <VLATAI>;=0A	close VLATAI;=0A=0A	if ($vlasa=
ys =3D~ /cmene/) {=0A	$cmene_count++;=0A	} elsif ($vlasays !~ /UNMATCHED=
/) {=0A	$lcount++;=0A	}=0A	}=0A=0A	if ($lcount > 0) {=0A	if (($lcount +=
$cmene_count * 0.2) /=0A	$tcount > $needed / 100) {=0A	print OUTFILE =
"$theline";=0A	}=0A	}=0A}=0Aclose FILE;=0Aclose OUTFILE;=0A
--2oS5YaxWCcQjTEyO--

--LpQ9ahxlCli8rRTG
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (FreeBSD)

iD8DBQE91F/ADrrilS51AZ8RAhIqAJ9ypQeUzV6BqSUBty9OOFObApbF+wCgsZ1W
2CDyoHcCDJnHq1l9lMRLPE4=
=TyEF
-----END PGP SIGNATURE-----

--LpQ9ahxlCli8rRTG--

