From fracture@cs6668125-184.austin.rr.com Thu Nov 14 18:38:52 2002 Received: with ECARTIS (v1.0.0; list lojban-list); Thu, 14 Nov 2002 18:38:52 -0800 (PST) Received: from cs6668125-184.austin.rr.com ([66.68.125.184] ident=root) by digitalkingdom.org with esmtp (Exim 4.05) id 18CWNb-0008Ag-00 for lojban-list@lojban.org; Thu, 14 Nov 2002 18:38:47 -0800 Received: from cs6668125-184.austin.rr.com (asdf@localhost [127.0.0.1]) by cs6668125-184.austin.rr.com (8.12.3/8.12.3) with ESMTP id gAF2jLWF088884 for ; Thu, 14 Nov 2002 20:45:21 -0600 (CST) (envelope-from fracture@cs6668125-184.austin.rr.com) Received: (from fracture@localhost) by cs6668125-184.austin.rr.com (8.12.3/8.12.3/Submit) id gAF2jLwX088883 for lojban-list@lojban.org; Thu, 14 Nov 2002 20:45:21 -0600 (CST) Date: Thu, 14 Nov 2002 20:45:21 -0600 From: Jordan DeLong To: lojban-list@lojban.org Subject: [lojban] Re: IRC logs and text archives - volunteers wanted Message-ID: <20021115024521.GA88242@allusion.net> References: <5.1.0.14.0.20021113231400.0337d580@pop.east.cox.net> <5.1.0.14.0.20021113231400.0337d580@pop.east.cox.net> <5.1.0.14.0.20021114043147.033943d0@pop.east.cox.net> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="LpQ9ahxlCli8rRTG" Content-Disposition: inline In-Reply-To: <5.1.0.14.0.20021114043147.033943d0@pop.east.cox.net> User-Agent: Mutt/1.4i X-archive-position: 2613 X-ecartis-version: Ecartis v1.0.0 Sender: lojban-list-bounce@lojban.org Errors-to: lojban-list-bounce@lojban.org X-original-sender: fracture@allusion.net Precedence: bulk Reply-to: lojban-list@lojban.org X-list: lojban-list --LpQ9ahxlCli8rRTG Content-Type: multipart/mixed; boundary="2oS5YaxWCcQjTEyO" Content-Disposition: inline --2oS5YaxWCcQjTEyO Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Nov 14, 2002 at 05:05:58AM -0500, Robert LeChevalier wrote: > At 10:43 PM 11/13/02 -0600, Jordan wrote: > >On Wed, Nov 13, 2002 at 11:23:11PM -0500, Bob LeChevalier-Logical Langua= ge =3D > >Group wrote: [...] > >I have essentially noninterrupted logs (10 megs of em) since Sun > >May 12 08:40:20 2002, when I first joined. >=20 > That's a lot! I wonder if Robin has room for that much (and more if it= =20 > keeps accumulating at that rate). >=20 > What percentage of it would you say is IN Lojban, as opposed to being=20 > discussion in English (or other languages) ABOUT Lojban [...] > We need to find someone willing to index them (and perhaps to weed out an= y=20 > logs that do not have any substantial Lojban text - discussions about the= =20 > language are interesting but are not a corpus of language usage), and to= =20 > put them on a site where they can be looked at (lojban.org or=20 > elsewhere). And if they get put on a web site, I'd like the group I've= =20 > asked for to maintain a list of web sites with Lojban text to include it. [...] So this morning I made a little script (i've attached this in case anyone finds it useful) to weed out just the lines of text which are lojban. While doing this I found that some of the middle had duplicate lines from when I used to run to clients, so after killing that the log is only 7.3Meg. Anyway, the way the lojbo culling worked was to take each line, run each word through vlatai and keep a tally of how many words were lojbo and how many were glico (cmene only counted .2 because a *lot* of english words are cmene), if that was greater than 80% the line made it through. Obviously this is an error prone way to do things, so there's a few things in there (of people saying things like "nice" and "sure") which aren't lojban, and it may have also missed some stuff which should go in there (though I don't think as much of this happened). All in all it gets either 8% or 11% lojban, depending on whether you count by lines or bytes. Not all of it represents actual lojban conversation though, some are snippets from english discussions where someone broke into some lojban, and some comes from the translation-game "zmitav". I'll see if robin wants the file for freq count and/or putting on the web or whatever. --=20 Jordan DeLong - fracture@allusion.net lu zo'o loi censa bakni cu terzba le zaltapla poi xagrai li'u sei la mark. tuen. cusku --2oS5YaxWCcQjTEyO Content-Type: application/x-perl Content-Disposition: attachment; filename="find_loj.pl" Content-Transfer-Encoding: quoted-printable #!/usr/bin/env perl -w=0A#=0A# Locate lojban text from a file containing so= me lines which=0A# are lojban and some which are not.=0A#=0A# (there's hack= s in here to make it irc-smart)=0A#=0Ause strict;=0A=0A# percent of words o= n a line which must be lojban=0Amy $needed =3D 80;=0A=0A##################= ############################################################=0A=0Aif ($#ARG= V !=3D 1) {=0A die "usage: $0 filename outfile";=0A}=0Amy $filename =3D $AR= GV[0];=0Amy $outfile =3D $ARGV[1];=0A=0Aopen FILE, "<$filename"=0A or die "= open $filename: $!";=0Aopen OUTFILE, ">$outfile"=0A or die "open $outfile: = $!";=0Awhile () {=0A my ($lcount, $cmene_count, $tcount);=0A my ($lin= e, $theline);=0A=0A $theline =3D $_;=0A=0A # trim the irc-style formatting = out of this=0A if (/\[\d\d:\d\d\] \<[^ ]+\> (.*)$/) {=0A $line =3D $1;=0A = } elsif (/\[\d\d:\d\d\] \*\*\* [^ ]+ (.*)$/) {=0A $line =3D $1;=0A } elsif= (/-----.*/) {=0A # keep log thingies.=0A print OUTFILE;=0A next;=0A } e= lse {=0A die "unkown line; $_";=0A }=0A=0A # init the lojbo and total=0A $= lcount =3D 0;=0A $tcount =3D 0;=0A $cmene_count =3D 0;=0A=0A while ($line = =3D~ /^([ \t]*[^ ]+)/) {=0A my $word =3D $1;=0A=0A $tcount++;=0A=0A $lin= e =3D~ s/^[ \t]*[^ ]+//;=0A $word =3D~ s/^[ \t]*//;=0A=0A next if $word = =3D~ /[^a-zA-Z\']/;=0A=0A $word =3D~ s/\'/\\\'/g;=0A open VLATAI, "vlatai= $word|";=0A my $vlasays =3D ;=0A close VLATAI;=0A=0A if ($vlasa= ys =3D~ /cmene/) {=0A $cmene_count++;=0A } elsif ($vlasays !~ /UNMATCHED= /) {=0A $lcount++;=0A }=0A }=0A=0A if ($lcount > 0) {=0A if (($lcount += $cmene_count * 0.2) /=0A $tcount > $needed / 100) {=0A print OUTFILE = "$theline";=0A }=0A }=0A}=0Aclose FILE;=0Aclose OUTFILE;=0A --2oS5YaxWCcQjTEyO-- --LpQ9ahxlCli8rRTG Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.7 (FreeBSD) iD8DBQE91F/ADrrilS51AZ8RAhIqAJ9ypQeUzV6BqSUBty9OOFObApbF+wCgsZ1W 2CDyoHcCDJnHq1l9lMRLPE4= =TyEF -----END PGP SIGNATURE----- --LpQ9ahxlCli8rRTG--