From lojban-out@lojban.org Thu Nov 14 18:38:57 2002 Return-Path: X-Sender: lojban-out@lojban.org X-Apparently-To: lojban@yahoogroups.com Received: (EGP: mail-8_2_3_0); 15 Nov 2002 02:38:57 -0000 Received: (qmail 62182 invoked from network); 15 Nov 2002 02:38:57 -0000 Received: from unknown (66.218.66.217) by m6.grp.scd.yahoo.com with QMQP; 15 Nov 2002 02:38:57 -0000 Received: from unknown (HELO digitalkingdom.org) (204.152.186.175) by mta2.grp.scd.yahoo.com with SMTP; 15 Nov 2002 02:38:57 -0000 Received: from lojban-out by digitalkingdom.org with local (Exim 4.05) id 18CWNl-0008BG-00 for lojban@yahoogroups.com; Thu, 14 Nov 2002 18:38:57 -0800 Received: from digitalkingdom.org ([204.152.186.175] helo=chain) by digitalkingdom.org with esmtp (Exim 4.05) id 18CWNh-0008Au-00; Thu, 14 Nov 2002 18:38:53 -0800 Received: with ECARTIS (v1.0.0; list lojban-list); Thu, 14 Nov 2002 18:38:52 -0800 (PST) Received: from cs6668125-184.austin.rr.com ([66.68.125.184] ident=root) by digitalkingdom.org with esmtp (Exim 4.05) id 18CWNb-0008Ag-00 for lojban-list@lojban.org; Thu, 14 Nov 2002 18:38:47 -0800 Received: from cs6668125-184.austin.rr.com (asdf@localhost [127.0.0.1]) by cs6668125-184.austin.rr.com (8.12.3/8.12.3) with ESMTP id gAF2jLWF088884 for ; Thu, 14 Nov 2002 20:45:21 -0600 (CST) (envelope-from fracture@cs6668125-184.austin.rr.com) Received: (from fracture@localhost) by cs6668125-184.austin.rr.com (8.12.3/8.12.3/Submit) id gAF2jLwX088883 for lojban-list@lojban.org; Thu, 14 Nov 2002 20:45:21 -0600 (CST) Date: Thu, 14 Nov 2002 20:45:21 -0600 To: lojban-list@lojban.org Subject: [lojban] Re: IRC logs and text archives - volunteers wanted Message-ID: <20021115024521.GA88242@allusion.net> References: <5.1.0.14.0.20021113231400.0337d580@pop.east.cox.net> <5.1.0.14.0.20021113231400.0337d580@pop.east.cox.net> <5.1.0.14.0.20021114043147.033943d0@pop.east.cox.net> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="LpQ9ahxlCli8rRTG" Content-Disposition: inline In-Reply-To: <5.1.0.14.0.20021114043147.033943d0@pop.east.cox.net> User-Agent: Mutt/1.4i X-archive-position: 2613 X-ecartis-version: Ecartis v1.0.0 Sender: lojban-list-bounce@lojban.org Errors-to: lojban-list-bounce@lojban.org X-original-sender: fracture@allusion.net Precedence: bulk X-list: lojban-list X-eGroups-From: Jordan DeLong From: Jordan DeLong Reply-To: fracture@allusion.net X-Yahoo-Group-Post: member; u=116389790 X-Yahoo-Profile: lojban_out X-Yahoo-Message-Num: 17109 --LpQ9ahxlCli8rRTG Content-Type: multipart/mixed; boundary="2oS5YaxWCcQjTEyO" Content-Disposition: inline --2oS5YaxWCcQjTEyO Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Nov 14, 2002 at 05:05:58AM -0500, Robert LeChevalier wrote: > At 10:43 PM 11/13/02 -0600, Jordan wrote: > >On Wed, Nov 13, 2002 at 11:23:11PM -0500, Bob LeChevalier-Logical Langua= ge =3D > >Group wrote: [...] > >I have essentially noninterrupted logs (10 megs of em) since Sun > >May 12 08:40:20 2002, when I first joined. >=20 > That's a lot! I wonder if Robin has room for that much (and more if it=20 > keeps accumulating at that rate). >=20 > What percentage of it would you say is IN Lojban, as opposed to being=20 > discussion in English (or other languages) ABOUT Lojban [...] > We need to find someone willing to index them (and perhaps to weed out an= y=20 > logs that do not have any substantial Lojban text - discussions about the= =20 > language are interesting but are not a corpus of language usage), and to= =20 > put them on a site where they can be looked at (lojban.org or=20 > elsewhere). And if they get put on a web site, I'd like the group I've=20 > asked for to maintain a list of web sites with Lojban text to include it. [...] So this morning I made a little script (i've attached this in case anyone finds it useful) to weed out just the lines of text which are lojban. While doing this I found that some of the middle had duplicate lines from when I used to run to clients, so after killing that the log is only 7.3Meg. Anyway, the way the lojbo culling worked was to take each line, run each word through vlatai and keep a tally of how many words were lojbo and how many were glico (cmene only counted .2 because a *lot* of english words are cmene), if that was greater than 80% the line made it through. Obviously this is an error prone way to do things, so there's a few things in there (of people saying things like "nice" and "sure") which aren't lojban, and it may have also missed some stuff which should go in there (though I don't think as much of this happened). All in all it gets either 8% or 11% lojban, depending on whether you count by lines or bytes. Not all of it represents actual lojban conversation though, some are snippets from english discussions where someone broke into some lojban, and some comes from the translation-game "zmitav". I'll see if robin wants the file for freq count and/or putting on the web or whatever. --=20 Jordan DeLong - fracture@allusion.net lu zo'o loi censa bakni cu terzba le zaltapla poi xagrai li'u sei la mark. tuen. cusku --2oS5YaxWCcQjTEyO Content-Type: application/x-perl Content-Disposition: attachment; filename="find_loj.pl" Content-Transfer-Encoding: quoted-printable [Attachment content not displayed.] --2oS5YaxWCcQjTEyO----LpQ9ahxlCli8rRTG Content-Type: application/pgp-signature Content-Disposition: inline [Attachment content not displayed.] --LpQ9ahxlCli8rRTG--