On Thu, Nov 14, 2002 at 05:05:58AM -0500, Robert LeChevalier wrote: > At 10:43 PM 11/13/02 -0600, Jordan wrote: > >On Wed, Nov 13, 2002 at 11:23:11PM -0500, Bob LeChevalier-Logical Language = > >Group wrote: [...] > >I have essentially noninterrupted logs (10 megs of em) since Sun > >May 12 08:40:20 2002, when I first joined. > > That's a lot! I wonder if Robin has room for that much (and more if it > keeps accumulating at that rate). > > What percentage of it would you say is IN Lojban, as opposed to being > discussion in English (or other languages) ABOUT Lojban [...] > We need to find someone willing to index them (and perhaps to weed out any > logs that do not have any substantial Lojban text - discussions about the > language are interesting but are not a corpus of language usage), and to > put them on a site where they can be looked at (lojban.org or > elsewhere). And if they get put on a web site, I'd like the group I've > asked for to maintain a list of web sites with Lojban text to include it. [...] So this morning I made a little script (i've attached this in case anyone finds it useful) to weed out just the lines of text which are lojban. While doing this I found that some of the middle had duplicate lines from when I used to run to clients, so after killing that the log is only 7.3Meg. Anyway, the way the lojbo culling worked was to take each line, run each word through vlatai and keep a tally of how many words were lojbo and how many were glico (cmene only counted .2 because a *lot* of english words are cmene), if that was greater than 80% the line made it through. Obviously this is an error prone way to do things, so there's a few things in there (of people saying things like "nice" and "sure") which aren't lojban, and it may have also missed some stuff which should go in there (though I don't think as much of this happened). All in all it gets either 8% or 11% lojban, depending on whether you count by lines or bytes. Not all of it represents actual lojban conversation though, some are snippets from english discussions where someone broke into some lojban, and some comes from the translation-game "zmitav". I'll see if robin wants the file for freq count and/or putting on the web or whatever. -- Jordan DeLong - fracture@allusion.net lu zo'o loi censa bakni cu terzba le zaltapla poi xagrai li'u sei la mark. tuen. cusku
Attachment:
find_loj.pl
Description: Perl program
Attachment:
pgp00259.pgp
Description: PGP signature