[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: IRC logs and text archives - volunteers wanted



On Thu, Nov 14, 2002 at 05:05:58AM -0500, Robert LeChevalier wrote:
> At 10:43 PM 11/13/02 -0600, Jordan wrote:
> >On Wed, Nov 13, 2002 at 11:23:11PM -0500, Bob LeChevalier-Logical Language =
> >Group wrote:
[...]
> >I have essentially noninterrupted logs (10 megs of em) since Sun
> >May 12 08:40:20 2002, when I first joined.
> 
> That's a lot!  I wonder if Robin has room for that much (and more if it 
> keeps accumulating at that rate).
> 
> What percentage of it would you say is IN Lojban, as opposed to being 
> discussion in English (or other languages) ABOUT Lojban
[...]
> We need to find someone willing to index them (and perhaps to weed out any 
> logs that do not have any substantial Lojban text - discussions about the 
> language are interesting but are not a corpus of language usage), and to 
> put them on a site where they can be looked at (lojban.org or 
> elsewhere).  And if they get put on a web site, I'd like the group I've 
> asked for to maintain a list of web sites with Lojban text to include it.
[...]

So this morning I made a little script (i've attached this in case
anyone finds it useful) to weed out just the lines of text which
are lojban.  While doing this I found that some of the middle had
duplicate lines from when I used to run to clients, so after killing
that the log is only 7.3Meg.

Anyway, the way the lojbo culling worked was to take each line, run
each word through vlatai and keep a tally of how many words were
lojbo and how many were glico (cmene only counted .2 because a *lot*
of english words are cmene), if that was greater than 80% the line
made it through.  Obviously this is an error prone way to do things,
so there's a few things in there (of people saying things like
"nice" and "sure") which aren't lojban, and it may have also missed
some stuff which should go in there (though I don't think as much
of this happened).

All in all it gets either 8% or 11% lojban, depending on whether
you count by lines or bytes.  Not all of it represents actual lojban
conversation though, some are snippets from english discussions
where someone broke into some lojban, and some comes from the
translation-game "zmitav".

I'll see if robin wants the file for freq count and/or putting on
the web or whatever.

-- 
Jordan DeLong - fracture@allusion.net
lu zo'o loi censa bakni cu terzba le zaltapla poi xagrai li'u
                                     sei la mark. tuen. cusku

Attachment: find_loj.pl
Description: Perl program

Attachment: pgp00259.pgp
Description: PGP signature