From opoudjis@optushome.com.au Wed Dec 11 03:09:51 2002
Return-Path: <opoudjis@optushome.com.au>
X-Sender: opoudjis@optushome.com.au
X-Apparently-To: lojban@yahoogroups.com
Received: (EGP: mail-8_2_3_0); 11 Dec 2002 11:09:51 -0000
Received: (qmail 74730 invoked from network); 11 Dec 2002 11:09:51 -0000
Received: from unknown (66.218.66.217)
  by m6.grp.scd.yahoo.com with QMQP; 11 Dec 2002 11:09:51 -0000
Received: from unknown (HELO mail021.syd.optusnet.com.au) (210.49.20.161)
  by mta2.grp.scd.yahoo.com with SMTP; 11 Dec 2002 11:09:51 -0000
Received: from optushome.com.au (c17180.brasd1.vic.optusnet.com.au [210.49.155.40])
  by mail021.syd.optusnet.com.au (8.11.1/8.11.1) with ESMTP id gBBB9nv29140
  for <lojban@yahoogroups.com>; Wed, 11 Dec 2002 22:09:49 +1100
Date: Wed, 11 Dec 2002 22:09:49 +1100
Mime-Version: 1.0 (Apple Message framework v548)
Content-Type: text/plain; charset=US-ASCII; format=flowed
Subject: BPFK: Call for volunteers: Concordancer
To: lojban@yahoogroups.com
Content-Transfer-Encoding: 7bit
Message-Id: <1073208E-0CF9-11D7-8AA3-003065D4EC72@optushome.com.au>
X-Mailer: Apple Mail (2.548)
From: Nick Nicholas <opoudjis@optushome.com.au>
X-Yahoo-Group-Post: member; u=90350612
X-Yahoo-Profile: opoudjis

Lojbanists,

As Language Design Commission Chair, I hereby issue a call for 
volunteers to help construct a concordance-like search engine to 
traverse the Lojban corpus.

This is in accordance with the plan for the compilation of the Lojban 
baseline dictionary (lojbo vlaski ca'ircukta), outlined in:

http://www.lojban.org/wiki/index.php/Mini-dictionary
http://www.lojban.org/wiki/index.php/Mini-dictionary%20To-do

http://www.tlg.uci.edu/~opoudjis/dist/jbovlacku.html is a very succinct 
prototype. (I am referring to the "Concordancing Software", not the 
"Voting Software", which will come later.)

It is perfectly acceptable for volunteers to suggest an already 
existing piece of software, or to customise such where legal. However, 
there are particular features I would like this engine to have; I do 
not believe the current search engines used for Lojban corpora (htDig, 
Google, yahoogroups search) do all that is needed. I quote from my 
discussion on the Wiki:

> We do not need just a generic web search interface: we want something 
> that will pinpoint the word in context accurately (so that the user 
> can quickly work out whether the instance found is relevant or not). 
> This should also indicate the author where retrievable (and in email 
> digests this should be doable readily). And the search needs to be 
> open-ended: if there are 10,000 instances, the user should potentially 
> be able to browse through the lot. In the first instance, this can be 
> just a search for full words; substring searches might help for 
> commonly orthographically-compounded cmavo, but my impression is we 
> know which the usual compounded cmavo are.

I do not quite know what the corpus will look like yet; but the 
majority of it will clearly be mail messages, with at least rudimentary 
mail headers. There will also be smidgeons of free form text, irc logs, 
html, xml, and mangled ascii versions of palaeolithic Word files (Bob's 
JLs &c.) The corpus retrievable from yahoogroups may not have prolix 
mail headers; and the sometimes truncated mail headers in surviving 
records of the listserv list have caused web software problems in the 
past. So while it would be very nice for the concordancer to identify 
the author much of the time, not all text will be so formatted.

People should be able to click from the search result straight to the 
text; and the text should be displayed in the search result with a 
generous amount of context (a line). Text known to be quoted (in the 
standard ways, with > , or &gt; if this is an html-mangled file) should 
be optionally skipped in the search results.

We do not need anything fancy, so if anyone thinks they can cook 
something appropriate up in an evening of Jolt-fuelled Perl, go right 
ahead. Please say so to the list first, though, so we don't have 
multiple people inventing the same wheel. The time for that can come 
later.

--
Edarh oni oroumene NICK NICHOLAS PhD, French/Italian,
kouraste na mpa"inei, University of Melbourne, Australia
apo ton kosmo entenh nickn@unimelb.edu.au
tsi naxei na orinei? http://www.opoudjis.net
--- Dhmhtzh Xouph, _O gerou-Kwstagkh_ (Tsakwniko poihma)


