[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

BPFK: Call for volunteers: Concordancer

To: lojban@yahoogroups.com
Subject: BPFK: Call for volunteers: Concordancer
From: Nick Nicholas <opoudjis@optushome.com.au>
Date: Wed, 11 Dec 2002 22:09:49 +1100

Lojbanists,

As Language Design Commission Chair, I hereby issue a call forvolunteers to help construct a concordance-like search engine totraverse the Lojban corpus.

This is in accordance with the plan for the compilation of the Lojbanbaseline dictionary (lojbo vlaski ca'ircukta), outlined in:


http://www.lojban.org/wiki/index.php/Mini-dictionary
http://www.lojban.org/wiki/index.php/Mini-dictionary%20To-do

http://www.tlg.uci.edu/~opoudjis/dist/jbovlacku.html is a very succinctprototype. (I am referring to the "Concordancing Software", not the"Voting Software", which will come later.)

It is perfectly acceptable for volunteers to suggest an alreadyexisting piece of software, or to customise such where legal. However,there are particular features I would like this engine to have; I donot believe the current search engines used for Lojban corpora (htDig,Google, yahoogroups search) do all that is needed. I quote from mydiscussion on the Wiki:

We do not need just a generic web search interface: we want somethingthat will pinpoint the word in context accurately (so that the usercan quickly work out whether the instance found is relevant or not).This should also indicate the author where retrievable (and in emaildigests this should be doable readily). And the search needs to beopen-ended: if there are 10,000 instances, the user should potentiallybe able to browse through the lot. In the first instance, this can bejust a search for full words; substring searches might help forcommonly orthographically-compounded cmavo, but my impression is weknow which the usual compounded cmavo are.

I do not quite know what the corpus will look like yet; but themajority of it will clearly be mail messages, with at least rudimentarymail headers. There will also be smidgeons of free form text, irc logs,html, xml, and mangled ascii versions of palaeolithic Word files (Bob'sJLs &c.) The corpus retrievable from yahoogroups may not have prolixmail headers; and the sometimes truncated mail headers in survivingrecords of the listserv list have caused web software problems in thepast. So while it would be very nice for the concordancer to identifythe author much of the time, not all text will be so formatted.

People should be able to click from the search result straight to thetext; and the text should be displayed in the search result with agenerous amount of context (a line). Text known to be quoted (in thestandard ways, with > , or > if this is an html-mangled file) shouldbe optionally skipped in the search results.

We do not need anything fancy, so if anyone thinks they can cooksomething appropriate up in an evening of Jolt-fuelled Perl, go rightahead. Please say so to the list first, though, so we don't havemultiple people inventing the same wheel. The time for that can comelater.


--
 Edarh oni oroumene          NICK NICHOLAS PhD, French/Italian,
 kouraste na mpa"inei,       University of Melbourne, Australia
 apo ton kosmo entenh        nickn@unimelb.edu.au
 tsi naxei na orinei?        http://www.opoudjis.net
    --- Dhmhtzh Xouph, _O gerou-Kwstagkh_ (Tsakwniko poihma)

Follow-Ups:
- [lojban] Re: BPFK: Call for volunteers: Concordancer
  - From: Jordan DeLong <lojban-out@lojban.org>

Prev by Date: [lojban] Re: the ethics of the HTML content meta tag
Next by Date: [lojban] Re: brochure: aesthetics
Previous by thread: [lojban] Re: brochure: aesthetics
Next by thread: [lojban] Re: BPFK: Call for volunteers: Concordancer
Index(es):
- Date
- Thread