[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[lojban] BPFK: Call for volunteers: Concordancer



Lojbanists,

As Language Design Commission Chair, I hereby issue a call for 
volunteers to help construct a concordance-like search engine to 
traverse the Lojban corpus.

This is in accordance with the plan for the compilation of the Lojban 
baseline dictionary (lojbo vlaski ca'ircukta), outlined in:

http://www.lojban.org/wiki/index.php/Mini-dictionary
http://www.lojban.org/wiki/index.php/Mini-dictionary%20To-do

http://www.tlg.uci.edu/~opoudjis/dist/jbovlacku.html is a very succinct 
prototype. (I am referring to the "Concordancing Software", not the 
"Voting Software", which will come later.)

It is perfectly acceptable for volunteers to suggest an already 
existing piece of software, or to customise such where legal. However, 
there are particular features I would like this engine to have; I do 
not believe the current search engines used for Lojban corpora (htDig, 
Google, yahoogroups search) do all that is needed. I quote from my 
discussion on the Wiki:

> We do not need just a generic web search interface: we want something 
> that will pinpoint the word in context accurately (so that the user 
> can quickly work out whether the instance found is relevant or not). 
> This should also indicate the author where retrievable (and in email 
> digests this should be doable readily). And the search needs to be 
> open-ended: if there are 10,000 instances, the user should potentially 
> be able to browse through the lot. In the first instance, this can be 
> just a search for full words; substring searches might help for 
> commonly orthographically-compounded cmavo, but my impression is we 
> know which the usual compounded cmavo are.

I do not quite know what the corpus will look like yet; but the 
majority of it will clearly be mail messages, with at least rudimentary 
mail headers. There will also be smidgeons of free form text, irc logs, 
html, xml, and mangled ascii versions of palaeolithic Word files (Bob's 
JLs &c.) The corpus retrievable from yahoogroups may not have prolix 
mail headers; and the sometimes truncated mail headers in surviving 
records of the listserv list have caused web software problems in the 
past. So while it would be very nice for the concordancer to identify 
the author much of the time, not all text will be so formatted.

People should be able to click from the search result straight to the 
text; and the text should be displayed in the search result with a 
generous amount of context (a line). Text known to be quoted (in the 
standard ways, with > , or > if this is an html-mangled file) should 
be optionally skipped in the search results.

We do not need anything fancy, so if anyone thinks they can cook 
something appropriate up in an evening of Jolt-fuelled Perl, go right 
ahead. Please say so to the list first, though, so we don't have 
multiple people inventing the same wheel. The time for that can come 
later.

--
  Edarh oni oroumene          NICK NICHOLAS PhD, French/Italian,
  kouraste na mpa"inei,       University of Melbourne, Australia
  apo ton kosmo entenh        nickn@unimelb.edu.au
  tsi naxei na orinei?        http://www.opoudjis.net
     --- Dhmhtzh Xouph, _O gerou-Kwstagkh_ (Tsakwniko poihma)


To unsubscribe, send mail to lojban-unsubscribe@onelist.com 

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/