[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
BPFK: Call for volunteers: Concordancer
Lojbanists,
As Language Design Commission Chair, I hereby issue a call for
volunteers to help construct a concordance-like search engine to
traverse the Lojban corpus.
This is in accordance with the plan for the compilation of the Lojban
baseline dictionary (lojbo vlaski ca'ircukta), outlined in:
http://www.lojban.org/wiki/index.php/Mini-dictionary
http://www.lojban.org/wiki/index.php/Mini-dictionary%20To-do
http://www.tlg.uci.edu/~opoudjis/dist/jbovlacku.html is a very succinct
prototype. (I am referring to the "Concordancing Software", not the
"Voting Software", which will come later.)
It is perfectly acceptable for volunteers to suggest an already
existing piece of software, or to customise such where legal. However,
there are particular features I would like this engine to have; I do
not believe the current search engines used for Lojban corpora (htDig,
Google, yahoogroups search) do all that is needed. I quote from my
discussion on the Wiki:
We do not need just a generic web search interface: we want something
that will pinpoint the word in context accurately (so that the user
can quickly work out whether the instance found is relevant or not).
This should also indicate the author where retrievable (and in email
digests this should be doable readily). And the search needs to be
open-ended: if there are 10,000 instances, the user should potentially
be able to browse through the lot. In the first instance, this can be
just a search for full words; substring searches might help for
commonly orthographically-compounded cmavo, but my impression is we
know which the usual compounded cmavo are.
I do not quite know what the corpus will look like yet; but the
majority of it will clearly be mail messages, with at least rudimentary
mail headers. There will also be smidgeons of free form text, irc logs,
html, xml, and mangled ascii versions of palaeolithic Word files (Bob's
JLs &c.) The corpus retrievable from yahoogroups may not have prolix
mail headers; and the sometimes truncated mail headers in surviving
records of the listserv list have caused web software problems in the
past. So while it would be very nice for the concordancer to identify
the author much of the time, not all text will be so formatted.
People should be able to click from the search result straight to the
text; and the text should be displayed in the search result with a
generous amount of context (a line). Text known to be quoted (in the
standard ways, with > , or > if this is an html-mangled file) should
be optionally skipped in the search results.
We do not need anything fancy, so if anyone thinks they can cook
something appropriate up in an evening of Jolt-fuelled Perl, go right
ahead. Please say so to the list first, though, so we don't have
multiple people inventing the same wheel. The time for that can come
later.
--
Edarh oni oroumene NICK NICHOLAS PhD, French/Italian,
kouraste na mpa"inei, University of Melbourne, Australia
apo ton kosmo entenh nickn@unimelb.edu.au
tsi naxei na orinei? http://www.opoudjis.net
--- Dhmhtzh Xouph, _O gerou-Kwstagkh_ (Tsakwniko poihma)