From sentto-44114-17940-1039604991-lojban-in=lojban.org@returns.groups.yahoo.com Wed Dec 11 06:25:39 2002 Received: with ECARTIS (v1.0.0; list lojban-list); Wed, 11 Dec 2002 06:25:39 -0800 (PST) Received: from n27.grp.scd.yahoo.com ([66.218.66.83]) by digitalkingdom.org with smtp (Exim 4.05) id 18M7nq-0002hs-01 for lojban-in@lojban.org; Wed, 11 Dec 2002 06:25:34 -0800 X-eGroups-Return: sentto-44114-17940-1039604991-lojban-in=lojban.org@returns.groups.yahoo.com Received: from [66.218.67.199] by n27.grp.scd.yahoo.com with NNFMP; 11 Dec 2002 11:09:52 -0000 X-Sender: opoudjis@optushome.com.au X-Apparently-To: lojban@yahoogroups.com Received: (EGP: mail-8_2_3_0); 11 Dec 2002 11:09:51 -0000 Received: (qmail 74730 invoked from network); 11 Dec 2002 11:09:51 -0000 Received: from unknown (66.218.66.217) by m6.grp.scd.yahoo.com with QMQP; 11 Dec 2002 11:09:51 -0000 Received: from unknown (HELO mail021.syd.optusnet.com.au) (210.49.20.161) by mta2.grp.scd.yahoo.com with SMTP; 11 Dec 2002 11:09:51 -0000 Received: from optushome.com.au (c17180.brasd1.vic.optusnet.com.au [210.49.155.40]) by mail021.syd.optusnet.com.au (8.11.1/8.11.1) with ESMTP id gBBB9nv29140 for ; Wed, 11 Dec 2002 22:09:49 +1100 To: lojban@yahoogroups.com Message-Id: <1073208E-0CF9-11D7-8AA3-003065D4EC72@optushome.com.au> X-Mailer: Apple Mail (2.548) From: Nick Nicholas X-Yahoo-Profile: opoudjis MIME-Version: 1.0 Mailing-List: list lojban@yahoogroups.com; contact lojban-owner@yahoogroups.com Delivered-To: mailing list lojban@yahoogroups.com Precedence: bulk Date: Wed, 11 Dec 2002 22:09:49 +1100 Subject: [lojban] BPFK: Call for volunteers: Concordancer Content-Type: text/plain; charset=US-ASCII X-archive-position: 3447 X-ecartis-version: Ecartis v1.0.0 Sender: lojban-list-bounce@lojban.org Errors-to: lojban-list-bounce@lojban.org X-original-sender: opoudjis@optushome.com.au Precedence: bulk Reply-to: lojban-list@lojban.org X-list: lojban-list Lojbanists, As Language Design Commission Chair, I hereby issue a call for volunteers to help construct a concordance-like search engine to traverse the Lojban corpus. This is in accordance with the plan for the compilation of the Lojban baseline dictionary (lojbo vlaski ca'ircukta), outlined in: http://www.lojban.org/wiki/index.php/Mini-dictionary http://www.lojban.org/wiki/index.php/Mini-dictionary%20To-do http://www.tlg.uci.edu/~opoudjis/dist/jbovlacku.html is a very succinct prototype. (I am referring to the "Concordancing Software", not the "Voting Software", which will come later.) It is perfectly acceptable for volunteers to suggest an already existing piece of software, or to customise such where legal. However, there are particular features I would like this engine to have; I do not believe the current search engines used for Lojban corpora (htDig, Google, yahoogroups search) do all that is needed. I quote from my discussion on the Wiki: > We do not need just a generic web search interface: we want something > that will pinpoint the word in context accurately (so that the user > can quickly work out whether the instance found is relevant or not). > This should also indicate the author where retrievable (and in email > digests this should be doable readily). And the search needs to be > open-ended: if there are 10,000 instances, the user should potentially > be able to browse through the lot. In the first instance, this can be > just a search for full words; substring searches might help for > commonly orthographically-compounded cmavo, but my impression is we > know which the usual compounded cmavo are. I do not quite know what the corpus will look like yet; but the majority of it will clearly be mail messages, with at least rudimentary mail headers. There will also be smidgeons of free form text, irc logs, html, xml, and mangled ascii versions of palaeolithic Word files (Bob's JLs &c.) The corpus retrievable from yahoogroups may not have prolix mail headers; and the sometimes truncated mail headers in surviving records of the listserv list have caused web software problems in the past. So while it would be very nice for the concordancer to identify the author much of the time, not all text will be so formatted. People should be able to click from the search result straight to the text; and the text should be displayed in the search result with a generous amount of context (a line). Text known to be quoted (in the standard ways, with > , or > if this is an html-mangled file) should be optionally skipped in the search results. We do not need anything fancy, so if anyone thinks they can cook something appropriate up in an evening of Jolt-fuelled Perl, go right ahead. Please say so to the list first, though, so we don't have multiple people inventing the same wheel. The time for that can come later. -- Edarh oni oroumene NICK NICHOLAS PhD, French/Italian, kouraste na mpa"inei, University of Melbourne, Australia apo ton kosmo entenh nickn@unimelb.edu.au tsi naxei na orinei? http://www.opoudjis.net --- Dhmhtzh Xouph, _O gerou-Kwstagkh_ (Tsakwniko poihma) To unsubscribe, send mail to lojban-unsubscribe@onelist.com Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/