From dbtwery@bellatlantic.net Fri Aug 25 02:23:49 2000
Return-Path: <dbtwery@bellatlantic.net>
Received: (qmail 19381 invoked from network); 25 Aug 2000 09:23:49 -0000
Received: from unknown (10.1.10.27) by m1.onelist.org with QMQP; 25 Aug 2000 09:23:49 -0000
Received: from unknown (HELO smtp-out2.bellatlantic.net) (199.45.39.157) by mta2 with SMTP; 25 Aug 2000 09:23:49 -0000
Received: from voyou (adsl-141-151-15-117.bellatlantic.net [141.151.15.117]) by smtp-out2.bellatlantic.net (8.9.1/8.9.1) with SMTP id FAA20154 for <lojban@egroups.com>; Fri, 25 Aug 2000 05:23:42 -0400 (EDT)
Message-ID: <002a01c00e76$01d770c0$aa45fea9@voyou>
To: "Lojban List" <lojban@egroups.com>
References: <4.2.2.20000823084322.00a24cb0@127.0.0.1> <4.2.2.20000824085154.00a693e0@127.0.0.1>
Subject: Re: [lojban] le stura be la gihuste
Date: Fri, 25 Aug 2000 05:22:34 -0400
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: 8bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 5.00.2615.200
X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2615.200
From: "David Twery" <dbtwery@bellatlantic.net>
X-Yahoo-Message-Num: 4034


----- Original Message -----
From: Bob LeChevalier (lojbab) <lojbab@lojban.org>
To: Lojban List <lojban@egroups.com>
Sent: Thursday, 24 August 2000 10:23

I may have been working on something that Elrond can use ...

> I think this is perhaps the main difference between the two
> approaches.  When translating, you are focusing on translating all
material
> for a single word at one time.  On the other hand, when creating such
> lists, we found it most highly essential to be able to look at the same
> fields for large numbers of words at one time, which allows me to sort on
> the fly on any of the fields, and to compare the same fields of all words
> of certain types; to the extent that the place structure definitions are
> consistent, it has been because we were able to do such multi-word
> comparisons easily in a single page display.  This necessitates either
> fixed length fields, or perhaps a spreadsheet-style database (I haven't
> used spreadsheets in years, so I am not sure the state of the art in
> user-friendliness of display, but I am thinking of Excel, which Nora has
> used on occasion.) which does allow for variable length fields (but I have
> no idea how the data is stored internally).
>
> We have never had a qualm about adding a field later, when it was shown to
> be needed, and I don't think we have the design competence at this point
to
> be certain what fields would be needed for a variety of tools, or even for
> a variety of natural languages.  It seems that you want to add all the
> fields now and do the translation for each word for all fields, but
> practically speaking, I don't think that we can reasonably produce even
the
> English form of such a list with added fields for grammatical information
> any time soon.

I have been using Excel as a database in my current project, which is to
generate a list of synonyms. After some of the worst god-awful programming
you'd ever want to see, I managed to extract as many of the synonyms in the
baseline gismu list as possible. It was downright easy to import into Excel.

In fact, I was also able to import the cmavo, the tergismu (the "oblique"
file from the Glosser), and Nora's latest lujvo collection. It all took a
few hours, but I think it was well worth it.

I can either upload or directly send you the file, if you would like. I am
still picking out errors (and recovering from a nasty virus episode), but
will be done within a few days. The Excel97 workbook is about 3 MB and
performs quite well. It can also be used to export the data in any number of
standard formats, including MDB, DBF and WKS files.

My longer-range project is to compile various gismu subsets, mainly to
facilitate study (I absolutely HATE Logflash). I already have lists of
colors, numbers, culture and metric gismu, plants and animals and their
by-products, clothing, furniture, and a few more lists.

> We cannot even get people to do keywords for the lujvo and fu'ivla
> lists.  The dikyjvo place structure analysis per the Book is immensely
time
> consuming.

You ain't kiddin', Bob. It takes about ten minutes to properly "compose" a
lujvo, and that's if you know the place structure development method.

My Excel database contains 5271 lujvo, from backemselrerkru (hyperbola) to
zvastejbu (conference registration table). About 40 of these have place
structures -- less than 1% if my math is correct.

> In short, generating data for each of these fields of information is a
> major project in itself.  We have generally done this by getting one
> field/data-category filled in for several hundred words at a time, then
> verified for consistency in style and content over all the words, before
> going on to a new word.  You are in effect proposing that we add several
> more fields to each word, and the easy answer is that it will likely be
> years efore they are done for the lujvo and fu'ivla even if we keep it
real
> simple.

It could be a lot faster if there was an organized effort (he says, looking
at the ground ;-)

> >As for the implementation of these ideas, the best thing would be an
> >XML-like database format. Unfortunately, I do not (yet) know much about
> >XML parsers and therefore did not bother working on an appropriate set of
> >XML tags.
>
> I have no idea even what XML is, and the only kind of parser I know
> anything about is a YACC parser (and not much even then).

XML may be desirable at some point, but is not necessary. A spreadsheet is
perfectly adequate; Excel for Windows, and I'm sure that many excellent
spreadsheets exist for the Unix/Linux, Mac, and even PalmPilot worlds.

Output from a spreadsheet can be XML-ized. It may not be a snap, and it may
require some custom programming, but I assure you, it can be done.

> I'll let others tackle improvements.  From my perspective, just producing
> the English files would take man-weeks of effort that we have no one to
> spare; the result would then be used without being checked by the years of
> multiple reviewers and proofreading that have gone into the existing gismu
> list.  And we get no French gismu list until the English is complete.
This
> merely heightens the existing dependence of Lojban on its English roots,
> when I think the goal of translation into other languages is as much as
> possible to cut Lojban loose from English.
>
> The tradeoff is that in cutting loose from English, the solidity of the
> baseline is weakened - will a French Lojbanist working from a French
> translation of the gismu list communicate well with a Russian Lojbanist
> working from a Russian translation, with neither resorting to the English
> lists?  We cannot know until we have those translations.

This comes back to the original problem: Even with mechanical translation
doing 90% of the "grunt" work, someone is going to have to just plow through
the lists and make sure the translations are solid; preferably someone
fluent in both English and the target language(s). Not to mention Lojban
itself.

> If I sound negative, it is not that your ideas are bad, but rather that I
> think the job is too big for the people we have and their low levels of
> time-availability, and ill-suited for the diverse methods different
> volunteers will use in working on it on different computer platforms using
> different software.

My own approach would be to get an English-->French translation program, and
edit the output. Do 50 gismu per session. Then pass it on to a (Francophone)
friend. It would take about 30 sessions to comb through the list.

> (Nora, for example, does all of her word list work on
> paper while riding on the subway, ideally entering it later into the
> computer; she doesn't use any standard format for her notes and comments,
> so only she can enter them into the machine, and she usually does that as
> straight unformatted text.)

You know, if Nora sells 150 boxes of American Friendship® Greeting Cards,
she can win a PalmPilot. And for an extra 50, she can get the bike. (Is that
a _zo'ocai_ or what?)

--d