[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[lojban] Re: Evaluating gismu

To: lojban@yahoogroups.com
Subject: [lojban] Re: Evaluating gismu
From: <lojban-out@lojban.org>
Date: Mon, 3 Jul 2006 16:29:01 -0400
Importance: Normal
In-reply-to: <001001c69ed0$85fec3d0$a9eafea9@kaosorg>
Reply-to: sasxsek@nutter.net
Here it is..  (originally posted to alt.language.artificial).

---
Dana Nutter <dn20056@nutter.net> wrote:

>There is one thing I'd like to know about Lojban, which is the
>formula use for generating the word roots.  Is this published
>anywhere?

It is a multistep algorithm and not a formula.

1.  For each root, identify similar word/words in the 6 source
languages that would serve as cognate memory hooks if the
Loglan/Lojban word was similar enough.  Don't worry about exact
matches in meaning.  Reduce the source word to root form, omitting any
suffixes or declension endings that are not relevant to the target
meaning.

2  Write these words spelled out phonetically using the Lojban phoneme
structure.  Use consistent rules for phonemes that there is no exact
Lojban match for (e.g. English th->t).

3. In theory, score all possible 5 letter Lojban root forms using a
weighted scale (following).  In practice, we only scored forms that
used the letters/phonemes in the source languages, which greatly
reduced the computation.

a.  The generated 5 letter form is compared with the Lojbanized source
word.  Considering ONLY letters in their proper order, generate a
fraction whose numerator is 5 if all 5 Lojban letters occur is the
source word in that order, 4 if any 4 of the 5 occur in order, 3 if
any 3 occur in the source word in order.  2 is more complex, requiring
that there be a 2 letter match with either a) the two matching letters
being adjacent in the same order in both source and prospective word,
or that the two letters be separated by exactly one non-matching
letter in between them in both source and prospective word.  1 letter
matches are ignored.

b.  the denominator of the fraction is the total number of characters
in the Lojbanized source word.  e.g. Lojban for English "green"
lojbanized as "grin", would have gotten fraction 4/4 for prospective
forms "GRINo"/"GRINi", etc., fraction 3/4 for "GiRNi" (note that the
out of place "i"s do not count because they are in the wrong order)
and the word that actually was used "cRINo", fraction 2/4 for such
forms as "maGRo" and "maRNo" and "mIRso" and "RoNgi" and "maGdI".

c. Each of the source languages has a weight proportional to the
number of native speakers of that language plus one half the estimated
number of second language speakers of that language.  (by
proportional, I mean that I calculated the number of speakers for each
language, and then made this a percentage of the sum of the numbers
for all 6 languages).  I have recalculated these weights several times
over the years, in case new words would be made though in fact no new
roots have been added since 1994.

http://lojban.org/publications/draft-dictionary/Working/LANGSTAT.99.txt
has the 1984-1999 values for that calculation, with data for the top
12 languages.

d.  For each prospective wordform, multiple the "fraction" by the
weight, and sum the 6 scores.

4.  List the 30 or 50 or N highest scoring wordforms.  In general the
highest scoring word would be used, but in case of collision (two
different concepts getting the same or too-similar highest scoring
word), we moved down one list or the other until an acceptable word
was found.  This process was manual and somewhat idiosyncratic.  In
addition, sometimes typos were made (the Lojban word for its roots
should have been gicmu and not gismu) but not discovered until too
late.

5.  We tried this algorithm, modified accordingly, with the top 12
languages, top 10 languages, top 8 languages, and top 6 languages.  We
found that for more than 6 languages, for the particular Lojban
morphology rules, the words came out almost like a random generator -
there were usually no especial similarities between the top scoring
word, second scoring word, etc., and the scores were all low and close
together - basically no wordforms were cognate in enough languages to
get a very high score.  With only 6 languages, and with Spanish often
reinforcing English where the latter uses Latinate roots, there was
usually a cluster of high scoring wordforms.

(Other morphology systems might do fine with differing number of
languages represented, but too many language families seems muddy the
waters.)

http://www.lojban.org/publications/etymology/finprims.html
has the 6 Lojbanized source language words, the match numerators, and
the score and the selected wordform, for all of the Lojban words.
http://www.lojban.org/publications/etymology/etysample.txt
is a selective sampling with explanation of the format, if it isn't
obvious.

In retrospect, the Lojban algorithm is flawed in that 
a) Russian words are often much longer than 5 letters, so they seldom
got a good fraction, and they sometimes got a numerator 3 for a
non-cognate form of 3 vowels or some such that really wouldn't help a
speaker 
b) Chinese words were either really short (2-3 letters) so they had
really good fractions, or they were based on compounds and thus had
reduced cognate value.  Lojbanization of Chinese words was problematic
because too many sounds mapped to Lojban C and S (and j and z to a
lesser extent), and so many words ended in Lojbanized "-in" or "-an"
that a word could get a numerator 2 with no cognate value whatsoever.
On the whole, Chinese cognate value came out pretty high because
Chinese had the highest population weight, and those 100% fractions.
c) Arabic on the other hand came out extremely poorly.  It seldom had
reinforcement from any other language, except for those few words
where an Arabic root has been adopted internationally.  The algorithm
also weighted vowel matches the same as consonant matches, so that a
word could get a numerator "2" fraction matching two vowels (usually a
separated by a consonant) in Arabic even though there is no cognate
value in Arabic to vowel matches (usually not in other languages
either, but Arabic's low population weight meant that this situation
affected Arabic contribution more than others.

lojbab
-- 
lojbab
lojbabNOSPAM@lojban.org
Bob LeChevalier, Founder, The Logical Language Group
(Opinions are my own; I do not speak for the organization.)
Artificial language Loglan/Lojban:                 http://www.lojban.org




To unsubscribe from this list, send mail to lojban-list-request@lojban.org
with the subject unsubscribe, or go to http://www.lojban.org/lsg2/, or if
you're really stuck, send mail to secretary@lojban.org for help.
References:
- [lojban] Re: Evaluating gismu
  - From: <sasxsek@nutter.net>
Prev by Date: [lojban] Re: Evaluating gismu
Next by Date: [lojban] Re: Evaluating gismu
Previous by thread: [lojban] Re: Evaluating gismu
Next by thread: [lojban] Re: Evaluating gismu
Index(es):
- Date
- Thread