From lojban-out@lojban.org Mon Jul 03 13:38:50 2006
Return-Path: <lojban-out@lojban.org>
X-Sender: lojban-out@lojban.org
X-Apparently-To: lojban@yahoogroups.com
Received: (qmail 87250 invoked from network); 3 Jul 2006 20:36:31 -0000
Received: from unknown (66.218.66.217)
  by m41.grp.scd.yahoo.com with QMQP; 3 Jul 2006 20:36:31 -0000
Received: from unknown (HELO chain.digitalkingdom.org) (64.81.49.134)
  by mta2.grp.scd.yahoo.com with SMTP; 3 Jul 2006 20:36:31 -0000
Received: from lojban-out by chain.digitalkingdom.org with local (Exim 4.62)
	(envelope-from <lojban-out@lojban.org>)
	id 1FxV4W-0005wz-Sz
	for lojban@yahoogroups.com; Mon, 03 Jul 2006 13:31:09 -0700
Received: from chain.digitalkingdom.org ([64.81.49.134])
	by chain.digitalkingdom.org with esmtp (Exim 4.62)
	(envelope-from <lojban-list-bounce@lojban.org>)
	id 1FxV2z-0005vp-LI; Mon, 03 Jul 2006 13:29:34 -0700
Received: with ECARTIS (v1.0.0; list lojban-list); Mon, 03 Jul 2006 13:29:24 -0700 (PDT)
Received: from nobody by chain.digitalkingdom.org with local (Exim 4.62)	(envelope-from <nobody@digitalkingdom.org>)	id 1FxV2W-0005vg-VO	for lojban-list-real@lojban.org; Mon, 03 Jul 2006 13:29:05 -0700
Received: from [216.148.227.155] (helo=rwcrmhc15.comcast.net)	by chain.digitalkingdom.org with esmtp (Exim 4.62)	(envelope-from <d_n@nutter.net>)	id 1FxV2V-0005vZ-91	for lojban-list@lojban.org; Mon, 03 Jul 2006 13:29:04 -0700
Received: from kaos (c-68-47-222-244.hsd1.tn.comcast.net[68.47.222.244])          by comcast.net (rwcrmhc15) with SMTP          id <20060703202902m1500dgukne>; Mon, 3 Jul 2006 20:29:02 +0000
Date: Mon, 3 Jul 2006 16:29:01 -0400
Message-ID: <001501c69edf$52743180$a9eafea9@kaosorg>
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from quoted-printable to 8bit by Ecartis
X-Priority: 3 (Normal)
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook, Build 10.0.2616
In-Reply-To: <001001c69ed0$85fec3d0$a9eafea9@kaosorg>
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106
Importance: Normal
X-Spam-Score: -1.0 (-)
X-archive-position: 11888
X-ecartis-version: Ecartis v1.0.0
Errors-to: lojban-list-bounce@lojban.org
X-original-sender: sasxsek@nutter.net
X-list: lojban-list
X-Spam-Score: -1.0 (-)
To: lojban@yahoogroups.com
X-Originating-IP: 64.81.49.134
X-eGroups-Msg-Info: 1:0:0:0
X-eGroups-From: <sasxsek@nutter.net>
From: <lojban-out@lojban.org>
Reply-To: sasxsek@nutter.net
Subject: [lojban] Re: Evaluating gismu
X-Yahoo-Group-Post: member; u=116389790; y=3jBlnv6O6UaZmqPZyD6OWcQ_SndqM4rkcKr1i9puygwqTwS9tg
X-Yahoo-Profile: lojban_out
X-Yahoo-Message-Num: 26313

Here it is..  (originally posted to alt.language.artificial).

---
Dana Nutter <dn20056@nutter.net> wrote:

>There is one thing I'd like to know about Lojban, which is the
>formula use for generating the word roots.  Is this published
>anywhere?

It is a multistep algorithm and not a formula.

1.  For each root, identify similar word/words in the 6 source
languages that would serve as cognate memory hooks if the
Loglan/Lojban word was similar enough.  Don't worry about exact
matches in meaning.  Reduce the source word to root form, omitting any
suffixes or declension endings that are not relevant to the target
meaning.

2  Write these words spelled out phonetically using the Lojban phoneme
structure.  Use consistent rules for phonemes that there is no exact
Lojban match for (e.g. English th->t).

3. In theory, score all possible 5 letter Lojban root forms using a
weighted scale (following).  In practice, we only scored forms that
used the letters/phonemes in the source languages, which greatly
reduced the computation.

a.  The generated 5 letter form is compared with the Lojbanized source
word.  Considering ONLY letters in their proper order, generate a
fraction whose numerator is 5 if all 5 Lojban letters occur is the
source word in that order, 4 if any 4 of the 5 occur in order, 3 if
any 3 occur in the source word in order.  2 is more complex, requiring
that there be a 2 letter match with either a) the two matching letters
being adjacent in the same order in both source and prospective word,
or that the two letters be separated by exactly one non-matching
letter in between them in both source and prospective word.  1 letter
matches are ignored.

b.  the denominator of the fraction is the total number of characters
in the Lojbanized source word.  e.g. Lojban for English "green"
lojbanized as "grin", would have gotten fraction 4/4 for prospective
forms "GRINo"/"GRINi", etc., fraction 3/4 for "GiRNi" (note that the
out of place "i"s do not count because they are in the wrong order)
and the word that actually was used "cRINo", fraction 2/4 for such
forms as "maGRo" and "maRNo" and "mIRso" and "RoNgi" and "maGdI".

c. Each of the source languages has a weight proportional to the
number of native speakers of that language plus one half the estimated
number of second language speakers of that language.  (by
proportional, I mean that I calculated the number of speakers for each
language, and then made this a percentage of the sum of the numbers
for all 6 languages).  I have recalculated these weights several times
over the years, in case new words would be made though in fact no new
roots have been added since 1994.

http://lojban.org/publications/draft-dictionary/Working/LANGSTAT.99.txt
has the 1984-1999 values for that calculation, with data for the top
12 languages.

d.  For each prospective wordform, multiple the "fraction" by the
weight, and sum the 6 scores.

4.  List the 30 or 50 or N highest scoring wordforms.  In general the
highest scoring word would be used, but in case of collision (two
different concepts getting the same or too-similar highest scoring
word), we moved down one list or the other until an acceptable word
was found.  This process was manual and somewhat idiosyncratic.  In
addition, sometimes typos were made (the Lojban word for its roots
should have been gicmu and not gismu) but not discovered until too
late.

5.  We tried this algorithm, modified accordingly, with the top 12
languages, top 10 languages, top 8 languages, and top 6 languages.  We
found that for more than 6 languages, for the particular Lojban
morphology rules, the words came out almost like a random generator -
there were usually no especial similarities between the top scoring
word, second scoring word, etc., and the scores were all low and close
together - basically no wordforms were cognate in enough languages to
get a very high score.  With only 6 languages, and with Spanish often
reinforcing English where the latter uses Latinate roots, there was
usually a cluster of high scoring wordforms.

(Other morphology systems might do fine with differing number of
languages represented, but too many language families seems muddy the
waters.)

http://www.lojban.org/publications/etymology/finprims.html
has the 6 Lojbanized source language words, the match numerators, and
the score and the selected wordform, for all of the Lojban words.
http://www.lojban.org/publications/etymology/etysample.txt
is a selective sampling with explanation of the format, if it isn't
obvious.

In retrospect, the Lojban algorithm is flawed in that 
a) Russian words are often much longer than 5 letters, so they seldom
got a good fraction, and they sometimes got a numerator 3 for a
non-cognate form of 3 vowels or some such that really wouldn't help a
speaker 
b) Chinese words were either really short (2-3 letters) so they had
really good fractions, or they were based on compounds and thus had
reduced cognate value.  Lojbanization of Chinese words was problematic
because too many sounds mapped to Lojban C and S (and j and z to a
lesser extent), and so many words ended in Lojbanized "-in" or "-an"
that a word could get a numerator 2 with no cognate value whatsoever.
On the whole, Chinese cognate value came out pretty high because
Chinese had the highest population weight, and those 100% fractions.
c) Arabic on the other hand came out extremely poorly.  It seldom had
reinforcement from any other language, except for those few words
where an Arabic root has been adopted internationally.  The algorithm
also weighted vowel matches the same as consonant matches, so that a
word could get a numerator "2" fraction matching two vowels (usually a
separated by a consonant) in Arabic even though there is no cognate
value in Arabic to vowel matches (usually not in other languages
either, but Arabic's low population weight meant that this situation
affected Arabic contribution more than others.

lojbab
-- 
lojbab
lojbabNOSPAM@lojban.org
Bob LeChevalier, Founder, The Logical Language Group
(Opinions are my own; I do not speak for the organization.)
Artificial language Loglan/Lojban:                 http://www.lojban.org


To unsubscribe from this list, send mail to lojban-list-request@lojban.org
with the subject unsubscribe, or go to http://www.lojban.org/lsg2/, or if
you're really stuck, send mail to secretary@lojban.org for help.