From lojban-out@lojban.org Mon Jul 03 13:38:50 2006 Return-Path: X-Sender: lojban-out@lojban.org X-Apparently-To: lojban@yahoogroups.com Received: (qmail 87250 invoked from network); 3 Jul 2006 20:36:31 -0000 Received: from unknown (66.218.66.217) by m41.grp.scd.yahoo.com with QMQP; 3 Jul 2006 20:36:31 -0000 Received: from unknown (HELO chain.digitalkingdom.org) (64.81.49.134) by mta2.grp.scd.yahoo.com with SMTP; 3 Jul 2006 20:36:31 -0000 Received: from lojban-out by chain.digitalkingdom.org with local (Exim 4.62) (envelope-from ) id 1FxV4W-0005wz-Sz for lojban@yahoogroups.com; Mon, 03 Jul 2006 13:31:09 -0700 Received: from chain.digitalkingdom.org ([64.81.49.134]) by chain.digitalkingdom.org with esmtp (Exim 4.62) (envelope-from ) id 1FxV2z-0005vp-LI; Mon, 03 Jul 2006 13:29:34 -0700 Received: with ECARTIS (v1.0.0; list lojban-list); Mon, 03 Jul 2006 13:29:24 -0700 (PDT) Received: from nobody by chain.digitalkingdom.org with local (Exim 4.62) (envelope-from ) id 1FxV2W-0005vg-VO for lojban-list-real@lojban.org; Mon, 03 Jul 2006 13:29:05 -0700 Received: from [216.148.227.155] (helo=rwcrmhc15.comcast.net) by chain.digitalkingdom.org with esmtp (Exim 4.62) (envelope-from ) id 1FxV2V-0005vZ-91 for lojban-list@lojban.org; Mon, 03 Jul 2006 13:29:04 -0700 Received: from kaos (c-68-47-222-244.hsd1.tn.comcast.net[68.47.222.244]) by comcast.net (rwcrmhc15) with SMTP id <20060703202902m1500dgukne>; Mon, 3 Jul 2006 20:29:02 +0000 Date: Mon, 3 Jul 2006 16:29:01 -0400 Message-ID: <001501c69edf$52743180$a9eafea9@kaosorg> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by Ecartis X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook, Build 10.0.2616 In-Reply-To: <001001c69ed0$85fec3d0$a9eafea9@kaosorg> X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 Importance: Normal X-Spam-Score: -1.0 (-) X-archive-position: 11888 X-ecartis-version: Ecartis v1.0.0 Errors-to: lojban-list-bounce@lojban.org X-original-sender: sasxsek@nutter.net X-list: lojban-list X-Spam-Score: -1.0 (-) To: lojban@yahoogroups.com X-Originating-IP: 64.81.49.134 X-eGroups-Msg-Info: 1:0:0:0 X-eGroups-From: From: Reply-To: sasxsek@nutter.net Subject: [lojban] Re: Evaluating gismu X-Yahoo-Group-Post: member; u=116389790; y=3jBlnv6O6UaZmqPZyD6OWcQ_SndqM4rkcKr1i9puygwqTwS9tg X-Yahoo-Profile: lojban_out X-Yahoo-Message-Num: 26313 Here it is.. (originally posted to alt.language.artificial). --- Dana Nutter wrote: >There is one thing I'd like to know about Lojban, which is the >formula use for generating the word roots. Is this published >anywhere? It is a multistep algorithm and not a formula. 1. For each root, identify similar word/words in the 6 source languages that would serve as cognate memory hooks if the Loglan/Lojban word was similar enough. Don't worry about exact matches in meaning. Reduce the source word to root form, omitting any suffixes or declension endings that are not relevant to the target meaning. 2 Write these words spelled out phonetically using the Lojban phoneme structure. Use consistent rules for phonemes that there is no exact Lojban match for (e.g. English th->t). 3. In theory, score all possible 5 letter Lojban root forms using a weighted scale (following). In practice, we only scored forms that used the letters/phonemes in the source languages, which greatly reduced the computation. a. The generated 5 letter form is compared with the Lojbanized source word. Considering ONLY letters in their proper order, generate a fraction whose numerator is 5 if all 5 Lojban letters occur is the source word in that order, 4 if any 4 of the 5 occur in order, 3 if any 3 occur in the source word in order. 2 is more complex, requiring that there be a 2 letter match with either a) the two matching letters being adjacent in the same order in both source and prospective word, or that the two letters be separated by exactly one non-matching letter in between them in both source and prospective word. 1 letter matches are ignored. b. the denominator of the fraction is the total number of characters in the Lojbanized source word. e.g. Lojban for English "green" lojbanized as "grin", would have gotten fraction 4/4 for prospective forms "GRINo"/"GRINi", etc., fraction 3/4 for "GiRNi" (note that the out of place "i"s do not count because they are in the wrong order) and the word that actually was used "cRINo", fraction 2/4 for such forms as "maGRo" and "maRNo" and "mIRso" and "RoNgi" and "maGdI". c. Each of the source languages has a weight proportional to the number of native speakers of that language plus one half the estimated number of second language speakers of that language. (by proportional, I mean that I calculated the number of speakers for each language, and then made this a percentage of the sum of the numbers for all 6 languages). I have recalculated these weights several times over the years, in case new words would be made though in fact no new roots have been added since 1994. http://lojban.org/publications/draft-dictionary/Working/LANGSTAT.99.txt has the 1984-1999 values for that calculation, with data for the top 12 languages. d. For each prospective wordform, multiple the "fraction" by the weight, and sum the 6 scores. 4. List the 30 or 50 or N highest scoring wordforms. In general the highest scoring word would be used, but in case of collision (two different concepts getting the same or too-similar highest scoring word), we moved down one list or the other until an acceptable word was found. This process was manual and somewhat idiosyncratic. In addition, sometimes typos were made (the Lojban word for its roots should have been gicmu and not gismu) but not discovered until too late. 5. We tried this algorithm, modified accordingly, with the top 12 languages, top 10 languages, top 8 languages, and top 6 languages. We found that for more than 6 languages, for the particular Lojban morphology rules, the words came out almost like a random generator - there were usually no especial similarities between the top scoring word, second scoring word, etc., and the scores were all low and close together - basically no wordforms were cognate in enough languages to get a very high score. With only 6 languages, and with Spanish often reinforcing English where the latter uses Latinate roots, there was usually a cluster of high scoring wordforms. (Other morphology systems might do fine with differing number of languages represented, but too many language families seems muddy the waters.) http://www.lojban.org/publications/etymology/finprims.html has the 6 Lojbanized source language words, the match numerators, and the score and the selected wordform, for all of the Lojban words. http://www.lojban.org/publications/etymology/etysample.txt is a selective sampling with explanation of the format, if it isn't obvious. In retrospect, the Lojban algorithm is flawed in that a) Russian words are often much longer than 5 letters, so they seldom got a good fraction, and they sometimes got a numerator 3 for a non-cognate form of 3 vowels or some such that really wouldn't help a speaker b) Chinese words were either really short (2-3 letters) so they had really good fractions, or they were based on compounds and thus had reduced cognate value. Lojbanization of Chinese words was problematic because too many sounds mapped to Lojban C and S (and j and z to a lesser extent), and so many words ended in Lojbanized "-in" or "-an" that a word could get a numerator 2 with no cognate value whatsoever. On the whole, Chinese cognate value came out pretty high because Chinese had the highest population weight, and those 100% fractions. c) Arabic on the other hand came out extremely poorly. It seldom had reinforcement from any other language, except for those few words where an Arabic root has been adopted internationally. The algorithm also weighted vowel matches the same as consonant matches, so that a word could get a numerator "2" fraction matching two vowels (usually a separated by a consonant) in Arabic even though there is no cognate value in Arabic to vowel matches (usually not in other languages either, but Arabic's low population weight meant that this situation affected Arabic contribution more than others. lojbab -- lojbab lojbabNOSPAM@lojban.org Bob LeChevalier, Founder, The Logical Language Group (Opinions are my own; I do not speak for the organization.) Artificial language Loglan/Lojban: http://www.lojban.org To unsubscribe from this list, send mail to lojban-list-request@lojban.org with the subject unsubscribe, or go to http://www.lojban.org/lsg2/, or if you're really stuck, send mail to secretary@lojban.org for help.