From nobody@digitalkingdom.org Thu Aug 09 09:09:06 2007 Received: with ECARTIS (v1.0.0; list lojban-beginners); Thu, 09 Aug 2007 09:09:07 -0700 (PDT) Received: from nobody by chain.digitalkingdom.org with local (Exim 4.67) (envelope-from ) id 1IJAZN-0008S3-QQ for lojban-beginners-real@lojban.org; Thu, 09 Aug 2007 09:09:06 -0700 Received: from eastrmmtao104.cox.net ([68.230.240.46]) by chain.digitalkingdom.org with esmtp (Exim 4.67) (envelope-from ) id 1IJAZH-0008R7-Af for lojban-beginners@lojban.org; Thu, 09 Aug 2007 09:09:05 -0700 Received: from eastrmimpo01.cox.net ([68.1.16.119]) by eastrmmtao104.cox.net (InterMail vM.7.08.02.01 201-2186-121-102-20070209) with ESMTP id <20070809160852.RHEO1505.eastrmmtao104.cox.net@eastrmimpo01.cox.net> for ; Thu, 9 Aug 2007 12:08:52 -0400 Received: from [127.0.0.1] ([72.192.234.183]) by eastrmimpo01.cox.net with bizsmtp id Zs8p1X0043y5FKc0000000; Thu, 09 Aug 2007 12:08:52 -0400 Message-ID: <46BB3C86.7060107@lojban.org> Date: Thu, 09 Aug 2007 12:10:46 -0400 From: Robert LeChevalier User-Agent: Mozilla Thunderbird 1.0.7 (Windows/20050923) X-Accept-Language: en-us, en MIME-Version: 1.0 To: lojban-beginners@lojban.org Subject: [lojban-beginners] Re: anti-Zipfian gismu rant References: <46C10802@webmail.bcpl.net> In-Reply-To: <46C10802@webmail.bcpl.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: 0.0 X-Spam-Score-Int: 0 X-Spam-Bar: / X-archive-position: 5321 X-ecartis-version: Ecartis v1.0.0 Sender: lojban-beginners-bounce@lojban.org Errors-to: lojban-beginners-bounce@lojban.org X-original-sender: lojbab@lojban.org Precedence: bulk Reply-to: lojban-beginners@lojban.org X-list: lojban-beginners turnip wrote: > Compare these pair of sentences: > Old Italian squirrels are stupid, but zebras are smart. > loi tolci'o natmritaliano bo ritcyratcu cu tolmencre .iki'u xirmrxipotigre cu > mencre. I don't know the significance of the English sentence. Is an "old Italian squirrel" a particular species? The point is that only someone seeking a literal translation of the English would say "tolci'o natmritaliano bo ritcyratcu" (and I wouldn't be using "mencre/tolmencre" for the presumed intelligence contrast, either - probably instead a pair of lujvo based on menli-kakne-zmadu/mleca). Zipf's law is intended to deal with use of a language to express concepts ***in that language***. Much of the time, sentences in a language, when translated literally into another language, come out longer, with amounts that vary unpredictably. When speaking in a language, speakers tend to use forms that are short. All of our references to Zipf's Law are based on assumptions and predictions as to what the usage frequencies of words would be among fluent Lojban speakers speaking the language communicatively among themselves without reference to any external language (i.e. ignoring translation issues) Looking at other-language word frequencies in isolation is grossly misleading. Some words are high frequency because of multiple meanings that would require several different Lojban words to convey. Some words, specifically culture-related words, are high frequency in some cultures and low frequency in others. English speakers may refer to Italian- a lot. I doubt that Hindi, Chinese or Arabic speakers do. Loglan/Lojban did attempt to consider usage frequency, but our basis was Helen Eaton's 1930s "Semantic Frequency List" which makes an effort to account for concepts as opposed to mere word forms. It only covered 4 European languages, but even that removed some of the English biases. "squirrel" for example, was among the sixth thousand in English frequency, but in French, Spanish, and German it was too low a frequency to be rated (which means that it wasn't in the top 8000). James Cooke Brown set a priority on having a gismu or a *short* lujvo for the top 2000 or 3000 concepts. He also made sure he had covered some lists of what were considered "fundamental primitive concepts" that had root words in pretty much all languages (I believe Swadesh had such a list, but I can't recall whether JCB specifically used that one - to some extent, he made his own list of this sort using his own research). After the initial gismu making, he and others freely added gismu rather haphazardly and without any sort of frequency justification. This led to such oddities as gismu for "billiards". This was in the era when fu'ivla had to be in the form of a gismu or lujvo because the other forms simply weren't allowed. When we remade the gismu list, we pared off most of the accretions as belonging in fu'ivla space. "gymnast" barely survived because it was an category of Olympic sports and hence arguably an international concept - and we couldn't think of a good short lujvo (to which terms could be added to indicate particular kinds of gymnastics). At that point there was no concept of making lujvo using fu'ivla. We put a lower priority on Eaton's frequencies (mostly because I didn't want to spend the time going through the list to determine JCB's justifications for his choices) and put a much higher priority on a word/concept's usefulness in making lujvo as opposed to its difficulty of being expressed as a lujvo. Late in the process, we went through Roget's thesaurus specifically looking for concepts that could not be easily expressed as lujvo, and then deciding for each whether it belonged as a gismu or a borrowing, and again whether it could be useful in making lujvo for other Roget words. Finally, in an attempt to be culturally neutral and systematic, we saw several sets of words that were incomplete, many because they included only Western biased concepts. For animals, plants, food staples, we sought out the most used concepts in non-Western cultures (hence the gismu for cassava/tar/starch roots and lotus). > The Algerian gymnast's cassava is 10^-18 cubits long. > le le jerxo zajba ku samcu cu xatsi gutci. JCB had many of the metric prefixes, so we made the set complete. We added all of the SI (metric) fundamental units as well. Because of the need to translate non-metric words, we had a parallel set of words for non-metric units. But not wanting to be biased towards any one culture, "foot" and "cubit" were combined. The keyword is "cubit" because "foot" is used as the keyword for the body part, and LogFlash requires unique keywords; it also stresses both the fact that the word is a measurement and that it is not limited to the English system of measurements - the English unit would be a lujvo based on glico-gutci. > Note the difference in length between the two sentences and their English > counterpart. Totally arbitrary, especially since your sentences are obviously arbitrarily designed to maximize the effect. It is easy to come up with short grammatically-correct nonsense in any language that may not translate briefly into another language. Chomsky's "Colorless green ideas sleep furiously" is probably an English example. > In the first, all the non-cmavo words are non-gismu, whereas in > the second, all the non-cmavo words are gismu. The first sentence is almost > twice as long as the English, whereas in the scond, it is about 20% shorter. > The English sentences are roughly the same length. The relative frequency of > the "important" English words in the British National Corpus (appearance per > million words)* are: > > old 524.86 Italian 47.91 squirrel 2.28 stupid 30.89 zebra 2.22 smart 17.32, > average =105.913 > > Algerian 2.18 gymnast 0.19 cassava 0.41 atto- 0.25 cubit 0.09 > Average=0.624 > > So how come we have short words (gismu) for the latter set, but very long > words for the former set? An Algerian speaks one of our core languages. And Italian doesn't (unless you want to call Italian an eastern dialect of Spanish, or neo-Latin, in which case it would be a short lujvo). > (On a side note, my kids have a raccoon puppet, which my brother likes to > put on and say, "I'm Italian. Yay!" in a silly voice, knowing that it makes > me laugh uncorntoallably because of lojban's lack of gismu for Italian or > raccoon :-) And if we had gismu for both of those, then someone else would have a llama puppet, and they might be Vietnamese (and puppet isn't a gismu either). For that matter, if one of the people from the island south of Australia was speaking in his language, he might be upset to find the English translation for his cubit-long animal puppet is "sesquipedalian Tasmanian devil puppet", which might be expressed rather briefly in the native language. lojbab