Return-Path: (5.61/UUNET-internet-primary) id AA18440; Wed, 25 Sep 91 04:12:24 -0400 Date: Wed, 25 Sep 91 01:01:22 -0400 From: cbmvax!uunet!grebyn.com!lojbab (Logical Language Group) Message-Id: <9109250501.AA09660@grebyn.com> To: conlang@buphy.bu.edu, lojban@cuvmb.cc.columbia.edu Subject: recognition scores Status: RO X-From-Space-Date: Wed Sep 25 08:08:16 1991 X-From-Space-Address: cbmvax!uunet!grebyn!lojbab Bruce Gilson writes: >James C. Brown, the inventor of Loglan, used a formula which had the >advantage of mathematical preciseness; however, insofar as it considers >the recognizability for speakers of a given language, it seems to me to >be fatally flawed. For one thing, it does not take into account the >fact that the consonants, I believe, count for more than the vowels; for The formula counts vowels and consonants equally, but this may be more or less hidden, especially in Brown's versions, by the Loglan phonology and morphology. For example, in "galuboi" as you mention below, it is the two vowels "a" and "u" that cause the 'score'. In Brown's early use of the algorithm, in Loglanizing the natural language words, a vowel or consonant that did not match any Loglan one was treated as a '-', i.e. it counted as an automatic non-match. After Chuck Barton's essays around 1980, the tendency was to try for an optimal mapping of all sounds of the target language to their closest equivalent in Loglan and this is what we used in Lojban as well. Brown was never that consistent in this, of course. As Carter notes, Brown (usually, but not always) used Wade-Giles Chinese transcription which is not phonetic as a basis for his Loglanization. Because many English sounds are diphthongs, or made into schwa, Brown tended to downweight English vowels because they never matched exactly. Thus WE consider our word for "hammer", "mruli", to have a score of 2/4 from English "xamr". This difference is very significant. I did a test on about 50 Esperanto words with their matching old Loglan and Lojban words, and Esperanto, not used in making words at all, scored as high as some of the langauges that >were< used in recognition - because of vowel matches. (For Lojban's reproducibility, I have complete files on the input to every word made, by the way. Many Megabytes) >another, I am quite certain that disconnected parts should be counted >for less than connected. They are and they aren't. First, a 2 phoneme match counts only if adjacent or separated by exactly 1 letter. In "galuboi" the latter holds. Three or more phonemes match as long as they occur in the same order regardless of separation. This criteria would work if dealing only with short words, but some Russian and Hindi words are quite long, and you get garbage matches. Incidentally, I think vowel matches are weighted too HIGHLY, and I'm surprised Bruce, in particular, doesn't know why. In Arabic and the Semitic languages in particular, but also to a lesser extent in Indo-European languages, the vowels ARE less significant to recognition than the consonants. In Arabic, the vowels in effect are only inflectionary in purpose. In many European languages, the vowels towards the end of the word change with inflection, but if you chop too much off (as we did in making Lojban) you skew results in a different direction. I think that the first and stressed vowel (if not first) are the only really important ones in recognition in English. JCB did his minimal (and undocumented) "engineering" studies on English speakers only, and his results are at least suspect. I in particular feel that vowel pair or even triplet matches are worthless for recognition in almost all words. >Thus, for example, JCB's figure of 2/7 for >Russian "galuboi" in "blanu" (in the original Sci. Amer. article) is >probably a great overestimate of the amount by which "blanu" reminds a >Russian speaker of "galuboi." I would count close to zero there. To >the extent that Lojban keeps JCB's formula, it suffers from the same >defects. 2/7 for Russian IS close to 0, with a language weight of less than 10 in old Loglan, this contributed about 2 points to the total score, I think, in the 70's. One problem indeed is that Russian and Hindi words, which are longer, get lower scores overall than their language weight. >Another question is the fact that in English and French, at least, >spelling may count more than pronunciation for recognizability. Not if you are trying to maximize recognition of the spoken language, as presumably Brown was. You do much better with Loglan/Lojban words spoken aloud than in print. But we did do some compromises in this area, especially with vowels. Thus the 'a' of 'cat' in Lojbanized as "a" instead of the slightly closer "e". Note that Brown, and to a lesser extent we, DO use visual recognition as a factor in Lojbanizing names and borrowed words. But even there, people want aural recognition. Thus one of our first students, a southern US lady, rejected the simple Lojbanization of her name as "kim.", choosing instead the virtually unrecognizable with saying aloud "ki,ym." Another factor not mentioned, and still a problem for us in helping people learn the words, is that Brown's original plan involved explicitly pointing out the sound hooks for each word as it was being learned. You may not guess at random that "cfipu" means "confuse", but upon being told that the English Lojbanizes as "knfiuz", and that the match is the sequence "f-i-u" you have three of the 5 letters, and the handy accidental match of the 'c' makes it even easier. We may find a way to eventually incorporate this data in LogFlash for people learning the language, but we haven't yet done so. Nor has JCB in any of his teaching software. > Suppose >we were constructing a loglan (in the generic sense) in which only >English was to be used in generating a recognizability score. For the >word "nation," using pronunciation alone, probably the best word would >be "necno" or something close to it (the final vowel would be >arbitrary). But I guarantee that more _literate_ English speakers would >recognize "natno" (again, with the variation of the final vowel >permitted). If we do not confine ourselves to a Loglan type of >CCVCV/CVCCV structure, Interlingua's "natione" is just about ideal in >recognizability, though in pronunciation only the two n's would count in >a pronunciation-based recognizability score. If the purpose of your conlang is solely to be read, I would agree. But we want people to speak Lojban. Why enshrine English spelling foibles in your conlang. Now if you were building it off European languages, you would almost certainly get the 'ti' or a 'ci'. The Lojbanization of the word, by the way is "neicn", and it came out "natmi" primarily because of the lessened but real European influence. My own ideas - I would give half-weight for any vowel in a diphthong, accept any vowel as a match OR a null for arabic, give some weight for many letters matching but out of order, and give 1/2 credit for consonants that match except for voicing. Maybe calculate both a visual and aural recognition score and use the average, etc. Lots of things to try. Luckily I will be able to try: In the new version of LogFlash now in Beta test, we have instrumentation that records when a user gets a word right or wrong, and follows the progress of each word throught the learning cycle. Probably from as few as a half dozen speakers who seriously study the language using LogFlash, we will be able to see what correlations there are between the Lojban recognition algorithm and actual learning experience. We can also see if there is any correlation with scoring as determined by other algorithms that weren't used, since scores are relative values, not absolutes, and even the non-best word has some score. Anyone interested in learning the Lojban vocabulary simply to be a major 'guinea pig' in a key scientific experiment is welcome to volunteer, even if you have no other reason to learn Lojban. You should have daily ability to use a PC-compatible for 3 months at an hour a day or 4-6 months at 20 minutes to 1/2 hour per day, to give us good data, and be reasonably committed to following through - a gap of several days with Logflash doesn't hurt long term learning but it will muck up the statistics, as well as give you deadly practice sessions for 4 or 5 days after the gap. We will probably give refunds on the software and maybe throw in a copy of the books when done for anyone who meets a reasonable learning schedule and returns a usable data file to us; details of such a promise are still to be worked out. Not that we believe the new version of LogFlash to be as effective a teacher of vocabulary for a new language as there is. You will learn the words so well that you can;t forget them. Tommy Whitlock dropped out for almost 2 years after using LogFlash, but had no trouble with gismu vocabulary coming to Lojban conversation sessions. Nora and I still occasionally pull up old Loglan words 3 1/2 years after completely relearning the vocabulary after first learning the old words with Loglan. We may start offering the program for use by conlangers who want to try it with their other languages, but this is only possible if special tailoring isn't needed. IN a later message Bruce adds: >However I find that one >major deterrent to my learning any significant number of Lojban gismu is >that they have no obvious cognates and so it would be like learning >Japanese. If the words were more easily related to something I knew, it >would help. Does this mean that if we had the sound cognate assistance that I mentioned above, that you'd learn Lojban???!!! Better watch out, we might take you up on that. I have 80% of a file of Lojbanized etymologies compiled, and we may put them into a new release of Logflash, perhaps next year. If it makes a difference to someone who will learn the language, it will get higher priority. We've otherwise considered the etymologies to be 'old stuff' and low prioirty. Note that the cognate problem has NOT been that significant. As with Carter's example of old Loglan for "hammer", you can pick up a memory hook for anything - regardless of cognate value. Nora and I have had complete mastery of the Lojban word "manci" which means "awe"/"wonder". She likes to eat bread as a snack, and we were driving behind a 'Wonder Bread' truck on our honeymoon when she suddenly pointed and said "manci" ("munchies"). Totally unforgettable word ever since. ---- lojbab = Bob LeChevalier, President, The Logical Language Group, Inc. 2904 Beau Lane, Fairfax VA 22031-1303 USA 703-385-0273 lojbab@grebyn.com NOTE THAT THIS IS A NEW NET ADDRESS AND SUPERSEDES OTHERS IN MY POSTINGS OR LOGICAL LANGUAGE GROUP, INC. PUBLICATIONS For information about Lojban, please provide a snail-post address to me via mail or phone. We are funded solely by contributions, which are encouraged for the purpose of defraying our costs, but are not mandatory.