Date: Wed, 25 Sep 91 01:01:22 -0400
From: cbmvax!uunet!grebyn.com!lojbab (Logical Language Group)
Message-Id: <9109250501.AA09660@grebyn.com>
To: conlang@buphy.bu.edu, lojban@cuvmb.cc.columbia.edu
Subject: recognition scores
Status: RO

Bruce Gilson writes:
>James C. Brown, the inventor of Loglan, used a formula which had the
>advantage of mathematical preciseness; however, insofar as it considers
>the recognizability for speakers of a given language, it seems to me to
>be fatally flawed.  For one thing, it does not take into account the
>fact that the consonants, I believe, count for more than the vowels; for

The formula counts vowels and consonants equally, but this may be more or
less hidden, especially in Brown's versions, by the Loglan phonology
and morphology.  For example, in "galuboi" as you mention below, it is
the two vowels "a" and "u" that cause the 'score'.

In Brown's early use of the algorithm, in Loglanizing the natural
language words, a vowel or consonant that did not match any Loglan one
was treated as a '-', i.e. it counted as an automatic non-match.  After
Chuck Barton's essays around 1980, the tendency was to try for an
optimal mapping of all sounds of the target language to their closest
equivalent in Loglan and this is what we used in Lojban as well.  Brown
was never that consistent in this, of course.  As Carter notes, Brown
(usually, but not always) used Wade-Giles Chinese transcription which is
not phonetic as a basis for his Loglanization.  Because many English
sounds are diphthongs, or made into schwa, Brown tended to downweight
English vowels because they never matched exactly.  Thus WE consider
our word for "hammer", "mruli", to have a score of 2/4 from English
"xamr".

This difference is very significant.  I did a test on about 50 Esperanto
words with their matching old Loglan and Lojban words, and Esperanto,
not used in making words at all, scored as high as some of the langauges
that >were< used in recognition - because of vowel matches.

(For Lojban's reproducibility, I have complete files on the input to
every word made, by the way.  Many Megabytes)

>another, I am quite certain that disconnected parts should be counted
>for less than connected.

They are and they aren't.  First, a 2 phoneme match counts only if
adjacent or separated by exactly 1 letter.  In "galuboi" the latter
holds.  Three or more phonemes match as long as they occur in the same
order regardless of separation.  This criteria would work if dealing
only with short words, but some Russian and Hindi words are quite long,
and you get garbage matches.

Incidentally, I think vowel matches are weighted too HIGHLY, and I'm
surprised Bruce, in particular, doesn't know why.  In Arabic and the
Semitic languages in particular, but also to a lesser extent in
Indo-European languages, the vowels ARE less significant to recognition
than the consonants.  In Arabic, the vowels in effect are only
inflectionary in purpose.  In many European languages, the vowels
towards the end of the word change with inflection, but if you chop too
much off (as we did in making Lojban) you skew results in a different
direction.  I think that the first and stressed vowel (if not first) are
the only really important ones in recognition in English.  JCB did his
minimal (and undocumented) "engineering" studies on English speakers
only, and his results are at least suspect.

I in particular feel that vowel pair or even triplet matches are
worthless for recognition in almost all words.

>Thus, for example, JCB's figure of 2/7 for
>Russian "galuboi" in "blanu" (in the original Sci.  Amer. article) is
>probably a great overestimate of the amount by which "blanu" reminds a
>Russian speaker of "galuboi."  I would count close to zero there.  To
>the extent that Lojban keeps JCB's formula, it suffers from the same
>defects.

2/7 for Russian IS close to 0, with a language weight of less than 10
in old Loglan, this contributed about 2 points to the total score, I think,
in the 70's.  One problem indeed is that Russian and Hindi words, which
are longer, get lower scores overall than their language weight.

>Another question is the fact that in English and French, at least,
>spelling may count more than pronunciation for recognizability.

Not if you are trying to maximize recognition of the spoken language, as
presumably Brown was.  You do much better with Loglan/Lojban words spoken
aloud than in print.

But we did do some compromises in this area, especially with vowels.
Thus the 'a' of 'cat' in Lojbanized as "a" instead of the slightly
closer "e".  Note that Brown, and to a lesser extent we, DO use visual
recognition as a factor in Lojbanizing names and borrowed words.  But even
there, people want aural recognition.  Thus one of our first students,
a southern US lady, rejected the simple Lojbanization of her name as "kim.",
choosing instead the virtually unrecognizable with saying aloud "ki,ym."

Another factor not mentioned, and still a problem for us in helping
people learn the words, is that Brown's original plan involved
explicitly pointing out the sound hooks for each word as it was being
learned.  You may not guess at random that "cfipu" means "confuse", but
upon being told that the English Lojbanizes as "knfiuz", and that the
match is the sequence "f-i-u" you have three of the 5 letters, and the
handy accidental match of the 'c' makes it even easier.

We may find a way to eventually incorporate this data in LogFlash for
people learning the language, but we haven't yet done so.  Nor has JCB
in any of his teaching software.

>  Suppose
>we were constructing a loglan (in the generic sense) in which only
>English was to be used in generating a recognizability score.  For the
>word "nation," using pronunciation alone, probably the best word would
>be "necno" or something close to it (the final vowel would be
>arbitrary).  But I guarantee that more _literate_ English speakers would
>recognize "natno" (again, with the variation of the final vowel
>permitted).  If we do not confine ourselves to a Loglan type of
>CCVCV/CVCCV structure, Interlingua's "natione" is just about ideal in
>recognizability, though in pronunciation only the two n's would count in
>a pronunciation-based recognizability score.

If the purpose of your conlang is solely to be read, I would agree.  But
we want people to speak Lojban.  Why enshrine English spelling foibles
in your conlang.  Now if you were building it off European languages, you
would almost certainly get the 'ti' or a 'ci'.  The Lojbanization of the
word, by the way is "neicn", and it came out "natmi" primarily because
of the lessened but real European influence.


My own ideas - I would give half-weight for any vowel in a diphthong,
accept any vowel as a match OR a null for arabic, give some weight for
many letters matching but out of order, and give 1/2 credit for consonants
that match except for voicing.  Maybe calculate both a visual and aural
recognition score and use the average, etc.  Lots of things to try.  Luckily
I will be able to try:

In the new version of LogFlash now in Beta test, we have instrumentation
that records when a user gets a word right or wrong, and follows the
progress of each word throught the learning cycle.  Probably from as few
as a half dozen speakers who seriously study the language using
LogFlash, we will be able to see what correlations there are between the
Lojban recognition algorithm and actual learning experience.  We can
also see if there is any correlation with scoring as determined by other
algorithms that weren't used, since scores are relative values, not
absolutes, and even the non-best word has some score.

Anyone interested in learning the Lojban vocabulary simply to be a major
'guinea pig' in a key scientific experiment is welcome to volunteer,
even if you have no other reason to learn Lojban.  You should have daily
ability to use a PC-compatible for 3 months at an hour a day or 4-6
months at 20 minutes to 1/2 hour per day, to give us good data, and be
reasonably committed to following through - a gap of several days with
Logflash doesn't hurt long term learning but it will muck up the
statistics, as well as give you deadly practice sessions for 4 or 5 days
after the gap.  We will probably give refunds on the software and maybe
throw in a copy of the books when done for anyone who meets a reasonable
learning schedule and returns a usable data file to us; details of such a
promise are still to be worked out.

Not that we believe the new version of LogFlash to be as effective a
teacher of vocabulary for a new language as there is.  You will learn
the words so well that you can;t forget them.  Tommy Whitlock dropped
out for almost 2 years after using LogFlash, but had no trouble with
gismu vocabulary coming to Lojban conversation sessions.  Nora and I
still occasionally pull up old Loglan words 3 1/2 years after completely
relearning the vocabulary after first learning the old words with
Loglan.  We may start offering the program for use by conlangers who
want to try it with their other languages, but this is only possible if
special tailoring isn't needed.

IN a later message Bruce adds:
>However I find that one
>major deterrent to my learning any significant number of Lojban gismu is
>that they have no obvious cognates and so it would be like learning
>Japanese.  If the words were more easily related to something I knew, it
>would help.

Does this mean that if we had the sound cognate assistance that I
mentioned above, that you'd learn Lojban???!!!  Better watch out, we
might take you up on that.  I have 80% of a file of Lojbanized
etymologies compiled, and we may put them into a new release of
Logflash, perhaps next year.  If it makes a difference to someone who
will learn the language, it will get higher priority.  We've otherwise
considered the etymologies to be 'old stuff' and low prioirty.

Note that the cognate problem has NOT been that significant.  As with
Carter's example of old Loglan for "hammer", you can pick up a memory
hook for anything - regardless of cognate value.  Nora and I have had
complete mastery of the Lojban word "manci" which means "awe"/"wonder".
She likes to eat bread as a snack, and we were driving behind a 'Wonder
Bread' truck on our honeymoon when she suddenly pointed and said "manci"
("munchies").  Totally unforgettable word ever since.
----
lojbab = Bob LeChevalier, President, The Logical Language Group, Inc.
         2904 Beau Lane, Fairfax VA 22031-1303 USA
         703-385-0273
         lojbab@grebyn.com

NOTE THAT THIS IS A NEW NET ADDRESS AND SUPERSEDES OTHERS IN MY POSTINGS
            OR LOGICAL LANGUAGE GROUP, INC. PUBLICATIONS

For information about Lojban, please provide a snail-post address to me
via mail or phone.  We are funded solely by contributions, which are
encouraged for the purpose of defraying our costs, but are not mandatory.