[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [lojban] Etymology of future gismu (if they are to be created)
gleki wrote:
I know from experience that any and all translation programs are horrid at translation.
Furthermore, I don't see any need to include more languages into the algorithm.
Transliteration *may be* horrid indeed (especially in case of Arabic). However, audio recordings can solve this issue.
The algorithm was chosen to make people from all over the world learn words quicker.
That wasn't quite the reason, though certainly JCB believed that it was
true. The primary reason was to create a lexicon that was (at least
apparently) NOT biased in favor of any one language to an extent that
exceeded its natural influence. "Cultural neutrality" was the
watchword. There were and are a lot of problems with how JCB formulated
the problem, and the dominance of American English semantics on the
MEANINGS of the words is what I most fear, but that is what we are stuck
with.
We did attempt to gather information using the old LogFlash program to
determine whether indeed recognition scores were predictive of word
learning. We got maybe a dozen data sets from different people, but my
lack of time and statistical analysis skills leaves the analysis of that
old data as one of my never-done tasks.
I suspect that there will be some correlation, but it might only exist
on those words with higher recognition scores. Since more languages
would lower the average score, learnability would likely be hurt.
If so why limit the number of source languages to 6?
Because any more that 6 was counterproductive, leading to essentially
random words, and even then Arabic in 6th place had very little Lojbanic
significance (in part because of the nature of Arabic morphology). The
extreme population dominance of Chinese and English (including 2nd
language speakers), and the existence of short roots in those languages
means that most Lojban words are basically an amalgamation of those two
languages, with sometimes a little coloration of one of the other languages.
Remember that a word has to match at least 2 letters (and if only 2,
they must be in the right place) in order to contribute to a Lojban
recognition score.
I suspect that any rigorous study would show that the Lojban morphology
cannot effectively represent contributions from more than 3 language
families (in essence, three languages with other languages possibly
reinforcing those three when their roots are similar, which happens most
often when they are in the same language family, or when there has been
significant borrowing). Most often, only two languages/families are
represented.
A couple percentage points different, and Lojban would look like an
amalgamation of Chinese and Hindi. Indeed, per the numbers below, that
is what would probably happen now.
We did experiments with more languages, ranging up to 12, but additional
languages merely gave lower recognition scores (sometimes leading to tie
scores between entirely different strings), and rarely, a letter might
change because it gave a couple more points.
If I had it to do over again, I would make a couple changes in Chinese
transliteration (which would give us more "o" and less "a" in the
language, and perhaps try to find a way to decrease the reinforcing of
fricative sounds that aren't really alike in Chinese). And I would use
entirely different rules for Arabic, because vowels count so little in
their roots compared to consonants, but the Lojban algorithm weights
consonants and vowels more or less equally.
At one point in the 90s, I fiddled with the program to try to do this,
but the original program no longer works properly (parts had been coded
in assembler to speed up the innermost loops back in the 8086 era when a
single word run would take several minutes rather than a few seconds)
and I was a little too rusty on my coding skills.
Russian is no longer among first 6.
Actually, I think it still is, though I haven't done the calculations in
recent years. The last time I did so, in 2004, it had dropped from 5th
into 6th place, but it was still solidly ahead of Bengali because of
second language speakers; it is probably closer now because Bengali
continues to grow, while Russian is stagnant or waning; both are
probably in the neighborhood of 250 million total speakers. But Russian
isn't very influential in the wordmaking any more than Arabic is, though
it is primarily because Russian roots are quite long. Bengali would
likely have a little more influence, but only to the extent that its
roots reinforce Hindi roots, skewing the language more towards the
Chinese/Hindi amalgam mentioned above.
Next after Bengali is Portuguese, because Indonesian is still primarily
a second language for most people who speak it, and second language
speakers are halved.
The 2004 weighting would have been
Chinese .33
Hindi .21
English .18
Spanish .12
Arabic .09
Russian .07
The 1987 weights were
Chinese .36
Hindi .16
English .21
Spanish .12
Arabic .07
Russian .09
If Bengali replaced Russian or were added, this would slightly
strengthen Hindi. But its weight would be on the same order as Russian,
not enough to actually participate in word-making except where it
reinforces the weight of a Hindi root. Even Spanish has insufficient
weight to participate in many words, except when it reinforces an
English root.
Portuguese would probably significantly reinforce Spanish, perhaps
enough to enable it to match English in weight, but otherwise would
never make any contribution.
Indonesia wouldn't reinforce anything except where it uses a borrowed
word, and thus would have even less effect than Arabic.
And do those 6 languages really represent the majority of the population of the planet?
Actually yes, but not by much (In 2004, the 6 languages represented 2.7
billion first language speakers and 1.5 billion 2nd language speakers
(with some overlap, especially in Hindi/English speakers, but probably
not so much to not exceed half of the current 7 billion).
But that wasn't the intent.
The most trustworthy answer is the following.
If adding more languages changes the resulting sounding then 6 languages are not enough.
Redoing the words with the current Hindi weighting would have a big
change in the language. So would the change in Chinese transliteration.
Any Arabic change would probably help some, but not enough to
significantly change the sound of the language. Adding additional
languages would probably not change the words much (though there might
be some randomization effects), but would lower the recognition scores.
(Masochists who know old Turbo Pascal might be able to do something with
the program, including running some trials with different weightings.
The source is still floating around somewhere on my machine. But IIRC,
the code is poorly-enough documented so that a good programmer could
write something from scratch almost as fast, that would allow them to
try additional languages and see for themselves that it doesn't buy much.)
lojbab
--
You received this message because you are subscribed to the Google Groups "lojban" group.
To post to this group, send email to lojban@googlegroups.com.
To unsubscribe from this group, send email to lojban+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/lojban?hl=en.