[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lojban] Etymology of future gismu (if they are to be created)



gleki wrote:
I know from experience that any and all translation programs are horrid at translation.
Furthermore, I don't see any need to include more languages into the algorithm.

Transliteration *may be* horrid indeed (especially in case of Arabic). However, audio recordings can solve this issue.
The algorithm was chosen to make people from all over the world learn words quicker.

That wasn't quite the reason, though certainly JCB believed that it was true. The primary reason was to create a lexicon that was (at least apparently) NOT biased in favor of any one language to an extent that exceeded its natural influence. "Cultural neutrality" was the watchword. There were and are a lot of problems with how JCB formulated the problem, and the dominance of American English semantics on the MEANINGS of the words is what I most fear, but that is what we are stuck with.

We did attempt to gather information using the old LogFlash program to determine whether indeed recognition scores were predictive of word learning. We got maybe a dozen data sets from different people, but my lack of time and statistical analysis skills leaves the analysis of that old data as one of my never-done tasks.

I suspect that there will be some correlation, but it might only exist on those words with higher recognition scores. Since more languages would lower the average score, learnability would likely be hurt.

If so why limit the number of source languages to 6?

Because any more that 6 was counterproductive, leading to essentially random words, and even then Arabic in 6th place had very little Lojbanic significance (in part because of the nature of Arabic morphology). The extreme population dominance of Chinese and English (including 2nd language speakers), and the existence of short roots in those languages means that most Lojban words are basically an amalgamation of those two languages, with sometimes a little coloration of one of the other languages.

Remember that a word has to match at least 2 letters (and if only 2, they must be in the right place) in order to contribute to a Lojban recognition score.

I suspect that any rigorous study would show that the Lojban morphology cannot effectively represent contributions from more than 3 language families (in essence, three languages with other languages possibly reinforcing those three when their roots are similar, which happens most often when they are in the same language family, or when there has been significant borrowing). Most often, only two languages/families are represented.

A couple percentage points different, and Lojban would look like an amalgamation of Chinese and Hindi. Indeed, per the numbers below, that is what would probably happen now.

We did experiments with more languages, ranging up to 12, but additional languages merely gave lower recognition scores (sometimes leading to tie scores between entirely different strings), and rarely, a letter might change because it gave a couple more points.

If I had it to do over again, I would make a couple changes in Chinese transliteration (which would give us more "o" and less "a" in the language, and perhaps try to find a way to decrease the reinforcing of fricative sounds that aren't really alike in Chinese). And I would use entirely different rules for Arabic, because vowels count so little in their roots compared to consonants, but the Lojban algorithm weights consonants and vowels more or less equally.

At one point in the 90s, I fiddled with the program to try to do this, but the original program no longer works properly (parts had been coded in assembler to speed up the innermost loops back in the 8086 era when a single word run would take several minutes rather than a few seconds) and I was a little too rusty on my coding skills.

Russian is no longer among first 6.

Actually, I think it still is, though I haven't done the calculations in recent years. The last time I did so, in 2004, it had dropped from 5th into 6th place, but it was still solidly ahead of Bengali because of second language speakers; it is probably closer now because Bengali continues to grow, while Russian is stagnant or waning; both are probably in the neighborhood of 250 million total speakers. But Russian isn't very influential in the wordmaking any more than Arabic is, though it is primarily because Russian roots are quite long. Bengali would likely have a little more influence, but only to the extent that its roots reinforce Hindi roots, skewing the language more towards the Chinese/Hindi amalgam mentioned above.

Next after Bengali is Portuguese, because Indonesian is still primarily a second language for most people who speak it, and second language speakers are halved.

The 2004 weighting would have been
Chinese .33
Hindi   .21
English .18
Spanish .12
Arabic  .09
Russian .07

The 1987 weights were
Chinese .36
Hindi   .16
English .21
Spanish .12
Arabic  .07
Russian .09

If Bengali replaced Russian or were added, this would slightly strengthen Hindi. But its weight would be on the same order as Russian, not enough to actually participate in word-making except where it reinforces the weight of a Hindi root. Even Spanish has insufficient weight to participate in many words, except when it reinforces an English root.

Portuguese would probably significantly reinforce Spanish, perhaps enough to enable it to match English in weight, but otherwise would never make any contribution.

Indonesia wouldn't reinforce anything except where it uses a borrowed word, and thus would have even less effect than Arabic.

And do those 6 languages really represent the majority of the population of the planet?

Actually yes, but not by much (In 2004, the 6 languages represented 2.7 billion first language speakers and 1.5 billion 2nd language speakers (with some overlap, especially in Hindi/English speakers, but probably not so much to not exceed half of the current 7 billion).

But that wasn't the intent.

The most trustworthy answer is the following.
If adding more languages changes the resulting sounding then 6 languages are not enough.

Redoing the words with the current Hindi weighting would have a big change in the language. So would the change in Chinese transliteration. Any Arabic change would probably help some, but not enough to significantly change the sound of the language. Adding additional languages would probably not change the words much (though there might be some randomization effects), but would lower the recognition scores.

(Masochists who know old Turbo Pascal might be able to do something with the program, including running some trials with different weightings. The source is still floating around somewhere on my machine. But IIRC, the code is poorly-enough documented so that a good programmer could write something from scratch almost as fast, that would allow them to try additional languages and see for themselves that it doesn't buy much.)

lojbab

--
You received this message because you are subscribed to the Google Groups "lojban" group.
To post to this group, send email to lojban@googlegroups.com.
To unsubscribe from this group, send email to lojban+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/lojban?hl=en.