[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[lojban-beginners] Re: anti-Zipfian gismu rant
turnip wrote:
Compare these pair of sentences:
Old Italian squirrels are stupid, but zebras are smart.
loi tolci'o natmritaliano bo ritcyratcu cu tolmencre .iki'u xirmrxipotigre cu
mencre.
I don't know the significance of the English sentence. Is an "old
Italian squirrel" a particular species?
The point is that only someone seeking a literal translation of the
English would say "tolci'o natmritaliano bo ritcyratcu" (and I wouldn't
be using "mencre/tolmencre" for the presumed intelligence contrast,
either - probably instead a pair of lujvo based on menli-kakne-zmadu/mleca).
Zipf's law is intended to deal with use of a language to express
concepts ***in that language***. Much of the time, sentences in a
language, when translated literally into another language, come out
longer, with amounts that vary unpredictably. When speaking in a
language, speakers tend to use forms that are short. All of our
references to Zipf's Law are based on assumptions and predictions as to
what the usage frequencies of words would be among fluent Lojban
speakers speaking the language communicatively among themselves without
reference to any external language (i.e. ignoring translation issues)
Looking at other-language word frequencies in isolation is grossly
misleading. Some words are high frequency because of multiple meanings
that would require several different Lojban words to convey. Some
words, specifically culture-related words, are high frequency in some
cultures and low frequency in others. English speakers may refer to
Italian- a lot. I doubt that Hindi, Chinese or Arabic speakers do.
Loglan/Lojban did attempt to consider usage frequency, but our basis was
Helen Eaton's 1930s "Semantic Frequency List" which makes an effort to
account for concepts as opposed to mere word forms. It only covered 4
European languages, but even that removed some of the English biases.
"squirrel" for example, was among the sixth thousand in English
frequency, but in French, Spanish, and German it was too low a frequency
to be rated (which means that it wasn't in the top 8000).
James Cooke Brown set a priority on having a gismu or a *short* lujvo
for the top 2000 or 3000 concepts.
He also made sure he had covered some lists of what were considered
"fundamental primitive concepts" that had root words in pretty much all
languages (I believe Swadesh had such a list, but I can't recall whether
JCB specifically used that one - to some extent, he made his own list of
this sort using his own research).
After the initial gismu making, he and others freely added gismu rather
haphazardly and without any sort of frequency justification. This led
to such oddities as gismu for "billiards". This was in the era when
fu'ivla had to be in the form of a gismu or lujvo because the other
forms simply weren't allowed.
When we remade the gismu list, we pared off most of the accretions as
belonging in fu'ivla space. "gymnast" barely survived because it was an
category of Olympic sports and hence arguably an international concept -
and we couldn't think of a good short lujvo (to which terms could be
added to indicate particular kinds of gymnastics). At that point there
was no concept of making lujvo using fu'ivla.
We put a lower priority on Eaton's frequencies (mostly because I didn't
want to spend the time going through the list to determine JCB's
justifications for his choices) and put a much higher priority on a
word/concept's usefulness in making lujvo as opposed to its difficulty
of being expressed as a lujvo.
Late in the process, we went through Roget's thesaurus specifically
looking for concepts that could not be easily expressed as lujvo, and
then deciding for each whether it belonged as a gismu or a borrowing,
and again whether it could be useful in making lujvo for other Roget words.
Finally, in an attempt to be culturally neutral and systematic, we saw
several sets of words that were incomplete, many because they included
only Western biased concepts. For animals, plants, food staples, we
sought out the most used concepts in non-Western cultures (hence the
gismu for cassava/tar/starch roots and lotus).
The Algerian gymnast's cassava is 10^-18 cubits long.
le le jerxo zajba ku samcu cu xatsi gutci.
JCB had many of the metric prefixes, so we made the set complete. We
added all of the SI (metric) fundamental units as well. Because of the
need to translate non-metric words, we had a parallel set of words for
non-metric units. But not wanting to be biased towards any one culture,
"foot" and "cubit" were combined. The keyword is "cubit" because
"foot" is used as the keyword for the body part, and LogFlash requires
unique keywords; it also stresses both the fact that the word is a
measurement and that it is not limited to the English system of
measurements - the English unit would be a lujvo based on glico-gutci.
Note the difference in length between the two sentences and their English
counterpart.
Totally arbitrary, especially since your sentences are obviously
arbitrarily designed to maximize the effect. It is easy to come up with
short grammatically-correct nonsense in any language that may not
translate briefly into another language. Chomsky's
"Colorless green ideas sleep furiously" is probably an English example.
> In the first, all the non-cmavo words are non-gismu, whereas in
the second, all the non-cmavo words are gismu. The first sentence is almost
twice as long as the English, whereas in the scond, it is about 20% shorter.
The English sentences are roughly the same length. The relative frequency of
the "important" English words in the British National Corpus (appearance per
million words)* are:
old 524.86 Italian 47.91 squirrel 2.28 stupid 30.89 zebra 2.22 smart 17.32,
average =105.913
Algerian 2.18 gymnast 0.19 cassava 0.41 atto- 0.25 cubit 0.09
Average=0.624
So how come we have short words (gismu) for the latter set, but very long
words for the former set?
An Algerian speaks one of our core languages. And Italian doesn't
(unless you want to call Italian an eastern dialect of Spanish, or
neo-Latin, in which case it would be a short lujvo).
(On a side note, my kids have a raccoon puppet, which my brother likes to
put on and say, "I'm Italian. Yay!" in a silly voice, knowing that it makes
me laugh uncorntoallably because of lojban's lack of gismu for Italian or
raccoon :-)
And if we had gismu for both of those, then someone else would have a
llama puppet, and they might be Vietnamese (and puppet isn't a gismu
either).
For that matter, if one of the people from the island south of Australia
was speaking in his language, he might be upset to find the English
translation for his cubit-long animal puppet is "sesquipedalian
Tasmanian devil puppet", which might be expressed rather briefly in the
native language.
lojbab