From cbmvax!uunet!PRC.Unisys.COM!dave Tue Jun 4 16:50:34 1991 Return-Path: From: cbmvax!uunet!PRC.Unisys.COM!dave Message-Id: <9106042047.AA02798@gem.PRC.Unisys.COM> Date: Tue, 4 Jun 91 16:45:28 EDT To: chalmers@violet.berkeley.edu Cc: lojban-list@snark.thyrsus.com In-Reply-To: John H. Chalmers Jr.'s message of Tue, 4 Jun 91 05:15:10 PDT <9106041215.AA18545@violet.berkeley.edu> Subject: Re: vocab sizes Status: RO > Can anyone help me with the following questions and/or > furnish me with references: What languages have the > largest vocabularies, what are the sizes, The problem with these questions is that they are very vague. When you make them specific enough to be answerable, the questions themselves tend to lose interest. An early problem you run into is defining the meaning of a "word." Is "jump" the "same word" as "jumps" and "jumped"? If you count them the same, then what about "swim", "swam", and "swum"? If you count every lexicographically different word as different, then doesn't that artificially inflate the number of words in a language? And how do you compare the counts with languages in which the words are not similarly inflected? For example, since Japanese has no noun plurals, by one counting method it probably has about half as many nouns as English. Is a "bank" that holds money the same word as a "bank" that holds a river? How many different "words" in the sentence: "The hot dog ate a hot dog."? > how are > vocabulary sizes measured (spoken words, literary sources, > official dictionaries, etc.), and how reliable are the > measures? Yes. Any of these. Of course, they will all give different answers. That's a consequence of the fact that a measurement without defining the measuring instrument is meaningless. > From what I had read, English was thought to have in > excess of 500,000 words and Classical Arabic was second > with about 350 K. Recently, a friend told me that English > is now thought to have about 1.5 M words, Russian nearly 1 M > and French about 500 K. She also stated that the vocabulary > sizes of all speakers of natural languages are measured in the > 100's of thousands of words. Even including transparent compounds, > derivatives and inflected forms, these numbers seem up to an > order of magnitude larger than the estimates I have seen in the > older literature from my college days. Is my knowledge > out of date? Webster's Third International Dictionary has about 450,000 entries. The OED has substantially more, but half the entries would have been long since forgotten if it were not for the OED itself. Is a word "in" a language if no one uses it any more? I know a substantial fraction of the words in Webster's 3rd--I'm not sure what the fraction is, but half is probably within an order of magnitude. (This would be an interesting experiment....) So my vocabulary is maybe 200,000 words. This is a VERY rough estimate. I also know a few hundred words that are not in Webster's. Many are from my profession, many are from other areas of interest to me, and some are just new. (I have absolutely no idea why the word "geas" is not in there.) So we can add at least a few tens of thousands of words that are not in Webster's, but are in common use in some segment of society. There are also some words used only within my small circle of family and friends, and nowhere else--should these count? So essentially, I think the question as it stands is not meaningful. It can be replaced by questions such as "How many lexicographically distinct words occur in a random sample of one million words from the most popular newspaper in the given language?" This question still needs a few ambiguities cleaned out of it before it can be answered properly--but it just doesn't have the zing of the original question. Next week: How big is a ball?