Received: from mail-vc0-f191.google.com ([209.85.220.191]:59651) by stodi.digitalkingdom.org with esmtps (TLSv1:RC4-SHA:128) (Exim 4.80.1) (envelope-from ) id 1WKWBU-00007X-8t for lojban-list-archive@lojban.org; Mon, 03 Mar 2014 08:58:02 -0800 Received: by mail-vc0-f191.google.com with SMTP id if11sf1031286vcb.28 for ; Mon, 03 Mar 2014 08:57:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20120806; h=date:from:to:message-id:subject:mime-version:x-original-sender :reply-to:precedence:mailing-list:list-id:list-post:list-help :list-archive:sender:list-subscribe:list-unsubscribe:content-type; bh=f0tvMAhsXuDj4Zo0HJKwdFuC8DPO/UF+nz/KM8SMUjw=; b=hcQev9uBGbVhgexQhknz1AZoYMhBmGL8pytFtBBTMmj5MeqIH9oyvuEv6PRAv1xHS0 EWA6D320Kr549gYtbIoEONBN+OeuVbDoX1KdS0zdY8n1pZ47T10Vj7QyOMEIybxQ1rBj qLwQ1IbY54xlgDcsMueOnTyaUBRQc/IY5RQvcXbYRha0g41C7MudGziZfUTiOrZ236Fy jAIhl0E4OcpkmZ0G5uksxeHufVrTmt8pAy3ofeHLZDSm4v1vZzW74S1GGcojB1NoM+o+ n4cio3BY3K9jdynsYAHlNwiFjMp3sry33mIb3f7F4r7Y+MvUhxx12NBd4cyWFRHmcjXa 642Q== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:message-id:subject:mime-version:x-original-sender :reply-to:precedence:mailing-list:list-id:list-post:list-help :list-archive:sender:list-subscribe:list-unsubscribe:content-type; bh=f0tvMAhsXuDj4Zo0HJKwdFuC8DPO/UF+nz/KM8SMUjw=; b=Myozj2Nf38tXChpMi5tZClXkIcb+Bxw4p8scV9/7/n0PrVQnrh8BX/KpHp3dpB9LLh UTIB3T+Q6bWnkhBtr8/jmtZgs+HGyseNtTSA9wJwpTa64XyorSpcP3T+IFcJpLqDT/Y0 RxVAHtCIBMA5T0C2pUDRiyDWi1maYx8qjw/PUgyHs4IKfU7M0rSpJpPyXTyIEnF7Ywm6 H3K+QpKIxoNRvScVcjCjGh3Hdi+wq7Ayzi183X6GnE0Ky88X/gw7x+6eAdbKDbDthIsU K8I0/OY8NSKbU6Hy0jK7b+NrO+qc5joGsXIslpjvZfW1vyrgNUoIqvXKTYjudJCiC7rX PDOQ== X-Received: by 10.50.70.3 with SMTP id i3mr285578igu.3.1393865857881; Mon, 03 Mar 2014 08:57:37 -0800 (PST) X-BeenThere: lojban@googlegroups.com Received: by 10.50.50.179 with SMTP id d19ls2162778igo.13.gmail; Mon, 03 Mar 2014 08:57:36 -0800 (PST) X-Received: by 10.50.23.80 with SMTP id k16mr285175igf.16.1393865856194; Mon, 03 Mar 2014 08:57:36 -0800 (PST) Date: Mon, 3 Mar 2014 08:57:34 -0800 (PST) From: Riley Martinez-Lynch To: lojban@googlegroups.com Message-Id: <1695f304-a592-4e0f-b5ab-c215c6e80fcc@googlegroups.com> Subject: [lojban] Historical "finprims" gismu algorithm weights and scores MIME-Version: 1.0 X-Original-Sender: shunpiker@gmail.com Reply-To: lojban@googlegroups.com Precedence: list Mailing-list: list lojban@googlegroups.com; contact lojban+owners@googlegroups.com List-ID: X-Google-Group-Id: 1004133512417 List-Post: , List-Help: , List-Archive: Sender: lojban@googlegroups.com List-Subscribe: , List-Unsubscribe: , Content-Type: multipart/alternative; boundary="----=_Part_93_13193044.1393865854942" X-Spam-Score: -0.1 (/) X-Spam_score: -0.1 X-Spam_score_int: 0 X-Spam_bar: / ------=_Part_93_13193044.1393865854942 Content-Type: text/plain; charset=UTF-8 Hello! I have been working on the python implementation of the gismu algorithm that Broca wrote. I made some updates and released it here: https://github.com/teleological/gimyzba I want to ensure that, despite changes to the program, the behavior continues to conform to the definition of the gismu algorithm -- so I started looking for authoritative scores for the gismu. I think I found them here: http://www.lojban.org/publications/etymology/finprims The "finprims" file includes the lojbanized source language words and scores associated with each gismu, but doesn't specify the per-language weights that were used to calculate the scores. CLL says that the "language weights used to make most of the gismu", based on "1985 number-of-speakers data" were (Order: Chinese, English, Hindi, Spanish, Russian, Arabic): 0.36, 0.21, 0.16, 0.11, 0.09, 0.07 http://dag.github.io/cll/4/14 These look like rounded versions numbers given in the "langstat" documents, which are described as the "1987 gismu-remaking" weights, "based on the 1985 Brittanica BotY": .360, .208, .156, .116, .087, .073 http://www.lojban.org/publications/etymology/langstat.99 Neither CLL nor the langstat documents mention finprims specifically, but it seems plausible that finprims represent the "1987 gismu-remaking process": Someone with knowledge of the history, please correct me if this is not a valid assumption. I plugged both the rounded (CLL) and unrounded (langstat) weights into the gismu scoring algorithm, but the resulting scores for the gismu that I tested were quite different from those in finprims. Less surprisingly, the updated weights from langstat94, langstat95 and langstat99 also failed to reproduce the finprims scores. In retrospect, the finprims score of 98.00 for {mamta} -- the only gismu that matches on all letters of all of the input words -- indicates that none of these weight sets could have been used to produce that score: The weight set used to score {mamta} sums to 98, whereas the weight sets from CLL and the langstat documents sum to 100. By comparing the scores for various gismu, I was able to deduce a set of weights that appear to have been used to score finprims: 0.33, 0.18, 0.16, 0.12, 0.12, 0.07 (Chinese, English, Hindi, Spanish, Russian, Arabic) I confirmed these values by using them to rescore the gismu. Aside from some rounding errors, they appear to reproduce the finprims scores. Here, for example, is {ninmu}, which has a match on each of the input words. Finprims assigns it a score of 60.04. Here are the scores using each of the other weight sets: CLL : 62.56 (+/- 2.52) 1987 : 62.58 (+/- 2.54) 1994 : 61.97 (+/- 1.93) 1995 : 61.92 (+/- 1.88) 1999 : 61.19 (+/- 1.15) "CLL" represents the "language weights used to make most of the gismu" and "1987" the "1987 gismu-remaking" weights from the langstat documents. The weights for other years are also drawn from the langstat documents. None of these weight sets produces a score close to the finprims value. But the weight set that I derived from finprims scores it exactly as 60.04. Given all of this, I'd like to pose the following questions, particularly to those who may be familiar with the genesis of the gismu: 1. Is the finprims document representative of the gismu-making process described in CLL and/or the "1987 gismu-remaking" process? Or were these separate efforts? 2. Can anyone confirm the weights that I derived from finprims, or alternately, identify issues in the methodology I'm using to generate scores? 3. If these weights are confirmed, is there a record of how were they derived? Have they been previously published? 4. Does anyone with a memory of the gismu-making process remember how decimal precision and rounding was handled in calculating the scores? For example, the letter sequence length scores (2-5) for each input word are divided by the length of each corresponding input word. I'd be curious to know how the precision of these numbers were handled before they were multiplied by the language weighs. I'd also like to know how the precision of the products was handled, before or after they were summed to make the scores. Thank you for your consideration. I'm enjoying getting to know lojban! --Riley mi'e la mukti mu'o -- You received this message because you are subscribed to the Google Groups "lojban" group. To unsubscribe from this group and stop receiving emails from it, send an email to lojban+unsubscribe@googlegroups.com. To post to this group, send email to lojban@googlegroups.com. Visit this group at http://groups.google.com/group/lojban. For more options, visit https://groups.google.com/groups/opt_out. ------=_Part_93_13193044.1393865854942 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hello!

I have been working o= n the python implementation of the gismu algorithm that Broca wrote. I made= some updates and released it here:




The "= finprims" file includes the lojbanized source language words and scores ass= ociated with each gismu, but doesn't specify the per-language weights that = were used to calculate the scores. CLL says that the "language weights used= to make most of the gismu", based on "1985 number-of-speakers data" were (= Order: Chinese, English, Hindi, Spanish, Russian, Arabic):

<= div>0.36, 0.21, 0.16, 0.11, 0.09, 0.07 
http://dag.github.io= /cll/4/14

These look like rounded ver= sions numbers given in the "langstat" documents, which are described as the= "1987 gismu-remaking" weights, "based on the 1985 Brittanica BotY":
<= div>
.360, .208, .156, .116, .087, .073

Neither CLL nor the langstat documents mention finprims specifically, but= it seems plausible that finprims represent the "1987 gismu-remaking proces= s": Someone with knowledge of the history, please correct me if this is not= a valid assumption.

I plugged both the rounded (C= LL) and unrounded (langstat) weights into the gismu scoring algorithm, but = the resulting scores for the gismu that I tested were quite different from = those in finprims. Less surprisingly, the updated weights from langstat94, = langstat95 and langstat99 also failed to reproduce the finprims scores.

In retrospect, the finprims score of 98.00 for {mamta= } -- the only gismu that matches on all letters of all of the input words -= - indicates that none of these weight sets could have been used to produce = that score: The weight set used to score {mamta} sums to 98, whereas the we= ight sets from CLL and the langstat documents sum to 100.

By comparing the scores for various gismu, I was able to deduce a s= et of weights that appear to have been used to score finprims:
0.33, 0.18, 0.16, 0.12, 0.12, 0.07
(Chinese, English, Hin= di, Spanish, Russian, Arabic)

I confi= rmed these values by using them to rescore the gismu. Aside from some round= ing errors, they appear to reproduce the finprims scores. Here, for example= , is {ninmu}, which has a match on each of the input words. Finprims assign= s it a score of 60.04. Here are the scores using each of the other weight s= ets:

CLL  :  62.56 (+/- 2.52)
1987 = :  62.58 (+/- 2.54)
1994 :  61.97 (+/- 1.93)
= 1995 :  61.92 (+/- 1.88)
1999 :  61.19 (+/- 1.15)
=

"CLL" represents the "language weights use= d to make most of the gismu" and "1987" the "1987 gismu-remaking" weights f= rom the langstat documents. The weights for other years are also drawn from= the langstat documents. None of these weight sets produces a score close t= o the finprims value. But the weight set that I derived from finprims score= s it exactly as 60.04.

Given all of this, I'd like= to pose the following questions, particularly to those who may be familiar= with the genesis of the gismu:
  1. Is the finprims document = representative of the gismu-making process described in CLL and/or the "198= 7 gismu-remaking" process? Or were these separate efforts?

  2. = Can anyone confirm the weights that I derived from finprims, or alternately= , identify issues in the methodology I'm using to generate scores?

    <= /li>
  3. If these weights are confirmed, is there a record of how were they = derived? Have they been previously published?

  4. Does anyone w= ith a memory of the gismu-making process remember how decimal precision and= rounding was handled in calculating the scores? For example, the letter se= quence length scores (2-5) for each input word are divided by the length of= each corresponding input word. I'd be curious to know how the precision of= these numbers were handled before they were multiplied by the language wei= ghs. I'd also like to know how the precision of the products was handled, b= efore or after they were summed to make the scores.
Tha= nk you for your consideration. I'm enjoying getting to know lojban!

--Riley
mi'e la mukti mu'o

--
You received this message because you are subscribed to the Google Groups &= quot;lojban" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to lojban+unsubscribe@googlegroups.com.
To post to this group, send email to lojban@googlegroups.com.
Visit this group at http:= //groups.google.com/group/lojban.
For more options, visit https://groups.google.com/groups/opt_out.
------=_Part_93_13193044.1393865854942--