Received: from mail-it0-f62.google.com ([209.85.214.62]:32841) by stodi.digitalkingdom.org with esmtps (TLSv1.2:ECDHE-RSA-AES128-GCM-SHA256:128) (Exim 4.87) (envelope-from ) id 1daj4y-00075J-SR for lojban-list-archive@lojban.org; Thu, 27 Jul 2017 06:43:58 -0700 Received: by mail-it0-f62.google.com with SMTP id t6sf14092803itb.0 for ; Thu, 27 Jul 2017 06:43:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20161025; h=sender:date:from:to:message-id:in-reply-to:references:subject :mime-version:x-original-sender:reply-to:precedence:mailing-list :list-id:list-post:list-help:list-archive:list-subscribe :list-unsubscribe; bh=M6Xw2gCYIxp1+s2MGTfAYUrm1EdEyPdGv91WurPSwW8=; b=OrHn82aSkGRGzezrp3grfG5xttOjD/pE5shSccF0pM9Ca8n7e5hvX1ZI6ArjOB+FF2 x48X4L2U4tIC05KH5kzzlKSbmkog3oX4NRpiS21WA+O2tHHumdrBcQO7SYPaJoGMAkN/ /GfDOmR/7wtMS83KJLdBQJ5Cq5sIzKUiQCO8paYABsq264NbVCH5jFZg1Q6KURTZni6g UO+UvjFunZVFU9VELaMSWyEx4BqowepGNn9a65ckrEjZyR+5HjB9tB5ui8sR9kMhO88Z 27tlz6SO0Jw4vazXj5a990iJVeb5iX2gXXdnPA/+DlLkIrCrLVzomVrxAA0LhnvRzm6u z0Rg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:message-id:in-reply-to:references:subject:mime-version :x-original-sender:reply-to:precedence:mailing-list:list-id :list-post:list-help:list-archive:list-subscribe:list-unsubscribe; bh=M6Xw2gCYIxp1+s2MGTfAYUrm1EdEyPdGv91WurPSwW8=; b=R0d2fetv5YxXTpHyPolF7eqv1kyyEtGHBrwU9C00c7Whh7HOvwZ8WaorxWTjnVT07I xmHpnbQdB2VxTmrtPKVWTGWY2WWr6n2ET0wyrqjzJ+cL+oGaVI9EOAN1vEeD42+rS3Oe RNx4BX5o9t3TMUBREu/G0c7Gdcd1SGMNIjSEhuKQn39GVux+yIPIQzBawPdaKhAnywwP GqmDe2CguT0RRb9OD7oe+27GeGC4J39SfhWPIl+TxTYD2XMULTT9f2TtJO3PV2PsXt+j WqIsgc64xMcp/EdRJwsvDCr+rx/QTQwhNWPEBIoiP+GI1vbyk40ZhwI/HqbiKA4zUPo7 XIsA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=sender:x-gm-message-state:date:from:to:message-id:in-reply-to :references:subject:mime-version:x-original-sender:reply-to :precedence:mailing-list:list-id:x-spam-checked-in-group:list-post :list-help:list-archive:list-subscribe:list-unsubscribe; bh=M6Xw2gCYIxp1+s2MGTfAYUrm1EdEyPdGv91WurPSwW8=; b=YHNy3uQkbh+sG03YAgcSDbISLl8V+5WNq1XEM6Y38yvOBwO+zAc39SBWLJQLdKC6pc Mdi6Ywly5giqLsP+ULOD1ekaaZoSsbtKPT3Kqhqs19UDLjyPMAhLAuicOz0g7Jr9l2G1 sUeSfTmWxRIQEzGjhxNRy44sW+GuyF9eUl9cfzRTUrXoKKzx93yJOdWgHDtRQSKNi8rQ qbicd/PZpyA3XRJA4RZwinu3oqySLEXp4f+PSH/wwowp67RJiB/jsK0VAmHKF1q5dSQd 3B1/ANoCo/GVLDIAEqEDDYCpa0vsPWeKPjztWboU9pkMe+3Fsew568D9ZjfWaErtLElf re5g== Sender: lojban@googlegroups.com X-Gm-Message-State: AIVw111akPa1U5drcSfaK3m8OPhSwzhkZwz9b3fDihzpq9fgdXn8qCOt 35r+xqefrlwbdg== X-Received: by 10.36.26.69 with SMTP id 66mr174268iti.12.1501163024029; Thu, 27 Jul 2017 06:43:44 -0700 (PDT) X-BeenThere: lojban@googlegroups.com Received: by 10.36.0.137 with SMTP id 131ls2352534ita.8.gmail; Thu, 27 Jul 2017 06:43:43 -0700 (PDT) X-Received: by 10.31.158.129 with SMTP id h123mr23139vke.15.1501163019929; Thu, 27 Jul 2017 06:43:39 -0700 (PDT) Date: Thu, 27 Jul 2017 06:43:39 -0700 (PDT) From: sukender1@gmail.com To: lojban Message-Id: In-Reply-To: <3c86d96b-e0ea-af6b-2ee8-51d4e0741fe5@gmail.com> References: <3c86d96b-e0ea-af6b-2ee8-51d4e0741fe5@gmail.com> Subject: Re: [lojban] Spaces in jbovlaste MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_763_1400951131.1501163019705" X-Original-Sender: sukender1@gmail.com Reply-To: lojban@googlegroups.com Precedence: list Mailing-list: list lojban@googlegroups.com; contact lojban+owners@googlegroups.com List-ID: X-Spam-Checked-In-Group: lojban@googlegroups.com X-Google-Group-Id: 1004133512417 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , X-Spam-Score: -2.0 (--) X-Spam_score: -2.0 X-Spam_score_int: -19 X-Spam_bar: -- ------=_Part_763_1400951131.1501163019705 Content-Type: multipart/alternative; boundary="----=_Part_764_344432803.1501163019705" ------=_Part_764_344432803.1501163019705 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Le jeudi 27 juillet 2017 13:04:54 UTC+2, Ilmen a =C3=A9crit : > > If spell checkers are only concerned with identifying what is a correct= =20 > word and what isn't, Exactly! For now, my first concern is to get a first step towards=20 spell/grammar checking for common software (see the other thread). That's= =20 clearly a "better than nothing" idea... and yes, it's clearly sub-optimal. =20 > then you should disregard Jbovlaste entries=20 > containing whitespace (they are multi-words lexemes), or even better,=20 > check all the words that compose them to see if any of them is missing=20 > from your spell-check whitelist (I strongly suspect there exists bu and= =20 > zei compounds containing words that appears nowhere else in the=20 > dictionary=E2=80=A6).=20 > Great! I'll do that. Thanks. =20 > "re zei zgabube" is indeed a sequence of three words. It is present in=20 > the dictionary because it is an independent lexeme, you cannot=20 > accurately derive its meaning from its parts. This occurs all the times= =20 > in natlangs, think for example to the English "take off".=20 > Okay. But as you mentioned, spell checkers only check spelling! So in the= =20 English ones, "take" and "off" are separated. The grammar checker, however,= =20 should detect the meaning of "take off" instead of "take" and "off"=20 separately. =20 > As for cmavo sequences, people are allowed to chain them up without=20 > whitespaces in between (this causes no ambiguity), although nowadays it= =20 > seems more common to always separate them with whitespaces. For a=20 > spell-checker, two strategy are possible: the lazy one would be to=20 > enforce the style of putting whitespaces between every cmavo, thus=20 > marking e.g. "lonu" as incorrect; the second strategy, more involved,=20 > would be to check any unknown letter string to see if it matchs a=20 > sequence of cmavo, and allow it if it does (e.g. if the program hits=20 > "calonu" and is able to find it can be a sequence of cmavo ca+lo+nu,=20 > only then it would allow it). But I don't know if the software you're=20 > using is able to do that without an explicit and systematic list of all= =20 > allowable cmavo strings=E2=80=A6=20 > You're right. I guess I'll insert both "split" and "merged" jbovlaste=20 entries ("tai da'i" and "taida'i"). But as long as the reference doesn't=20 exhibit ALL possible combinations ("ca lo no", "ca lonu", "calonu", etc.),= =20 and as long as there are no subtle rules about generating "affixes" (ie.=20 compounds words generation for spell checkers), then it would be hard being= =20 precise. I'll start with a very basic spell checker and maybe add rules later on...= =20 if there are enough people willing to help! I'm clearly too few experienced= =20 in Lojban to easily find the rules which are the "most important". Do you= =20 think about a few rules that could be integrated? I guess that the rule "a cmavo can follow a cmavo as suffix" could be nice,= =20 but I don't know how to implement it. I'm currently struggling with=20 https://www.systutorials.com/docs/linux/man/4-hunspell/#lbAI If the software were to need an explicit and exhaustive list of allowed=20 > words, I guess it wouldn't be very handy to use for very synthetic=20 > languages (e.g. Turkish, Quechua, Greenlandic=E2=80=A6), which might have= an=20 > infinite number of valid words.=20 > Well, that's the "affix" stuff I just wrote about. I don't know anything=20 about those languages, but surely they have "good" affix/replacement rules= =20 in their dictionaries. =20 Anyway, thank you very much for clarification. --=20 Sukender --=20 You received this message because you are subscribed to the Google Groups "= lojban" group. To unsubscribe from this group and stop receiving emails from it, send an e= mail to lojban+unsubscribe@googlegroups.com. To post to this group, send email to lojban@googlegroups.com. Visit this group at https://groups.google.com/group/lojban. For more options, visit https://groups.google.com/d/optout. ------=_Part_764_344432803.1501163019705 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Le jeudi 27 juillet 2017 13:04:54 UTC+2, Ilmen a =C3=A9cri= t=C2=A0:
If spell checkers are = only concerned with identifying what is a correct=20
word and what isn't,

Exactly! For n= ow, my first concern is to get a first step towards spell/grammar checking = for common software (see the other thread). That's clearly a "bett= er than nothing" idea... and yes, it's clearly sub-optimal.
<= div>=C2=A0
then you shoul= d disregard Jbovlaste entries=20
containing whitespace (they are multi-words lexemes), or even better,= =20
check all the words that compose them to see if any of them is missing= =20
from your spell-check whitelist (I strongly suspect there exists bu and= =20
zei compounds containing words that appears nowhere else in the=20
dictionary=E2=80=A6).

Great! I'll do that. Thanks.
=
=C2=A0
"re zei = zgabube" is indeed a sequence of three words. It is present in=20
the dictionary because it is an independent lexeme, you cannot=20
accurately derive its meaning from its parts. This occurs all the times= =20
in natlangs, think for example to the English "take off".

Okay. But as you mentioned, spell chec= kers only check spelling! So in the English ones, "take" and &quo= t;off" are separated. The grammar checker, however, should detect the = meaning of "take off" instead of "take" and "off&q= uot; separately.
=C2=A0
As for cmavo sequences, people are allowed to chain them up withou= t=20
whitespaces in between (this causes no ambiguity), although nowadays it= =20
seems more common to always separate them with whitespaces. For a=20
spell-checker, two strategy are possible: the lazy one would be to=20
enforce the style of putting whitespaces between every cmavo, thus=20
marking e.g. "lonu" as incorrect; the second strategy, more i= nvolved,=20
would be to check any unknown letter string to see if it matchs a=20
sequence of cmavo, and allow it if it does (e.g. if the program hits=20
"calonu" and is able to find it can be a sequence of cmavo ca= +lo+nu,=20
only then it would allow it). But I don't know if the software you&= #39;re=20
using is able to do that without an explicit and systematic list of all= =20
allowable cmavo strings=E2=80=A6

You're right. I guess I'll ins= ert both "split" and "merged" jbovlaste entries ("= tai da'i" and "taida'i"). But as long as the referen= ce doesn't exhibit ALL possible combinations ("ca lo no", &qu= ot;ca lonu", "calonu", etc.), and as long as there are no su= btle rules about generating "affixes" (ie. compounds words genera= tion for spell checkers), then it would be hard being precise.
I'll start with a very basic spell checker and maybe add r= ules later on... if there are enough people willing to help! I'm clearl= y too few experienced in Lojban to easily find the rules which are the &quo= t;most important". Do you think about a few rules that could be integr= ated?
I guess that the rule "a cmavo can follow a cmavo = as suffix" could be nice, but I don't know how to implement it. I&= #39;m currently struggling with=C2=A0https://www.systutorials.com/docs/linux/= man/4-hunspell/#lbAI

If the software were to need an explicit and exhaustive list= of allowed=20
words, I guess it wouldn't be very handy to use for very synthetic= =20
languages (e.g. Turkish, Quechua, Greenlandic=E2=80=A6), which might ha= ve an=20
infinite number of valid words.

Well, that's the "affix"= stuff I just wrote about. I don't know anything about those languages,= but surely they have "good" affix/replacement rules in their dic= tionaries.
=C2=A0
Anyway, thank you very much for clari= fication.

--=C2=A0
Sukender

--
You received this message because you are subscribed to the Google Groups &= quot;lojban" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to lojban+unsub= scribe@googlegroups.com.
To post to this group, send email to lojban@googlegroups.com.
Visit this group at http= s://groups.google.com/group/lojban.
For more options, visit http= s://groups.google.com/d/optout.
------=_Part_764_344432803.1501163019705-- ------=_Part_763_1400951131.1501163019705--