Sender: lojban@googlegroups.com
Date: Thu, 27 Jul 2017 06:43:39 -0700 (PDT)
From: sukender1@gmail.com
To: lojban <lojban@googlegroups.com>
Message-Id: <c239f7c9-39b7-4bf4-af60-b97c0df89603@googlegroups.com>
In-Reply-To: <3c86d96b-e0ea-af6b-2ee8-51d4e0741fe5@gmail.com>
References: <b8da37b0-bc48-417f-ad27-6ba85424a312@googlegroups.com>
 <3c86d96b-e0ea-af6b-2ee8-51d4e0741fe5@gmail.com>
Subject: Re: [lojban] Spaces in jbovlaste
MIME-Version: 1.0
Content-Type: multipart/mixed; 
	boundary="----=_Part_763_1400951131.1501163019705"
Reply-To: lojban@googlegroups.com
Precedence: list
Mailing-list: list lojban@googlegroups.com; contact lojban+owners@googlegroups.com
X-Spam_score: -2.0
X-Spam_score_int: -19
X-Spam_bar: --

------=_Part_763_1400951131.1501163019705
Content-Type: multipart/alternative; 
	boundary="----=_Part_764_344432803.1501163019705"

------=_Part_764_344432803.1501163019705
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Le jeudi 27 juillet 2017 13:04:54 UTC+2, Ilmen a =C3=A9crit :
>
> If spell checkers are only concerned with identifying what is a correct=
=20
> word and what isn't,


Exactly! For now, my first concern is to get a first step towards=20
spell/grammar checking for common software (see the other thread). That's=
=20
clearly a "better than nothing" idea... and yes, it's clearly sub-optimal.
=20

> then you should disregard Jbovlaste entries=20
> containing whitespace (they are multi-words lexemes), or even better,=20
> check all the words that compose them to see if any of them is missing=20
> from your spell-check whitelist (I strongly suspect there exists bu and=
=20
> zei compounds containing words that appears nowhere else in the=20
> dictionary=E2=80=A6).=20
>

Great! I'll do that. Thanks.
=20

> "re zei zgabube" is indeed a sequence of three words. It is present in=20
> the dictionary because it is an independent lexeme, you cannot=20
> accurately derive its meaning from its parts. This occurs all the times=
=20
> in natlangs, think for example to the English "take off".=20
>

Okay. But as you mentioned, spell checkers only check spelling! So in the=
=20
English ones, "take" and "off" are separated. The grammar checker, however,=
=20
should detect the meaning of "take off" instead of "take" and "off"=20
separately.
=20

> As for cmavo sequences, people are allowed to chain them up without=20
> whitespaces in between (this causes no ambiguity), although nowadays it=
=20
> seems more common to always separate them with whitespaces. For a=20
> spell-checker, two strategy are possible: the lazy one would be to=20
> enforce the style of putting whitespaces between every cmavo, thus=20
> marking e.g. "lonu" as incorrect; the second strategy, more involved,=20
> would be to check any unknown letter string to see if it matchs a=20
> sequence of cmavo, and allow it if it does (e.g. if the program hits=20
> "calonu" and is able to find it can be a sequence of cmavo ca+lo+nu,=20
> only then it would allow it). But I don't know if the software you're=20
> using is able to do that without an explicit and systematic list of all=
=20
> allowable cmavo strings=E2=80=A6=20
>

You're right. I guess I'll insert both "split" and "merged" jbovlaste=20
entries ("tai da'i" and "taida'i"). But as long as the reference doesn't=20
exhibit ALL possible combinations ("ca lo no", "ca lonu", "calonu", etc.),=
=20
and as long as there are no subtle rules about generating "affixes" (ie.=20
compounds words generation for spell checkers), then it would be hard being=
=20
precise.

I'll start with a very basic spell checker and maybe add rules later on...=
=20
if there are enough people willing to help! I'm clearly too few experienced=
=20
in Lojban to easily find the rules which are the "most important". Do you=
=20
think about a few rules that could be integrated?
I guess that the rule "a cmavo can follow a cmavo as suffix" could be nice,=
=20
but I don't know how to implement it. I'm currently struggling with=20
https://www.systutorials.com/docs/linux/man/4-hunspell/#lbAI

If the software were to need an explicit and exhaustive list of allowed=20
> words, I guess it wouldn't be very handy to use for very synthetic=20
> languages (e.g. Turkish, Quechua, Greenlandic=E2=80=A6), which might have=
 an=20
> infinite number of valid words.=20
>

Well, that's the "affix" stuff I just wrote about. I don't know anything=20
about those languages, but surely they have "good" affix/replacement rules=
=20
in their dictionaries.
=20
Anyway, thank you very much for clarification.

--=20
Sukender

--=20
You received this message because you are subscribed to the Google Groups "=
lojban" group.
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to lojban+unsubscribe@googlegroups.com.
To post to this group, send email to lojban@googlegroups.com.
Visit this group at https://groups.google.com/group/lojban.
For more options, visit https://groups.google.com/d/optout.

------=_Part_764_344432803.1501163019705
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Le jeudi 27 juillet 2017 13:04:54 UTC+2, Ilmen a =C3=A9cri=
t=C2=A0:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin-left: 0=
.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">If spell checkers are =
only concerned with identifying what is a correct=20
<br>word and what isn&#39;t,</blockquote><div><br></div><div>Exactly! For n=
ow, my first concern is to get a first step towards spell/grammar checking =
for common software (see the other thread). That&#39;s clearly a &quot;bett=
er than nothing&quot; idea... and yes, it&#39;s clearly sub-optimal.</div><=
div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin: 0;margin=
-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">then you shoul=
d disregard Jbovlaste entries=20
<br>containing whitespace (they are multi-words lexemes), or even better,=
=20
<br>check all the words that compose them to see if any of them is missing=
=20
<br>from your spell-check whitelist (I strongly suspect there exists bu and=
=20
<br>zei compounds containing words that appears nowhere else in the=20
<br>dictionary=E2=80=A6).
<br></blockquote><div><br></div><div>Great! I&#39;ll do that. Thanks.</div>=
<div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin: 0;margi=
n-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">&quot;re zei =
zgabube&quot; is indeed a sequence of three words. It is present in=20
<br>the dictionary because it is an independent lexeme, you cannot=20
<br>accurately derive its meaning from its parts. This occurs all the times=
=20
<br>in natlangs, think for example to the English &quot;take off&quot;.
<br></blockquote><div><br></div><div>Okay. But as you mentioned, spell chec=
kers only check spelling! So in the English ones, &quot;take&quot; and &quo=
t;off&quot; are separated. The grammar checker, however, should detect the =
meaning of &quot;take off&quot; instead of &quot;take&quot; and &quot;off&q=
uot; separately.</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" st=
yle=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-lef=
t: 1ex;">As for cmavo sequences, people are allowed to chain them up withou=
t=20
<br>whitespaces in between (this causes no ambiguity), although nowadays it=
=20
<br>seems more common to always separate them with whitespaces. For a=20
<br>spell-checker, two strategy are possible: the lazy one would be to=20
<br>enforce the style of putting whitespaces between every cmavo, thus=20
<br>marking e.g. &quot;lonu&quot; as incorrect; the second strategy, more i=
nvolved,=20
<br>would be to check any unknown letter string to see if it matchs a=20
<br>sequence of cmavo, and allow it if it does (e.g. if the program hits=20
<br>&quot;calonu&quot; and is able to find it can be a sequence of cmavo ca=
+lo+nu,=20
<br>only then it would allow it). But I don&#39;t know if the software you&=
#39;re=20
<br>using is able to do that without an explicit and systematic list of all=
=20
<br>allowable cmavo strings=E2=80=A6
<br></blockquote><div><br></div><div>You&#39;re right. I guess I&#39;ll ins=
ert both &quot;split&quot; and &quot;merged&quot; jbovlaste entries (&quot;=
tai da&#39;i&quot; and &quot;taida&#39;i&quot;). But as long as the referen=
ce doesn&#39;t exhibit ALL possible combinations (&quot;ca lo no&quot;, &qu=
ot;ca lonu&quot;, &quot;calonu&quot;, etc.), and as long as there are no su=
btle rules about generating &quot;affixes&quot; (ie. compounds words genera=
tion for spell checkers), then it would be hard being precise.</div><div><b=
r></div><div>I&#39;ll start with a very basic spell checker and maybe add r=
ules later on... if there are enough people willing to help! I&#39;m clearl=
y too few experienced in Lojban to easily find the rules which are the &quo=
t;most important&quot;. Do you think about a few rules that could be integr=
ated?<br></div><div>I guess that the rule &quot;a cmavo can follow a cmavo =
as suffix&quot; could be nice, but I don&#39;t know how to implement it. I&=
#39;m currently struggling with=C2=A0<a href=3D"https://www.systutorials.co=
m/docs/linux/man/4-hunspell/#lbAI">https://www.systutorials.com/docs/linux/=
man/4-hunspell/#lbAI</a></div><div><br></div><blockquote class=3D"gmail_quo=
te" style=3D"margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;paddi=
ng-left: 1ex;">If the software were to need an explicit and exhaustive list=
 of allowed=20
<br>words, I guess it wouldn&#39;t be very handy to use for very synthetic=
=20
<br>languages (e.g. Turkish, Quechua, Greenlandic=E2=80=A6), which might ha=
ve an=20
<br>infinite number of valid words.
<br></blockquote><div><br></div><div>Well, that&#39;s the &quot;affix&quot;=
 stuff I just wrote about. I don&#39;t know anything about those languages,=
 but surely they have &quot;good&quot; affix/replacement rules in their dic=
tionaries.</div><div>=C2=A0</div><div>Anyway, thank you very much for clari=
fication.</div><div><br></div><div>--=C2=A0</div><div>Sukender</div></div>

<p></p>

-- <br />
You received this message because you are subscribed to the Google Groups &=
quot;lojban&quot; group.<br />
To unsubscribe from this group and stop receiving emails from it, send an e=
mail to <a href=3D"mailto:lojban+unsubscribe@googlegroups.com">lojban+unsub=
scribe@googlegroups.com</a>.<br />
To post to this group, send email to <a href=3D"mailto:lojban@googlegroups.=
com">lojban@googlegroups.com</a>.<br />
Visit this group at <a href=3D"https://groups.google.com/group/lojban">http=
s://groups.google.com/group/lojban</a>.<br />
For more options, visit <a href=3D"https://groups.google.com/d/optout">http=
s://groups.google.com/d/optout</a>.<br />

------=_Part_764_344432803.1501163019705--

------=_Part_763_1400951131.1501163019705--