[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lojban] Spaces in jbovlaste



Le jeudi 27 juillet 2017 13:04:54 UTC+2, Ilmen a écrit :
If spell checkers are only concerned with identifying what is a correct
word and what isn't,

Exactly! For now, my first concern is to get a first step towards spell/grammar checking for common software (see the other thread). That's clearly a "better than nothing" idea... and yes, it's clearly sub-optimal.
 
then you should disregard Jbovlaste entries
containing whitespace (they are multi-words lexemes), or even better,
check all the words that compose them to see if any of them is missing
from your spell-check whitelist (I strongly suspect there exists bu and
zei compounds containing words that appears nowhere else in the
dictionary…).

Great! I'll do that. Thanks.
 
"re zei zgabube" is indeed a sequence of three words. It is present in
the dictionary because it is an independent lexeme, you cannot
accurately derive its meaning from its parts. This occurs all the times
in natlangs, think for example to the English "take off".

Okay. But as you mentioned, spell checkers only check spelling! So in the English ones, "take" and "off" are separated. The grammar checker, however, should detect the meaning of "take off" instead of "take" and "off" separately.
 
As for cmavo sequences, people are allowed to chain them up without
whitespaces in between (this causes no ambiguity), although nowadays it
seems more common to always separate them with whitespaces. For a
spell-checker, two strategy are possible: the lazy one would be to
enforce the style of putting whitespaces between every cmavo, thus
marking e.g. "lonu" as incorrect; the second strategy, more involved,
would be to check any unknown letter string to see if it matchs a
sequence of cmavo, and allow it if it does (e.g. if the program hits
"calonu" and is able to find it can be a sequence of cmavo ca+lo+nu,
only then it would allow it). But I don't know if the software you're
using is able to do that without an explicit and systematic list of all
allowable cmavo strings…

You're right. I guess I'll insert both "split" and "merged" jbovlaste entries ("tai da'i" and "taida'i"). But as long as the reference doesn't exhibit ALL possible combinations ("ca lo no", "ca lonu", "calonu", etc.), and as long as there are no subtle rules about generating "affixes" (ie. compounds words generation for spell checkers), then it would be hard being precise.

I'll start with a very basic spell checker and maybe add rules later on... if there are enough people willing to help! I'm clearly too few experienced in Lojban to easily find the rules which are the "most important". Do you think about a few rules that could be integrated?
I guess that the rule "a cmavo can follow a cmavo as suffix" could be nice, but I don't know how to implement it. I'm currently struggling with https://www.systutorials.com/docs/linux/man/4-hunspell/#lbAI

If the software were to need an explicit and exhaustive list of allowed
words, I guess it wouldn't be very handy to use for very synthetic
languages (e.g. Turkish, Quechua, Greenlandic…), which might have an
infinite number of valid words.

Well, that's the "affix" stuff I just wrote about. I don't know anything about those languages, but surely they have "good" affix/replacement rules in their dictionaries.
 
Anyway, thank you very much for clarification.

-- 
Sukender

--
You received this message because you are subscribed to the Google Groups "lojban" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lojban+unsubscribe@googlegroups.com.
To post to this group, send email to lojban@googlegroups.com.
Visit this group at https://groups.google.com/group/lojban.
For more options, visit https://groups.google.com/d/optout.