[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lojban] Spaces in jbovlaste



jbovlaste should already be filtered to contain only Lojban, and there are, broadly, three types of Lojban words:
cmevla are everything that ends in a consonant
brivla all contain a consonant cluster and end in a vowel
cmavo optionally start with a single consonant, and consist entirely of vowels and apostrophes after that.

So, I think you could filter all cmavo clusters by looking for anything that matches /.+[^aeiou'].*[aeiou]/ but doesn't match /[^aeiou'][^aeiou']/. Contains a non-vowel somewhere after the first letter, ends in a vowel, and doesn't contain a consonant cluster.

At least, that seems like a good start. 

On Fri, Jul 28, 2017 at 10:43 AM Sukender <sukender@free.fr> wrote:
Tanks for the clarification. I didn't even imagine that this big random compound cmavo would be valid! You made my evening! ;-)

About {lonu} entry, I clearly agree. But I can't filter all them out... Or can I? If you get any (simple) idea of rule for that, then go ahead!

By the way, I already filtered out a few words. I indeed found some of huge length (even a weird one about Macarena!). As it may be spam, I added an arbitrary rule that throws away all that have more than 22 characters. Maybe a finer rule has to be found...

Cheers,


--
Sukender




Le 28 juillet 2017 17:33:14 CEST, Adam Lopresto <adamlopresto@gmail.com> a écrit :
If you're going to allow cmavo to be combined arbitrarily (which is probably appropriate), then there's no reason for {lonu} to have its own entry. So I'd suggest not adding any cmavo clusters.

And {lonulonucalo} can be grammatical, you just need the right text after it. {lonulonucalo nu jamna kei mi damba cu nandu mi cu se zungi mi}, "I feel guilty that it was hard for me to fight during the war." As you said, a fully grammar checker would be needed to really get things right, and that's a separate problem.

On Fri, Jul 28, 2017 at 6:54 AM <sukender1@gmail.com> wrote:
coi la .ilmen.

I just applied your idea (added split entries) and added merged entries... And I also found a very simple way to add compound cmavo!
Indeed:
  • I created a script that splits jbovlaste entries into cmavo and non-cmavo, by using a simple regex (using rules listed in the CLL, chapter 4.2)
  • Then I tagged all cmavo with a flag "C", and added the Hunspell rule "CCC*" (~= "CC+"), which means you can "glue" 2 or more cmavo together.
Of course, this will allow un-grammatical things such as "lonulonucalo", but once again this is not the spell-checker role.

I tried your example "calonu". It seems the "lonu" entry exists, so my dictionary inteprets that as a "normal word" (= non-simple-cmavo) instead of a "compound cmavo". But all following combinations are now valid :
  • ca, lo, nu
  • lo nu, lonu, ca lo, calo
  • ca lonu, calo nu, calonu
Only calo & calonu are detected as a compound (remember "lonu" is an entry), but anyway that works as expected.
Experimental cmavo support will be added soon.

Do you know other rules that could be great integrating?
Please test ( https://github.com/Sukender/lojban-spell-check-dist ) and give feedback! ki'e

I still have issues with dots in LibreOffice (.i .a and such)... And some words of "le cmalu noltru" are not recognized yet. Is there any other word source I can use?

co'o

-- 
Sukender

--
You received this message because you are subscribed to the Google Groups "lojban" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lojban+unsubscribe@googlegroups.com.
To post to this group, send email to lojban@googlegroups.com.
Visit this group at https://groups.google.com/group/lojban.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "lojban" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lojban+unsubscribe@googlegroups.com.
To post to this group, send email to lojban@googlegroups.com.
Visit this group at https://groups.google.com/group/lojban.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "lojban" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lojban+unsubscribe@googlegroups.com.
To post to this group, send email to lojban@googlegroups.com.
Visit this group at https://groups.google.com/group/lojban.
For more options, visit https://groups.google.com/d/optout.