jbovlaste should already be filtered to contain only Lojban, and there are, broadly, three types of Lojban words:
cmevla are everything that ends in a consonant
brivla all contain a consonant cluster and end in a vowel
cmavo optionally start with a single consonant, and consist entirely of vowels and apostrophes after that.
So, I think you could filter all cmavo clusters by looking for anything that matches /.+[^aeiou'].*[aeiou]/ but doesn't match /[^aeiou'][^aeiou']/. Contains a non-vowel somewhere after the first letter, ends in a vowel, and doesn't contain a consonant cluster.
At least, that seems like a good start.
Tanks for the clarification. I didn't even imagine that this big random compound cmavo would be valid! You made my evening! ;-)
About {lonu} entry, I clearly agree. But I can't filter all them out... Or can I? If you get any (simple) idea of rule for that, then go ahead!
By the way, I already filtered out a few words. I indeed found some of huge length (even a weird one about Macarena!). As it may be spam, I added an arbitrary rule that throws away all that have more than 22 characters. Maybe a finer rule has to be found...
Cheers,
--
Sukender
Le 28 juillet 2017 17:33:14 CEST, Adam Lopresto <
adamlopresto@gmail.com> a écrit :
If you're going to allow cmavo to be combined arbitrarily (which is probably appropriate), then there's no reason for {lonu} to have its own entry. So I'd suggest not adding any cmavo clusters.
And {lonulonucalo} can be grammatical, you just need the right text after it. {lonulonucalo nu jamna kei mi damba cu nandu mi cu se zungi mi}, "I feel guilty that it was hard for me to fight during the war." As you said, a fully grammar checker would be needed to really get things right, and that's a separate problem.
coi la .ilmen.
I just applied your idea (added split entries) and added merged entries... And I also found a very simple way to add compound cmavo!
Indeed:
- I created a script that splits jbovlaste entries into cmavo and non-cmavo, by using a simple regex (using rules listed in the CLL, chapter 4.2)
- Then I tagged all cmavo with a flag "C", and added the Hunspell rule "CCC*" (~= "CC+"), which means you can "glue" 2 or more cmavo together.
Of course, this will allow un-grammatical things such as "lonulonucalo", but once again this is not the spell-checker role.
I tried your example "calonu". It seems the "lonu" entry exists, so my dictionary inteprets that as a "normal word" (= non-simple-cmavo) instead of a "compound cmavo". But all following combinations are now valid :
- ca, lo, nu
- lo nu, lonu, ca lo, calo
- ca lonu, calo nu, calonu
Only calo & calonu are detected as a compound (remember "lonu" is an entry), but anyway that works as expected.
Experimental cmavo support will be added soon.
Do you know other rules that could be great integrating?
I still have issues with dots in LibreOffice (.i .a and such)... And some words of "le cmalu noltru" are not recognized yet. Is there any other word source I can use?
co'o
--
You received this message because you are subscribed to the Google Groups "lojban" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lojban+unsubscribe@googlegroups.com.
To post to this group, send email to lojban@googlegroups.com.
Visit this group at https://groups.google.com/group/lojban.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "lojban" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lojban+unsubscribe@googlegroups.com.
To post to this group, send email to lojban@googlegroups.com.
Visit this group at https://groups.google.com/group/lojban.
For more options, visit https://groups.google.com/d/optout.