[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lojban] Spaces in jbovlaste



coi la .adam. .i coi ro do

Sorry for the late answer; I was tweaking my scripts & tools according to your advice and according to the CLL.
I did not took the exact regex you proposed, but included your idea. So : "thanks" ! Could you eventually review/check my regexes (see links to scripts below)?

For your information, and based on your idea and Ilmen's idea, I added a 3 step processing:
  1. Clean the input in a generic way : tabs/spaces, split entries with spaces, etc. (see the sed script at this point, or its latest version)
  2. Clean from a "Lojbanic" point of view : remove non-lojban entries, prepend dot before words starting with vowels, etc. (current script / latest version)
  3. Split entries: cmevla, cmavo, compound cmavo, and a few other classes (current script / latest version)
Current results are:
38 "illegal" words, and 430 duplicates (mainly generated by splitting, such as when processing "lo nu", "lo", "nu")

Splitter generates such things (here are a few lines for each, of course):
--- cmavo ---
.a
.a'a
.a'au
.a'e
.a'ei
.ai
.a'i
--- cmavo_compound ---
.a'acu'i
.a'anai
.a'enai
.a'icu'i
.aicu'i
.ainai
.a'inai
--- brivla ---
.a'anmo
.abniena
.abvele
.aclotlu
.adgalagda
.adji
.admine
--- vowel ---
.abu
.ebu
.ibu
--- consonant ---
by
cy
dy
--- cmevla ---
.abata'adj
.abgad
.acaman
.akev
.akrobat
.akuuas
.aleksandras
--- other ---
(empty list)



co'o

-- 
Sukender


Le vendredi 28 juillet 2017 17:51:56 UTC+2, Adam Lopresto a écrit :
jbovlaste should already be filtered to contain only Lojban, and there are, broadly, three types of Lojban words:
cmevla are everything that ends in a consonant
brivla all contain a consonant cluster and end in a vowel
cmavo optionally start with a single consonant, and consist entirely of vowels and apostrophes after that.

So, I think you could filter all cmavo clusters by looking for anything that matches /.+[^aeiou'].*[aeiou]/ but doesn't match /[^aeiou'][^aeiou']/. Contains a non-vowel somewhere after the first letter, ends in a vowel, and doesn't contain a consonant cluster.

At least, that seems like a good start. 

--
You received this message because you are subscribed to the Google Groups "lojban" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lojban+unsubscribe@googlegroups.com.
To post to this group, send email to lojban@googlegroups.com.
Visit this group at https://groups.google.com/group/lojban.
For more options, visit https://groups.google.com/d/optout.