Received: from mail-pg0-f55.google.com ([74.125.83.55]:35776) by stodi.digitalkingdom.org with esmtps (TLSv1.2:ECDHE-RSA-AES128-GCM-SHA256:128) (Exim 4.87) (envelope-from ) id 1dc822-000149-1z for lojban-list-archive@lojban.org; Mon, 31 Jul 2017 03:34:39 -0700 Received: by mail-pg0-f55.google.com with SMTP id u132sf886250pgb.2 for ; Mon, 31 Jul 2017 03:34:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20161025; h=sender:date:from:to:message-id:in-reply-to:references:subject :mime-version:x-original-sender:reply-to:precedence:mailing-list :list-id:list-post:list-help:list-archive:list-subscribe :list-unsubscribe; bh=wEZClKwkROlWml2vpWYlgzOt5mzC0JANY/6YpMFL+lQ=; b=aF+p+MobxZhOHG1gI7dv8ChSV3vhJdqioCOC9RmTHxwi9FNtR+jTv76uMDxjoTpNbK pK9nitIq7y3bJmaveqefiN/rg/3RoIaaiWOLci3aldmxg8+8HoVMj3hTrJd+INFhpRtf xaoB/ZnoggIijSngVyA3PwFU7MkdW2akRYZDgDE+F5k+gfqZXNdRn0x3oioYj6bsFtwp 2fvc/102pt3RllMN54XpCuq3Mh8B/YkCjk7+mZfPeBW2faTZZGz9yjdcR5XM2C0byvyj BOq7o9lxz3W7gqUhwQ7pj4u+5HngBKOZElWAMvF9uzBkDyYbDxyehyiubFOqq71HIn5f kr2Q== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:message-id:in-reply-to:references:subject:mime-version :x-original-sender:reply-to:precedence:mailing-list:list-id :list-post:list-help:list-archive:list-subscribe:list-unsubscribe; bh=wEZClKwkROlWml2vpWYlgzOt5mzC0JANY/6YpMFL+lQ=; b=pZ1R3an88zNpnBB11c8kVx/EkNT3DNRg1r0ukhCUerygFrtllLWFLDws5e/ANEVBVL up5pu/Q7CwRm/l3h3/tAuJ3yi+i1fxZKZ3/sEYjFShAR3/LYhAQ7SJqwAwFqCbNNCn1L qNhog3xg4jujBD1S8o5/UDUr3xpO04NXlSZYPDuBrUpmyog+P4pawtQYVpcYRLSXfjhk VMJhjm/400BjKOGJVdw2T4YlgB7hG4K+x2Vhd8wyEp72i0bbue+SXa3qMvOXBrUBBLea 64QhRXLfvcqaMGN6TorztaSpFQVFgWOv8UsIoDMvWyNaCeZXXVV5ijaPqOWif48Unk9q bTlg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=sender:x-gm-message-state:date:from:to:message-id:in-reply-to :references:subject:mime-version:x-original-sender:reply-to :precedence:mailing-list:list-id:x-spam-checked-in-group:list-post :list-help:list-archive:list-subscribe:list-unsubscribe; bh=wEZClKwkROlWml2vpWYlgzOt5mzC0JANY/6YpMFL+lQ=; b=C63YAY7vhjhycTEZTK4ISd/1xaxhdtnMEwYlKTuDZg9B/b4C1QaK4pCSrHrKhR+utH NiNY1LLEYDnIxpMlr2hT6iAqL2viG6s+SxxcYplvxtyrrOx6HwsKhKLsjfeQZILXKn0q bUPW5pfgzMYjZGQACy4Wg+gGz6cQ35Wyjerjpkp5SdH4hukG7dfazMVYpG9aoaQnes1D jLXjx5MXx07/cyRHpVMLMkLft5OGzLB/wBme+kLSxbaPBtsQhljVeKvubxLyBPBXFz4x yQdpS83ovd3poQNdzlDSo+eod7lUqyp4IrDqg8cQTGL5dU2r2Tm5kha6xB9+p/5VqESU 9uoA== Sender: lojban@googlegroups.com X-Gm-Message-State: AIVw113XL8jXkbfdzDwnF/7lhll0VuwKmQITgtlZWITuxmxrSWk5vXf/ XXS+L4c9eNqDiQ== X-Received: by 10.36.19.81 with SMTP id 78mr498467itz.2.1501497271713; Mon, 31 Jul 2017 03:34:31 -0700 (PDT) X-BeenThere: lojban@googlegroups.com Received: by 10.107.22.70 with SMTP id 67ls8069318iow.26.gmail; Mon, 31 Jul 2017 03:34:31 -0700 (PDT) X-Received: by 10.31.168.197 with SMTP id r188mr89189vke.8.1501497271142; Mon, 31 Jul 2017 03:34:31 -0700 (PDT) Date: Mon, 31 Jul 2017 03:34:30 -0700 (PDT) From: sukender1@gmail.com To: lojban Message-Id: <24b366c8-02bc-4ffa-865b-a43a2a2142dd@googlegroups.com> In-Reply-To: References: <3c86d96b-e0ea-af6b-2ee8-51d4e0741fe5@gmail.com> <00784DD2-C6DC-45F9-9DDE-E2B64BD6A1CB@free.fr> Subject: Re: [lojban] Spaces in jbovlaste MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_3654_1674216394.1501497270944" X-Original-Sender: sukender1@gmail.com Reply-To: lojban@googlegroups.com Precedence: list Mailing-list: list lojban@googlegroups.com; contact lojban+owners@googlegroups.com List-ID: X-Spam-Checked-In-Group: lojban@googlegroups.com X-Google-Group-Id: 1004133512417 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , X-Spam-Score: -1.3 (-) X-Spam_score: -1.3 X-Spam_score_int: -12 X-Spam_bar: - ------=_Part_3654_1674216394.1501497270944 Content-Type: multipart/alternative; boundary="----=_Part_3655_2048290450.1501497270945" ------=_Part_3655_2048290450.1501497270945 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable coi la .adam. .i coi ro do Sorry for the late answer; I was tweaking my scripts & tools according to= =20 your advice and according to the CLL. I did not took the exact regex you proposed, but included your idea. So := =20 "thanks" ! Could you eventually review/check my regexes (see links to=20 scripts below)? For your information, and based on your idea and Ilmen's idea, I added a 3= =20 step processing: 1. Clean the input in a generic way : tabs/spaces, split entries with=20 spaces, etc. (see the sed script at this point=20 ,=20 or its latest version=20 ) 2. Clean from a "Lojbanic" point of view : remove non-lojban entries,=20 prepend dot before words starting with vowels, etc. (current script=20 =20 / latest version=20 ) 3. Split entries: cmevla, cmavo, compound cmavo, and a few other classes= =20 (current script=20 / latest version=20 ) =20 Current results are: 38 "illegal" words, and 430 duplicates (mainly generated by splitting, such= =20 as when processing "lo nu", "lo", "nu") Splitter generates such things (here are a few lines for each, of course): --- cmavo --- .a .a'a .a'au .a'e .a'ei .ai .a'i --- cmavo_compound --- .a'acu'i .a'anai .a'enai .a'icu'i .aicu'i .ainai .a'inai --- brivla --- .a'anmo .abniena .abvele .aclotlu .adgalagda .adji .admine --- vowel --- .abu .ebu .ibu --- consonant --- by cy dy --- cmevla --- .abata'adj .abgad .acaman .akev .akrobat .akuuas .aleksandras --- other --- (empty list) co'o --=20 Sukender Le vendredi 28 juillet 2017 17:51:56 UTC+2, Adam Lopresto a =C3=A9crit : > > jbovlaste should already be filtered to contain only Lojban, and there=20 > are, broadly, three types of Lojban words: > cmevla are everything that ends in a consonant > brivla all contain a consonant cluster and end in a vowel > cmavo optionally start with a single consonant, and consist entirely of= =20 > vowels and apostrophes after that. > > So, I think you could filter all cmavo clusters by looking for anything= =20 > that matches /.+[^aeiou'].*[aeiou]/ but doesn't match /[^aeiou'][^aeiou']= /.=20 > Contains a non-vowel somewhere after the first letter, ends in a vowel, a= nd=20 > doesn't contain a consonant cluster. > > At least, that seems like a good start.=20 > --=20 You received this message because you are subscribed to the Google Groups "= lojban" group. To unsubscribe from this group and stop receiving emails from it, send an e= mail to lojban+unsubscribe@googlegroups.com. To post to this group, send email to lojban@googlegroups.com. Visit this group at https://groups.google.com/group/lojban. For more options, visit https://groups.google.com/d/optout. ------=_Part_3655_2048290450.1501497270945 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
coi la .adam. .i coi ro do

S= orry for the late answer; I was tweaking my scripts & tools according t= o your advice and according to the CLL.
I did not took the exact = regex you proposed, but included your idea. So : "thanks" ! Could= you eventually review/check my regexes (see links to scripts below)?
=

For your information, and based on your idea and Ilmen&= #39;s idea, I added a 3 step processing:
  1. Clean the input = in a generic way : tabs/spaces, split entries with spaces, etc. (see the sed script at this point, or its latest ve= rsion)
  2. Clean from a "Lojbanic" point of view : remove= non-lojban entries, prepend dot before words starting with vowels, etc. (<= a href=3D"https://github.com/Sukender/lojban-spell-check/blob/2e93ceb/wordl= ist/clean_lojban.sed">current script / latest ver= sion)
  3. Split entries: cmevla, cmavo, compound cmavo, and a f= ew other classes (current script=C2=A0/=C2=A0latest version)
Current results are:
=
38 "illegal" words, and 430 duplicates (mainly generate= d by splitting, such as when processing "lo nu", "lo", = "nu")

Splitter generates such thin= gs (here are a few lines for each, of course):
--- cmavo ---
.a
<= div class=3D"subprettyprint">.a'a
.a= 'au
.a'e
.a'ei
.ai
.a'i
--- cmavo_= compound ---
.a'acu'i
.a'anai
.a&= #39;enai
.a'icu'i
.aicu'i
.ainai<= /div>
.a'inai
--- brivla ---
.a'anmo
<= div class=3D"subprettyprint">.abniena
.a= bvele
.aclotlu
.adgalagda
.adji
.admine
--- vowel = ---
.abu
.ebu
.ibu
--- consonant ---
by
cy
dy
--- cmevla ---
.abata'adj
.abgad
.acaman
.akev
<= div class=3D"subprettyprint">.akrobat
.a= kuuas
.aleksandras
--- other ---
(empty list)=



co'o

--=C2=A0
Sukender


Le vendredi 28 j= uillet 2017 17:51:56 UTC+2, Adam Lopresto a =C3=A9crit=C2=A0:
jbovlaste should already be= filtered to contain only Lojban, and there are, broadly, three types of Lo= jban words:
cmevla are everything that ends in a consonant
br= ivla all contain a consonant cluster and end in a vowel
cmavo opt= ionally start with a single consonant, and consist entirely of vowels and a= postrophes after that.

So, I think you could filte= r all cmavo clusters by looking for anything that matches /.+[^aeiou'].= *[aeiou]/ but doesn't match /[^aeiou'][^aeiou']/. Contains a no= n-vowel somewhere after the first letter, ends in a vowel, and doesn't = contain a consonant cluster.

At least, that seems = like a good start.=C2=A0

--
You received this message because you are subscribed to the Google Groups &= quot;lojban" group.
To unsubscribe from this group and stop receiving emails from it, send an e= mail to lojban+unsub= scribe@googlegroups.com.
To post to this group, send email to lojban@googlegroups.com.
Visit this group at http= s://groups.google.com/group/lojban.
For more options, visit http= s://groups.google.com/d/optout.
------=_Part_3655_2048290450.1501497270945-- ------=_Part_3654_1674216394.1501497270944--