From nobody@digitalkingdom.org Sat Nov 08 23:16:02 2008
Received: with ECARTIS (v1.0.0; list lojban-list); Sat, 08 Nov 2008 23:16:02 -0800 (PST)
Received: from nobody by chain.digitalkingdom.org with local (Exim 4.69)	(envelope-from <nobody@digitalkingdom.org>)	id 1Kz4Wg-0003hC-2r	for lojban-list-real@lojban.org; Sat, 08 Nov 2008 23:16:02 -0800
Received: from mail-gx0-f15.google.com ([209.85.217.15])	by chain.digitalkingdom.org with esmtp (Exim 4.69)	(envelope-from <stephen.pollei@gmail.com>)	id 1Kz4WZ-0003h1-U2	for lojban-list@lojban.org; Sat, 08 Nov 2008 23:16:02 -0800
Received: by gxk8 with SMTP id 8so1487892gxk.10        for <lojban-list@lojban.org>; Sat, 08 Nov 2008 23:15:49 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=gmail.com; s=gamma;        h=domainkey-signature:received:received:message-id:date:from:to         :subject:cc:mime-version:content-type:content-transfer-encoding         :content-disposition;        bh=xvtIjH+OG3d0SIWkmS6zDB+6xkIYjrRyDGxt3l8614k=;        b=J9hJ9iNPjJa7olO1UI1yuWbclLZbyxl6aK85oOGh4diLt9eWqs6dUE1kPL0pmmefBj         AqM78iTjzHNn4vNArf+1XUTeYVEMKiVk1PlKIF08CBL0ZqAprq9p5UB81OtPajNUO8FT         VmQKjIT2LdPHvrFP3JzF91Cd2CnCf/NkKWAEw=
DomainKey-Signature: a=rsa-sha1; c=nofws;        d=gmail.com; s=gamma;        h=message-id:date:from:to:subject:cc:mime-version:content-type         :content-transfer-encoding:content-disposition;        b=aELfY8xh2EC6KctUTQ9H1u143o13FvIB/+KIC4imcrE+uZ1ZK2ds+L9owO8xMl3V4B         L296f0vDpcQ6Zvn7YzBtvewpuK/ZXlUx7a5n5pI6/n2iVUfpk6pAjelHl/VduCyMV1me         qUaTBytF7YYcrexiFB94SMqP263gKpf5soQdg=
Received: by 10.151.45.6 with SMTP id x6mr7170355ybj.65.1226214949723;        Sat, 08 Nov 2008 23:15:49 -0800 (PST)
Received: by 10.150.218.18 with HTTP; Sat, 8 Nov 2008 23:15:49 -0800 (PST)
Message-ID: <feed8cdd0811082315s5da6d4d9u99cd61890f90079@mail.gmail.com>
Date: Sat, 8 Nov 2008 23:15:49 -0800
From: "Stephen Pollei" <stephen.pollei@gmail.com>
To: "A. PIEKARSKI" <totus@rogers.com>
Subject: [lojban] Re: eSpeak and lojban
Cc: lojban-list@lojban.org, jonsd@users.sourceforge.net
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-Spam-Score: -0.0
X-Spam-Score-Int: 0
X-Spam-Bar: /
X-archive-position: 14965
X-ecartis-version: Ecartis v1.0.0
Sender: lojban-list-bounce@lojban.org
Errors-to: lojban-list-bounce@lojban.org
X-original-sender: stephen.pollei@gmail.com
Precedence: bulk
Reply-to: lojban-list@lojban.org
X-list: lojban-list

On 11/8/08, A. PIEKARSKI <totus@rogers.com> wrote:
>  I just installed eSpeak and tried it out.  It is interesting, but has
>  enough problems to limit its usefulness.
>
>  Then I discovered some relevent messages in the mail archives with
>  your name appended - sounds like you were trying to make some
>  improvements in September.
>
>  Can you let me know the current status, please?  By the way, the thing
>  that I found most awkward is the way it races through without pausing
>  before the {ni'o}s.

I'm not sure of it's current status as a version 1.39 has been
released while I tried modifying version 1.38 and I'm not sure if the
upstream author made any changes based on what I was experimenting
with. I should check. I also have done very little as I'm not sure
what to do about the x and z phomemes. I was completely stuck on
those.

I did make some updates to lihertadji.pl , including some incomplete
stuff to add commas to words. I also want to beef up the fixup
function so only valid pronounceable lojban words make it through into
the output. Some of that I'm worried that the definition of what is
and isn't a valid syllable, isn't well defined and agreed upon. I
thought that adding the commas would help espeak pronounce syllable
boundaries a little better; I haven't tested how espeak actually
respondes to having commas all over the place.

I also have long time goal of having li'ertadji output SSML so you can
mix lojban and other languages with zoi , la'o and la'oi . I'm
attaching what I have even though it's ugly and has known errors and
incomplete spots in it. Yes I know some of it is embarrassing, but
it's a work-in-progress .

I will check to see what if anything got changed for version 1.39 .

Ok just downloaded it and checked and it looks the exactly the same.

espeak-1.39-source]$ sha1sum dictsource/jbo_*
15e2321ccc869369e70bf5cf358261187d40598b  dictsource/jbo_list
4081ab03f16a9e1bd75c47146f8b7a844216dbe2  dictsource/jbo_rules

  espeak-1.38-source]$ sha1sum dictsource/jbo_*
15e2321ccc869369e70bf5cf358261187d40598b  dictsource/jbo_list.orig
4081ab03f16a9e1bd75c47146f8b7a844216dbe2  dictsource/jbo_rules.orig

I think the upstream guy thought that some of my changes were too extreme.
I will redo some of my work and only fix things that are clearly
wrong, some of it wasn't wrong persay, just not really needed.

short to do list:
1) get rid of { a } to { abu }, { e } to { ebu }, etc from jbo_list
That's obviously and clearly a proper fix. espeak would change {le
broda e le brode} into {le broda ebu le brode} and { u bu co'e} into {
ubu bu co'e} which goes without saying is bogus.

2) get rid of { m } to { my }, { l } to { ly } from jbo_list
espeak shouldn't be touching any of the single consonants words, l,m,
n, and r should be syllabics and according to at least some lojbanists
should be valid cmevla.
  { b } to { by } shouldn't really be that harmful . The cll would
classify them as cmevla , but they aren't really made up of any
syllables . I would prefer if they got altered to be { b } to { yb } ,
I think I will made li'ertadji do that in the future. the preprocessor
should be able to handle more invalid words then what espeak tries to
do.
The reason I prefer { b } to { yb } over { by } is because it
preserves it's cmevla status and doesn't change the word into a lerfu
valsi . I need feedback from others on if they should be considered
valid one letter click names and thus be left alone, or whether they
are invalid words that the cll underspecified on.
As far as I know nobody use single letter cmevla in real life, so it's
not a big deal.

The upstream maintainer was reluctant to take a patch from me that got
rid of them all, because he said some of the consonants would sound
merely like a click noise; I think that is valid concern, but it
should be the responsibility of the writer not to give bogus input to
espeak. Using the preprocesser with fixups will hopefully ensure that
little to no invalid input reaches espeak in the first place.
Also I am not sure; I didn't test whether the syllabics sound like
clicks or if the if they are recongizable.

3) Get rid of the stuff which stresses cmavo from jbo_list but leave
the upstream authors pausing in place . cmavo should never be stressed
unless someone capitalized some of it's letters. The stress doesn't
invalidate the meaning of the words, so isn't critically wrong, but
isn't strictly correct either.

It also will stress cmevla without capital letters which also isn't
wrong really, it might even be the right thing to do according to some
I've talked to. I'd prefer if it didn't stress cmevla  and leave that
up to the writer, but I could be wrong about that. I request feedback.

maybe nice to have but dropped:
1)  in jbo_rules h, q, and w aren't really legal lojban characters,
pause if they occur at the ends of a word ; y always needs pausing if
it occurs at the end of words .. li'ertadji foreign word detection and
fixup routines should handle the h,q,and w stuff even better than
these espeak rules. The preproccesor also handles y stuff very well .
2) dj and tc but not ts and dz has special support in jbo_rules, my
earlier patch had dropped that not sure if that's an improvement or
not. Someone who knows better should figure out of affricitives are at
their best.
3) Someone should check to see if the rules that change the sounds for
l, n , r based on adjacent letters make sense. My earlier patch just
made them more consistent.
4) Give extra pause to lo'u and le'u the error quote stuff and zo and
zoi the regular quote stuff

I had also just ripped out some pausing stuff as the rules didn't
really require them, and they don't hurt anything AFAIK. Unless anyone
complains about too much pause; I won't touch them again. I'm more
likely to push for a few more things to get pauses.

Also I did nothing to change the x and the z, I doubt the upstream
espeak maintainer did anything either. I haven't tested yet and
hopefully I'm wrong.

Not sure how much pause you want before ni'o and no'i . It shouldn't
be that fast especially since most people put newlines before sections
as well as using ni'o, and I think espeak is whitespace sensitive
enough to do extra pause if you have a few newlines in place.

The critical fix is getting rid of the { .a } to { .abu } snafu. If
1.40 had nothing else but that, it would be a significant improvement.
Getting rid of the stressing of some cmavo and not adding "y" to
single consonant cmevla would be nice, but the x and z issue is more
important, but sadly I don't think I can do anything about it.

I also have no idea what the upstreams schedule is like.

Anyway it's getting late and tomorrow I'm traveling out of town with
some friends, so I won't be able to do anything 'till Monday afternoon
anyway.

I attached the simple but critical patch. Something like the below
should be done before 1.40 is released.

diff -U2 jbo_list jbo_list.crit
--- jbo_list    2008-05-16 09:46:20.000000000 -0700
+++ jbo_list.crit       2008-11-08 23:17:52.000000000 -0800
@@ -19,23 +19,14 @@


-_a     abu
 b      b@
 c      S@
 d      d@
-_e     ebu
 f      f@
 g      g@
-_i     ibu
 j      Z@
 k      k@
-l      l@
-m      m@
-n      n@
-_o     obu
 p      p@
-r      R@
 s      s@
 t      t@
-_u     ubu
 v      v@
 x      x@


To unsubscribe from this list, send mail to lojban-list-request@lojban.org
with the subject unsubscribe, or go to http://www.lojban.org/lsg2/, or if
you're really stuck, send mail to secretary@lojban.org for help.