When I first started learning lojban I wrote up a quick'n dirty script to make looking up words faster and easier. gismu and cmavo were easy, but I could never figure out lujvo. So I'm taking another stab at it. I currently have something that works in the general cases of {bajdri}, {ba'udri}, and {bagypau}. But currently I'm not sure how to deal with 4 letter rafsi and non "y" buffer letters.
To deal with the non "y" buffer letters I thought I could just say:
strip all "y" from the word
get first three non "'" chars
if the first letter is "r", "l", "m", or "n" and the second letter is a consonant, then chop off the first letter and grab another letter from the right
(so if I was parsing "bacru zei bevri" = "ba'urbei" I would (after handling ba'u in the first iteration) end up with "rbe" and due to the above step, I'd strip off the "r" and grab the next letter thus ending with "bei" which is the right result).
But this produces strange results because there ARE cases where buffer letters are followed by consonants (morsi for instance).
Is there a way to un-ambiguously and algorithmically break a lujvo down into its component gismu?