I have run these four files through jbogenturfahi with the --rafske option. I have attached both the raw output[1] and the post-processed output[2]. The post-processed output is hopefully what you want, a sorted list of words, one per line, that appear in each input file. 1: The raw output is in Scheme, and contains more information but is also more difficult to parse without a Scheme reader. 2: The program I used to perform post-processing is attached as well, though it also requires having Scheme. I include it for informational purposes. -Alan On Fri, Apr 22, 2011 at 06:45:28PM +0200, Johan Pretorius wrote: > Hi Alan, all > > Alan, can I please ask you to run the attached four files through > jbogenturfa'i, and send me back the results? I have a visual tool (kdiff3) > to compare them to my results, which makes it easier for me to figure out > what is going on. > > New release! Get it here: > [1]http://sourceforge.net/projects/vlastezba/files/vlastezba_21.jar/download > > In this release, I have fixed a bunch of things: > - Dots are no longer assumed to be an integral part of a word. In fact, > now, if a dot is found, it is assumed to be a word separator, in exactly > the same way as a space. Beyond this they are completely ignored, and > indeed, removed from the input stream. > - "ybu" and "y'y" now parses. Since no clarity was to be had about whether > or not y is a vowel, consonant, neither or both, I just added those two as > special cases... I alread had a loose standing "y" as a special case in > there, because it is explicitly mentioned in CLL (section 4.3, I think) > - The last cmavo cluster in a file is no longer misparsed. Specifically, I > added a regression test and unit test for "coirodo" appearing on a single > line in its own file, and it finds 3 words as you would expect it to. > - Output is now always ordered alphabetically. Previously it was in any > old order because I used an unordered HashMap to store them in. > - Previously we seemed to produce some duplicates (I guess this could > happen if there were extra whitespace in the words). This only happened in > about 0.5% of cases. I did not consciously fix this, but it seems to no > longer happen. > - Internally, the logic is much better organized - the parsing logic is no > longer all stuffed into a single class, instead there is a class hierarchy > specifically to represent each word class, the idea is that each will have > its own specialized processing. The main point of doing this was to enrich > the results returned by the tokenizer, which means in future we can get > all flexible (like, if we find a lujvo, we will know what it's rafsi are, > so that we can decide to give the user a list of those, look up their > gismu's definitions, or what). > - Added regression tests. There are 4 files: the Terry the Tiger story, > the Berenstein Bears story, a file containing only "coirodo" on a single > line, and a file containing a list of all recognized cmavo (about 1000 > lines). I also added a script that will run all these through vlastezba, > compares the outputs against "expected" results, and spits the diffs into > a single file (test_result.txt). It should be noted that the "expected" > results are baselined off of this release, so it is impossible for there > to be any reported problems. However, next time a change is made, it will > be possible to see how the regression tests are affected. The expected > results can then be manually updated to be more correct, thus causing the > test to become more correct over time. > - Added 2 unit tests to the ones already existing, specifically to test > these two cases: "coirodo" and "ybu"... since both were problems that got > fixed in this release. > > By the way, does anybody know how to do a formal release on SourceForge? > Aside from just uploading the jar file, which is what I'm doing currently. > > Regards, > iu'an > > -- > You received this message because you are subscribed to the Google Groups > "Lojban Beginners" group. > To post to this group, send email to lojban-beginners@googlegroups.com. > To unsubscribe from this group, send email to > lojban-beginners+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/lojban-beginners?hl=en. > > References > > Visible links > 1. http://sourceforge.net/projects/vlastezba/files/vlastezba_21.jar/download -- .i ma'a lo bradi ku penmi gi'e du -- You received this message because you are subscribed to the Google Groups "Lojban Beginners" group. To post to this group, send email to lojban-beginners@googlegroups.com. To unsubscribe from this group, send email to lojban-beginners+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/lojban-beginners?hl=en.
Attachment:
jbogenturfahi-cipra.zip
Description: Zip compressed data