I've decided to release an alpha version of my parser for testing and comments as I must stop developing it for a while - it is causing me too much loss of sleep and disturbance of domestic peace :)
I've checked the code once more and cleaned it a little bit. There shouldn't be any major problems, but as I haven't yet been able to do extensive testing, I decided to publish yhe program as an alpha version and only on my own site. There may be obscure errors in the PEG so the output of this version shouldn't be taken as gospel. THIS ISN'T A CERTIFIED PARSER. Anyway, it handles most of "Alice".
I've appended the README file from the package.
Veijo
LLLP = Lua LPeg Lojban Parser
(Version = alpha)
Requirements:
lua5.1x (or luajit2) and the LPeg library (either built-in or external)
LuaJIT doesn't seem to offer any benefit for the PEG but makes a difference
in auxiliary operations. However, for reasonable sized texts the difference
is negligible.
LLLP files:
lllp.lua the main program script
lllp_morphology.lua the Lojban morphology PEG
lllp_syntax_r.lua the Lojban syntax PEG, a reduced output version
lllp_syntax_f.lua the Lojban syntax PEG, a "full" output version
The reduced output version of the syntax PEG omits the numbered intermediate
rules (e.g. term-1) from the output because the depth of the "full" parse
tree can exceed the maximum number of syntax levels an unmodified lua/luajit
interpreter can handle (200 levels), and increasing the limit can be unsafe.
While parsing "Alice" using the "full" output version, the program hit the
limit at three points. I've set the program to use the reduced output version
as the full output isn't usually required. The version to use can be changed
by editing the main program script.
A luajit linux binary with built-in LPeg library is included in the package.
Running:
luajit lllp.lua lojban_file_name
NB. the output goes to STDOUT and can be re-directed as required
NB. the output parameters can only be set by editing lllp.lua
NB. input text is sliced at blank lines and handled block by block.
This means that terminated structures MUST NOT span blank lines!
NB. punctuation handling is still deficient (this is an alpha version)
Lua commenting conventions can be used within the Lojban files:
-- two or more adjacent dashes mark the rest of the line as comment
--[[ starts a multi-line comment ending at --]] or EOF
A space between -- and [[ can be used to de-activate the commenting
The output can be fine-tuned quite extensively by editing lllp.lua. It is
also possible to add processing stages at various points.
** The program gathers statistics about word usage.
There is no error handling. The processing of a block terminates when a
syntax error is found, and the program continues with the next block if any.
I've tested the parser with both single sentences and the full "Alice", and
there don't seem to be any major problems. Alice does contain a number of
blocks which don't pass the parser, but most do. On a decent PC the process
takes about one minute, and the reduced tree output interleaved with the
source text blocks is about 1,200 A4 pages long (1,300 Letter size), the full
tree would be about 16,000 pages.
--