Difference between revisions of "LALR Parser Generator"

From wiki.visual-prolog.com

m (header level)
(Example template)
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
Visual Prolog 7.5 Commercial Edition (build 7501+) contains an LALR(1) parser generator in the examples (in the directory '''vipLalrGen'''). The example directory also contains a program '''exprEval''', which uses a generated parser to parse arithmetical expressions.
Visual Prolog Commercial Edition contains an LALR(1) parser generator in the examples (in the directory '''vipLalrGen''').
 
The example directory also contains a program '''exprEval''', which uses a generated parser to parse arithmetical expressions.


The parser generator itself also uses such parser to parse grammar files, so it can be seen as another example.
The parser generator itself also uses such parser to parse grammar files, so it can be seen as another example.
Line 13: Line 15:
* if the grammar is not fulfilled: output "syntax error" messages describing the problems.
* if the grammar is not fulfilled: output "syntax error" messages describing the problems.


The overall parsing process consist of splitting the input into lexical elements known as <vp>terminals</vp> and parsing these according to the grammar specification.
The overall parsing process consist of splitting the input into lexical elements known as <vpgrm>terminals</vpgrm> and parsing these according to the grammar specification.
 
The lexing and the parsing is performed intermixed in a single start-to-end scan of the text (without backtracking).


'''vipLalrGen''' will create a combined lexer and parser that performs the lexing and parsing intermixed in a single start-to-end scan of the text (without backtracking).
=== Parser structure ===
=== Parser structure ===


The parser of the '''exprEval''' demo can be used as illustration of the parser components and how the overall parsing works.
The parser of the '''exprEval''' demo can be used as illustration of the parser components and how the overall parsing works.


The grammar file and the parser components are in the sub directory/package <vp>expressionGrm</vp>.
The grammar file and the parser components are in the sub directory/package <vpgrm>expressionGrm</vpgrm>.
 
* <vp>expcessionGrm.vipgrm</vp> the grammar specification.
* <vp>expcessionGrm.i</vp>, <vp>expcessionGrm.cl</vp> and <vp>expcessionGrm.pro</vp> the parser.
* <vp>expressionLexer</vp> contains the lexer.
* <vp>expressionSem</vp> contains a support class for building the resulting parse tree.


<vp>expcessionGrm.i</vp>, <vp>expcessionGrm.cl</vp> and <vp>expcessionGrm.pro</vp> are generated by the <vp>vipLalrGen</vp> program from the grammar specification <vp>expcessionGrm.vipgrm</vp>.
* <vpgrm>expcessionGrm.vipgrm</vpgrm> the grammar specification.
* <vpgrm>expcessionGrm.i</vpgrm>, <vpgrm>expcessionGrm.cl</vpgrm> and <vpgrm>expcessionGrm.pro</vpgrm> the parser and lexer.
* <vpgrm>expressionSem</vpgrm> contains a support class for building the resulting parse tree.


The directory sub directory/package <vp>expressionGrmSem</vp> contains support predicates for the semantic actions in the grammar.
<vpgrm>expcessionGrm.i</vpgrm>, <vpgrm>expcessionGrm.cl</vpgrm> and <vpgrm>expcessionGrm.pro</vpgrm> are generated by the <vpgrm>vipLalrGen</vpgrm> program from the grammar specification <vpgrm>expcessionGrm.vipgrm</vpgrm>.


The directory sub directory/package <vp>expressionLexer</vp> contains the lexical analyzer.  It is based on the class <vp>pfc\syntax\lexer_string</vp>, which makes it very easy to implement lexers uses Visual Prolog number, string and comment syntax.  Basically the programmer only needs to define things like keywords and operators.
The directory sub directory/package <vpgrm>expressionGrmSem</vpgrm> contains support predicates for the semantic actions in the grammar.


=== vipLalrGen ===
=== vipLalrGen ===


The parser generator itself is in the <vp>vipLalrGen</vp> sub directory (i.e. in '''<examples root>\vipLalrGen\vipLalrGen''', and it should be built before it can be used.
The parser generator itself is in the <vpgrm>vipLalrGen</vpgrm> sub directory (i.e. in '''<examples root>\vipLalrGen\vipLalrGen''', and it should be built before it can be used.


The '''vipLalrGen''' program will read grammar files and generate LALR(1) parsers as Visual Prolog source code.
The '''vipLalrGen''' program will read grammar files and generate LALR(1) parsers as Visual Prolog source code.
Line 63: Line 62:
</source>}}
</source>}}
When running vipLalrGen on a grammar file it always produces the file:
When running vipLalrGen on a grammar file it always produces the file:
* <vp>log/<grammar>.log</vp> containing detailed information about the grammar
* <vpgrm>log/<grammar>.log</vpgrm> containing detailed information about the grammar
* If successful: <vp><grammar>.i</vp>, <vp><grammar>.cl</vp>, <vp><grammar>.pro</vp> containing the generated parser
* If successful: <vpgrm><grammar>.i</vpgrm>, <vpgrm><grammar>.cl</vpgrm>, <vpgrm><grammar>.pro</vpgrm> containing the generated parser


=== Grammar files ===
=== Grammar files ===


The input <vp>vipLalrGen</vp> is a grammar file. As mentioned the IDE supports token coloring if the extension is vipgrm.
The input <vpgrm>vipLalrGen</vpgrm> is a grammar file. As mentioned the IDE supports token coloring if the extension is vipgrm.


a grammar file contains a named grammar:
A grammar file contains a named grammar:


<source lang="vipgrm">
<vipgrm>
grammar expressionGrm
grammar expressionGrm
     open expression, expressionGrmSem
     open expression, expressionGrmSem


nonassoc t_cmp.
terminals
left t_plus.
    [number] : [integer] [real].
left t_mult.
    [boolean] : ["true"] ["false"].
right t_power.
    [compare] :  ["<"] ["<="] ["="] [">"] [">="] ["<>"].
    [plus] : ["+"] ["-"].
    [mult] : ["*"] ["/"].
    ["^"].
    ["("].
    [")"].
 
precedence
    [compare] nonassoc.
    [plus] left.
    [mult] left.
    ["^"] right.


nonterminals
startsymbols
     exp : expression.
     exp : expression.
rules
rules
     exp { mkBinOp(Op, a, b) } ==>
     exp { mkBinOp(Op, A, B) } ==>
         exp { a },
         exp { A },
         [t_cmp] { Op },
         [compare] { Op },
         exp { b }.
         exp { B }.


     exp { mkBinOp(Op, a, b) } ==>
     exp { mkBinOp(Op, A, B) } ==>
         exp { a },
         exp { A },
         [t_power] { Op },
         ["^"] { Op },
         exp { b }.
         exp { B }.


     exp { mkBinOp(Op, a, b) } ==>
     exp { mkBinOp(Op, A, B) } ==>
         exp { a },
         exp { A },
         [t_mult] { Op },
         [mult] { Op },
         exp { b }.
         exp { B }.


     exp { mkBinOp(Op, a, b) } ==>
     exp { mkBinOp(Op, A, B) } ==>
         exp { a },
         exp { A },
         [t_plus] { Op },
         [plus] { Op },
         exp { b }.
         exp { B }.


     exp { E } ==>
     exp { E } ==>
         [t_lpar],
         ["("],
         exp { E },
         exp { E },
         [t_rpar].
         [")"].


     exp { number(toTerm(real, N)) } ==>
     exp { number(toTerm(real, N)) } ==>
         [t_number] { N }.
         [number] { N }.


     exp { bool(toTerm(boolean, N)) } ==>
     exp { bool(toTerm(boolean, N)) } ==>
         [t_boolean] { N }.
         [boolean] { N }.


end grammar expressionGrm
end grammar expressionGrm
</source>
</vipgrm>
 
==== terminals and lexing ====
 
The grammar file (among other) contains a <vpgrm>terminals</vpgrm> section.  This section defines the lexical elements of the language.
 
<vipgrm>
terminals
    [number] : [integer] [real].
    [boolean] : ["true"] ["false"].
    [compare] :  ["<"] ["<="] ["="] [">"] [">="] ["<>"].
    [plus] : ["+"] ["-"].
    [mult] : ["*"] ["/"].
    ["^"].
    ["("].
    [")"].
</vipgrm>
 
The form without colon (e.g. <vpgrm>["^"].</vpgrm>) is shorthand for a form with colon (i.e. <vpgrm>["^"] : ["^"].</vpgrm>).
 
Each line above defines the terminal symbol in front of the colon as the set of tokens after the colon.  Where "terminal symbols" are the ones in the grammar, whereas tokens are the actual strings in the parsed text.
 
So whenever <vp>true</vp> appears in the text it the lexer will return the <vpgrm>[boolean]</vpgrm> terminal symbol to the parser.
 
The lexer used in the parsing is that same that is used to parse Visual Prolog programs with.  Certain aspects of the lexing is bound to conform to Visual Prolog.  Comments are like in Visual Prolog (see [[#comments]]) and cannot be used as terminal symbols.  Furthermore the following tokens (i.e. token sets) are predefined:
 
* <vpgrm>[integer]</vpgrm> corresponds to Visual Prolog integer literals (including the octal and hex formats)
* <vpgrm>[real]</vpgrm> corresponds to Visual Prolog real literals
* <vpgrm>[char]</vpgrm> corresponds to Visual Prolog characters literals
* <vpgrm>[string]</vpgrm> corresponds to Visual Prolog string literals (including the @-forms)
* <vpgrm>[lowercaseId]</vpgrm> corresponds to Visual Prolog lowercase identifiers
* <vpgrm>[uppercaseId]</vpgrm> corresponds to Visual Prolog uppercase identifiers
 
Notice that to use any of these in the grammar they must be stated in the terminals section.
 
Above the <vpgrm>[number]</vpgrm> terminal is returned whenever an <vpgrm>[integer]</vpgrm> or <vpgrm>[real]</vpgrm> token is met.  But if the text contains a (Visual Prolog) string literal then it will give an error because <vpgrm>[string]</vpgrm> is not mentioned in a terminal definition.
 
Each token can only be recognized as one terminal symbol, so given the <vpgrm>[boolean]</vpgrm> definition above the word <vp>true</vp> is no longer recognized as a <vpgrm>[lowercaseId]</vpgrm>.


==== nonterminals & rules ====
==== nonterminals & rules ====


The grammar file (among other) contains <vp>nonterminals</vp> and <vp>rules</vp> sections.
The grammar file also contains <vpgrm>nonterminals</vpgrm> and <vpgrm>rules</vpgrm> sections.


<vp>nonterminals</vp> sections declares nonterminal symbols and the type of the corresponding parse trees.  The nonterminals section above states that <vp>exp</vp> is a nonterminal symbol and that it produces parse trees of the (Visual Prolog) type <vp>expression</vp>.
<vpgrm>nonterminals</vpgrm> sections declares nonterminal symbols and the type of the corresponding parse trees.  The nonterminals section above states that <vpgrm>exp</vpgrm> is a nonterminal symbol and that it produces parse trees of the (Visual Prolog) type <vpgrm>expression</vpgrm>.


<vp>rules</vp> sections contains rules that both defines how valid ''derivations'' of nonterminal system looks and how the corresponding parse tree is constructed.
<vpgrm>rules</vpgrm> sections contains rules that both defines how valid ''derivations'' of nonterminal system looks and how the corresponding parse tree is constructed.


Everything in braces have to do with the parse trees. If we initially disregard it, we only see what have to do with the ''derivations''
Everything in braces have to do with the parse trees. If we initially disregard it, we only see what have to do with the ''derivations''.


<source lang="vipgrm">
<vipgrm>
rules
rules
     exp ==> exp, [t_cmp], exp.
     exp ==> exp, [compare], exp.


     exp ==> exp, [t_power], exp.
     exp ==> exp, ["^"], exp.


     exp ==> exp, [t_mult], exp.
     exp ==> exp, [mult], exp.


     exp ==> exp, [t_plus], exp.
     exp ==> exp, [plus], exp.


     exp ==> [t_lpar], exp, [t_rpar].
     exp ==> ["("], exp, [")"].


     exp ==> [t_number].
     exp ==> [number].


     exp ==> [t_boolean].
     exp ==> [boolean].
</source>
</vipgrm>
      
      
The first rule says that from the nonterminal symbol <vp>exp</vp> we can derive <vp>exp</vp> followed by <vp>[t_cmp]</vp> followed by <vp>exp</vp>, where <vp>[t_cmp]</vp> is the terminal symbol <vp>t_cmp</vp>.
The first rule says that from the nonterminal symbol <vpgrm>exp</vpgrm> we can derive <vpgrm>exp</vpgrm> followed by <vpgrm>[compare]</vpgrm> followed by <vpgrm>exp</vpgrm>, where <vpgrm>[compare]</vpgrm> is the terminal symbol <vpgrm>compare</vpgrm>.


{{Example|
{{Example|
Here is a derivation that corresponds to the expression <vp>7 < 5</vp> (assuming that <vp>t_number</vp> are numbers and <vp>t_cmp</vp> are comparison operators):
Here is a derivation that corresponds to the expression <vpgrm>7 < 5</vpgrm> (assuming that <vpgrm>number</vpgrm> are numbers and <vpgrm>compare</vpgrm> are comparison operators):
<source lang="vipgrm">
<vipgrm>
exp
exp
     ==> exp, [t_cmp], exp  
     ==> exp, [compare], exp  
     ==> exp, [t_cmp], [t_number]  
     ==> exp, [compare], [number]  
     ==> [t_number], [t_cmp], [t_number]</source>
     ==> [number], [compare], [number]
</vipgrm>
}}
}}


Line 162: Line 210:
Given the rule
Given the rule


<source lang="vipgrm">
<vipgrm>
   exp { mkBinOp(Op, a, b) } ==>
   exp { mkBinOp(Op, A, B) } ==>
         exp { a },
         exp { A },
         [t_cmp] { Op },
         [compare] { Op },
         exp { b }.</source>
         exp { B }.
</vipgrm>


<vp>a</vp> and <vp>b</vp> contains the parse-trees of the two <vp>exp</vp>s and <vp>Op</vp> contains the string of the terminal symbol <vp>t_cmp</vp>.  The resulting parse tree is calculated as <vp>mkBinOp(Op, a, b)</vp>.
<vpgrm>A</vpgrm> and <vpgrm>B</vpgrm> contains the parse-trees of the two <vpgrm>exp</vpgrm>s and <vpgrm>Op</vpgrm> contains the token string of the terminal symbol <vpgrm>compare</vpgrm>.  The resulting parse tree is calculated as <vpgrm>mkBinOp(Op, A, B)</vpgrm>.


==== Precedence ====
==== Precedence ====
Line 174: Line 223:
The grammar rules in the expression grammar are by themselves ambiguous, in the sense that 3 + 4 * 5 can derived by both these  rightmost derivations:
The grammar rules in the expression grammar are by themselves ambiguous, in the sense that 3 + 4 * 5 can derived by both these  rightmost derivations:


<source lang="vipgrm">
<vipgrm>
exp  
exp  
     ==> exp, [t_add], exp  
     ==> exp, [plus], exp  
     ==> exp, [t_add], exp, [t_mult], exp  
     ==> exp, [plus], exp, [mult], exp  
     ==> exp, [t_add], exp, [t_mult], [t_number]  
     ==> exp, [plus], exp, [mult], [number]  
     ==> exp, [t_add], [t_number], [t_mult], [t_number]  
     ==> exp, [plus], [number], [mult], [number]  
     ==> [t_number], [t_add], [t_number], [t_mult], [t_number]
     ==> [number], [plus], [number], [mult], [number]
</source>
</vipgrm>


<source lang="vipgrm">
<vipgrm>
exp  
exp  
     ==> exp, [t_mult], exp  
     ==> exp, [mult], exp  
     ==> exp, [t_mult], [t_number]  
     ==> exp, [mult], [number]  
     ==> exp, [t_add], exp, [t_mult], [t_number]  
     ==> exp, [plus], exp, [mult], [number]  
     ==> exp, [t_add], [t_number], [t_mult], [t_number]  
     ==> exp, [plus], [number], [mult], [number]  
     ==> [t_number], [t_add], [t_number], [t_mult], [t_number]
     ==> [number], [plus], [number], [mult], [number]
</source>
</vipgrm>


The ambiguity is not relevant with regards to whether the expression is a valid expression or not, but the two derivations corresponds to two different parse trees (corresponding to the expressions <vp>(3 + 4) * 5</vp> and <vp>3 + (4 * 5)</vp>, respectively).  Obviously, we are interested in a particular parse tree (i.e. the latter).
The ambiguity is not relevant with regards to whether the expression is a valid expression or not, but the two derivations corresponds to two different parse trees (corresponding to the expressions <vpgrm>(3 + 4) * 5</vpgrm> and <vpgrm>3 + (4 * 5)</vpgrm>, respectively).  Obviously, we are interested in a particular parse tree (i.e. the latter).


Moreover the ambiguity in the grammar actually makes it impossible for the parser generator to produce a parser (see next section).
Moreover the ambiguity in the grammar actually makes it impossible for the parser generator to produce a parser (see next section).
Line 198: Line 247:
This particular kind of ambiguity can be solved by stating the precedence of the terminal symbols corresponding to the operators:
This particular kind of ambiguity can be solved by stating the precedence of the terminal symbols corresponding to the operators:


<source lang="vipgrm">  
<vipgrm>  
nonassoc t_cmp.
precedence
left t_plus.
    [compare] nonassoc.
left t_mult.
    [plus] left.
right t_power.
    [mult] left.
</source>
    ["^"] right.
</vipgrm>


a precedence declaration states one of the associativites (<vp>nonassoc</vp>, <vp>left</vp> or <vp>right</vp>) followed by a comma separated sequence of terminal symbols and terminated by a dot.
a precedence declaration list the terminals and the one of the associativites (<vpgrm>nonassoc</vpgrm>, <vpgrm>left</vpgrm> or <vpgrm>right</vpgrm>).


The terminals in a precedence declaration have the same precedence level and associativity.  The declarations are stated in increasing precedence (i.e. the first declaration have lowest precedence).
The terminals in a precedence declaration have the same precedence level and associativity.  The declarations are stated in increasing precedence (i.e. the first declaration have lowest precedence).


Given <vp>a [t1] b [t2] c</vp>
Given <vpgrm>A [t1] B [t2] C</vpgrm>
 
* If <vpgrm>t1</vpgrm> has higher precedence than <vpgrm>t2</vpgrm> then <vpgrm>A [t1] B</vpgrm> will be reduced to <vpgrm>AB</vpgrm> and then <vpgrm>AB [t2] C</vpgrm> will be reduced.
* If <vpgrm>t1</vpgrm> has lower precedence than <vpgrm>t2</vpgrm> then <vpgrm>B [t2] C</vpgrm> will be reduced to <vpgrm>BC</vpgrm> and then <vpgrm>a [t1] BC</vpgrm> will be reduced.
*If <vpgrm>t1</vpgrm> and <vpgrm>t2</vpgrm> have same precedence, then
** If they are <vpgrm>left</vpgrm>  then <vpgrm>A [t1] B</vpgrm> will be reduced to <vpgrm>AB</vpgrm> and then <vpgrm>AB [t2] C</vpgrm> will be reduced.
** If they are <vpgrm>right</vpgrm>  then <vpgrm>B [t2] C</vpgrm> will be reduced to <vpgrm>BC</vpgrm> and then <vpgrm>a [t1] BC</vpgrm> will be reduced.
** If they are <vpgrm>nonassoc</vpgrm> the construction is illegal and the parser will issue s syntax error.
 
=== Error recovery ===
 
During parsing the parser will look at the next terminal symbol and:
 
* Decide that parsing is successfully ended and thus <vpgrm>accept</vpgrm> the input
* <vpgrm>shift</vpgrm> the terminal symbol on its stack
* <vpgrm>reduce</vpgrm> the top of the stack
* or else conclude that there is a syntax <vpgrm>error</vpgrm> in the input
 
In the last case the parser will report the error and it will not produce a parse tree.  It is however desirable to try to detect additional syntax errors in the remaining input. ''Error recovery'' is attempting to get the parser and input back in synchronization, so that additional syntax errors can be detected.
 
The parsing machinery in PFC can use a technique involving an special error-terminal.  In the grammar files the terminal <vpgrm>[error]</vpgrm> is reserved for the error terminal.
 
The basic use of the <vpgrm>[error]</vpgrm> terminal is to add grammar rules of the form:
 
<vipgrm>
rules
    ...
    something ==>
        [error].
</vipgrm>
 
When the parser detects an error it will do recovery in the following way:
 
* It will pop entities from its stack (corresponding to skipping backwards in the things that are in progress of being parsed), until it get to a state where the <vpgrm>[error]</vpgrm> terminal can be shifted on the stack.  Given the grammar rule above this would be in a place where <vpgrm>something</vpgrm> is a valid "next thing".
* It will then perform actions corresponding to the grammar rule (shift the error terminal and perform the reduction)
* Finally it will skip input until it finds a terminal that can be shifted onto the stack
 
If all that succeeds recovery is completed and parsing resumes normally.
 
{{Example|
This strategy works quite good if the language have some clear "synchronization" points.  In Visual Prolog the section keywords <vpgrm>predicates</vpgrm>, <vpgrm>clauses</vpgrm>, etc are very good synchronization points: It is quite certain how to parse things after one of these words.


* If <vp>t1</vp> has higher precedence than <vp>t2</vp> then <vp>a [t1] b</vp> will be reduced to <vp>AB</vp> and then <vp>AB [t2] c</vp> will be reduced.
<vipgrm>
* If <vp>t1</vp> has lower precedence than <vp>t2</vp> then <vp>b [t2] c</vp> will be reduced to <vp>BC</vp> and then <vp>a [t1] BC</vp> will be reduced.
rules
*If <vp>t1</vp> and <vp>t2</vp> have same precedence, then
    sectionList ==> .
** If they are <vp>left</vp> then <vp>a [t1] b</vp> will be reduced to <vp>AB</vp> and then <vp>AB [t2] c</vp> will be reduced.
    sectionList ==> sectionList, section.
** If they are <vp>right</vpthen <vp>b [t2] c</vp> will be reduced to <vp>BC</vp> and then <vp>a [t1] BC</vp> will be reduced.
rules
** If they are <vp>nonassoc</vp> the construction is illegal and the parser will issue s syntax error.
    section ==> [t_predicates], ...
    section ==> [t_clauses], ...
    ...
    section ==> [error].
</vipgrm>
 
If a syntax error is detected in a <vpgrm>section</vpgrm> the parser stack will contain a <vpgrm>sectionList</vpgrm> followed by whatever it is in the middle of parsing in the current section.
 
It will then pop until it reaches the <vpgrm>sectionList</vpgrm>, because at that point our error-section (and thus the <vpgrm>[error]</vpgrm> terminal) would be a valid next thing.
 
It will shift and reduce the error-section and then skip terminal symbols until we meet one that can legally follow a section, like a section keyword, <vpgrm>end interface</vpgrm>, ...
}}
 
=== Cursor ===
 
For many reasons (for example to give error messages and create browse and debug info) the compiler deals with positions in the input.  Every rule have access to a (predefined) variable named <vpgrm>Cursor</vpgrm>.  This variable holds four pieces of information:
 
* The start position of the construction
* The end position of the construction
* Comments before in the construction (comments are discussed in the next section)
* Comments after the construction
 
The parser calculates such a cursor for a rule from the cursors of the components of the rule. The position handling is quite simple, given a rule:
 
<vipgrm>b  ==> a1, a2, ... an.</vipgrm>
 
* The start position of <vpgrm>b</vpgrm> is the start position of <vpgrm>a1</vpgrm>
* The end position of <vpgrm>b</vpgrm> is the end position of <vpgrm>an</vpgrm>.
 
The <vpgrm>Cursor</vpgrm> variable can be used in the semantic action in the rule head together with the variables defined for the body components:
 
<vipgrm>b { mkB(Cursor, A1, A2, ... An ) } ==> a1 { A1 }, a2 { A2 }, ... an { An }.</vipgrm>
 
=== Comments ===
 
In most parsing schemes comments are simply discarded by the lexer, so the parser never see any comments at all.  This is fine for many purposes, but there may be situations where you want to use the comments.  Visual Prolog programs can for example contain documentation comments which it may be nice to deal with.  And if you want to pretty print or restructure a program programmatically you don't want to discard the comments.
 
As mentioned above comments are placed in the cursors.  Many cursors are however discarded during parsing, but the comments in these cursors should not be discarded.  Therefore the parser moves comments from discarded cursors to cursors that stay are not discarded.
 
The parser assumes that cursors that is used in a semantical action by means of the <vpgrm>Cursor</vpgrm> variable will stay alive, and will therefore assume that comments on such a cursor will survive.  Subsequently it may move additional comments to such a cursor.
 
Consider a rule for an if-then (for simplicity without an else part) construction:
 
<vipgrm>term { mkIfThen(Cursor, Cond, Then ) } ==>
    ["if"],
    term { Cond },
    ["then"]
    term { Then },
    ["end"],
    ["if"].
</vipgrm>
 
When the parser have to reduce this rule it have access to the six cursors corresponding to the sub-components, it also knows which of them that can carry comments and which will be discarded.
 
In general pre-comments are moved leftwards and post comments are moved rightwards.  So pre-comment of <vpgrm>["then"]</vpgrm> will be moved to become post-comments of <vpgrm>Cond</vpgrm> (given that <vpgrm>Cond</vpgrm> can carry comments). Likewise the post-comments of <vpgrm>["then"]</vpgrm> will be moved to become pre-comments of <vpgrm>Then</vpgrm> (given that <vpgrm>Then</vpgrm> can carry comments).
 
If the first symbol is going to be discarded its pre comments will move to become pre-comments in the parent cursor.  And likewise the post comments of the last symbol will be moved to the post-comments of the parent cursor.
 
If several adjacent symbols cannot carry comments then the exact movement of comments depends on the placement in the rule.  For the if-then rule above:
* the pre-comments of <vpgrm>["end"]</vpgrm> will become post comments in <vpgrm>Then</vpgrm>, and
* the post-comments of  vpgrm>["end"]</vpgrm> and all comments of <vpgrm>["if"]</vpgrm> will go to the parent cursor
 
The same principle of "most goes to the parent" will be used in the beginning of the rule.  In the middle of a rule most comments will move to the left.
 
It is important to notice that comments end's up on cursors that have been used with the <vpgrm>Cursor</vpgrm> variable, so to preserve all comments it is important to retain all such referenced cursors.  Or to put it differently if a semantic action uses the <vpgrm>Cursor</vpgrm> variable it also takes responsibility for the comments on that cursor.
 
It is also important to notice that the parser moves comments around by updating making '''destructive''' updates to the cursor structs, so it is not only important to retain cursors received during parsing, you must retain exactly those cursors your semantic actions receive. To be safe you should not assume that comment are stable until the entire parsing is completed.
 
In some cases you may need access to a cursor in a grammar rule that will not be able to carry comments.  In that case you can use the variable <vp>CursorNC</vp> instead of the <vp>Cursor</vp> variable.  When using <vp>CursorNC</vp> the parser will assume that the cursor is not saved and can therefore not carry comments.
 
 
{{Example|1= In the Visual Prolog grammar the equal operator "=" is put into the <vpgrm>[t_rel_op]</vpgrm> terminal symbol:
<vipgrm>
terminals
    [t_rel_op] :  ["<"] ["<="] ["<>"] ["="] ["=="] [">"] ["><"] [">="].
</vipgrm>
But there are some places where you use equal where the other <vpgrm>[t_rel_op]</vpgrm> tokens are illegal. This is handled by the following grammar rule:
<vipgrm>
nonterminals
    equalOperator : string.
rules
    equalOperator { mustBe(CursorNC, "=", RelOp) } ==>
        [t_rel_op] { RelOp }.
</vipgrm>
The <vp>mustBe</vp> predicate will report a syntax error if <vp>RelOp</vp> is not <vp>"="</vp>.  It need a cursor to report the error (if error is the case), but the cursor is not placed in the resulting parse tree (which is simply a string) so if the parser moved any comments to that cursor they would be lost. Therefore the <vp>mustBe</vp> predicate is called with the <vp>CursorNC</vp> indicating that that cursor will not be able to carry comments.
}}


=== Conflicts ===
=== Conflicts ===


As briefly mentioned above there are situations where the parser generator cannot produce a parser.  Briefly speaking the parser generator generates parset tables that in a given situation instructs the parser either to shift the next terminal symbol onto an parser stack or to reduce the top elements of the parser stack using one of the grammar rules (aka production rule).
As briefly mentioned above there are situations where the parser generator cannot produce a parser.  Briefly speaking the parser generator generates parser tables that in a given situation instructs the parser either to shift the next terminal symbol onto an parser stack or to reduce the top elements of the parser stack using one of the grammar rules (aka production rule).


But when calculating the parser table it may turn out that it would both be equally good/bad to shift the next terminal onto the stack and to reduce the stack top.  This is a so called shift-reduce conflict.  Likewise it can turn out that there are two equally good reductions that could be applied to the top of the parse stack, i.e. a so called reduce-reduce conflict.
But when calculating the parser tables it may turn out that it would both be equally good/bad to shift the next terminal onto the stack and to reduce the stack top.  This is a so called shift-reduce conflict.  Likewise it can turn out that there are two equally good reductions that could be applied to the top of the parse stack, i.e. a so called reduce-reduce conflict.


Details about parsing conflicts can be found in the produced log file.
Details about parsing conflicts can be found in the produced log file.


A grammar for a certain language can be written in many ways (just like a program with a certain functionality can be written in many ways), and if may often be possible to rewrite a grammar that have conflicts to another one that doesn't have conflicts (but recognizes the same language).
A grammar for a certain language can be written in many ways, and it may often be possible to rewrite a grammar that have conflicts to another one that doesn't have conflicts (but recognizes the same language).


It is outside the scope of this article to go into details about conflicts and how to resolve them, unfortunately it may require relatively deep understanding of parsing.  As mentioned above vipLalrGen is an LALR(1) parser generator like YACC and the like, so there is lots of existing material about the subject "out there".
It is outside the scope of this article to go into details about conflicts and how to resolve them, unfortunately it may require relatively deep understanding of parsing.  As mentioned above vipLalrGen is an LALR(1) parser generator like YACC and the like, so there is lots of existing material about the subject "out there".

Latest revision as of 13:01, 16 February 2018

Visual Prolog Commercial Edition contains an LALR(1) parser generator in the examples (in the directory vipLalrGen).

The example directory also contains a program exprEval, which uses a generated parser to parse arithmetical expressions.

The parser generator itself also uses such parser to parse grammar files, so it can be seen as another example.

The directory $(ProDir)/vip/vipSyntax contains a grammar file vipSyntax.vipgrm defining the Visual Prolog syntax. The directory also contains the corresponding generated files, and all other files necessary to parse Visual Prolog programs. But that is not the topic of this article, it is only mentioned for reference.

The theory of parsers is huge and cannot be covered here. An LALR(1) parser parses from Left to right, and produces an inverse Rightmost derivation using 1 Look-Ahead symbol. See for example wikipedia:LALR parser for more information.

Parsing

The purpose of a parser is to validate that an input string fulfills a certain grammar specification, and:

  • if the grammar is fulfilled: construct a corresponding parse tree.
  • if the grammar is not fulfilled: output "syntax error" messages describing the problems.

The overall parsing process consist of splitting the input into lexical elements known as terminals and parsing these according to the grammar specification.

vipLalrGen will create a combined lexer and parser that performs the lexing and parsing intermixed in a single start-to-end scan of the text (without backtracking).

Parser structure

The parser of the exprEval demo can be used as illustration of the parser components and how the overall parsing works.

The grammar file and the parser components are in the sub directory/package expressionGrm.

  • expcessionGrm.vipgrm the grammar specification.
  • expcessionGrm.i, expcessionGrm.cl and expcessionGrm.pro the parser and lexer.
  • expressionSem contains a support class for building the resulting parse tree.

expcessionGrm.i, expcessionGrm.cl and expcessionGrm.pro are generated by the vipLalrGen program from the grammar specification expcessionGrm.vipgrm.

The directory sub directory/package expressionGrmSem contains support predicates for the semantic actions in the grammar.

vipLalrGen

The parser generator itself is in the vipLalrGen sub directory (i.e. in <examples root>\vipLalrGen\vipLalrGen, and it should be built before it can be used.

The vipLalrGen program will read grammar files and generate LALR(1) parsers as Visual Prolog source code.

vipLalrGen.exe - LALR parser generator for Visual Prolog
 
Usage:
 
    vipLalrGen.exe [options] <grammar files>
 
Options:
 
    @<File>     Read options from <File>
    -help       Displays the help message
    -out <OutDir>       Generate files in <OutDir> (default: the directory containing the grammar file)
    -details    Add details to the log file
    -nodetails  Don't add details to the log file (default)

It is recommended to have grammars in files with extension vipgrm; the IDE will token color files with that extension.

Example To run vipLalrGen on the grammar file in the exprEval example from a command console:
>cd <example path>\vipLalrGen
>vipLalrGen\Exe\vipLalrGen.exe exprEval\expressionGrm\expressionGrm.vipgrm
OK: exprEval\expressionGrm\expressionGrm.vipgrm

When running vipLalrGen on a grammar file it always produces the file:

  • log/<grammar>.log containing detailed information about the grammar
  • If successful: <grammar>.i, <grammar>.cl, <grammar>.pro containing the generated parser

Grammar files

The input vipLalrGen is a grammar file. As mentioned the IDE supports token coloring if the extension is vipgrm.

A grammar file contains a named grammar:

grammar expressionGrm
    open expression, expressionGrmSem
 
terminals
    [number] : [integer] [real].
    [boolean] : ["true"] ["false"].
    [compare] :  ["<"] ["<="] ["="] [">"] [">="] ["<>"].
    [plus] : ["+"] ["-"].
    [mult] : ["*"] ["/"].
    ["^"].
    ["("].
    [")"].
 
precedence
    [compare] nonassoc.
    [plus] left.
    [mult] left.
    ["^"] right.
 
startsymbols
    exp : expression.
rules
    exp { mkBinOp(Op, A, B) } ==>
        exp { A },
        [compare] { Op },
        exp { B }.
 
    exp { mkBinOp(Op, A, B) } ==>
        exp { A },
        ["^"] { Op },
        exp { B }.
 
    exp { mkBinOp(Op, A, B) } ==>
        exp { A },
        [mult] { Op },
        exp { B }.
 
    exp { mkBinOp(Op, A, B) } ==>
        exp { A },
        [plus] { Op },
        exp { B }.
 
    exp { E } ==>
        ["("],
        exp { E },
        [")"].
 
    exp { number(toTerm(real, N)) } ==>
        [number] { N }.
 
    exp { bool(toTerm(boolean, N)) } ==>
        [boolean] { N }.
 
end grammar expressionGrm

terminals and lexing

The grammar file (among other) contains a terminals section. This section defines the lexical elements of the language.

terminals
    [number] : [integer] [real].
    [boolean] : ["true"] ["false"].
    [compare] :  ["<"] ["<="] ["="] [">"] [">="] ["<>"].
    [plus] : ["+"] ["-"].
    [mult] : ["*"] ["/"].
    ["^"].
    ["("].
    [")"].

The form without colon (e.g. ["^"].) is shorthand for a form with colon (i.e. ["^"] : ["^"].).

Each line above defines the terminal symbol in front of the colon as the set of tokens after the colon. Where "terminal symbols" are the ones in the grammar, whereas tokens are the actual strings in the parsed text.

So whenever true appears in the text it the lexer will return the [boolean] terminal symbol to the parser.

The lexer used in the parsing is that same that is used to parse Visual Prolog programs with. Certain aspects of the lexing is bound to conform to Visual Prolog. Comments are like in Visual Prolog (see #comments) and cannot be used as terminal symbols. Furthermore the following tokens (i.e. token sets) are predefined:

  • [integer] corresponds to Visual Prolog integer literals (including the octal and hex formats)
  • [real] corresponds to Visual Prolog real literals
  • [char] corresponds to Visual Prolog characters literals
  • [string] corresponds to Visual Prolog string literals (including the @-forms)
  • [lowercaseId] corresponds to Visual Prolog lowercase identifiers
  • [uppercaseId] corresponds to Visual Prolog uppercase identifiers

Notice that to use any of these in the grammar they must be stated in the terminals section.

Above the [number] terminal is returned whenever an [integer] or [real] token is met. But if the text contains a (Visual Prolog) string literal then it will give an error because [string] is not mentioned in a terminal definition.

Each token can only be recognized as one terminal symbol, so given the [boolean] definition above the word true is no longer recognized as a [lowercaseId].

nonterminals & rules

The grammar file also contains nonterminals and rules sections.

nonterminals sections declares nonterminal symbols and the type of the corresponding parse trees. The nonterminals section above states that exp is a nonterminal symbol and that it produces parse trees of the (Visual Prolog) type expression.

rules sections contains rules that both defines how valid derivations of nonterminal system looks and how the corresponding parse tree is constructed.

Everything in braces have to do with the parse trees. If we initially disregard it, we only see what have to do with the derivations.

rules
    exp ==> exp, [compare], exp.
 
    exp ==> exp, ["^"], exp.
 
    exp ==> exp, [mult], exp.
 
    exp ==> exp, [plus], exp.
 
    exp ==> ["("], exp, [")"].
 
    exp ==> [number].
 
    exp ==> [boolean].

The first rule says that from the nonterminal symbol exp we can derive exp followed by [compare] followed by exp, where [compare] is the terminal symbol compare.

Example

Here is a derivation that corresponds to the expression 7 < 5 (assuming that number are numbers and compare are comparison operators):

exp
    ==> exp, [compare], exp 
    ==> exp, [compare], [number] 
    ==> [number], [compare], [number]

The derivation in the example is a rightmost derivation, because in each step we derive something from the rightmost nonterminal symbol. An LR parser will make reductions in the inverse order of the derivations in a rightmost derivation. LR parsing means that we scan tokens from Left to right and produce an inverse Rightmost derivation.

As mentioned the braces describes how to construct a corresponding parse tree. The braces on the left hand side contains a Visual Prolog expression that constructs the node in the parse tree. The braces in the right hand side of the rules defines variable names for the corresponding sub-trees.

Given the rule

   exp { mkBinOp(Op, A, B) } ==>
        exp { A },
        [compare] { Op },
        exp { B }.

A and B contains the parse-trees of the two exps and Op contains the token string of the terminal symbol compare. The resulting parse tree is calculated as mkBinOp(Op, A, B).

Precedence

The grammar rules in the expression grammar are by themselves ambiguous, in the sense that 3 + 4 * 5 can derived by both these rightmost derivations:

exp 
    ==> exp, [plus], exp 
    ==> exp, [plus], exp, [mult], exp 
    ==> exp, [plus], exp, [mult], [number] 
    ==> exp, [plus], [number], [mult], [number] 
    ==> [number], [plus], [number], [mult], [number]
exp 
    ==> exp, [mult], exp 
    ==> exp, [mult], [number] 
    ==> exp, [plus], exp, [mult], [number] 
    ==> exp, [plus], [number], [mult], [number] 
    ==> [number], [plus], [number], [mult], [number]

The ambiguity is not relevant with regards to whether the expression is a valid expression or not, but the two derivations corresponds to two different parse trees (corresponding to the expressions (3 + 4) * 5 and 3 + (4 * 5), respectively). Obviously, we are interested in a particular parse tree (i.e. the latter).

Moreover the ambiguity in the grammar actually makes it impossible for the parser generator to produce a parser (see next section).

This particular kind of ambiguity can be solved by stating the precedence of the terminal symbols corresponding to the operators:

 
precedence
    [compare] nonassoc.
    [plus] left.
    [mult] left.
    ["^"] right.

a precedence declaration list the terminals and the one of the associativites (nonassoc, left or right).

The terminals in a precedence declaration have the same precedence level and associativity. The declarations are stated in increasing precedence (i.e. the first declaration have lowest precedence).

Given A [t1] B [t2] C

  • If t1 has higher precedence than t2 then A [t1] B will be reduced to AB and then AB [t2] C will be reduced.
  • If t1 has lower precedence than t2 then B [t2] C will be reduced to BC and then a [t1] BC will be reduced.
  • If t1 and t2 have same precedence, then
    • If they are left then A [t1] B will be reduced to AB and then AB [t2] C will be reduced.
    • If they are right then B [t2] C will be reduced to BC and then a [t1] BC will be reduced.
    • If they are nonassoc the construction is illegal and the parser will issue s syntax error.

Error recovery

During parsing the parser will look at the next terminal symbol and:

  • Decide that parsing is successfully ended and thus accept the input
  • shift the terminal symbol on its stack
  • reduce the top of the stack
  • or else conclude that there is a syntax error in the input

In the last case the parser will report the error and it will not produce a parse tree. It is however desirable to try to detect additional syntax errors in the remaining input. Error recovery is attempting to get the parser and input back in synchronization, so that additional syntax errors can be detected.

The parsing machinery in PFC can use a technique involving an special error-terminal. In the grammar files the terminal [error] is reserved for the error terminal.

The basic use of the [error] terminal is to add grammar rules of the form:

rules
    ...
    something ==>
        [error].

When the parser detects an error it will do recovery in the following way:

  • It will pop entities from its stack (corresponding to skipping backwards in the things that are in progress of being parsed), until it get to a state where the [error] terminal can be shifted on the stack. Given the grammar rule above this would be in a place where something is a valid "next thing".
  • It will then perform actions corresponding to the grammar rule (shift the error terminal and perform the reduction)
  • Finally it will skip input until it finds a terminal that can be shifted onto the stack

If all that succeeds recovery is completed and parsing resumes normally.

Example

This strategy works quite good if the language have some clear "synchronization" points. In Visual Prolog the section keywords predicates, clauses, etc are very good synchronization points: It is quite certain how to parse things after one of these words.

rules
    sectionList ==> .
    sectionList ==> sectionList, section.
rules
    section ==> [t_predicates], ...
    section ==> [t_clauses], ...
    ...
    section ==> [error].

If a syntax error is detected in a section the parser stack will contain a sectionList followed by whatever it is in the middle of parsing in the current section.

It will then pop until it reaches the sectionList, because at that point our error-section (and thus the [error] terminal) would be a valid next thing.

It will shift and reduce the error-section and then skip terminal symbols until we meet one that can legally follow a section, like a section keyword, end interface, ...

Cursor

For many reasons (for example to give error messages and create browse and debug info) the compiler deals with positions in the input. Every rule have access to a (predefined) variable named Cursor. This variable holds four pieces of information:

  • The start position of the construction
  • The end position of the construction
  • Comments before in the construction (comments are discussed in the next section)
  • Comments after the construction

The parser calculates such a cursor for a rule from the cursors of the components of the rule. The position handling is quite simple, given a rule:

b  ==> a1, a2, ... an.
  • The start position of b is the start position of a1
  • The end position of b is the end position of an.

The Cursor variable can be used in the semantic action in the rule head together with the variables defined for the body components:

b { mkB(Cursor, A1, A2, ... An ) } ==> a1 { A1 }, a2 { A2 }, ... an { An }.

Comments

In most parsing schemes comments are simply discarded by the lexer, so the parser never see any comments at all. This is fine for many purposes, but there may be situations where you want to use the comments. Visual Prolog programs can for example contain documentation comments which it may be nice to deal with. And if you want to pretty print or restructure a program programmatically you don't want to discard the comments.

As mentioned above comments are placed in the cursors. Many cursors are however discarded during parsing, but the comments in these cursors should not be discarded. Therefore the parser moves comments from discarded cursors to cursors that stay are not discarded.

The parser assumes that cursors that is used in a semantical action by means of the Cursor variable will stay alive, and will therefore assume that comments on such a cursor will survive. Subsequently it may move additional comments to such a cursor.

Consider a rule for an if-then (for simplicity without an else part) construction:

term { mkIfThen(Cursor, Cond, Then ) } ==>
     ["if"],
     term { Cond },
     ["then"]
     term { Then },
     ["end"],
     ["if"].

When the parser have to reduce this rule it have access to the six cursors corresponding to the sub-components, it also knows which of them that can carry comments and which will be discarded.

In general pre-comments are moved leftwards and post comments are moved rightwards. So pre-comment of ["then"] will be moved to become post-comments of Cond (given that Cond can carry comments). Likewise the post-comments of ["then"] will be moved to become pre-comments of Then (given that Then can carry comments).

If the first symbol is going to be discarded its pre comments will move to become pre-comments in the parent cursor. And likewise the post comments of the last symbol will be moved to the post-comments of the parent cursor.

If several adjacent symbols cannot carry comments then the exact movement of comments depends on the placement in the rule. For the if-then rule above:

  • the pre-comments of ["end"] will become post comments in Then, and
  • the post-comments of vpgrm>["end"]</vpgrm> and all comments of ["if"] will go to the parent cursor

The same principle of "most goes to the parent" will be used in the beginning of the rule. In the middle of a rule most comments will move to the left.

It is important to notice that comments end's up on cursors that have been used with the Cursor variable, so to preserve all comments it is important to retain all such referenced cursors. Or to put it differently if a semantic action uses the Cursor variable it also takes responsibility for the comments on that cursor.

It is also important to notice that the parser moves comments around by updating making destructive updates to the cursor structs, so it is not only important to retain cursors received during parsing, you must retain exactly those cursors your semantic actions receive. To be safe you should not assume that comment are stable until the entire parsing is completed.

In some cases you may need access to a cursor in a grammar rule that will not be able to carry comments. In that case you can use the variable CursorNC instead of the Cursor variable. When using CursorNC the parser will assume that the cursor is not saved and can therefore not carry comments.


Example In the Visual Prolog grammar the equal operator "=" is put into the [t_rel_op] terminal symbol:
terminals
    [t_rel_op] :  ["<"] ["<="] ["<>"] ["="] ["=="] [">"] ["><"] [">="].

But there are some places where you use equal where the other [t_rel_op] tokens are illegal. This is handled by the following grammar rule:

nonterminals
    equalOperator : string.
rules
    equalOperator { mustBe(CursorNC, "=", RelOp) } ==>
        [t_rel_op] { RelOp }.
The mustBe predicate will report a syntax error if RelOp is not "=". It need a cursor to report the error (if error is the case), but the cursor is not placed in the resulting parse tree (which is simply a string) so if the parser moved any comments to that cursor they would be lost. Therefore the mustBe predicate is called with the CursorNC indicating that that cursor will not be able to carry comments.

Conflicts

As briefly mentioned above there are situations where the parser generator cannot produce a parser. Briefly speaking the parser generator generates parser tables that in a given situation instructs the parser either to shift the next terminal symbol onto an parser stack or to reduce the top elements of the parser stack using one of the grammar rules (aka production rule).

But when calculating the parser tables it may turn out that it would both be equally good/bad to shift the next terminal onto the stack and to reduce the stack top. This is a so called shift-reduce conflict. Likewise it can turn out that there are two equally good reductions that could be applied to the top of the parse stack, i.e. a so called reduce-reduce conflict.

Details about parsing conflicts can be found in the produced log file.

A grammar for a certain language can be written in many ways, and it may often be possible to rewrite a grammar that have conflicts to another one that doesn't have conflicts (but recognizes the same language).

It is outside the scope of this article to go into details about conflicts and how to resolve them, unfortunately it may require relatively deep understanding of parsing. As mentioned above vipLalrGen is an LALR(1) parser generator like YACC and the like, so there is lots of existing material about the subject "out there".

External References

Basic reading is the "Dragon Book".