# LALR Parser Generator

Visual Prolog 7.5 Commercial Edition (build 7501+) contains an LALR(1) parser generator in the examples (in the directory **vipLalrGen**). The example directory also contains a program **exprEval**, which uses a generated parser to parse arithmetical expressions.

The parser generator itself also uses such parser to parse grammar files, so it can be seen as another example.

The directory **$(ProDir)/vip/vipSyntax** contains a grammar file **vipSyntax.vipgrm** defining the Visual Prolog syntax. The directory also contains the corresponding generated files, and all other files necessary to parse Visual Prolog programs. But that is not the topic of this article, it is only mentioned for reference.

The theory of parsers is huge and cannot be covered here. An LALR(1) parser parses from *Left* to right, and produces an inverse *Rightmost* derivation using 1 *Look-Ahead* symbol. See for example wikipedia:LALR parser for more information.

## Contents

### Parsing

The purpose of a parser is to validate that an input string fulfills a certain grammar specification, and:

- if the grammar is fulfilled: construct a corresponding parse tree.
- if the grammar is not fulfilled: output "syntax error" messages describing the problems.

The overall parsing process consist of splitting the input into lexical elements known as **terminals** and parsing these according to the grammar specification.

The lexing and the parsing is performed intermixed in a single start-to-end scan of the text (without backtracking).

### Parser structure

The parser of the **exprEval** demo can be used as illustration of the parser components and how the overall parsing works.

The grammar file and the parser components are in the sub directory/package **expressionGrm**.

**expcessionGrm.vipgrm**the grammar specification.**expcessionGrm.i**,**expcessionGrm.cl**and**expcessionGrm.pro**the parser.**expressionLexer**contains the lexer.**expressionSem**contains a support class for building the resulting parse tree.

**expcessionGrm.i**, **expcessionGrm.cl** and **expcessionGrm.pro** are generated by the **vipLalrGen** program from the grammar specification **expcessionGrm.vipgrm**.

The directory sub directory/package **expressionGrmSem** contains support predicates for the semantic actions in the grammar.

The directory sub directory/package **expressionLexer** contains the lexical analyzer. It is based on the class **pfc\syntax\lexer_string**, which makes it very easy to implement lexers uses Visual Prolog number, string and comment syntax. Basically the programmer only needs to define things like keywords and operators.

### vipLalrGen

The parser generator itself is in the **vipLalrGen** sub directory (i.e. in **<examples root>\vipLalrGen\vipLalrGen**, and it should be built before it can be used.

The **vipLalrGen** program will read grammar files and generate LALR(1) parsers as Visual Prolog source code.

vipLalrGen.exe - LALR parser generator for Visual Prolog Usage: vipLalrGen.exe [options] <grammar files> Options: @<File> Read options from <File> -help Displays the help message -out <OutDir> Generate files in <OutDir> (default: the directory containing the grammar file) -details Add details to the log file -nodetails Don't add details to the log file (default)

It is recommended to have grammars in files with extension **vipgrm**; the IDE will token color files with that extension.

**exprEval**example from a command console:

>cd <example path>\vipLalrGen >vipLalrGen\Exe\vipLalrGen.exe exprEval\expressionGrm\expressionGrm.vipgrm OK: exprEval\expressionGrm\expressionGrm.vipgrm

When running vipLalrGen on a grammar file it always produces the file:

**log/<grammar>.log**containing detailed information about the grammar- If successful:
**<grammar>.i**,**<grammar>.cl**,**<grammar>.pro**containing the generated parser

### Grammar files

The input **vipLalrGen** is a grammar file. As mentioned the IDE supports token coloring if the extension is vipgrm.

A grammar file contains a named grammar:

grammar expressionGrm open expression, expressionGrmSem nonassoc t_cmp. left t_plus. left t_mult. right t_power. nonterminals exp : expression. rules exp { mkBinOp(Op, A, B) } ==> exp { A }, [t_cmp] { Op }, exp { B }. exp { mkBinOp(Op, A, B) } ==> exp { A }, [t_power] { Op }, exp { B }. exp { mkBinOp(Op, A, B) } ==> exp { A }, [t_mult] { Op }, exp { B }. exp { mkBinOp(Op, A, B) } ==> exp { A }, [t_plus] { Op }, exp { B }. exp { E } ==> [t_lpar], exp { E }, [t_rpar]. exp { number(toTerm(real, N)) } ==> [t_number] { N }. exp { bool(toTerm(boolean, N)) } ==> [t_boolean] { N }. end grammar expressionGrm

#### nonterminals & rules

The grammar file (among other) contains **nonterminals** and **rules** sections.

**nonterminals** sections declares nonterminal symbols and the type of the corresponding parse trees. The nonterminals section above states that **exp** is a nonterminal symbol and that it produces parse trees of the (Visual Prolog) type **expression**.

**rules** sections contains rules that both defines how valid *derivations* of nonterminal system looks and how the corresponding parse tree is constructed.

Everything in braces have to do with the parse trees. If we initially disregard it, we only see what have to do with the *derivations*

rules exp ==> exp, [t_cmp], exp. exp ==> exp, [t_power], exp. exp ==> exp, [t_mult], exp. exp ==> exp, [t_plus], exp. exp ==> [t_lpar], exp, [t_rpar]. exp ==> [t_number]. exp ==> [t_boolean].

The first rule says that from the nonterminal symbol **exp** we can derive **exp** followed by **[t_cmp]** followed by **exp**, where **[t_cmp]** is the terminal symbol **t_cmp**.

Here is a derivation that corresponds to the expression **7 < 5** (assuming that **t_number** are numbers and **t_cmp** are comparison operators):

exp ==> exp, [t_cmp], exp ==> exp, [t_cmp], [t_number] ==> [t_number], [t_cmp], [t_number]

The derivation in the example is a **rightmost** derivation, because in each step we derive something from the rightmost nonterminal symbol. An LR parser will make reductions in the inverse order of the derivations in a rightmost derivation. **LR** parsing means that we scan tokens from **L**eft to right and produce an inverse **R**ightmost derivation.

As mentioned the braces describes how to construct a corresponding parse tree. The braces on the left hand side contains a Visual Prolog expression that constructs the node in the parse tree. The braces in the right hand side of the rules defines variable names for the corresponding sub-trees.

Given the rule

exp { mkBinOp(Op, A, B) } ==> exp { A }, [t_cmp] { Op }, exp { B }.

**A** and **B** contains the parse-trees of the two **exp**s and **Op** contains the string of the terminal symbol **t_cmp**. The resulting parse tree is calculated as **mkBinOp(Op, A, B)**.

#### Precedence

The grammar rules in the expression grammar are by themselves ambiguous, in the sense that 3 + 4 * 5 can derived by both these rightmost derivations:

exp ==> exp, [t_add], exp ==> exp, [t_add], exp, [t_mult], exp ==> exp, [t_add], exp, [t_mult], [t_number] ==> exp, [t_add], [t_number], [t_mult], [t_number] ==> [t_number], [t_add], [t_number], [t_mult], [t_number]

exp ==> exp, [t_mult], exp ==> exp, [t_mult], [t_number] ==> exp, [t_add], exp, [t_mult], [t_number] ==> exp, [t_add], [t_number], [t_mult], [t_number] ==> [t_number], [t_add], [t_number], [t_mult], [t_number]

The ambiguity is not relevant with regards to whether the expression is a valid expression or not, but the two derivations corresponds to two different parse trees (corresponding to the expressions **(3 + 4) * 5** and **3 + (4 * 5)**, respectively). Obviously, we are interested in a particular parse tree (i.e. the latter).

Moreover the ambiguity in the grammar actually makes it impossible for the parser generator to produce a parser (see next section).

This particular kind of ambiguity can be solved by stating the precedence of the terminal symbols corresponding to the operators:

nonassoc t_cmp. left t_plus. left t_mult. right t_power.

A precedence declaration states one of the associativites (**nonassoc**, **left** or **right**) followed by a comma separated sequence of terminal symbols and terminated by a dot.

The terminals in a precedence declaration have the same precedence level and associativity. The declarations are stated in increasing precedence (i.e. the first declaration have lowest precedence).

Given **A [t1] B [t2] C**

- If
**t1**has higher precedence than**t2**then**A [t1] B**will be reduced to**AB**and then**AB [t2] C**will be reduced. - If
**t1**has lower precedence than**t2**then**B [t2] C**will be reduced to**BC**and then**A [t1] BC**will be reduced. - If
**t1**and**t2**have same precedence, then- If they are
**left**then**A [t1] B**will be reduced to**AB**and then**AB [t2] C**will be reduced. - If they are
**right**then**B [t2] C**will be reduced to**BC**and then**A [t1] BC**will be reduced. - If they are
**nonassoc**the construction is illegal and the parser will issue s syntax error.

- If they are

### Conflicts

As briefly mentioned above there are situations where the parser generator cannot produce a parser. Briefly speaking the parser generator generates parset tables that in a given situation instructs the parser either to shift the next terminal symbol onto an parser stack or to reduce the top elements of the parser stack using one of the grammar rules (aka production rule).

But when calculating the parser table it may turn out that it would both be equally good/bad to shift the next terminal onto the stack and to reduce the stack top. This is a so called shift-reduce conflict. Likewise it can turn out that there are two equally good reductions that could be applied to the top of the parse stack, i.e. a so called reduce-reduce conflict.

Details about parsing conflicts can be found in the produced log file.

A grammar for a certain language can be written in many ways (just like a program with a certain functionality can be written in many ways), and if may often be possible to rewrite a grammar that have conflicts to another one that doesn't have conflicts (but recognizes the same language).

It is outside the scope of this article to go into details about conflicts and how to resolve them, unfortunately it may require relatively deep understanding of parsing. As mentioned above vipLalrGen is an LALR(1) parser generator like YACC and the like, so there is lots of existing material about the subject "out there".

### External References

Basic reading is the "Dragon Book".