Text grammar format

Introduction

The XML presentation of the grammar is not intended to be human readable/writeable, but rather to be easy readable for the Chaperon components. It is recommended to use this text grammar format and convert it to the XML presentation.

Structure

The text grammar consists of two parts. The first part contains the token definitions and special instruction declarations. The other part contains the productions.

[tokens] 
[special instructions]

%start "Symbol of the production" ;

%%
[productions]

The declaration "%start" declares the root production for the result document.

Lexical tokens

The tokens are similar to the tokens of the XML grammar. For token definition the text grammar makes use of regular expressions

%token WORD "[A-Za-z][a-z]*";

If you are using '%left' or '%right' instead of '%token', the token gets the a left or right associativity.

%right WORD "[A-Za-z][a-z]*";
%left PUNCTUATION "[\.,\;\?!]";

The token, which occurs first, gets a higher priority as the following tokens.

Alternations

Alternation means that one of the contained elements must match.

%token CHAR "[A-Za-z] | [0-9]";

Concatenations

Concatenation means that all elements in a sequence must match.

%token IDENTIFIER "[A-Za-z] [A-Za-z0-9_]*";

Character classes

A character class compares a character to the characters which this class contains. There are two options for a character class. Either a character class or a negated character class. The negated character class implies that the character should not match.

%token PUNCTUATION "[\.,\;\?!]";
%token NOTNUMBER "[^0-9]";

Universal character

This character matches all characters except carriage return and line feed

%token COMMENT "// .*";

Begin of line

This symbol matches the beginning of a line

%token NOTE "^ \[ [0-9]+ \]";

End of line

This symbol matches the end of a line

%token BREAK "\\ \\ $";

Abbreviations

If an regular expression is often used, you can use an abbreviation for it

%ab NUMBER "[0-9]";
%token FLOAT "<NUMBER>+ \. <NUMBER>+";
%token INT "<NUMBER>+";

Comments and Whitespaces

These are two special tokens which can appear in any position in the parsed text. The parser will read the tokens and then disgard them.

%ignore whitespace "[\n\r\ ]";
%ignore comment "// .*";

Productions

The productions are similarly handled to the productions in the XML grammar. More than one definition can be declared through an alternation

[Symbol of the production] : [Symbol1] [Symbol2] [..]
                           | [Symbol1] [..]
                           ;

To set the precedence for the production use "%prec"

example : WORD float %prec PLUS
        | WORD
        ;

Error productions

Error productions allow to control the level of error recovery. These productions use a special error symbol as placeholder, which can hold the text, which could not be parse by any other part of the grammar.

line : error CR
     ;

by Stephan Michels

Copyright © 2003 Chaperon Project. All rights reserved.