XML lexicon format

Structure

The lexicon format define several lexeme/token, which could be recognized from the input stream. It similar to the normal lexicon, which defines words for natural language. Each token is represent by a symbol and definition.

<lexicon>
  <lexeme symbol="...">[lexeme definitions]</lexeme>
  <lexeme symbol="...">[lexeme definitions]</lexeme>
  <lexeme symbol="...">[lexeme definitions]</lexeme>
</lexicon>

Lexical tokens

Every token has an entry, and mapped to a terminal symbol. By terminal we mean that this symbol can not be broken down into smaller structures. Is there any symbol specified, the token will be recognized, but neglected.

<lexeme symbol="Name of the symbol">
 [definition of the lexeme]
</lexeme>

For the definition of tokens Chaperon uses a structure similar to Regex. It contains alternations, concatenations, characters classes, etc.

Every element can contain the attributes "minOccurs" and "maxOccurs", which mean how much can the part minimal and maximal occurs.

Alternations

Alternation means that one of the contained elements must match.

<lexeme symbol="Name of the symbol">
 <alt>
  [element 1]
  [element 2]
  [element 3]
 </alt>
</lexeme>

Concatenations

Concatenation means that all elements in a sequence must match.

<lexeme symbol="Name of the symbol">
 <concat>
  [element 1]
  [element 2]
  [element 3]
 </concat>
</lexeme>

Character classes

A character class compares a character to the characters which this class contains. There are two options for a character class. Either a character class or a exclusive character class. The exclusive character class implies that the character should not match to any of the characters in the class.

<lexeme symbol="Name of the symbol">
 <cclass>
  [Characters, which should match]
 </cclass>

 <cclass exlusive="true">
  [Characters, which shouldn't match]
 </cclass>
</lexeme>

The character class can contain two elements:

Character sets
Character intervals

<lexeme symbol="Name of the symbol">
 <cclass>
  <cset content="abcd"/>
  <cinterval min="e" max="z"/>
 </cclass>
</lexeme>

The Character set defines a set of characters, which the character class should include. And the characater interval defines a interval between two characters.

For the cset, you can specify a single character by his unicode using the attribute code.

Strings

The string must match to every character in a sequence. Instead the attribute content, you can specify a single character by his unicode using the attribute code.

<lexeme symbol="Name of the symbol">
 <cstring content="Sequence of characters"/>
</lexeme>

Universal character

This character matches all characters including carriage return and line feed.

<lexeme symbol="Name of the symbol">
 <cuniversal/>
</lexeme>

Begin of line

This symbol matches the beginning of a line.

<lexeme symbol="Name of the symbol">
 <bol/>
</lexeme>

End of line

This symbol matches the end of a line, which means before the first carriage return or line feed.

<lexeme symbol="Name of the symbol">
 <eol/>
</lexeme>

by Stephan Michels

Copyright © 2003 Chaperon Project. All rights reserved.