Using Apache Cocoon

The Chaperon project contains a generators and transformers for the Apache Cocoon project. This enables Cocoon to read and transform text documents.

The projects holds three main components: TextGenerator, LexicalTransformer and the ParserTransformer.

TextGenerator

The TextGenerator are used to create a SAX stream by a text file. It simply read a the text file and put it into a XML element. To use the generator you must include the generator declaration into the sitemap like the following example.

<map:generators>
 [...] 
 <map:generator name="text" 
                src="org.apache.cocoon.generation.TextGenerator" 
                logger="sitemap.generator.textgenerator"/>
 [...]
</map:generator>

And can be used in a pipeline to generate the SAX streams.

<map:match pattern="*.xml">
 <map:generate  type="text" src="example/{1}.txt"/>
 <map:serialize type="xml"/>
</map:match>

The generator generated following output:

<text xmlns="http://chaperon.sourceforge.net/schema/text/1.0">
 My text is not a text, if a text is a ....
</text>

LexicalTransformer

The LexicalTransformer used these text elements to analyse the text into lexemes. To use this transformer, you must the the following declaration for the sitemap.

<map:transformers>
 [...]
 <map:transformer name="lexer" 
                  src="org.apache.cocoon.transformation.LexicalTransformer" 
                  logger="sitemap.transformer.lexicaltransformer"/>
 [...]
</map:transformers>

<map:pipelines>
 <map:pipeline>
  <map:match pattern="*.xml">
   <map:generate  type="text"  src="example/{1}.txt"/>
   <map:transform type="lexer" src="lexicon.xlex"/>
   <map:serialize type="xml"/>
  </map:match>
 </map:pipeline>
</map:pipelines>

The output of the transformer has the following structure.

<lexemes xmlns="http://chaperon.sourceforge.net/schema/lexemes/1.0">
 <lexeme symbol="word" text="My"/>
 <lexeme symbol="word" text="text"/>
 <lexeme symbol="word" text="is"/>
</lexemes>

Optional you can use a parameter to specify the encoding, which should be used.

<map:transformer name="lexer" 
                 src="org.apache.cocoon.transformation.LexicalTransformer" 
                 logger="sitemap.transformer.lexicaltransformer">
 <map.parameter name="encoding" value="ISO-8851_1"/>
<map:transformer>

Following list of parameters can be used.

AttributeDescription
recoveryIf the transformer should try to recover errors, which can occur.

ParserTransformer

The ParserTransformer used these lexemes to build the syntax tree.

Warning
Warning! With large grammars the transformer can take minutes to startup. This time needs the transformer to build a parser automaton once-only to be later as fast as possible.
<map:transformers>
 [...]
 <map:transformer name="parser" 
                  src="org.apache.cocoon.transformation.ParserTransformer" 
                  logger="sitemap.transformer.parsertransformer"/>
 [...]
</map:transformers>

<map:pipelines>
 <map:pipeline>
  <map:match pattern="*.xml">
   <map:generate  type="text"   src="example/{1}.txt"/>
   <map:transform type="lexer"  src="lexicon.xlex"/>
   <map:transform type="parser" src="grammar.xgrm"/>
   <map:serialize type="xml"/>
  </map:match>
 </map:pipeline>
</map:pipelines>

The output of the transformer has the following structure.

<sentence xmlns="http://chaperon.sourceforge.net/schema/syntaxtree/1.0">
 <preposition>
  <word>My</word>
 </preposition>
 <subject>
  <word>text</word>
 </subject>
 <verb>
  <word>is</world>
 </verb>
 [...]
</sentence>

Additional parameters are 'flatten', which used to decrease the deep of the produced XML hirachy. This parameter resolve nested elements, which had the same symbol.

<map:transformer name="parser" 
                 src="net.sourceforge.chaperon.adapter.cocoon.ParserTransformer" 
                 logger="sitemap.transformer.parsertransformer"/>
 <map:parameter name="flatten" value="true"/>
</map:transformer>

Following list of parameters can be used.

AttributeDescription
flattenIf the transformer should produce a more flatten XML hirachy, which means elements which the same name will be collapsed
by Stephan Michels