Multi-level Non-uniform Grouping of Very Large Flat Structured Documents

Grouping In XSLT v1.0

Grouping together different elements in a XML source document [XML]into an alternative structure in the result document is a common problem when developing XSL stylesheets [XSLT]. There are many variations on the problem of grouping, one example is as follows. Given a source XML document that lists AusWeb conference chairs:

Example 1. 

Source Document
<chairs>
  <chair>
    <name>Paul Thistlewaite</name>
    <year>1995</year>
    <year>1996</year>
    <year>1997</year>
    <year>1998</year>
  </chair>
  <chair>
    <name>Helen Ashman</name>
    <year>1995</year>
    <year>1996</year>
    <year>1997</year>
    <year>1998</year>
  </chair>
  <chair>
    <name>Roger Debreceny</name>
    <year>1999</year>
  </chair>
  <chair>
    <name>Allan Ellis</name>
    <year>1999</year>
    <year>2000</year>
    <year>2001</year>
    <year>2002</year>
    <year>2003</year>
    <year>2004</year>
  </chair>
</chairs>

Note that the example document has a uniform structure: the chairs element contains a sequence of chair element, which in turn contain a single name element followed by a sequence of year elements. For display purposes it may be desirable to rearrange this list of names into a two-column table, such as:

Example 2. 

Desired Result Document 1
<table>
  <row>
    <entry>Paul Thistlewaite</entry>
    <entry>Helen Ashman</entry>
  </row>
  <row>
    <entry>Roger Debreceny</entry>
    <entry>Allan Ellis</entry>
  </row>
</table>

This is an example of positional grouping. Solving this type of problem using XSLT involves the following steps:

  1. Identify the nodes that are to be grouped

  2. Process only one of the nodes, typically the first

  3. When processing that node, find the other nodes in that group

Figure 1. 

Positional Grouping

So, for the above problem a solution might be as follows:

Example 3. 

XSL Stylesheet 1
<xsl:stylesheet version='1.0'
  xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>

  <xsl:strip-space elements='*'/>

  <xsl:template match='chairs'>

    <!-- Step 1: We need to grouped together chair
	 elements pair-wise.
      -->

    <table>
      <!-- Step 2: Select for processing all the
	   odd-numbered chair elements.
	-->
      <xsl:apply-templates
	 select='chair[position() mod 2 = 1]'/>
    </table>
  </xsl:template>

  <!-- This template will only process the odd-numbered
       chair elements.
    -->
  <xsl:template match='chair'>
    <row>
      <entry>
	<xsl:apply-templates select='name'/>
      </entry>

      <!-- Step 3: Now find the even-numbered
	   chair elements.
	-->

      <entry>
	<xsl:apply-templates
	   select='following-sibling::chair[1]/name'/>
      </entry>
    </row>
  </xsl:template>

</xsl:stylesheet>

Another type of grouping problem [Tennison] is grouping by content. For example, given the same source document as above it may be desirable to find the conference chairs for a given year:

Example 4. 

Desired Result Document 2
<chairs year='1997'>
  <name>Paul Thistlewaite</name>
  <name>Helen Ashman</name>
</chairs>
<chairs year='2002'>
  <name>Allan Ellis</name>
</chairs>

A stylesheet to produce this result is as follows:

Example 5. 

XSL Stylesheet 2
<xsl:stylesheet version='1.0'
  xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>

  <xsl:strip-space elements='*'/>

  <!-- Step 1: Group by year -->

  <xsl:template match='/'>
    <!-- Step 2: Select the first node that is
	 the year we are interested in.
      -->
    <xsl:apply-templates
       select='/chairs/chair/year[. = "1997"][1]'/>
    <xsl:apply-templates
       select='/chairs/chair/year[. = "2002"][1]'/>
  </xsl:template>

  <!-- Step 3: For a particular year,
       find all chairs.
    -->
  <xsl:template match='year'>
    <chairs year='{.}'>
      <xsl:apply-templates
	 select='preceding-sibling::name|
		 ancestor::chair/following-sibling::chair
		   [year = current()]/name'/>
    </chairs>
  </xsl:template>

  <xsl:template match='name'>
    <xsl:copy/>
  </xsl:template>

</xsl:stylesheet>

A common use of either type of grouping is to derive structure from a flat-structured data source [Pawson]. That is, to have a deeper element hierarchy in the result document than existed in the source document. As an example, below is an extract of another version of the sample XML source document (the full document may be found in flatsource.xml). Note that its structure is also uniform: it is a recurring sequence of title, year, chair (repeatable) and location elements.

Example 6. 

Source Document
<conferences>
  <title>AusWeb 95</title>
  <year>1995</year>
  <chair>Paul Thistlewaite</chair>
  <chair>Helen Ashman</chair>
  <location>Ballina Beach Resort</location>

  <title>AusWeb 96</title>
  <year>1996</year>
  <chair>Paul Thistlewaite</chair>
  <chair>Helen Ashman</chair>
  <location>Conrad Jupiters Casino</location>

  <title>AusWeb 97</title>
  <year>1997</year>
  <chair>Paul Thistlewaite</chair>
  <chair>Helen Ashman</chair>
  <location>Conrad Jupiters Casino</location>
.
.
.
</conferences>

The structure of the above sample document is very flat - the element hierarchy is only two elements deep. It would be much better to have a deeper element hierarchy, as shown below (the complete document may be found in deeper.xml).

Example 7. 

Desired Structured Result Document
<conferences>
  <conference>
    <title>AusWeb 95</title>
    <year>1995</year>
    <location>Ballina Beach Resort</location>
    <chairs>
      <name>Paul Thistlewaite</name>
      <name>Helen Ashman</name>
    </chairs>
  </conference>
  <conference>
    <title>AusWeb 96</title>
    <year>1996</year>
    <location>Conrad Jupiters Casino</location>
    <chairs>
      <name>Paul Thistlewaite</name>
      <name>Helen Ashman</name>
    </chairs>
  </conference>
  <conference>
    <title>AusWeb 97</title>
    <year>1997</year>
    <location>Conrad Jupiters Casino</location>
    <chairs>
      <name>Paul Thistlewaite</name>
      <name>Helen Ashman</name>
    </chairs>
  </conference>
.
.
.
</conferences>

The desired result document now has an element hierarchy depth of 4 elements. In order to achieve the required structure, grouping by content is necessary. In this case, the Meunchian method [Meunch] may be used to identify which nodes belong to a particular group. The Meunchian method involves the following steps:

  1. First find that node which delimits the desired group.

  2. Select a set of nodes, usually in a particular axis, that includes all nodes to be grouped. This set will probably contain nodes that are not part of the group.

  3. Apply a predicate to the selected set of nodes that compares the identity of the next delimiting node to the actual delimiting node found in step 1. There are two common methods for identifying a node:

    • Use the generate-id() function. This function always returns the same, unique string for a given node.

    • Use the | set union operator combined with the count() function. The union operation eliminates duplicates, so the count of a node unioned with itself will be 1.

Further, this is a two-level grouping problem: first, the content is grouped by conference and second, the conference chairs are grouped together.

An XSL stylesheet to transform the original source document into the desired result document is as follows:

Example 8. 

XSL Stylesheet 3
<xsl:stylesheet version='1.0'
  xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>

  <xsl:strip-space elements='*'/>

  <!-- Step 1: Group title, year, chair and location
       together.
    -->

  <xsl:template match='conferences'>
    <conferences>

      <!-- Step 2: Process the title elements,
	   which being each group.
	-->
      <xsl:apply-templates select='title'/>
    </conferences>
  </xsl:template>

  <!-- Step 3: For each title, find the related
       nodes.  These occur before the next title.
    -->

  <xsl:template match='title'>

    <!-- Find the next location element 
	 which delimits this group.
      -->
    <xsl:variable
       name='nextLocation'
       select='following-sibling::location[1]'/>

    <conference>
      <xsl:apply-templates
	 select='.|following-sibling::year[1]|
		 $nextLocation'/>

      <!-- Now group together the chair elements
	   for this particular conference.
	-->
      <chairs>
	<xsl:apply-templates
	   select='following-sibling::chair
		   [count($nextLocation|
		   following-sibling::location[1]) = 1]'/>
      </chairs>
    </conference>
  </xsl:template>

  <xsl:template match='chair'>
    <name>
      <xsl:apply-templates/>
    </name>
  </xsl:template>
  <xsl:template match='*'>
    <xsl:copy/>
  </xsl:template>

</xsl:stylesheet>

This solution utilizes positional grouping for the first level of grouping (conferences) and Meunchian grouping for the second level of grouping (conference chairs).

Handling Large, Multi-level, Non-uniform, Flat-structured Documents

Methods for grouping in XSLT are suitable when the total number of nodes being grouped is relatively small. Also, grouping methods are easily applied when the document is uniform and the grouping is only one- or two-levels deep. However, there are significant classes of documents where these conditions are not the case. One such class of documents is word processing documents that are saved as XML; both Microsoft Word 2003 [MSWord] and Open Office [OpenOffice] have this capability.

Word processing documents are fundamentally unstructured. Where they do have structure, it is related to the physical structuring of the visual appearance of the document. For example, a document may have a section in single column followed by a section in two columns. Many organisations wish to convert word processing documents into a logically structured format, such as DocBook [Walsh] or TEI [TEI]. In this case, physical structures are of no consequence and so should be discarded. Instead, the logical structures must be identified; this is most easily and reliably achieved by marking-up the document using "styles" but can also be done by applying heuristics.

When converting a word processing document into a structured form, the first step is to "normalise" the document. This involves applying a XSL stylesheet that for the most part copies the document's data identically to the result, but removes the physical structures. Other unnecessary data may also be removed during this process, such as style information. The data that remains is a flat list of paragraphs, as shown below:

Figure 2. 

Open Office Document

Example 9. 

Extract Of Open Office Document Content
<office:document-content
   xmlns:office="http://openoffice.org/2000/office"
   xmlns:text="http://openoffice.org/2000/text"
   office:class="text" office:version="1.0">
  <office:body>
    <text:p text:style-name="chair">Paul Thistlewaite</text:p>
    <text:p text:style-name="Standard">1995</text:p>
    <text:p text:style-name="Standard">1996</text:p>
    <text:p text:style-name="Standard">1997</text:p>
    <text:p text:style-name="Standard">1998</text:p>
    <text:p text:style-name="chair">Helen Ashman</text:p>
    <text:p text:style-name="Standard">1995</text:p>
    <text:p text:style-name="Standard">1996</text:p>
    <text:p text:style-name="Standard">1997</text:p>
    <text:p text:style-name="Standard">1998</text:p>
    <text:p text:style-name="chair">Roger Debreceny</text:p>
    <text:p text:style-name="Standard">1999</text:p>
    <text:p text:style-name="chair">Allan Ellis</text:p>
    <text:p text:style-name="Standard">1999</text:p>
    <text:p text:style-name="Standard">2000</text:p>
    <text:p text:style-name="Standard">2001</text:p>
    <text:p text:style-name="Standard">2002</text:p>
    <text:p text:style-name="Standard">2003</text:p>
    <text:p text:style-name="Standard">2004</text:p>
    <text:p text:style-name="Standard"/>
  </office:body>
</office:document-content>

This small example contains 20 paragraphs. However, some applications must routinely deal with documents that contain well in excess of 2000 paragraphs.

Structured document schemata, such as DocBook or TEI or of an equivalent complexity, often have a number of sectioning components. For example, DocBook has a sectioning hierarchy as follows:

set
book, article
preface, chapter, appendix
section, sect1
sect2

In fact, DocBook defines 5 numbered sections so potentially the depth of the sectioning element hierarchy is 8. From the perspective of grouping, this is a multi-level grouping problem and is far more complex than one- or two-level grouping.

An XSL stylsheet that transforms the Open Office document above into a structured form shown earlier is as follows:

Example 10. 

XSL Stylesheet 4
<xsl:stylesheet version='1.0'
   xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
   xmlns:office="http://openoffice.org/2000/office"
   xmlns:text="http://openoffice.org/2000/text"
   exclude-result-prefixes='office text'>

  <xsl:strip-space elements='*'/>

  <xsl:template match='office:body'>
    <chairs>
      <xsl:apply-templates
	 select='text:p[@text:style-name = "chair"]'/>
    </chairs>
  </xsl:template>

  <xsl:template match='text:p[@text:style-name = "chair"]'>
    <chair>
      <name>
	<xsl:apply-templates/>
      </name>

      <xsl:variable name='nextChair'
	select='following-sibling::text:p
		  [@text:style-name = "chair"][1]'/>

      <xsl:choose>
	<xsl:when test='$nextChair'>

	  <xsl:apply-template
	     select='following-sibling::text:p
	       [generate-id(
		  following-sibling::text:p[@text:style-name = "chair"][1]) = 
	        generate-id($nextChair)]'/>

	</xsl:when>
	<xsl:otherwise>

	  <!-- Take all following paragraphs
	       to the end of the document
	    -->

	  <xsl:apply-templates select='following-sibling::text:p'/>
	</xsl:otherwise>
      </xsl:choose>
    </chair>
  </xsl:template>

  <xsl:template match='text:p[@text:style-name = "Standard"]'>
    <year>
      <xsl:apply-templates/>
    </year>
  </xsl:template>

</xsl:stylesheet>

In addition to the complexity of multi-level grouping, structured text documents exhibit a non-uniform pattern of sectioning components. That is, although the sectioning components are well-defined (books contain chapters, chapters contain sect1s, and so on) it is not necessarily the case that section components will appear at certain levels. With some schemata, it is also the case that certain elements may appear to different levels in the hierarchy. For example, some sect1 elements may contain sect2 children, whereas others may not. Also, lists or tables may occur at any level in the component hierarchy.

It is possible to deal with these various characteristics when converting word processing documents to structured documents. Handling multi-level documents causes the complexity of the application of Meunchian methods to rise markedly, but the problem is still tractable. However, the application of Muenchian methods to large, flat structures results in very poor performance. This is because the method is essentially calculating the intersection of two sets.

Figure 3. 

Intersection Of Two Sets

Determining a set nodes along a single axis is efficient, but applying a predicate to every node in the set is very costly. This cost is incurred for every node in every group. The performance cost for documents of different sizes is given below. It is important to note that the result of using the Meunchian method is correct, the desired result document is produced, it is only the runtime performance that is undesirable. Also, some XSLT processors provide an extension function to calculate the intersection of two nodesets, which obviates the need for the approaches described here. However, for maximum portability it is best to avoid the use of extensions.

Muenchian Method Performance

Description

Small

Medium

Large

Level 1

1.85s

814s

24+ hrs

Level 2

0.45s

4.6s

?

Notes:

  • All times were measured on an Apple Macintosh PowerBook, 800MHz G4 PowerPC, 512MB RAM, using Gnome libxml2 and libxslt [Gnome].

  • In order to achieve two-level grouping, the processing was performed using two separate stylesheets; one to find the first level groups and a second to find all second level groups. Note how finding the second level groups is much faster, because the sibling chains have been shortened by the first grouping stylesheet.

  • The "Small" document is approximately 100 paragraphs. The "Medium" document is approximately 2000 paragraphs. The "Large" document is approximately 10000 paragraphs.

Although it has been shown that the Muenchian method performs poorly on word processing style documents, it must be realised that the problem is not the size of the document per se; the real problem is that word processing style documents have large, flat structures. The Meunchian method has unacceptable behaviour when grouping occurs in structures that have large numbers of sibling elements. Large (semi-)structured documents may not necessarily exhibit the same bad performance. In fact, once some structures have been established in the document (perhaps in a processing chain) and the sibling lengths have been reduced the Meunchian method may once again be successfully applied.

So far, only a simple use of the Meunchian method has been shown. It is in fact possible to dramatically speed-up processing by using keys, via the key() function. In order to take advantage of keys, a key is created for each level of grouping desired in the result document. Each key then provides a fast lookup for the nodes in each group at that level. The key for each level of grouping must combine the key at previous levels, and so the complexity of this approach increases with the depth of grouping.

The XSL stylesheet solution shown above is modified to use a key:

Example 11. 

XSL Stylesheet 5
<xsl:stylesheet version='1.0'
   xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
   xmlns:office="http://openoffice.org/2000/office"
   xmlns:text="http://openoffice.org/2000/text"
   exclude-result-prefixes='office text'>

  <xsl:strip-space elements='*'/>

  <xsl:key name='immediate-nodes'
	   match='node()[not([@text:style-name = "chair"])]'
	   use='generate-id(preceding-sibling::text:p
		[@text:style-name = "chair"][1])'/>

  <xsl:template match='office:body'>
    <chairs>
      <xsl:apply-templates
	 select='text:p[@text:style-name = "chair"]'/>
    </chairs>
  </xsl:template>

  <xsl:template match='text:p[@text:style-name = "chair"]'>
    <chair>
      <name>
	<xsl:apply-templates/>
      </name>

      <xsl:apply-templates
	 select='key("immediate-nodes", generate-id())'/>

    </chair>
  </xsl:template>

  <xsl:template match='text:p[@text:style-name = "Standard"]'>
    <year>
      <xsl:apply-templates/>
    </year>
  </xsl:template>

</xsl:stylesheet>

This solution improves the runtime performance, as shown in the table below. Note that the time required to build the key(s) is now significant; the actual application of the templates is very fast.

Muenchian Method With Keys Performance

Small

Medium

Large

2.43s

141s

3.9hrs

Notes:

  • The same documents were used for timing these results, as well as the same hardware and software.

  • A single stylesheet was used to perform both first- and second-level grouping.

Grouping in Large, Multi-level, Non-uniform, Flat-structured Documents

All of the methods described so far result in performance times that are unacceptable. They are also difficult to use with non-uniform structures. A new method has been developed that results in acceptable performance. This method is based on an approach described by Mike Kay in [Pawson], but extended to handle multi-level grouping. Instead of using the Meunchian method to calculate the intersection of two sets, this method iterates through the document one node at a time and "remembers" the context of the node in order to build the desired structures in the result document. Template modes are used to set the current context.

The new method uses the same basic techniques of grouping: identify the nodes to be grouped, process the node that is the start of the group and while processing that node gather together the nodes in that group. In addition, to achieve high performance some additional rules are introduced:

  • Use the idiom following-sibling::*[1] to step through the source document a single node at a time.

  • When performing an expensive computation, save the result in a variable and pass that variable as a parameter. That is, avoid re-computing the same (expensive) result.

  • Send duplicate nodesets for processing, but also include a parameter that determines when to stop processing, to avoid duplicating data. Use a mode to set the component level.

  • Separate grouping problems into different stylesheets, and chain the processing of a document through those stylesheets. This simplifies the development of the stylesheets and can improve performance (by reducing the number of templates in a stylesheet).

To illustrate the new method, a XSL stylesheet is shown below that transforms the Open Office example document into a structured form, using the new approach:

Example 12. 

XSL Stylesheet 5
<xsl:stylesheet version='1.0'
   xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
   xmlns:office="http://openoffice.org/2000/office"
   xmlns:text="http://openoffice.org/2000/text"
   exclude-result-prefixes='office text'>

  <xsl:strip-space elements='*'/>

  <xsl:template match='office:body'>
    <chairs>

      <!-- Step 1: Group "chair" nodes together.
	-->

      <xsl:apply-templates
	 select='text:p[@text:style-name = "chair"]'/>
    </chairs>
  </xsl:template>

  <xsl:template match='text:p[@text:style-name = "chair"]'>
    <chair>
      <name>
	<xsl:apply-templates/>
      </name>

      <!-- Find the end delimiter for this group.
	-->

      <xsl:variable name='nextChair'
	select='following-sibling::text:p
		[@text:style-name = "chair"][1]'/>

      <xsl:choose>
	<xsl:when test='$nextChair'>

	  <!-- Process the next paragraph.
	       Set "year" mode to signify that we're
	       looking for year values.
	    -->

	  <xsl:apply-templates
	     select='following-sibling::text:p[1]'
	     mode='year'>

	    <!-- Include the end delimiter so that
		 we know when to stop.
	      -->

	    <xsl:with-param name='nextChair'
			    select='$nextChair'/>
	  </xsl:apply-templates>
	</xsl:when>

	<xsl:otherwise>
	  <xsl:apply-templates
	     select='following-sibling::text:p[1]'
	     mode='year'/>
	</xsl:otherwise>
      </xsl:choose>
    </chair>
  </xsl:template>

  <!-- This template only has effect when we're
       looking for year values.
    -->

  <xsl:template
     match='text:p[@text:style-name = "Standard"]'
     mode='year'>

    <xsl:param name='nextChair' select='/..'/>

    <xsl:choose>
      <xsl:when test='not($nextChair)'>
	<!-- This is the last group -->
	<year>
	  <xsl:apply-templates/>
	</year>

	<xsl:apply-templates
	   select='following-sibling::text:p[1]'
	   mode='year'/>

      </xsl:when>

      <!-- Have we reached the end of this group? -->
      <xsl:when
	 test='generate-id(.) = 
	       generate-id($nextChair)'/>

      <xsl:otherwise>
	<!-- Still in a group -->
	<year>
	  <xsl:apply-templates/>
	</year>

	<xsl:apply-templates
	   select='following-sibling::text:p[1]'
	   mode='year'/>

      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>

  <!-- Eliminate all unmatched elements to avoid
       duplicates.
    -->
  <xsl:template match='text:p'/>

</xsl:stylesheet>

Extending this solution to multiple component levels is very straight-forward. The template that matches each level must check for a terminating condition. Any combination of explicit parameters or heuristics may be used as a terminating condition. The template is also passed a nodeset of further nodes to process.

As an example, the following word processing document is to be transformed into a DocBook document:

Figure 4. 

Example Word Processing Document

This document has three levels of headings, using styles named (surprisingly) Heading 1, Heading 2 and Heading 3. A common heuristic is that each level of heading delimits the content of the heading before it. Hence this is a three-level grouping problem. An extract of an XSL stylesheet to transform this document into DocBook is shown below (the complete stylesheet may be found in stylesheet6.xsl). This extract shows the templates for handling one heading level.

Example 13. 

XSL Stylesheet 6
<xsl:stylesheet version='1.0'
   xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
   xmlns:office="http://openoffice.org/2000/office"
   xmlns:text="http://openoffice.org/2000/text"
   exclude-result-prefixes='office text'>

  <xsl:strip-space elements='*'/>

  <xsl:template match='office:body'>
    <article>
      <title>
	<xsl:apply-templates
	   select='text:h[@text:style-name = "Heading 1"]'/>
      </title>

      <!-- Handle any paragraphs appearing in the "front-matter" -->

      <xsl:apply-templates
	 select='text:h[@text:style-name = "Heading 2"][1]/
		 preceding-sibling::text:p'/>

      <!-- Now create the sections.
	   Find all of the candidate sections,
	   store in a variable,
	   then kick off the processing with the first section -->

      <xsl:variable name='sections'
		    select='text:h[@text:style-name = "Heading 2"]'/>

      <xsl:apply-templates
	 select='$sections[1]' mode='section'>

	<xsl:with-param name='sections' select='$sections'/>
      </xsl:apply-templates>
    </article>
  </xsl:template>

  <xsl:template
     match='text:h[@text:style-name = "Heading 2"]'
     mode='section'>

    <xsl:param name='sections' select='/..'/>

    <sect1>
      <title>
	<xsl:apply-templates/>
      </title>

      <!-- Edge condition: the last section must be handled separately -->

      <xsl:choose>
	<xsl:when test='$sections'>

	  <!-- Are there any subsections in this section?
	       If so, store them in a variable -->

	  <xsl:variable
	     name='subsections'
	     select='$sections[1]/
		     preceding-sibling::text:h
		     [@text:style-name = "Heading 2"]'/>

	  <xsl:apply-templates
	     select='following-sibling::*[1]'
	     mode='subsection'>

	    <xsl:with-param
	       name='nextSection'
	       select='$sections[1]'/>

	    <xsl:with-param
	       name='subsections'
	       select='$subsections'/>
	  </xsl:apply-templates>
	</xsl:when>

	<xsl:otherwise>

	  <!-- This is similar, but the subsections
	       are determined differently -->
...
	</xsl:otherwise>
      </xsl:choose>
    </sect1>

    <!-- Now continue with the next section -->

    <xsl:apply-templates
       select='$sections[1]'
       mode='section'>
      <xsl:with-param name='$sections[position() != 1]'/>
    </xsl:apply-templates>
  </xsl:template>

  <xsl:template match='text:h'
		mode='subsection'>
    <xsl:param name='nextSection'
	       select='/..'/>
    <xsl:param name='subsections'
	       select='/..'/>

    <xsl:choose>
      <!-- Have we reached the end delimiter
	   for this subsection?  If so stop -->
      <xsl:when test='generate-id() = generate-id($nextSection)'/>

      <xsl:when test='@text:style-name = "Heading 2"'>

	<!-- Find the (potential) end delimiter
	     for this subsection -->

	<xsl:variable
	   name='nextSubsection'
	   select='following-sibling::text:h
		   [@text:style-name = "Heading 2"][1]'/>

	<!-- Find the subsubsections, if any.
	     Store them in a variable -->

	<xsl:variable
	   name='subsubsections'
	   select='following-sibling::text:h
		   [@text:style-name = "Heading 3"]'/>

	<sect2>
...
	</sect2>

	<!-- Are there more subsections *within this group*
	     to continue with? -->

	<xsl:if test='count($subsections|$nextSubsection) = 
		      count($subsections)'>

	  <xsl:apply-templates
	     select='$nextSubsection'
	     mode='subsection'>
...
	  </xsl:apply-templates>
	</xsl:if>
      </xsl:when>
...
    </xsl:choose>
  </xsl:template>
.
.
.
</xsl:stylesheet>

This solution improves the runtime performance to an acceptable level, as shown in the table below:

New Approach Performance

Small

Medium

Large

0.48s

10.57s

1020s

Notes:

  • The same documents were used for timing these results, as well as the same hardware and software.

  • The stylesheet used performs two-level grouping.

XSLT Version 2.0

There is a W3C Working Group currently developing a specification for the next version of XSLT; version 2.0 [XSLT2]. Version 2.0 will have support for grouping built into the language and it would appear that this feature solves the problems described above.

Grouping is performed in XSLT v2.0 by using the new xsl:for-each-group element along with the new current-group() XPath function. The xsl:for-each-group element may be used to achieve positional as well as content-based grouping. Like the xsl:for-each element, the xsl:for-each-group element has a select attribute to select a set a nodes that will be iterated upon. However, these are not processed individually, but rather they are grouped together into subsets and it is these subsets that the operation iterates over. One of three attributes (and only one) determines how these subsets are chosen: group-by, group-adjacent or group-starting-with, which will be explained below. Within the content of the xsl:for-each-group element the current-group() XPath function accesses the group that is the current subject of the iteration.

The group-by attribute establishes a grouping key. The value of the attribute is an XPath expression that is evaluated for every node being grouped and the result of the expression is interpreted as a string. All nodes to be grouped that possess the same grouping key are in the same group. For example, XSL Stylesheet 2 may be rewritten as follows:

Example 14. 

XSLT Stylesheet 2, Using XSLT 2.0
<xsl:stylesheet version='1.0'
  xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
  <xsl:strip-space elements='*'/>

  <xsl:template match='chairs'>
    <xsl:for-each-group select='chair/year'
      group-by='.'>
      <chairs year='{current-group()[1]}'>
	<xsl:for-each select='current-group()'>
	  <xsl:apply-templates select='../name'/>
	</xsl:for-each>
      </chairs>
    </xsl:for-each-group>
  </xsl:template>

  <xsl:template match='name'>
    <xsl:copy/>
  </xsl:template>

</xsl:stylesheet>

      

The group-starting-with attribute is also evaluated for every node to be grouped. Its value is a pattern and if a node being grouped matches the pattern then a new group is started and the node becomes its first member. Every node that does not match the pattern joins the group of the previous matching node. For example, XSL stylesheet 1 may be rewritten as follows:

Example 15. 

XSLT Stylesheet 1, Using XSLT 2.0
<xsl:stylesheet version='1.0'
  xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
  <xsl:strip-space elements='*'/>

  <xsl:template match='chairs'>
    <table>
      <xsl:for-each-group select='chair/name'
	group-starting-with='position() mod 2 = 1'>
	<tr>
	  <xsl:apply-templates select='current-group()'/>
	</tr>
      </xsl:for-each-group>
    </table>
  </xsl:template>

  <xsl:template match='name'>
    <td>
      <xsl:apply-templates/>
    </td>
  </xsl:template>
</xsl:stylesheet>

Using group-starting-with the attribute will solve the problem of non-uniform grouping. Such a stylesheet may be written as follows:

Example 16. 

Non-Uniform Grouping Using XSLT 2.0
<xsl:stylesheet version='1.0'
  xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
   xmlns:office="http://openoffice.org/2000/office"
   xmlns:text="http://openoffice.org/2000/text"
   exclude-result-prefixes='office text'>
  <xsl:strip-space elements='*'/>

  <xsl:template match='office:body'>
    <chairs>
      <xsl:for-each-group select='*'
	group-starting-with='@text:style-name = "chair"'>
	<chair>
	  <xsl:apply-templates select='current-group()'/>
	</chair>
      </xsl:for-each-group>
    </chairs>
  </xsl:template>

  <xsl:template match='text:p[@text:style-name = "chair"]'>
    <name>
      <xsl:apply-templates/>
    </name>
  </xsl:template>
  <xsl:template match='text:p[@text:style-name = "Standard"]'>
    <year>
      <xsl:apply-templates/>
    </year>
  </xsl:template>
</xsl:stylesheet>

The group-adjacent attribute also establishes a grouping key for each node to be grouped. If the previous node has the same grouping key then they join the same group, otherwise a new group is started. This can be used to group nodes by content.

Conclusion

Most modern word processing applications offer the ability to save a document in an XML format. However, the schema used by the word processing application is usually not suitable for storage or further processing of the document. Word processing documents are essentially flat structures; they are a long list of paragraphs. Building up deeper structures in the document using XSLT is a particular type of grouping problem.

Grouping is a common problem in XSLT and a number of techniques to achieve grouping have been shown. These include positional and Meunchian grouping, both with and without using keys. Although these techniques do produce the correct result document, their runtime performance (even with keys) has been shown to be unacceptable.

A new technique has been developed, based on the use of modal template processing. This technique has been shown to be able to handle multi-level grouping tasks, as well as non-uniform grouping structures. Although the technique is verbose (even by XML standards!), the application of the technique is mostly boilerplate. Future work is aimed at developing a code-generator to automatically produce the boilerplate templates given a description of the input start/end delimiters and the desired result structures.

The new technique described achieves high performance. The comparison of the various technqies on documents of different sizes is summarised below. For a baseline comparison, the timing of an identity transformation is given.

Muenchian Method Performance

Description

Small

Medium

Large

(Identity)

0.25s

1.29s

14s

Meunchian level 1

1.85s

814s

24+ hrs

Meunchian level 2

0.45s

4.6s

?

Meunchian with keys

2.43s

141s

3.9hrs

Modal templates

0.48s

10.57s

1020s

While this technique is very useful for XSLT version 1.0, XSLT version 2.0 introduces direct support for grouping that makes this and other grouping techniques obsolete.

References

[XML]

eXtensible Markup Language, Second Edition. Tim Bray (Ed). W3C Recommendation October 2000.

[XSLT]

XSL Transformations (XSLT) Version 1.0. James Clark. W3C Recommendation November 1999.

[XSLT2]

XSL Transformations (XSLT) Version 2.0. Michael Kay (editor). W3C Working Draft December 2001.

[Tennison]

Jeni Tennison. Jeni's XSLT Pages: Grouping.

[Pawson]

XSLT FAQ. Flat file transformation.

[Meunch]

XSLT FAQ. Grouping.

[MSWord]

Microsoft Word 2003. Microsoft Office Online Home Page.

[OpenOffice]

Open Office. OpenOffice.org: Home Page.

[Walsh]

DocBook: The Definitive Guide. Norm Walsh. Read The Book.

[TEI]

Text Encoding Initiative. TEI Website.

[Gnome]

Gnome libxml. Daniel Veillard. xmlsoft.org.


Copyright © 1998-2004 Zveno Pty Ltd. All rights reserved. ABN 64 074 383 163. Legal notices. Comments or questions about this website? Contact the Zveno webperson.