![]() |
![]() |
|||||||||
Multi-level Non-uniform Grouping of Very Large Flat Structured Documents
Grouping together different elements in a XML source document [XML]into an alternative structure in the result document is a common problem when developing XSL stylesheets [XSLT]. There are many variations on the problem of grouping, one example is as follows. Given a source XML document that lists AusWeb conference chairs: Example 1. Source Document
<chairs>
<chair>
<name>Paul Thistlewaite</name>
<year>1995</year>
<year>1996</year>
<year>1997</year>
<year>1998</year>
</chair>
<chair>
<name>Helen Ashman</name>
<year>1995</year>
<year>1996</year>
<year>1997</year>
<year>1998</year>
</chair>
<chair>
<name>Roger Debreceny</name>
<year>1999</year>
</chair>
<chair>
<name>Allan Ellis</name>
<year>1999</year>
<year>2000</year>
<year>2001</year>
<year>2002</year>
<year>2003</year>
<year>2004</year>
</chair>
</chairs>
Note that the example document has a uniform structure: the chairs element contains a sequence of chair element, which in turn contain a single name element followed by a sequence of year elements. For display purposes it may be desirable to rearrange this list of names into a two-column table, such as: Example 2. Desired Result Document 1
<table>
<row>
<entry>Paul Thistlewaite</entry>
<entry>Helen Ashman</entry>
</row>
<row>
<entry>Roger Debreceny</entry>
<entry>Allan Ellis</entry>
</row>
</table>
This is an example of positional grouping. Solving this type of problem using XSLT involves the following steps:
So, for the above problem a solution might be as follows: Example 3. XSL Stylesheet 1
<xsl:stylesheet version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
<xsl:strip-space elements='*'/>
<xsl:template match='chairs'>
<!-- Step 1: We need to grouped together chair
elements pair-wise.
-->
<table>
<!-- Step 2: Select for processing all the
odd-numbered chair elements.
-->
<xsl:apply-templates
select='chair[position() mod 2 = 1]'/>
</table>
</xsl:template>
<!-- This template will only process the odd-numbered
chair elements.
-->
<xsl:template match='chair'>
<row>
<entry>
<xsl:apply-templates select='name'/>
</entry>
<!-- Step 3: Now find the even-numbered
chair elements.
-->
<entry>
<xsl:apply-templates
select='following-sibling::chair[1]/name'/>
</entry>
</row>
</xsl:template>
</xsl:stylesheet>
Another type of grouping problem [Tennison] is grouping by content. For example, given the same source document as above it may be desirable to find the conference chairs for a given year: Example 4. Desired Result Document 2
<chairs year='1997'> <name>Paul Thistlewaite</name> <name>Helen Ashman</name> </chairs> <chairs year='2002'> <name>Allan Ellis</name> </chairs> A stylesheet to produce this result is as follows: Example 5. XSL Stylesheet 2
<xsl:stylesheet version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
<xsl:strip-space elements='*'/>
<!-- Step 1: Group by year -->
<xsl:template match='/'>
<!-- Step 2: Select the first node that is
the year we are interested in.
-->
<xsl:apply-templates
select='/chairs/chair/year[. = "1997"][1]'/>
<xsl:apply-templates
select='/chairs/chair/year[. = "2002"][1]'/>
</xsl:template>
<!-- Step 3: For a particular year,
find all chairs.
-->
<xsl:template match='year'>
<chairs year='{.}'>
<xsl:apply-templates
select='preceding-sibling::name|
ancestor::chair/following-sibling::chair
[year = current()]/name'/>
</chairs>
</xsl:template>
<xsl:template match='name'>
<xsl:copy/>
</xsl:template>
</xsl:stylesheet>
A common use of either type of grouping is to derive structure from a flat-structured data source [Pawson]. That is, to have a deeper element hierarchy in the result document than existed in the source document. As an example, below is an extract of another version of the sample XML source document (the full document may be found in flatsource.xml). Note that its structure is also uniform: it is a recurring sequence of title, year, chair (repeatable) and location elements. Example 6. Source Document
<conferences> <title>AusWeb 95</title> <year>1995</year> <chair>Paul Thistlewaite</chair> <chair>Helen Ashman</chair> <location>Ballina Beach Resort</location> <title>AusWeb 96</title> <year>1996</year> <chair>Paul Thistlewaite</chair> <chair>Helen Ashman</chair> <location>Conrad Jupiters Casino</location> <title>AusWeb 97</title> <year>1997</year> <chair>Paul Thistlewaite</chair> <chair>Helen Ashman</chair> <location>Conrad Jupiters Casino</location> . . . </conferences> The structure of the above sample document is very flat - the element hierarchy is only two elements deep. It would be much better to have a deeper element hierarchy, as shown below (the complete document may be found in deeper.xml). Example 7. Desired Structured Result Document
<conferences>
<conference>
<title>AusWeb 95</title>
<year>1995</year>
<location>Ballina Beach Resort</location>
<chairs>
<name>Paul Thistlewaite</name>
<name>Helen Ashman</name>
</chairs>
</conference>
<conference>
<title>AusWeb 96</title>
<year>1996</year>
<location>Conrad Jupiters Casino</location>
<chairs>
<name>Paul Thistlewaite</name>
<name>Helen Ashman</name>
</chairs>
</conference>
<conference>
<title>AusWeb 97</title>
<year>1997</year>
<location>Conrad Jupiters Casino</location>
<chairs>
<name>Paul Thistlewaite</name>
<name>Helen Ashman</name>
</chairs>
</conference>
.
.
.
</conferences>
The desired result document now has an element hierarchy depth of 4 elements. In order to achieve the required structure, grouping by content is necessary. In this case, the Meunchian method [Meunch] may be used to identify which nodes belong to a particular group. The Meunchian method involves the following steps:
Further, this is a two-level grouping problem: first, the content is grouped by conference and second, the conference chairs are grouped together. An XSL stylesheet to transform the original source document into the desired result document is as follows: Example 8. XSL Stylesheet 3
<xsl:stylesheet version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
<xsl:strip-space elements='*'/>
<!-- Step 1: Group title, year, chair and location
together.
-->
<xsl:template match='conferences'>
<conferences>
<!-- Step 2: Process the title elements,
which being each group.
-->
<xsl:apply-templates select='title'/>
</conferences>
</xsl:template>
<!-- Step 3: For each title, find the related
nodes. These occur before the next title.
-->
<xsl:template match='title'>
<!-- Find the next location element
which delimits this group.
-->
<xsl:variable
name='nextLocation'
select='following-sibling::location[1]'/>
<conference>
<xsl:apply-templates
select='.|following-sibling::year[1]|
$nextLocation'/>
<!-- Now group together the chair elements
for this particular conference.
-->
<chairs>
<xsl:apply-templates
select='following-sibling::chair
[count($nextLocation|
following-sibling::location[1]) = 1]'/>
</chairs>
</conference>
</xsl:template>
<xsl:template match='chair'>
<name>
<xsl:apply-templates/>
</name>
</xsl:template>
<xsl:template match='*'>
<xsl:copy/>
</xsl:template>
</xsl:stylesheet>
This solution utilizes positional grouping for the first level of grouping (conferences) and Meunchian grouping for the second level of grouping (conference chairs). Methods for grouping in XSLT are suitable when the total number of nodes being grouped is relatively small. Also, grouping methods are easily applied when the document is uniform and the grouping is only one- or two-levels deep. However, there are significant classes of documents where these conditions are not the case. One such class of documents is word processing documents that are saved as XML; both Microsoft Word 2003 [MSWord] and Open Office [OpenOffice] have this capability. Word processing documents are fundamentally unstructured. Where they do have structure, it is related to the physical structuring of the visual appearance of the document. For example, a document may have a section in single column followed by a section in two columns. Many organisations wish to convert word processing documents into a logically structured format, such as DocBook [Walsh] or TEI [TEI]. In this case, physical structures are of no consequence and so should be discarded. Instead, the logical structures must be identified; this is most easily and reliably achieved by marking-up the document using "styles" but can also be done by applying heuristics. When converting a word processing document into a structured form, the first step is to "normalise" the document. This involves applying a XSL stylesheet that for the most part copies the document's data identically to the result, but removes the physical structures. Other unnecessary data may also be removed during this process, such as style information. The data that remains is a flat list of paragraphs, as shown below: Example 9. Extract Of Open Office Document Content
<office:document-content
xmlns:office="http://openoffice.org/2000/office"
xmlns:text="http://openoffice.org/2000/text"
office:class="text" office:version="1.0">
<office:body>
<text:p text:style-name="chair">Paul Thistlewaite</text:p>
<text:p text:style-name="Standard">1995</text:p>
<text:p text:style-name="Standard">1996</text:p>
<text:p text:style-name="Standard">1997</text:p>
<text:p text:style-name="Standard">1998</text:p>
<text:p text:style-name="chair">Helen Ashman</text:p>
<text:p text:style-name="Standard">1995</text:p>
<text:p text:style-name="Standard">1996</text:p>
<text:p text:style-name="Standard">1997</text:p>
<text:p text:style-name="Standard">1998</text:p>
<text:p text:style-name="chair">Roger Debreceny</text:p>
<text:p text:style-name="Standard">1999</text:p>
<text:p text:style-name="chair">Allan Ellis</text:p>
<text:p text:style-name="Standard">1999</text:p>
<text:p text:style-name="Standard">2000</text:p>
<text:p text:style-name="Standard">2001</text:p>
<text:p text:style-name="Standard">2002</text:p>
<text:p text:style-name="Standard">2003</text:p>
<text:p text:style-name="Standard">2004</text:p>
<text:p text:style-name="Standard"/>
</office:body>
</office:document-content>
This small example contains 20 paragraphs. However, some applications must routinely deal with documents that contain well in excess of 2000 paragraphs. Structured document schemata, such as DocBook or TEI or of an equivalent complexity, often have a number of sectioning components. For example, DocBook has a sectioning hierarchy as follows:
In fact, DocBook defines 5 numbered sections so potentially the depth of the sectioning element hierarchy is 8. From the perspective of grouping, this is a multi-level grouping problem and is far more complex than one- or two-level grouping. An XSL stylsheet that transforms the Open Office document above into a structured form shown earlier is as follows: Example 10. XSL Stylesheet 4
<xsl:stylesheet version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
xmlns:office="http://openoffice.org/2000/office"
xmlns:text="http://openoffice.org/2000/text"
exclude-result-prefixes='office text'>
<xsl:strip-space elements='*'/>
<xsl:template match='office:body'>
<chairs>
<xsl:apply-templates
select='text:p[@text:style-name = "chair"]'/>
</chairs>
</xsl:template>
<xsl:template match='text:p[@text:style-name = "chair"]'>
<chair>
<name>
<xsl:apply-templates/>
</name>
<xsl:variable name='nextChair'
select='following-sibling::text:p
[@text:style-name = "chair"][1]'/>
<xsl:choose>
<xsl:when test='$nextChair'>
<xsl:apply-template
select='following-sibling::text:p
[generate-id(
following-sibling::text:p[@text:style-name = "chair"][1]) =
generate-id($nextChair)]'/>
</xsl:when>
<xsl:otherwise>
<!-- Take all following paragraphs
to the end of the document
-->
<xsl:apply-templates select='following-sibling::text:p'/>
</xsl:otherwise>
</xsl:choose>
</chair>
</xsl:template>
<xsl:template match='text:p[@text:style-name = "Standard"]'>
<year>
<xsl:apply-templates/>
</year>
</xsl:template>
</xsl:stylesheet>
In addition to the complexity of multi-level grouping, structured text documents exhibit a non-uniform pattern of sectioning components. That is, although the sectioning components are well-defined (books contain chapters, chapters contain sect1s, and so on) it is not necessarily the case that section components will appear at certain levels. With some schemata, it is also the case that certain elements may appear to different levels in the hierarchy. For example, some sect1 elements may contain sect2 children, whereas others may not. Also, lists or tables may occur at any level in the component hierarchy. It is possible to deal with these various characteristics when converting word processing documents to structured documents. Handling multi-level documents causes the complexity of the application of Meunchian methods to rise markedly, but the problem is still tractable. However, the application of Muenchian methods to large, flat structures results in very poor performance. This is because the method is essentially calculating the intersection of two sets. Determining a set nodes along a single axis is efficient, but applying a predicate to every node in the set is very costly. This cost is incurred for every node in every group. The performance cost for documents of different sizes is given below. It is important to note that the result of using the Meunchian method is correct, the desired result document is produced, it is only the runtime performance that is undesirable. Also, some XSLT processors provide an extension function to calculate the intersection of two nodesets, which obviates the need for the approaches described here. However, for maximum portability it is best to avoid the use of extensions.
Notes:
Although it has been shown that the Muenchian method performs poorly on word processing style documents, it must be realised that the problem is not the size of the document per se; the real problem is that word processing style documents have large, flat structures. The Meunchian method has unacceptable behaviour when grouping occurs in structures that have large numbers of sibling elements. Large (semi-)structured documents may not necessarily exhibit the same bad performance. In fact, once some structures have been established in the document (perhaps in a processing chain) and the sibling lengths have been reduced the Meunchian method may once again be successfully applied. So far, only a simple use of the Meunchian method has been shown. It is in fact possible to dramatically speed-up processing by using keys, via the key() function. In order to take advantage of keys, a key is created for each level of grouping desired in the result document. Each key then provides a fast lookup for the nodes in each group at that level. The key for each level of grouping must combine the key at previous levels, and so the complexity of this approach increases with the depth of grouping. The XSL stylesheet solution shown above is modified to use a key: Example 11. XSL Stylesheet 5
<xsl:stylesheet version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
xmlns:office="http://openoffice.org/2000/office"
xmlns:text="http://openoffice.org/2000/text"
exclude-result-prefixes='office text'>
<xsl:strip-space elements='*'/>
<xsl:key name='immediate-nodes'
match='node()[not([@text:style-name = "chair"])]'
use='generate-id(preceding-sibling::text:p
[@text:style-name = "chair"][1])'/>
<xsl:template match='office:body'>
<chairs>
<xsl:apply-templates
select='text:p[@text:style-name = "chair"]'/>
</chairs>
</xsl:template>
<xsl:template match='text:p[@text:style-name = "chair"]'>
<chair>
<name>
<xsl:apply-templates/>
</name>
<xsl:apply-templates
select='key("immediate-nodes", generate-id())'/>
</chair>
</xsl:template>
<xsl:template match='text:p[@text:style-name = "Standard"]'>
<year>
<xsl:apply-templates/>
</year>
</xsl:template>
</xsl:stylesheet>
This solution improves the runtime performance, as shown in the table below. Note that the time required to build the key(s) is now significant; the actual application of the templates is very fast.
Notes:
All of the methods described so far result in performance times that are unacceptable. They are also difficult to use with non-uniform structures. A new method has been developed that results in acceptable performance. This method is based on an approach described by Mike Kay in [Pawson], but extended to handle multi-level grouping. Instead of using the Meunchian method to calculate the intersection of two sets, this method iterates through the document one node at a time and "remembers" the context of the node in order to build the desired structures in the result document. Template modes are used to set the current context. The new method uses the same basic techniques of grouping: identify the nodes to be grouped, process the node that is the start of the group and while processing that node gather together the nodes in that group. In addition, to achieve high performance some additional rules are introduced:
To illustrate the new method, a XSL stylesheet is shown below that transforms the Open Office example document into a structured form, using the new approach: Example 12. XSL Stylesheet 5
<xsl:stylesheet version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
xmlns:office="http://openoffice.org/2000/office"
xmlns:text="http://openoffice.org/2000/text"
exclude-result-prefixes='office text'>
<xsl:strip-space elements='*'/>
<xsl:template match='office:body'>
<chairs>
<!-- Step 1: Group "chair" nodes together.
-->
<xsl:apply-templates
select='text:p[@text:style-name = "chair"]'/>
</chairs>
</xsl:template>
<xsl:template match='text:p[@text:style-name = "chair"]'>
<chair>
<name>
<xsl:apply-templates/>
</name>
<!-- Find the end delimiter for this group.
-->
<xsl:variable name='nextChair'
select='following-sibling::text:p
[@text:style-name = "chair"][1]'/>
<xsl:choose>
<xsl:when test='$nextChair'>
<!-- Process the next paragraph.
Set "year" mode to signify that we're
looking for year values.
-->
<xsl:apply-templates
select='following-sibling::text:p[1]'
mode='year'>
<!-- Include the end delimiter so that
we know when to stop.
-->
<xsl:with-param name='nextChair'
select='$nextChair'/>
</xsl:apply-templates>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates
select='following-sibling::text:p[1]'
mode='year'/>
</xsl:otherwise>
</xsl:choose>
</chair>
</xsl:template>
<!-- This template only has effect when we're
looking for year values.
-->
<xsl:template
match='text:p[@text:style-name = "Standard"]'
mode='year'>
<xsl:param name='nextChair' select='/..'/>
<xsl:choose>
<xsl:when test='not($nextChair)'>
<!-- This is the last group -->
<year>
<xsl:apply-templates/>
</year>
<xsl:apply-templates
select='following-sibling::text:p[1]'
mode='year'/>
</xsl:when>
<!-- Have we reached the end of this group? -->
<xsl:when
test='generate-id(.) =
generate-id($nextChair)'/>
<xsl:otherwise>
<!-- Still in a group -->
<year>
<xsl:apply-templates/>
</year>
<xsl:apply-templates
select='following-sibling::text:p[1]'
mode='year'/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<!-- Eliminate all unmatched elements to avoid
duplicates.
-->
<xsl:template match='text:p'/>
</xsl:stylesheet>
Extending this solution to multiple component levels is very straight-forward. The template that matches each level must check for a terminating condition. Any combination of explicit parameters or heuristics may be used as a terminating condition. The template is also passed a nodeset of further nodes to process. As an example, the following word processing document is to be transformed into a DocBook document: This document has three levels of headings, using styles named (surprisingly) Heading 1, Heading 2 and Heading 3. A common heuristic is that each level of heading delimits the content of the heading before it. Hence this is a three-level grouping problem. An extract of an XSL stylesheet to transform this document into DocBook is shown below (the complete stylesheet may be found in stylesheet6.xsl). This extract shows the templates for handling one heading level. Example 13. XSL Stylesheet 6
<xsl:stylesheet version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
xmlns:office="http://openoffice.org/2000/office"
xmlns:text="http://openoffice.org/2000/text"
exclude-result-prefixes='office text'>
<xsl:strip-space elements='*'/>
<xsl:template match='office:body'>
<article>
<title>
<xsl:apply-templates
select='text:h[@text:style-name = "Heading 1"]'/>
</title>
<!-- Handle any paragraphs appearing in the "front-matter" -->
<xsl:apply-templates
select='text:h[@text:style-name = "Heading 2"][1]/
preceding-sibling::text:p'/>
<!-- Now create the sections.
Find all of the candidate sections,
store in a variable,
then kick off the processing with the first section -->
<xsl:variable name='sections'
select='text:h[@text:style-name = "Heading 2"]'/>
<xsl:apply-templates
select='$sections[1]' mode='section'>
<xsl:with-param name='sections' select='$sections'/>
</xsl:apply-templates>
</article>
</xsl:template>
<xsl:template
match='text:h[@text:style-name = "Heading 2"]'
mode='section'>
<xsl:param name='sections' select='/..'/>
<sect1>
<title>
<xsl:apply-templates/>
</title>
<!-- Edge condition: the last section must be handled separately -->
<xsl:choose>
<xsl:when test='$sections'>
<!-- Are there any subsections in this section?
If so, store them in a variable -->
<xsl:variable
name='subsections'
select='$sections[1]/
preceding-sibling::text:h
[@text:style-name = "Heading 2"]'/>
<xsl:apply-templates
select='following-sibling::*[1]'
mode='subsection'>
<xsl:with-param
name='nextSection'
select='$sections[1]'/>
<xsl:with-param
name='subsections'
select='$subsections'/>
</xsl:apply-templates>
</xsl:when>
<xsl:otherwise>
<!-- This is similar, but the subsections
are determined differently -->
...
</xsl:otherwise>
</xsl:choose>
</sect1>
<!-- Now continue with the next section -->
<xsl:apply-templates
select='$sections[1]'
mode='section'>
<xsl:with-param name='$sections[position() != 1]'/>
</xsl:apply-templates>
</xsl:template>
<xsl:template match='text:h'
mode='subsection'>
<xsl:param name='nextSection'
select='/..'/>
<xsl:param name='subsections'
select='/..'/>
<xsl:choose>
<!-- Have we reached the end delimiter
for this subsection? If so stop -->
<xsl:when test='generate-id() = generate-id($nextSection)'/>
<xsl:when test='@text:style-name = "Heading 2"'>
<!-- Find the (potential) end delimiter
for this subsection -->
<xsl:variable
name='nextSubsection'
select='following-sibling::text:h
[@text:style-name = "Heading 2"][1]'/>
<!-- Find the subsubsections, if any.
Store them in a variable -->
<xsl:variable
name='subsubsections'
select='following-sibling::text:h
[@text:style-name = "Heading 3"]'/>
<sect2>
...
</sect2>
<!-- Are there more subsections *within this group*
to continue with? -->
<xsl:if test='count($subsections|$nextSubsection) =
count($subsections)'>
<xsl:apply-templates
select='$nextSubsection'
mode='subsection'>
...
</xsl:apply-templates>
</xsl:if>
</xsl:when>
...
</xsl:choose>
</xsl:template>
.
.
.
</xsl:stylesheet>
This solution improves the runtime performance to an acceptable level, as shown in the table below:
Notes:
There is a W3C Working Group currently developing a specification for the next version of XSLT; version 2.0 [XSLT2]. Version 2.0 will have support for grouping built into the language and it would appear that this feature solves the problems described above. Grouping is performed in XSLT v2.0 by using the new xsl:for-each-group element along with the new current-group() XPath function. The xsl:for-each-group element may be used to achieve positional as well as content-based grouping. Like the xsl:for-each element, the xsl:for-each-group element has a select attribute to select a set a nodes that will be iterated upon. However, these are not processed individually, but rather they are grouped together into subsets and it is these subsets that the operation iterates over. One of three attributes (and only one) determines how these subsets are chosen: group-by, group-adjacent or group-starting-with, which will be explained below. Within the content of the xsl:for-each-group element the current-group() XPath function accesses the group that is the current subject of the iteration. The group-by attribute establishes a grouping key. The value of the attribute is an XPath expression that is evaluated for every node being grouped and the result of the expression is interpreted as a string. All nodes to be grouped that possess the same grouping key are in the same group. For example, XSL Stylesheet 2 may be rewritten as follows: Example 14.
<xsl:stylesheet version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
<xsl:strip-space elements='*'/>
<xsl:template match='chairs'>
<xsl:for-each-group select='chair/year'
group-by='.'>
<chairs year='{current-group()[1]}'>
<xsl:for-each select='current-group()'>
<xsl:apply-templates select='../name'/>
</xsl:for-each>
</chairs>
</xsl:for-each-group>
</xsl:template>
<xsl:template match='name'>
<xsl:copy/>
</xsl:template>
</xsl:stylesheet>
The group-starting-with attribute is also evaluated for every node to be grouped. Its value is a pattern and if a node being grouped matches the pattern then a new group is started and the node becomes its first member. Every node that does not match the pattern joins the group of the previous matching node. For example, XSL stylesheet 1 may be rewritten as follows: Example 15.
<xsl:stylesheet version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
<xsl:strip-space elements='*'/>
<xsl:template match='chairs'>
<table>
<xsl:for-each-group select='chair/name'
group-starting-with='position() mod 2 = 1'>
<tr>
<xsl:apply-templates select='current-group()'/>
</tr>
</xsl:for-each-group>
</table>
</xsl:template>
<xsl:template match='name'>
<td>
<xsl:apply-templates/>
</td>
</xsl:template>
</xsl:stylesheet>
Using group-starting-with the attribute will solve the problem of non-uniform grouping. Such a stylesheet may be written as follows: Example 16. Non-Uniform Grouping Using XSLT 2.0
<xsl:stylesheet version='1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
xmlns:office="http://openoffice.org/2000/office"
xmlns:text="http://openoffice.org/2000/text"
exclude-result-prefixes='office text'>
<xsl:strip-space elements='*'/>
<xsl:template match='office:body'>
<chairs>
<xsl:for-each-group select='*'
group-starting-with='@text:style-name = "chair"'>
<chair>
<xsl:apply-templates select='current-group()'/>
</chair>
</xsl:for-each-group>
</chairs>
</xsl:template>
<xsl:template match='text:p[@text:style-name = "chair"]'>
<name>
<xsl:apply-templates/>
</name>
</xsl:template>
<xsl:template match='text:p[@text:style-name = "Standard"]'>
<year>
<xsl:apply-templates/>
</year>
</xsl:template>
</xsl:stylesheet>
The group-adjacent attribute also establishes a grouping key for each node to be grouped. If the previous node has the same grouping key then they join the same group, otherwise a new group is started. This can be used to group nodes by content. Most modern word processing applications offer the ability to save a document in an XML format. However, the schema used by the word processing application is usually not suitable for storage or further processing of the document. Word processing documents are essentially flat structures; they are a long list of paragraphs. Building up deeper structures in the document using XSLT is a particular type of grouping problem. Grouping is a common problem in XSLT and a number of techniques to achieve grouping have been shown. These include positional and Meunchian grouping, both with and without using keys. Although these techniques do produce the correct result document, their runtime performance (even with keys) has been shown to be unacceptable. A new technique has been developed, based on the use of modal template processing. This technique has been shown to be able to handle multi-level grouping tasks, as well as non-uniform grouping structures. Although the technique is verbose (even by XML standards!), the application of the technique is mostly boilerplate. Future work is aimed at developing a code-generator to automatically produce the boilerplate templates given a description of the input start/end delimiters and the desired result structures. The new technique described achieves high performance. The comparison of the various technqies on documents of different sizes is summarised below. For a baseline comparison, the timing of an identity transformation is given.
While this technique is very useful for XSLT version 1.0, XSLT version 2.0 introduces direct support for grouping that makes this and other grouping techniques obsolete.
|
Copyright © 1998-2004 Zveno Pty Ltd. All rights reserved. ABN 64 074 383 163. Legal notices. Comments or questions about this website? Contact the Zveno webperson.