XSLT: how do I parse a document to retain only interesting nodes, along with all parents, at any depth? -
given document below:
<root> <a> <b> <c>sometext</c> </b> <b> <d/> </b> <b> <e> <f>some interesting__text more</f> </e> </b> </a> <h> <g>another piece of very_interesting__text</g> </h> </root>
i following out:
<root> <a> <b> <e> <f>some interesting__text more</f> </e> </b> </a> <h> <g>another piece of very_interesting__text</g> </h> <interesting>interesting__text</interesting> <interesting>very_interesting__text</interesting> </root>
essentially, need out parent nodes of node contains interesting text, can matched using regex \w+__\w+
. bonus, interesting pieces added somewhere @ end of document.
the nodes can contain interesting pieces can named anything, dependencies on specific node names cannot part of solution.
i'm thinking xslt way @ this, having trouble putting stylesheet. obviously, in code, prefer stylesheet using others in script, simplify things some.
thanks in advance.
edit: there error in sample xml, comment asked why tag being transformed tag - corrected in above.
here xslt 2.0 suggestion:
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/xsl/transform"> <xsl:param name="pattern" select="'\w+__\w+'"/> <xsl:output indent="yes"/> <xsl:variable name="text" select="//text()[matches(., $pattern)]"/> <xsl:variable name="nodes" select="$text/ancestor-or-self::node()"/> <xsl:template match="/*"> <xsl:copy> <xsl:apply-templates select="@* , node()[$nodes intersect .]"/> <xsl:apply-templates select="$text" mode="interesting"/> </xsl:copy> </xsl:template> <xsl:template match="node()"> <xsl:copy> <xsl:apply-templates select="@* , node()[$nodes intersect .]"/> </xsl:copy> </xsl:template> <xsl:template match="text()" mode="interesting"> <interesting><xsl:value-of select="."/></interesting> </xsl:template> </xsl:stylesheet>
using saxon 9.5, transforms
<root> <a> <b> <c>sometext</c> </b> <b> <d/> </b> <b> <e> <f>some interesting__text more</f> </e> </b> </a> <f> <g>another piece of very_interesting__text</g> </f> </root>
into
<root> <a> <b> <e> <f>some interesting__text more</f> </e> </b> </a> <f> <g>another piece of very_interesting__text</g> </f> <interesting>some interesting__text more</interesting> <interesting>another piece of very_interesting__text</interesting> </root>
the sample nodes don't have attribute, if can have them in real xml add template
<xsl:template match="@*"> <xsl:copy/> </xsl:template>
to filter text nodes in collection of interesting
elements can use analyze-string
changing template text()
to
<xsl:template match="text()" mode="interesting"> <interesting> <xsl:analyze-string select="." regex="{$pattern}"> <xsl:matching-substring> <xsl:value-of select="."/> </xsl:matching-substring> </xsl:analyze-string> </interesting> </xsl:template>
the result changed to
<root> <a> <b> <e> <f>some interesting__text more</f> </e> </b> </a> <f> <g>another piece of very_interesting__text</g> </f> <interesting>interesting__text</interesting> <interesting>interesting__text</interesting> </root>
you might need change or adapt pattern if very_
substring should extracted.
Comments
Post a Comment