{"id":446,"date":"2015-01-16T22:44:47","date_gmt":"2015-01-17T03:44:47","guid":{"rendered":"http:\/\/pmcgovern.ca\/wp\/?p=446"},"modified":"2021-12-12T12:43:21","modified_gmt":"2021-12-12T17:43:21","slug":"replace-xml-entities-with-elements","status":"publish","type":"post","link":"https:\/\/pmcgovern.ca\/wp\/?p=446","title":{"rendered":"Replace XML Entities with Elements"},"content":{"rendered":"<p>Stuffing XML into someone else&#8217;s system is always a treat when there is no schema or DTD. It&#8217;s even more fun when the original dev teams on both sides have moved on, leaving no documentation whatever. The particular system I&#8217;m working with gets a tummy ache when any characters above ASCII-7 go down it&#8217;s pie-hole. Nobody knows why. It is simply a brute fact, as unyielding and unanswerable as the tides.<\/p>\n<p>Stranger still, the XML input to the legacy system are UTF-8, but certain characters used in chemical notation ( \u03b1, \u03b2, \u03b3 ) must be replaced with elements containing a path to images (<em>sic.<\/em>) unless these characters appear in an attribute value. A <strong>character-map<\/strong> cannot be used because it applies to the whole document, but attribute values must be left unmodified. There is no good way to fix this problem, only some less bad than others.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"xml\"><!--?xml version=\"1.0\" encoding=\"utf-8\"?-->\n<xsl:stylesheet version=\"2.0\" xmlns:xsl=\"http:\/\/www.w3.org\/1999\/XSL\/Transform\">\n \n  <xsl:output method=\"xml\" indent=\"yes\" encoding=\"UTF-8\"> \n\n    <xsl:variable name=\"subs\">\n      <map>\t\t\n        <entry key=\"\u0945\">alpha.gif<\/entry>\n        <entry key=\"\u0946\">beta.gif<\/entry>\n        <entry key=\"\u0947\">gamma.gif<\/entry>\n      <\/map>\n    <\/xsl:variable>\n\t\n    <xsl:template match=\"*\">\n      <xsl:element name=\"{name()}\">\n        <xsl:apply-templates select=\"@*\">\n        <xsl:apply-templates>\n      <\/xsl:apply-templates><\/xsl:apply-templates><\/xsl:element>\n    <\/xsl:template>\n\n    <xsl:template match=\"@*\">\n      <xsl:attribute name=\"{name()}\">\n        <xsl:value-of select=\".\">\n      <\/xsl:value-of><\/xsl:attribute>\n    <\/xsl:template>\n\n    <xsl:template match=\"text()\">\n\t\n      <xsl:for-each select=\"string-to-codepoints( . )\">\n\t\n        <xsl:variable name=\"chr\" select=\".\">\t\t\t\n          <xsl:choose>\n            <xsl:when test=\"$subs\/\/entry[@key=codepoints-to-string($chr)]\">\n              <xsl:element name=\"symbol\">\n                <xsl:attribute name=\"src\" select=\"$subs\/\/entry[@key=codepoints-to-string($chr)][1]\"> \t\n              <\/xsl:attribute><\/xsl:element>\n            <\/xsl:when>\n          <xsl:otherwise>\n            <xsl:value-of select=\"codepoints-to-string($chr)\">\n          <\/xsl:value-of><\/xsl:otherwise>\n        <\/xsl:choose>\n\n      <\/xsl:variable><\/xsl:for-each>\n\t\t\n  <\/xsl:template>\n\n<\/xsl:output><\/xsl:stylesheet>\n<\/pre>\n<p>The answer I came up with uses an XML variable, <strong>$subs<\/strong>, to hold a mapping between the troublesome characters and their image counterparts. It then splits matched text into a sequence of integers with <strong>string-to-codepoints<\/strong>, iterates through them, and looks them up in the table. A separate templates handles attributes, foregoing the conversion.<\/p>\n<p>The runtime performance is crap, but if you are serving out mark-up that uses images like this you deserve what you get. The principal advantage is that using the lookup mapping makes it clear which characters are being munged, and is a single point of change should the need arise.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Stuffing XML into someone else&#8217;s system is always a treat when there is no schema or DTD. It&#8217;s even more fun when the original dev teams on both sides have moved on, leaving no documentation whatever. The particular system I&#8217;m&#8230;<\/p>\n","protected":false},"author":1,"featured_media":394,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[11],"tags":[10],"class_list":["post-446","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-programming","tag-xsl"],"_links":{"self":[{"href":"https:\/\/pmcgovern.ca\/wp\/index.php?rest_route=\/wp\/v2\/posts\/446","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pmcgovern.ca\/wp\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/pmcgovern.ca\/wp\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/pmcgovern.ca\/wp\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/pmcgovern.ca\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=446"}],"version-history":[{"count":8,"href":"https:\/\/pmcgovern.ca\/wp\/index.php?rest_route=\/wp\/v2\/posts\/446\/revisions"}],"predecessor-version":[{"id":883,"href":"https:\/\/pmcgovern.ca\/wp\/index.php?rest_route=\/wp\/v2\/posts\/446\/revisions\/883"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/pmcgovern.ca\/wp\/index.php?rest_route=\/wp\/v2\/media\/394"}],"wp:attachment":[{"href":"https:\/\/pmcgovern.ca\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=446"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/pmcgovern.ca\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=446"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/pmcgovern.ca\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=446"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}