Replace XML Entities with Elements

Replace XML Entities with Elements

Stuffing XML into someone else’s system is always a treat when there is no schema or DTD. It’s even more fun when the original dev teams on both sides have moved on, leaving no documentation whatever. The particular system I’m working with gets a tummy ache when any characters above ASCII-7 go down it’s pie-hole. Nobody knows why. It is simply a brute fact, as unyielding and unanswerable as the tides.

Stranger still, the XML input to the legacy system are UTF-8, but certain characters used in chemical notation ( α, β, γ ) must be replaced with elements containing a path to images (sic.) unless these characters appear in an attribute value. A character-map cannot be used because it applies to the whole document, but attribute values must be left unmodified. There is no good way to fix this problem, only some less bad than others.



 
   

    
      		
        alpha.gif
        beta.gif
        gamma.gif
      
    
	
    
      
        
        
      
    

    
      
        
      
    

    
	
      
	
        			
          
            
              
                 	
              
            
          
            
          
        

      
		
  


The answer I came up with uses an XML variable, $subs, to hold a mapping between the troublesome characters and their image counterparts. It then splits matched text into a sequence of integers with string-to-codepoints, iterates through them, and looks them up in the table. A separate templates handles attributes, foregoing the conversion.

The runtime performance is crap, but if you are serving out mark-up that uses images like this you deserve what you get. The principal advantage is that using the lookup mapping makes it clear which characters are being munged, and is a single point of change should the need arise.