Home
TeamSite
BOM in XML files
System
When using XML::XPath to parse an XML file with the version of Perl that ships with TeamSite 5.5.2 on Solaris, the perl binary core dumps if the XML file contains a BOM. I don't know much about BOM but if I remove the first three bytes in the file then Perl does not core dump.
Does anyone know how to deal with this? I know I can strip the BOM manually before creating the XML::XPath, but that's a hassle everywhere I need to work with XML files...
Find more posts tagged with
Comments
stingray
You might try opening the file as binary and passing the file handle to XML::XPath as an ioref. Not sure if that'd work, but those characters are likely what's gagging the ancient version of perl that Interwoven refuses to update.
FYI, the BOM is the Byte Order Mark. It's used to show readers what order the bytes of multibyte characters are in. It serves no purpose with UTF-8. It's only relevant with UTF-16 or UTF-32, so if you can prevent them from being put in the file to begin with, I'd sugget that.
Migrateduser
Thanks for the info. Unfortunately the XML I am creating the object from references other XML files which are the ones with the BOMs. I asked the developers to ensure no BOMs are in their data but these things seem to have a way of slipping back in. I modified the code to remove BOMs from all XML files in the directory containing the main XML so as long as they're all in one directory the script will function. I guess I could subclass XML::XPath and use that instead of adding the BOM stripping code to all scripts?
stingray
Looks like your problem might ultimately reside in expat. XML::XPath uses XML::XPath::XMLParser which uses XML:
arser which is a binding on top of expat. There was a known bug, which was fixed back in Feb. 2001, regarding UTF-8 BOMs causing expat to crash. The expat files in iw-perl all date from 1998 and 1999. Here's the bug report:
http://sourceforge.net/tracker/index.php?func=detail&aid=223767&group_id=10127&atid=110127
Maybe someday Interwoven will upgrade their perl version to something that actually supports UTF-8. and was released a little later than 1997.
Any possibility you can install a newer version of perl, XML:
arser, and XML::XPath? We had to do that to get decent XML/XPath/SOAP support and it seems to work (however annoying it might be).
Migrateduser
Thanks again. Unfortunately there are too many environments that I'm not responsible for so I can't install Perl.
I think Interwoven must be scared that customizations they sold to customers as consulting services might break if they upgrade Perl, which could result in much-needed class action lawsuits.
stingray
Take a look at TeamSite::XMLParser. They've already had to jump through considerable hoops to get UTF-8 support on a version of perl that doesn't actually support it. They seem to be so busy trying to add buzzword compliant features that they probably don't know want to spend the time to undo the ugly hacks they added to make UTF-8 work.
If you want to be really daring, you could subclass XML:
arser and replace it's file_ext_ent_handler method. That method creates an IO::File object for external entities. All you'd have to do it replicate the method, but seek beyond the first three bytes of the file (if the BOM exists) before returning the filehandle.