Simple xml datsource failing with & character in xml file

Hi,
 
I have an xml file as a simple xml datasource. Within it is a line like ...
 
<div>
<pre class="_prettyXprint _lang-xml">
<lineitem><item>AB Allen & Bradley</item></lineitem>
</pre>
</div>
When I try and setup some row mappings with the designer for my dataset it fails saying the source file is invalid.
 
If I change the & character to 'and', for example, or remove it completely things work ok.
 
How can I handle these characters in the xml file successfully? Is there some way to escape them or something?
 
Thanks,

Find more posts tagged with

Comments

warwick.baker@gmail.com

I guess most people will tell me to wrap the content of the tag in <![CDATA[AB Allen & Bradley]]>. But what if I don't have control of the xml, I just pull it off a feed.
 
As a follow on from this initial problem, could anyone tell me if there is a way of maintaining the whitespace in xml tags as is?
 
Thanks.

Clement Wong

Per XML Spec (<a data-ipb='nomediaparse' href='http://www.w3.org/TR/REC-xml/#syntax'>http://www.w3.org/TR/REC-xml/#syntax</a>):
"The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form, except when used as markup delimiters, or within a <a data-ipb='nomediaparse' href='http://www.w3.org/TR/REC-xml/#dt-comment' title="Comment">comment</a>, a <a data-ipb='nomediaparse' href='http://www.w3.org/TR/REC-xml/#dt-pi' title="Processing instruction">processing instruction</a>, or a <a data-ipb='nomediaparse' href='http://www.w3.org/TR/REC-xml/#dt-cdsection' title="CDATA Section">CDATA section</a>. If they are needed elsewhere, they MUST be <a data-ipb='nomediaparse' href='http://www.w3.org/TR/REC-xml/#dt-escape' title="escape">escaped</a> using either <a data-ipb='nomediaparse' href='http://www.w3.org/TR/REC-xml/#dt-charref' title="Character Reference">numeric character references</a> or the strings " & " and " < " respectively. The right angle bracket (>) may be represented using the string " > ", and MUST, <a data-ipb='nomediaparse' href='http://www.w3.org/TR/REC-xml/#dt-compat' title="For Compatibility">for compatibility</a>, be escaped using either " > " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a <a data-ipb='nomediaparse' href='http://www.w3.org/TR/REC-xml/#dt-cdsection' title="CDATA Section">CDATA section</a>."
 
 
Since the incoming data is not properly formed, then you should read the stream in and then "fix" it by adding a CDATA where needed.
 
Check out my DevShare I recently wrote about parsing XML from a RSS feed with E4X @ <a data-ipb='nomediaparse' href='http://developer.actuate.com/community/forum/index.php?/files/file/1122-parsing-xml-in-birt-with-e4x/'>http://developer.actuate.com/community/forum/index.php?/files/file/1122-parsing-xml-in-birt-with-e4x/</a>
 
After the report design has read in the XML feed, you can use a regex search and replace to add the CDATA tags in. This will also maintain your whitespace requirement.
 
This code snippet example take a simple XML string like your example, and "fixes" it. The magic is in this line, and everything else is just variable initialization and debug statements:
<pre class="_prettyXprint">
myXML2 = myXML2.replace(/\<item\>/g, "<item><![CDATA[").replace(/\<\/item\>/g, "]]></item>");</pre>
Demo:
<pre class="_prettyXprint">
//The logger only works in commercial BIRT and will show the output to Eclipse's Error Log in the IDE
logger = java.util.logging.Logger.getLogger("birt.report.logger");

myXML = "<feed><lineitem><item><![CDATA[AB Allen & Bradley]]></item><item><![CDATA[CW Clement & Wong]]></item></lineitem></feed>";
myXML2 = "<feed><lineitem><item>AB Allen & Bradley</item><item>CW Clement & Wong</item></lineitem></feed>";

//& is an invalid character without CDATA
//
//If you don't search/replace, you will get the following error:
// TypeError: The entity name must immediately follow the '&' in the entity reference. (/report/method[@name="beforeFactory"]#14)
//
myXML2 = myXML2.replace(/\<item\>/g, "<item><![CDATA[").replace(/\<\/item\>/g, "]]></item>");

logger.warning (myXML); // Show a well formatted XML stream
logger.warning (myXML2); // Show the raw XML stream

rss = new XML(myXML); // Convert to XML literal using E4X
rss2 = new XML (myXML2); // Convert to XML literal using E4X

totalItems = rss.lineitem.item.length(); // Easy E4X dot notation
totalItems2 = rss2.lineitem.item.length();

logger.warning (totalItems); // Shows 2 <item>
logger.warning (totalItems2);

logger.warning (rss2.lineitem.item[0]); // Shows each <item> element
logger.warning (rss2.lineitem.item[1]);
</pre>

warwick.baker@gmail.com

Thanks Clement, this is great help, I'll give it a try and see how I go.
 
Regards,
Warwick Baker.

warwick.baker@gmail.com

Hello Clement,
 
I tried in a bit of java code to wrapper things like you suggested. For example, after manipulating the input xml feed I ended up with ...
 
<div>
<pre class="_prettyXprint">
<reportbody>
<lineitem><item><![CDATA[Brand Description]]></item></lineitem>
<lineitem><item><![CDATA[

]]></item></lineitem>
<lineitem><item><![CDATA[AB Pickle & Smitherns]]></item></lineitem>
<lineitem><item><![CDATA[HANS Hansen]]></item></lineitem>
<lineitem><item><![CDATA[ 17 Brand Codes Listed.]]></item></lineitem>
</reportbody>
</pre>
However, when I feed this into the report engine the problem of the & is handled aok but the whitespace at the start of the last line is not maintained in the generated report. Am I missing something?
 
Thanks,
Warwick Baker.
</div>

Clement Wong

Two things to make it work with the leading whitespaces:
 
E4X setting -- add this before rss = new XML(myXML);
<pre class="_prettyXprint">
XML.ignoreWhitespace = false;</pre>
BIRT setting for the text report item:
 
Properties > Format String > Custom > Custom settings > Preserve white spaces
 
 
See attached for an example.

warwick.baker@gmail.com

Thanks Clement, I'm sticking away from the E4X stuff and doing the tweaking manually in java code before passing to birt engine.
 
I'll keep mucking about with it ...