Issue in setting UTF8 encoding for XML generated from perl script

Hi All,
The issue is setting up the correct UTF8 encoding for the XML generated from a perl script.The XML has Japanese characters in it and it does not display properly on the browser.
Whenever the script gets run it generates the XML in encoding as 'UTF-8 without BOM'.
If I set the encoding of the XML manually as 'UTF-8' then it renders properly on browser.
Kindly comment.

Here is the code which generates the XML.

!--
eval { $output = new IO::File(">$IWMOUNT/default/main/Public/JP/WORKAREA/JP_WA/jp/ja/filtering/investment-reports/investment-reports.xml"); };

$writer = new XML::Writer(OUTPUT => $output);
$writer->xmlDecl("utf8");
$writer->startTag("Investment-Reports");
$writer->endTag("Commentaries");
$writer->end();
$output->close();
--

thx in advance

Find more posts tagged with

Comments

nipper

Japanese will render correctly in UTF-8, but you have a few details to make certain.

First, you list UTF-8 with no BOM - that is bad.

You need a BOM, it is a 3 byte mark at the very beginning of a file. I use this in my TPL to generate the BOM:

[html]

my $UTF8_BOM = "\xEF\xBB\xBF";
iwpt_output($UTF8_BOM)

[/html]

You can modifiy that for your needs.

Second make certain that your browser has the full Japanese character set. From my experience Mozilla 2.0 does as well as IE7, while IE 6 and Safari don't - that being from the base install. Find a Japanese site and see if it renders correctly.

HTH

Andy

motlnt

Japanese will render correctly in UTF-8, but you have a few details to make certain.

First, you list UTF-8 with no BOM - that is bad.

You need a BOM, it is a 3 byte mark at the very beginning of a file. I use this in my TPL to generate the BOM:

[html]

my $UTF8_BOM = "\xEF\xBB\xBF";
iwpt_output($UTF8_BOM)

[/html]

You can modifiy that for your needs.

Second make certain that your browser has the full Japanese character set. From my experience Mozilla 2.0 does as well as IE7, while IE 6 and Safari don't - that being from the base install. Find a Japanese site and see if it renders correctly.

HTH

Andy

Thx Andy,
I did tried that also(from your earlier post), but the encoding of the XML file still remains 'UTF-8 without BOM'.This shows up in my Notepad++ editor.
Is there any way to force the encoding of the actual physical XML file to UTF-8?

ISCBorisB

Thx Andy,
I did tried that also(from your earlier post), but the encoding of the XML file still remains 'UTF-8 without BOM'.This shows up in my Notepad++ editor.
Is there any way to force the encoding of the actual physical XML file to UTF-8?

All that BOM/NoBOM rigmarole really depends on the requirements of your generated XML *consumer*, for the lack of
better term. For what it's worth, from my personal experience it is almost always sufficient to generate UTF-8 XML
without BOM Marker (that's right, no BOM!) but with the XML Encoding Header. Like this:

[html]

...
[/html]Note that it's a bad idea to lie to XML Parser in this manner. If you declared XML File as UTF-8 encoded it better BE UTF-8 encoded,
eg contain only valid UTF-8 (multi-byte) characters.

Note also that two flavors of UTF-16, LE and BE *seem* to represent totally different story. There correct BOM (one of the two) is almost always required

motlnt

All that BOM/NoBOM rigmarole really depends on the requirements of your generated XML *consumer*, for the lack of
better term. For what it's worth, from my personal experience it is almost always sufficient to generate UTF-8 XML
without BOM Marker (that's right, no BOM!) but with the XML Encoding Header. Like this:

[html]

...
[/html]Note that it's a bad idea to lie to XML Parser in this manner. If you declared XML File as UTF-8 encoded it better BE UTF-8 encoded,
eg contain only valid UTF-8 (multi-byte) characters.

Note also that two flavors of UTF-16, LE and BE *seem* to represent totally different story. There correct BOM (one of the two) is almost always required

The *consumer* here is cold fusion(.cfm) file which parses the XML.
Somehow it is still treating the XML as 'UTF-8 without BOM' even if I declare that XML decl on first line.

Not sure how to make that perl script(IPL) to force the encoding to XML.

nipper

Look at the output in a binary editor, it should start something like this:

FF FA 0A 00 3C

Which looks like this
ÿþ

ISCBorisB

The *consumer* here is cold fusion(.cfm) file which parses the XML.
Somehow it is still treating the XML as 'UTF-8 without BOM' even if I declare that XML decl on first line.

Not sure how to make that perl script(IPL) to force the encoding to XML.

UTF-8 Byte Order Mark, or BOM is an optional three bytes Marker (Hex 0xEFBBBF) at the very beginning of your File.
If all you did was just set <?xml version="1.0" encoding="UTF-8"?> header then your File really IS 'UTF-8 without BOM'.
CF XML Parser should be able to eat it up without a problem and ask for more.

You can not "force" anything to "be" UTF-8. For example, "Àpropos de nous" IS NOT Valid UTF-8. "Ã€propos de nous" IS Valid.
Note that the first letter is the same, only in two different encodings. (In UTF-8 À 's encoding is two-byte character 0xC380).

So... What is the problem you are trying to solve? Who complains? What is the error message?

motlnt

Look at the output in a binary editor, it should start something like this:

FF FA 0A 00 3C

Which looks like this
ÿþ

gosh...that makes it completely unparseble by the CFM....surprisingly if I view just the XML in browser it renders properly but not when it gets parsed by the CFM.

still wondering....

nipper

gosh...that makes it completely unparseble by the CFM....surprisingly if I view just the XML in browser it renders properly but not when it gets parsed by the CFM.

still wondering....

You need to find out from whoever sells coldfusion. That is a standard way of putting out UTF-8 (to include the BOM).

If they cannot process the BOM then you may have to encode the data before you put it in XML.

motlnt

You need to find out from whoever sells coldfusion. That is a standard way of putting out UTF-8 (to include the BOM).

If they cannot process the BOM then you may have to encode the data before you put it in XML.

yeah...I am researching on that....but going back to the original issue is there any way where the perl script(.ipl) generate the XML in plain vanilla UTF-8 instead of 'UTF-8 without BOM'?

ISCBorisB

yeah...I am researching on that....but going back to the original issue is there any way where the perl script(.ipl) generate the XML in plain vanilla UTF-8 instead of 'UTF-8 without BOM'?

It's a pretty safe bet that your problem has nothing to do to the presence or absence of BOM Marker, it's optional.
Attach your XML (do not copy/paste to the post, attach). Chances are, you have some invalid (non UTF-8) symbols there.

motlnt

It's a pretty safe bet that your problem has nothing to do to the presence or absence of BOM Marker, it's optional.
Attach your XML (do not copy/paste to the post, attach). Chances are, you have some invalid (non UTF-8) symbols there.

plz find the XML attached...saving it as .txt....plz note the encoding as 'UTF-8 without BOM'.

ISCBorisB

plz find the XML attached...saving it as .txt....plz note the encoding as 'UTF-8 without BOM'.

Your UTF-8 Encoding seems to be correct! I've attached a copy of your File with the UTF-8 BOM Marker.
Copy it AS IS and try, see if that changes anything. Unless you are absolutely sure how exactly
your Editors treat Unicode and BOM, do not edit the File.

motlnt

Your UTF-8 Encoding seems to be correct! I've attached a copy of your File with the UTF-8 BOM Marker.
Copy it AS IS and try, see if that changes anything. Unless you are absolutely sure how exactly
your Editors treat Unicode and BOM, do not edit the File.

that didn't work either...as I mentioned earlier in the post, CFM is not even able to parse this new xml now...also now I see the encoding in file provided by you as 'UTF-8" and not 'UTF-8 without BOM'....seems more fishy on the coldfusion side to me...

ISCBorisB

that didn't work either...as I mentioned earlier in the post, CFM is not even able to parse this new xml now...also now I see the encoding in file provided by you as 'UTF-8" and not 'UTF-8 without BOM'....seems more fishy on the coldfusion side to me...

Ok. Let me repeat what I've already said - one last time. Your problem has nothing whatever to do to the presence or absence of BOM Marker!

Something fails somewhere in you Application. Why do you think it's encoding related? Sure, if your software expects (for example)
UTF-16BE or UTF-32LE or Windows-1252 or whatever... You can feed it flawless UTF-8, it'll still fail. For all we know, your problem
may be even not related to encoding.

motlnt

Ok. Let me repeat what I've already said - one last time. Your problem has nothing whatever to do to the presence or absence of BOM Marker!

Something fails somewhere in you Application. Why do you think it's encoding related? Sure, if your software expects (for example)
UTF-16BE or UTF-32LE or Windows-1252 or whatever... You can feed it flawless UTF-8, it'll still fail. For all we know, your problem
may be even not related to encoding.

here is the reason...I might be repeating this but, the perl script generates the XML and on opening the XML in an editor like 'Notepad++' is shows me the encoding as 'UTF-8 without BOM'.On deploying the XML to coldfusion server where a CFM parses it it shows me junk characters on the browser.
Mind it that the XML has Japanese(binary) characters in it.
Once I save the same XML in editor as only 'UTF-8' and deploy it ,everything works fine.
encoding is the only thing I can think right now about.
I am still researching on how cold fusion handles Japanese characters.

ISCBorisB

Once I save the same XML in editor as only 'UTF-8' and deploy it ,everything works fine.

Make a copy of the File "before". Save it as 'UTF-8', whatever Editor you are using. Compare Files "before"/"after", do it in hexadecimal mode if needed. What's the difference?

motlnt

Make a copy of the File "before". Save it as 'UTF-8', whatever Editor you are using. Compare Files "before"/"after", do it in hexadecimal mode if needed. What's the difference?

I am attaching two files before/after.
ir1.txt is before (UTF8 without BOM)
ir2.txt is after I set the encoding manually(just UTF-8).

motlnt

Look at the output in a binary editor, it should start something like this:

FF FA 0A 00 3C

Which looks like this
ÿþ

finally this is resolved...the coldfusion was not able to read the encoding properly...so I had to force the encoding in CFM page with following directive...

< cfprocessingdirective pageencoding="utf-8" >

thx everyone for the inputs/comments