Templating default encoding

ANdy did you try the meta-http tag setting browsers encoding to UTF8?

when you open your html page in browser and change encoding to utf-8, does it still display garbage characters?

OK, so I am really confused here.

We are translating some pages, HTML everywhere say UTF-8. These are all generated from templates, on Windows 2003 TS 6.7.1.

When the 3rd party sent the DCRs, they had the funky characters (ü) rather than the encoded &# 1234;

No big deal the page should support that.

So I generate the page, I get garbage. Open the HTML in notepad, save it explicitly as UTF-8 and it works. The funny thing, according to iwpt_encoding.ipl UTF-8 is the only option.

This is the same for regen, preview or workflow (iwpt_compile). No where do I override the default.

Any ideas ?

I have no freaking idea what is going on.

Andy

I get it now. The DCRs you have got use some other encoding for their content. When you do generate, your page is generated as UTF-8 and thats the cause of garbage characters.

Find the encoding of the DCRs, and then use the meta http tag in your pt to set the encoding to that particular encoding. your generate shall work fine then.

You can test this by simply preview the generated html on your browser and then changing the encoding of your browser ..

The DCRs are all set to UTF-8

The TPL generates code for UTF-8

My browser is set to display UTF-8

I get garbage

If I run my text through:
use HTML::Entities;
$newText = encode_entities($node->value('Text'));

it works. I should not have to do that.

Finally, if I open the generated page in notepad and save as, explicitly specifying UTF-8 it works fine.

I changed my iwpt_compile to add -oenc UTF-8 that did not help.

Obviously I can open/save every file as UTF-8. I should not need to do that.

I am pulling my hair out.

OK, So I am getting closer.

I have decided it is a BOM (byte-order-mark) issue.

I generated a new file.

submitted it, opened and saved it via notepad and brought it up in UltraCompare Hex mode.

There are 3 hex chars at the beginning of the file FF BB BF

I opened/saved the TPL in notepad then it worked.

The DCRs are all set to UTF-8

The TPL generates code for UTF-8

My browser is set to display UTF-8

I get garbage

If I run my text through:
use HTML::Entities;
$newText = encode_entities($node->value('Text'));

it works. I should not have to do that.

Finally, if I open the generated page in notepad and save as, explicitly specifying UTF-8 it works fine.

I changed my iwpt_compile to add -oenc UTF-8 that did not help.

Obviously I can open/save every file as UTF-8. I should not need to do that.

I am pulling my hair out.

Andy - I ran into a similar issue and I am sure, I am going right here..

1. Your new DCRs use some encoding for their content say ISO something
2. You bring these DCRs to your UTF-8 systems and try generating them
3. When you generate from these DCRS, the generated HTML either is generated as UTF-8 pr forces your Browser to use UTF-8 or both
4. The other encoding characters look garbaged

try this now -

1. Find the encoding of your new DCRs. Answer this question - What encoding do they use currently?
2. In your PT - set the meta http tag to force your browser to use this encoding and not UTF -8
3. Generate using this PT Now.

If the only encoding, you want to use is UTF-8, then you may have to do what you are doing currently, saving the DCRs in UTF-8 format or you may search for a more elegant solution to do the same..

The implementation I'm currently working on has dozens of language requiremetns, so I also wrestled with similar issues. Playing with TeamSite and HTML settings proved useless... no matter what I tried to set the encoding to, nothing worked.

The key is for the TPL to be saved with the proper encoding and a signature (also known as the byte order mark -- BOM). This byte order mark is required for most applications to properly interpret encoded characters (go to unicode.org for more info on the BOM). The BOM is added by Notepad if you save the TPL as UTF-8, or through Visual Studio is you save the file with signature.

Give it a try and let me know if it works. We saved all of our DCTs and TPLs this way, and have been running French, Spanish, Arabic, Chinese, Japanese, etc. sites ever since with no problems whatsoever. We can bring in existing data with extended characters, encoded characters, etc... and all characters are displayed properly during preview and after page generation...

I'm not sure if our problems are exactly the same, but it did solve most of our issues.

G'luck.

The key is for the TPL to be saved with the proper encoding and a signature (also known as the byte order mark -- BOM). This byte order mark is required for most applications to properly interpret encoded characters (go to unicode.org for more info on the BOM). The BOM is added by Notepad if you save the TPL as UTF-8, or through Visual Studio is you save the file with signature.

Yup that was what I found as well. I have not tried editing a TPL with say UltraEdit (my editor of choice) after I put the BOM in place.

Please don't tell me I have to use notepad for the rest of my life. Smiley Very Happy

UltraEdit gives you the BOM option, no need for Notepad Smiley Happy

The funny thing is they call it a byte order mark, but it simply identifies the data as UTF-8. UTF-8 actually does NOT have byte order issues like other encoding, so it's not an "actually" byte order encoding scheme...

On a side note, be wary if you use multiple TPLs (includes) to generate a page. The BOM has to be the first character in order for the encoded data to be interpretted properly. If you add the BOM to includes, it will interpret the BOM as a character and give you unwanted spacing within your page... so only apply it to the "main" TPL.

UltraEdit gives you the BOM option, no need for Notepad

The funny thing is they call it a byte order mark, but it simply identifies the data as UTF-8. UTF-8 actually does NOT have byte order issues like other encoding, so it's not an "actually" byte order encoding scheme...

On a side note, be wary if you use multiple TPLs (includes) to generate a page. The BOM has to be the first character in order for the encoded data to be interpretted properly. If you add the BOM to includes, it will interpret the BOM as a character and give you unwanted spacing within your page... so only apply it to the "main" TPL.

That is strange, when I tried saving with UltraEdit and TextEdit, even specifying UTF-8 it did not work. I had to use notepad.

I also do use includes (as well as iw_ostream) so I will need to test this pretty thoroughly.

I certainly appreciate the help

Andy

Let me know if it worked... I'm pretty sure the implemenation I'm working on has more language requirements than most (closing in on 50 now... ugh!), and I've heard so many encoding nightmares from other developers that could have simply been solved using proper encoding rather than complex scripting/code...

Good finding, LMX and Andy!

Let me know if it worked... I'm pretty sure the implemenation I'm working on has more language requirements than most (closing in on 50 now... ugh!), and I've heard so many encoding nightmares from other developers that could have simply been solved using proper encoding rather than complex scripting/code...

Looks like it is working. I put the BOM in all my TPLs but am lucky in that I have the main is a stub which references the 2nd, so there are 2 BOM on the 1st line. Can't tell.

I am curious, my generated file now includes the
And it also looks like UE will maintain the BOM, so I can edit my TPL with UE, just have to save as once with notepad

Andy

Sweet. Glad to hear things worked out. I had 2 BOMs in a row at the beginning of my document, and it cause unwanted spacing.

The only "language" and character problem I have yet to resolve is how will I get Arabic data to be managed within TinyMCE (right justified, and it has to be edited/read from right to left).

ugh...

Sweet. Glad to hear things worked out. I had 2 BOMs in a row at the beginning of my document, and it cause unwanted spacing.

Yea I was seeing that too. I took your advice and only save the primary TPL as UTF-8 and then also removed the 1st <iw_pt and it is working better, still have one page with a funky line from the BOM (I think) but cannot find it.

ISCBorisB

Yea I was seeing that too. I took your advice and only save the primary TPL as UTF-8 and then also removed the 1st <iw_pt and it is working better, still have one page with a funky line from the BOM (I think) but cannot find it.

I wonder if your problems are somehow related to 6.7. Do you use true multi-byte symbols within your PT Code itself?
Do you use <iwov_xslt>?.

If both answers are "No" you should be able to do what you want without messing up presentation template code.
(I *think* it should be possible even if both answers are "Yes", not without extra-coding though).
In my 6.5 I do it from English to Kanji and anything in between ( Russian, Arabic, you name it ) with plain vanilla ASCII PTs.
Just like you, we also use translation services that return DCRs with true UTF-8 multi-byte symbols.

If all you need is a BOM in the output, then:

my $UTF8_BOM = "\xEF\xBB\xBF"; # Simpler "\x{EFBBBF}" may cause stray "Wide character in print" warnings, do not use it
iwpt_output($UTF8_BOM);

Beware! That should be the very beginning of the output File, if you do not do it all in Perl you would have to play with the first line of you pt, no \n's etc...