Help with weird windows 1252 character

Hi Folks,

I'm just going through a process of writing a routine that will automatically replace any "known" windows 1252 characters with an equivalent HTML encoded character (as I have specified myself). I thought I had it nailed, until I parsed all of our existing HTML pages (thousands, spanning 10 years of development). Then I came across this weird phenomena where I have this character (it "seems" like it is an e acute, but I don't really know what it is!).

On our Sun box, it shows up in a putty terminal as (using cat):

CommuniquÃ©

If I "vi" it, it shows up like this:

Communiqu\303\251

In my windows text editor (textpad, file encoding is utf-8), it shows up like this:

Communiqué

And if I run it through a Perl script using Devel: Smiley Tongue

eek I get this information for "two" characters:

SV = PVIV(0x238efc) at 0x18ebab0
REFCNT = 2
FLAGS = (IOK,POK,pIOK,pPOK)
IV = 195
PV = 0x1909694 "195"\0
CUR = 3
LEN = 4
SV = PVIV(0x238f0c) at 0x18ebabc
REFCNT = 2
FLAGS = (IOK,POK,pIOK,pPOK)
IV = 169
PV = 0x190a7b4 "169"\0
CUR = 3
LEN = 4

Thats it! Two characters for what I thought was one windows-1252 e acute.

(Oh...and I wrote a basic C program to count the characters also, and it counts the last e acute as two characters also).

Now admitting that my character encoding knowledge is rudimentary, but I'm not understanding this at all. Is it possible to get one character represented by two characters? What am I missing?

Any pointers are gratefully appreciated.

Find more posts tagged with

Comments

Migrateduser

Maybe this is OS specific? From your post, it sounds like all your unix utils, cat; vi; etc.... shows two characters. While your Windows program (textpad) shows it as one character. Perhaps you can try other Windows program to see if you are getting one character such as MS Word, UltraEdit, Notepad(gasp!), Wordpad??

3D.png

Jamik

Hi Folks,

I'm just going through a process of writing a routine that will automatically replace any "known" windows 1252 characters with an equivalent HTML encoded character (as I have specified myself). I thought I had it nailed, until I parsed all of our existing HTML pages (thousands, spanning 10 years of development). Then I came across this weird phenomena where I have this character (it "seems" like it is an e acute, but I don't really know what it is!).

On our Sun box, it shows up in a putty terminal as (using cat):

CommuniquÃ©

If I "vi" it, it shows up like this:

Communiqu\303\251

In my windows text editor (textpad, file encoding is utf-8), it shows up like this:

Communiqué

And if I run it through a Perl script using Devel:eek I get this information for "two" characters:

SV = PVIV(0x238efc) at 0x18ebab0
REFCNT = 2
FLAGS = (IOK,POK,pIOK,pPOK)
IV = 195
PV = 0x1909694 "195"\0
CUR = 3
LEN = 4
SV = PVIV(0x238f0c) at 0x18ebabc
REFCNT = 2
FLAGS = (IOK,POK,pIOK,pPOK)
IV = 169
PV = 0x190a7b4 "169"\0
CUR = 3
LEN = 4

Thats it! Two characters for what I thought was one windows-1252 e acute.

(Oh...and I wrote a basic C program to count the characters also, and it counts the last e acute as two characters also).

Now admitting that my character encoding knowledge is rudimentary, but I'm not understanding this at all. Is it possible to get one character represented by two characters? What am I missing?

Any pointers are gratefully appreciated.

What you're missing is that you've actually done an UTF-8 encoding for this particular character when you did your find/replace. The character is indeed the e acute but stored as the UTF-8 value of 0xC3 0xA9 which if you try to view that letter in a program that's using iso-8859-1 or win 1252 will come across as two characters (the two that you list). If you change your putty session to use utf-8 you'll see that the character does come across as the e acute.

so you can either change the content-type of your html/xml pages to be utf-8 or change the e acute to the proper iso-8859-1 encoding or change it to an html entity:

eacute; #233; or #xe9; (add the & to the front of those)

brizrobbo

Aha....I have just learned about UTF-8 multibyte characters!!! Thats whats going on here.

This document was really helpful:

http://perldoc.perl.org/perluniintro.html

However, it is early days for me, in terms of dealing with the dreaded windows 1252 characters. I still haven't got it pinned yet, but am working on it.

Cheers, Robbo