Help with weird windows 1252 character

Hi Folks,

I'm just going through a process of writing a routine that will automatically replace any "known" windows 1252 characters with an equivalent HTML encoded character (as I have specified myself). I thought I had it nailed, until I parsed all of our existing HTML pages (thousands, spanning 10 years of development). Then I came across this weird phenomena where I have this character (it "seems" like it is an e acute, but I don't really know what it is!).

On our Sun box, it shows up in a putty terminal as (using cat):

Communiqué



If I "vi" it, it shows up like this:

Communiqu\303\251



In my windows text editor (textpad, file encoding is utf-8), it shows up like this:

Communiqué



And if I run it through a Perl script using Devel:Smiley Tongueeek I get this information for "two" characters:

SV = PVIV(0x238efc) at 0x18ebab0
REFCNT = 2
FLAGS = (IOK,POK,pIOK,pPOK)
IV = 195
PV = 0x1909694 "195"\0
CUR = 3
LEN = 4
SV = PVIV(0x238f0c) at 0x18ebabc
REFCNT = 2
FLAGS = (IOK,POK,pIOK,pPOK)
IV = 169
PV = 0x190a7b4 "169"\0
CUR = 3
LEN = 4

Thats it! Two characters for what I thought was one windows-1252 e acute.

(Oh...and I wrote a basic C program to count the characters also, and it counts the last e acute as two characters also).

Now admitting that my character encoding knowledge is rudimentary, but I'm not understanding this at all. Is it possible to get one character represented by two characters? What am I missing?

Any pointers are gratefully appreciated.

Comments

  • Maybe this is OS specific? From your post, it sounds like all your unix utils, cat; vi; etc.... shows two characters. While your Windows program (textpad) shows it as one character. Perhaps you can try other Windows program to see if you are getting one character such as MS Word, UltraEdit, Notepad(gasp!), Wordpad??
    3D.png 75.7K
  • Hi Folks,

    I'm just going through a process of writing a routine that will automatically replace any "known" windows 1252 characters with an equivalent HTML encoded character (as I have specified myself). I thought I had it nailed, until I parsed all of our existing HTML pages (thousands, spanning 10 years of development). Then I came across this weird phenomena where I have this character (it "seems" like it is an e acute, but I don't really know what it is!).

    On our Sun box, it shows up in a putty terminal as (using cat):

    Communiqué



    If I "vi" it, it shows up like this:

    Communiqu\303\251



    In my windows text editor (textpad, file encoding is utf-8), it shows up like this:

    Communiqué



    And if I run it through a Perl script using Devel:Smiley Tongueeek I get this information for "two" characters:

    SV = PVIV(0x238efc) at 0x18ebab0
    REFCNT = 2
    FLAGS = (IOK,POK,pIOK,pPOK)
    IV = 195
    PV = 0x1909694 "195"\0
    CUR = 3
    LEN = 4
    SV = PVIV(0x238f0c) at 0x18ebabc
    REFCNT = 2
    FLAGS = (IOK,POK,pIOK,pPOK)
    IV = 169
    PV = 0x190a7b4 "169"\0
    CUR = 3
    LEN = 4

    Thats it! Two characters for what I thought was one windows-1252 e acute.

    (Oh...and I wrote a basic C program to count the characters also, and it counts the last e acute as two characters also).

    Now admitting that my character encoding knowledge is rudimentary, but I'm not understanding this at all. Is it possible to get one character represented by two characters? What am I missing?

    Any pointers are gratefully appreciated.
    What you're missing is that you've actually done an UTF-8 encoding for this particular character when you did your find/replace. The character is indeed the e acute but stored as the UTF-8 value of 0xC3 0xA9 which if you try to view that letter in a program that's using iso-8859-1 or win 1252 will come across as two characters (the two that you list). If you change your putty session to use utf-8 you'll see that the character does come across as the e acute.

    so you can either change the content-type of your html/xml pages to be utf-8 or change the e acute to the proper iso-8859-1 encoding or change it to an html entity:

    eacute; #233; or #xe9; (add the & to the front of those)
  • Aha....I have just learned about UTF-8 multibyte characters!!! Thats whats going on here.

    This document was really helpful:

    http://perldoc.perl.org/perluniintro.html

    However, it is early days for me, in terms of dealing with the dreaded windows 1252 characters. I still haven't got it pinned yet, but am working on it.

    Cheers, Robbo
TeamSite Developer Resources

  • Docker Automation

  • LiveSite Content Services (LSCS) REST API

  • Single Page Application (SPA) Modules

  • TeamSite Add-ons

If you are interested in gaining full access to the content, you can register for a My Support account here.
image
OpenText CE Products
TeamSite
APIs