Help with weird windows 1252 character
Hi Folks,
I'm just going through a process of writing a routine that will automatically replace any "known" windows 1252 characters with an equivalent HTML encoded character (as I have specified myself). I thought I had it nailed, until I parsed all of our existing HTML pages (thousands, spanning 10 years of development). Then I came across this weird phenomena where I have this character (it "seems" like it is an e acute, but I don't really know what it is!).
On our Sun box, it shows up in a putty terminal as (using cat):
Communiqué
If I "vi" it, it shows up like this:
Communiqu\303\251
In my windows text editor (textpad, file encoding is utf-8), it shows up like this:
Communiqué
And if I run it through a Perl script using Devel:eek I get this information for "two" characters:
SV = PVIV(0x238efc) at 0x18ebab0
REFCNT = 2
FLAGS = (IOK,POK,pIOK,pPOK)
IV = 195
PV = 0x1909694 "195"\0
CUR = 3
LEN = 4
SV = PVIV(0x238f0c) at 0x18ebabc
REFCNT = 2
FLAGS = (IOK,POK,pIOK,pPOK)
IV = 169
PV = 0x190a7b4 "169"\0
CUR = 3
LEN = 4
Thats it! Two characters for what I thought was one windows-1252 e acute.
(Oh...and I wrote a basic C program to count the characters also, and it counts the last e acute as two characters also).
Now admitting that my character encoding knowledge is rudimentary, but I'm not understanding this at all. Is it possible to get one character represented by two characters? What am I missing?
Any pointers are gratefully appreciated.
I'm just going through a process of writing a routine that will automatically replace any "known" windows 1252 characters with an equivalent HTML encoded character (as I have specified myself). I thought I had it nailed, until I parsed all of our existing HTML pages (thousands, spanning 10 years of development). Then I came across this weird phenomena where I have this character (it "seems" like it is an e acute, but I don't really know what it is!).
On our Sun box, it shows up in a putty terminal as (using cat):
Communiqué
If I "vi" it, it shows up like this:
Communiqu\303\251
In my windows text editor (textpad, file encoding is utf-8), it shows up like this:
Communiqué
And if I run it through a Perl script using Devel:eek I get this information for "two" characters:
SV = PVIV(0x238efc) at 0x18ebab0
REFCNT = 2
FLAGS = (IOK,POK,pIOK,pPOK)
IV = 195
PV = 0x1909694 "195"\0
CUR = 3
LEN = 4
SV = PVIV(0x238f0c) at 0x18ebabc
REFCNT = 2
FLAGS = (IOK,POK,pIOK,pPOK)
IV = 169
PV = 0x190a7b4 "169"\0
CUR = 3
LEN = 4
Thats it! Two characters for what I thought was one windows-1252 e acute.
(Oh...and I wrote a basic C program to count the characters also, and it counts the last e acute as two characters also).
Now admitting that my character encoding knowledge is rudimentary, but I'm not understanding this at all. Is it possible to get one character represented by two characters? What am I missing?
Any pointers are gratefully appreciated.
0
Comments
-
Maybe this is OS specific? From your post, it sounds like all your unix utils, cat; vi; etc.... shows two characters. While your Windows program (textpad) shows it as one character. Perhaps you can try other Windows program to see if you are getting one character such as MS Word, UltraEdit, Notepad(gasp!), Wordpad??0
-
What you're missing is that you've actually done an UTF-8 encoding for this particular character when you did your find/replace. The character is indeed the e acute but stored as the UTF-8 value of 0xC3 0xA9 which if you try to view that letter in a program that's using iso-8859-1 or win 1252 will come across as two characters (the two that you list). If you change your putty session to use utf-8 you'll see that the character does come across as the e acute.Hi Folks,
I'm just going through a process of writing a routine that will automatically replace any "known" windows 1252 characters with an equivalent HTML encoded character (as I have specified myself). I thought I had it nailed, until I parsed all of our existing HTML pages (thousands, spanning 10 years of development). Then I came across this weird phenomena where I have this character (it "seems" like it is an e acute, but I don't really know what it is!).
On our Sun box, it shows up in a putty terminal as (using cat):
Communiqué
If I "vi" it, it shows up like this:
Communiqu\303\251
In my windows text editor (textpad, file encoding is utf-8), it shows up like this:
Communiqué
And if I run it through a Perl script using Devel:eek I get this information for "two" characters:
SV = PVIV(0x238efc) at 0x18ebab0
REFCNT = 2
FLAGS = (IOK,POK,pIOK,pPOK)
IV = 195
PV = 0x1909694 "195"\0
CUR = 3
LEN = 4
SV = PVIV(0x238f0c) at 0x18ebabc
REFCNT = 2
FLAGS = (IOK,POK,pIOK,pPOK)
IV = 169
PV = 0x190a7b4 "169"\0
CUR = 3
LEN = 4
Thats it! Two characters for what I thought was one windows-1252 e acute.
(Oh...and I wrote a basic C program to count the characters also, and it counts the last e acute as two characters also).
Now admitting that my character encoding knowledge is rudimentary, but I'm not understanding this at all. Is it possible to get one character represented by two characters? What am I missing?
Any pointers are gratefully appreciated.
so you can either change the content-type of your html/xml pages to be utf-8 or change the e acute to the proper iso-8859-1 encoding or change it to an html entity:
eacute; #233; or #xe9; (add the & to the front of those)0 -
Aha....I have just learned about UTF-8 multibyte characters!!! Thats whats going on here.
This document was really helpful:
http://perldoc.perl.org/perluniintro.html
However, it is early days for me, in terms of dealing with the dreaded windows 1252 characters. I still haven't got it pinned yet, but am working on it.
Cheers, Robbo0
Categories
- All Categories
- 123 Developer Announcements
- 54 Articles
- 157 General Questions
- 152 Thrust Services
- 56 Developer Hackathon
- 38 Thrust Studio
- 20.6K Analytics
- 4.2K AppWorks
- 9.1K Extended ECM
- 919 Core Messaging
- 84 Digital Asset Management
- 9.4K Documentum
- 34 eDOCS
- 193 Exstream
- 39.8K TeamSite
- 1.7K Web Experience Management
- 10 XM Fax
- Follow Categories
TeamSite Developer Resources
If you are interested in gaining full access to the content, you can register for a My Support account here.
- Docker Automation
- LiveSite Content Services (LSCS) REST API
- Single Page Application (SPA) Modules
- TeamSite Add-ons
If you are interested in gaining full access to the content, you can register for a My Support account here.