Discussions
Categories
Groups
Community Home
Categories
INTERNAL ENABLEMENT
POPULAR
THRUST SERVICES & TOOLS
CLOUD EDITIONS
Quick Links
MY LINKS
HELPFUL TIPS
Back to website
Home
Web CMS (TeamSite)
finding bad HTML in perl
nipper
Hoping for a point in the right direction. Not certain what PMs to use.
I am on site with a load of HTML that I am cracking & running through MetaTagger. However some of the
HTML has bad charactes (and thousands of files). They are listed as UTF-8 but have some characters (I assume
trademark, copyright, etc) that show up as PARTNAMEâ„¢ or PARTNAME\232 & other unprintable characters.
WHen this happens to DCRs, I ran a quick XML parser over all the directories & trapped the bad data.
Is there a similar parser for HTML ? Which PM & I can find my way from there.
Thanks
ANdy
Find more posts tagged with
Comments
gzevin
use HTML:
****;
Greg Zevin, Ph.D. Comp. Sc.
Independent Interwoven Consultant/Architect
Sydney, AU
jed
It's not a perl solution, but you might be able to use tidy:
http://tidy.sourceforge.net/
http://www.w3.org/People/Raggett/tidy/
--
Jed Michnowicz
jedm@sun.com
Content Management Engineer
Sun Microsystems
tonyarnold
That's exactly what I did. Compile tidy on the box you're going to use it on, then look at the scripts on this page for more ideas:
http://users.rcn.com/creitzel/tidy.html
I used the perl SWIG modules from here to parse
all
HTML entered by the user (including the monstrous mess that visualformat would like to refer to as XHTML). Now all content that is generated is valid XHTML
. Let me know if you'd like more guidance - it's all pretty straightforward stuff...