finding bad HTML in perl

Hoping for a point in the right direction. Not certain what PMs to use.

I am on site with a load of HTML that I am cracking & running through MetaTagger. However some of the
HTML has bad charactes (and thousands of files). They are listed as UTF-8 but have some characters (I assume
trademark, copyright, etc) that show up as PARTNAMEâ„¢ or PARTNAME\232 & other unprintable characters.

WHen this happens to DCRs, I ran a quick XML parser over all the directories & trapped the bad data.

Is there a similar parser for HTML ? Which PM & I can find my way from there.

Thanks

ANdy

Find more posts tagged with

Comments

gzevin

use HTML:

****;

Greg Zevin, Ph.D. Comp. Sc.
Independent Interwoven Consultant/Architect
Sydney, AU

jed

It's not a perl solution, but you might be able to use tidy:
http://tidy.sourceforge.net/
http://www.w3.org/People/Raggett/tidy/

--
Jed Michnowicz
jedm@sun.com
Content Management Engineer
Sun Microsystems

tonyarnold

That's exactly what I did. Compile tidy on the box you're going to use it on, then look at the scripts on this page for more ideas: http://users.rcn.com/creitzel/tidy.html

I used the perl SWIG modules from here to parse all HTML entered by the user (including the monstrous mess that visualformat would like to refer to as XHTML). Now all content that is generated is valid XHTML Smiley Happy

. Let me know if you'd like more guidance - it's all pretty straightforward stuff...