Extracting HTML tables and nested tables from HTML

I'm trying to write a script that extracts a HTML table across many different HTML files.

I need to copy the HTML that makes up the table (as well as nested tables) into a field on a new DCR. I need the HTML like .... not just the contents of the table cells.

The common characteristic of the table that I need to extract is that the width of the table is 602.

I'm fairly certain that the width attribute is in the same location of each tag across the files. So I need to copy all HTML beginning with:

and ending with:

ignoring all nested tables and paste it to a new file.

I thought that HTML::TableExtract might be the way to go because I could pinpoint the location of the table but my code just gives me the contents of the table cells. I'm thinking that it may be possible to use HTML::LinkExtor to get the table I want but I'm not sure how to ignore all the nested tables in the file.

Thanks much,

taiyo

Find more posts tagged with

Comments

jbonifaci

Your post seems to have had a few things stripped out. If you could attach an example of an html page you are trying to strip the table out of, I should be able to give you something that can do this that doesn't need any modules. Make sure to attach an example with nested tables.

Also, for future reference, this would have been a perfect post for the new PERL forum here on devnet.

Control_Center.JPG

Connector.JPG

taiyo

Hey thanks for the note about the new Perl forum as well as pointing out the sloppieness of my post. Sorry about that. I've attached an example HTML page for you to look at. I really just need the table that begins with :

<table width="602" border="0" cellspacing="0" cellpadding="0" bgcolor="white" height="100%">

and ends with:

</table>

cheers,

taiyo

jbonifaci

Heya,

I've done similar things to this in the past in a single regular expression, but didn't have it in me today. Here's what I came up though, assuming you've read the html into a variable named $html, this code below should do what you're looking for. The highest level table with width 602 will be in the variable $table.

my ($table) = $html =~ /(<table\b[^>]*width\s*=\s*['"]?602['"]?[^>]*>(.|\n)*?<\/table\b[^>]*>)/i;

my $balanced = 1;
while($balanced != 0)
{
my @openTable = $table =~ /(<table\b[^>]*>)/gi;
my @closeTable = $table =~ /(<\/table\b[^>]*>)/gi;

$balanced = $#openTable - $#closeTable;

if ($balanced != 0)
{
$table =~ s/([\\\|\[\]\^\$])/\\$1/g;
($table) = $html =~ /($table(.|\n)*?<\/table>)/i;
}
}