Extraction of HTML tables with Perl

I'm trying to write a script that extracts the HTML that makes up a particular table across many different HTML files.

I need the HTML that makes up the table not just the contents of the table cells.

The common characteristic of the table that I need to extract is that the width of the table is 602.

I'm fairly certain that the width attribute is in the same location of each <TABLE> tag across the files. So I need to copy all HTML beginning with:

<table width="602" border="0" cellspacing="0" cellpadding="0" bgcolor="white" height="100%">

and ending with:

</table>

ignoring all nested tables and paste it to a new file.

I thought that HTML::TableExtract might be the way to go because I could pinpoint the location of the table but my code just gives me the contents of the table cells. I'm thinking that it may be possible to use HTML::LinkExtor to get the table I want but I'm not sure how to ignore all the nested tables in the file.

thanks much !

Find more posts tagged with

Comments

jbonifaci

Heya,

I've done similar things to this in the past in a single regular expression, but didn't have it in me today. Here's what I came up though, assuming you've read the html into a variable named $html, this code below should do what you're looking for. The highest level table with width 602 will be in the variable $table.

my ($table) = $html =~ /(<table\b[^>]*width\s*=\s*['"]?602['"]?[^>]*>(.|\n)*?<\/table\b[^>]*>)/i;

my $balanced = 1;
while($balanced != 0)
{
my @openTable = $table =~ /(<table\b[^>]*>)/gi;
my @closeTable = $table =~ /(<\/table\b[^>]*>)/gi;

$balanced = $#openTable - $#closeTable;

if ($balanced != 0)
{
$table =~ s/([\\\|\[\]\^\$])/\\$1/g;
($table) = $html =~ /($table(.|\n)*?<\/table>)/i;
}
}

Adam Stoller

Not sure if the attached is what you want, but I *think* it does what you're looking for.

--fish
Senior Consultant, Quotient Inc.
http://www.quotient-inc.com

Johnny

have a look at HTML::TreeBuilder

It gives a very clean interface to manipulate HTML code

John Cuiuli

taiyo

Hey jbonifaci,

thanks so much for your regex --- it's pretty sweet and I've got it working for simple pages but it complains as follows:

Use of uninitialized value at html_extract.ipl line 15, <DATA> chunk 1.
Use of uninitialized value at html_extract.ipl line 16, <DATA> chunk 1.
Use of uninitialized value at html_extract.ipl line 26, <DATA> chunk 1.

when I run it against more complex pages. I'm thinking that all the special characters that are present in the htm like "{,},< --,!, >" may be causing problems.... or could the string be too large???

I've attached a sample htm file that fails....and your script that i modified is below --- Thanks again!

#!/usr/bin/perl -w
#=================================================
use strict;
my $html = '';
{
local $/;
$html = <DATA>;

}
my ($table) = $html =~ /(<table\b[^>]*width\s*=\s*['"]?600['"]?[^>]*>(.|\n)*?<\/table\b[^>]*>)/i;
#print "$table \n";
my $balanced = 1;
while($balanced != 0)
{
my @openTable = $table =~ /(<table\b[^>]*>)/gi;
my @closeTable = $table =~ /(<\/table\b[^>]*>)/gi;

$balanced = $#openTable - $#closeTable;

if ($balanced != 0)
{
$table =~ s/([\\\|\[\]\^\$])/\\$1/g;
($table) = $html =~ /($table(.|\n)*?<\/table>)/i;
}
}
print "=" x 70, "\n$table\n", "=" x 70, "\n";
#---------------------------------------------------------------------
__DATA__
<html>
<head><title>foo</title></head>
<body>
<h1>This is a heading</h1>
<table width="602" border="0" cellspacing="0" cellpadding="0" bgcolor="white" height="100%">
<tr><th>Col1</th>
<th>Col2</th>
<th colspan="2">Col3 and 4</th>
</tr>
<tr><td>a</td><td>b</td><td>c</td><td>d</td></tr>
<tr><td colspan="4">
<table border='0' width="100%">
<tr><th>Col-A</th>
<th>Col-B</th>
<th colspan="2">Col-C and D</th>
</tr>
<tr><td>1</td><td>2</td><td>3</td><td>4</td></tr>
</table>
</td>
<tr><td>aa</td><td>bb</td><td>cc</td><td>dd</td></tr>
</table>
<p>
More Text with a table above and nothing below
</p>
</body>
</html>

Edited by taiyo on 11/20/03 11:04 AM (server time).

taiyo

hey fish --

thanks for you post -- got your script working but it fails if the table is nested in the html. i know i should be able to figure our a way to overcome this ---- just wanted to say thanks!!!

taiyo

hey john,

had a go with HTML::TreeBuilder and was able to get the html into the format but need to figure out how to pull out the useful stuff and ignore the rest....

i believe that there " dump()" should help but not sure if i should continue down this path if i can tweek the regex above....cheers

jbonifaci

taiyo,

The reason you're getting the three warnings in the example script you posted is because you're looking for a table of width 600 and the one in your __DATA__ section does not have a table with width 600, only 602. I noticed you did this because the html you attached has 600 width instead of 602. You could change the regex to look for 60[02] instead of changing it from 600 to 602.

And as far as the reason the attached file is failing, I left out a few regex metacharacters in this line:

$table =~ s/([\\\|\[\]\^\$])/\\$1/g;

Replace the above line with this:

$table =~ s/([\\\|\[\]\{\}\^\$\?\+\*])/\\$1/g;

And you should be gtg. I think I've got all the metacharacters covered now.

Let me know if you have any further issues.

Jef

dude -- you rip !!!

that thing works like a charm --- much props to you!!!

so what is the significance of using the brackets around the "02" when i change from 600 to 602?

thanks again!

jbonifaci

Glad I could help out, =].

[]'s denote character sets. So [02] will match any one character within that character set. Meaning 60[02] will match either 600 or 602. If it were 60[02a#], it would match either 600, 602, 60a or 60#.

Hope that makes sense,
Jeff

New Text Document (2).txt

project.dcpackage

4000008001_170714.jpeg

taiyo

yep --- got it --- thanks again !