Home
TeamSite
Extraction of HTML tables with Perl
taiyo
I'm trying to write a script that extracts the HTML that makes up a particular table across many different HTML files.
I need the HTML that makes up the table not just the contents of the table cells.
The common characteristic of the table that I need to extract is that the width of the table is 602.
I'm fairly certain that the width attribute is in the same location of each <TABLE> tag across the files. So I need to copy all HTML beginning with:
<table width="602" border="0" cellspacing="0" cellpadding="0" bgcolor="white" height="100%">
and ending with:
</table>
ignoring all nested tables and paste it to a new file.
I thought that HTML::TableExtract might be the way to go because I could pinpoint the location of the table but my code just gives me the contents of the table cells. I'm thinking that it may be possible to use HTML::LinkExtor to get the table I want but I'm not sure how to ignore all the nested tables in the file.
thanks much !
Find more posts tagged with
Comments
jbonifaci
Heya,
I've done similar things to this in the past in a single regular expression, but didn't have it in me today. Here's what I came up though, assuming you've read the html into a variable named $html, this code below should do what you're looking for. The highest level table with width 602 will be in the variable $table.
my ($table) = $html =~ /(<table\b[^>]*width\s*=\s*['"]?602['"]?[^>]*>(.|\n)*?<\/table\b[^>]*>)/i;
my $balanced = 1;
while($balanced != 0)
{
my
@openTable
= $table =~ /(<table\b[^>]*>)/gi;
my
@closeTable
= $table =~ /(<\/table\b[^>]*>)/gi;
$balanced = $#openTable - $#closeTable;
if ($balanced != 0)
{
$table =~ s/([\\\|\(\)\[\]\^\$])/\\$1/g;
($table) = $html =~ /($table(.|\n)*?<\/table>)/i;
}
}
Adam Stoller
Not sure if the attached is what you want, but I *think* it does what you're looking for.
--fish
Senior Consultant, Quotient Inc.
http://www.quotient-inc.com
Johnny
have a look at HTML::TreeBuilder
It gives a very clean interface to manipulate HTML code
John Cuiuli
taiyo
Hey jbonifaci,
thanks so much for your regex --- it's pretty sweet and I've got it working for simple pages but it complains as follows:
Use of uninitialized value at html_extract.ipl line 15, <DATA> chunk 1.
Use of uninitialized value at html_extract.ipl line 16, <DATA> chunk 1.
Use of uninitialized value at html_extract.ipl line 26, <DATA> chunk 1.
when I run it against more complex pages. I'm thinking that all the special characters that are present in the htm like "{,},< --,!, >" may be causing problems.... or could the string be too large???
I've attached a sample htm file that fails....and your script that i modified is below --- Thanks again!
#!/usr/bin/perl -w
#=================================================
use strict;
my $html = '';
{
local $/;
$html = <DATA>;
}
my ($table) = $html =~ /(<table\b[^>]*width\s*=\s*['"]?600['"]?[^>]*>(.|\n)*?<\/table\b[^>]*>)/i;
#print "$table \n";
my $balanced = 1;
while($balanced != 0)
{
my
@openTable
= $table =~ /(<table\b[^>]*>)/gi;
my
@closeTable
= $table =~ /(<\/table\b[^>]*>)/gi;
$balanced = $#openTable - $#closeTable;
if ($balanced != 0)
{
$table =~ s/([\\\|\(\)\[\]\^\$])/\\$1/g;
($table) = $html =~ /($table(.|\n)*?<\/table>)/i;
}
}
print "=" x 70, "\n$table\n", "=" x 70, "\n";
#---------------------------------------------------------------------
__DATA__
<html>
<head><title>foo</title></head>
<body>
<h1>This is a heading</h1>
<table width="602" border="0" cellspacing="0" cellpadding="0" bgcolor="white" height="100%">
<tr><th>Col1</th>
<th>Col2</th>
<th colspan="2">Col3 and 4</th>
</tr>
<tr><td>a</td><td>b</td><td>c</td><td>d</td></tr>
<tr><td colspan="4">
<table border='0' width="100%">
<tr><th>Col-A</th>
<th>Col-B</th>
<th colspan="2">Col-C and D</th>
</tr>
<tr><td>1</td><td>2</td><td>3</td><td>4</td></tr>
</table>
</td>
<tr><td>aa</td><td>bb</td><td>cc</td><td>dd</td></tr>
</table>
<p>
More Text with a table above and nothing below
</p>
</body>
</html>
Edited by taiyo on 11/20/03 11:04 AM (server time).
taiyo
hey fish --
thanks for you post -- got your script working but it fails if the table is nested in the html. i know i should be able to figure our a way to overcome this ---- just wanted to say thanks!!!
taiyo
hey john,
had a go with HTML::TreeBuilder and was able to get the html into the format but need to figure out how to pull out the useful stuff and ignore the rest....
i believe that there " dump()" should help but not sure if i should continue down this path if i can tweek the regex above....cheers
jbonifaci
taiyo,
The reason you're getting the three warnings in the example script you posted is because you're looking for a table of width 600 and the one in your __DATA__ section does not have a table with width 600, only 602. I noticed you did this because the html you attached has 600 width instead of 602. You could change the regex to look for 60[02] instead of changing it from 600 to 602.
And as far as the reason the attached file is failing, I left out a few regex metacharacters in this line:
$table =~ s/([\\\|\(\)\[\]\^\$])/\\$1/g;
Replace the above line with this:
$table =~ s/([\\\|\(\)\[\]\{\}\^\$\?\+\*])/\\$1/g;
And you should be gtg. I think I've got all the metacharacters covered now.
Let me know if you have any further issues.
Jef
taiyo
dude -- you rip !!!
that thing works like a charm --- much props to you!!!
so what is the significance of using the brackets around the "02" when i change from 600 to 602?
thanks again!
jbonifaci
Glad I could help out, =].
[]'s denote character sets. So [02] will match any one character within that character set. Meaning 60[02] will match either 600 or 602. If it were 60[02a#], it would match either 600, 602, 60a or 60#.
Hope that makes sense,
Jeff
taiyo
yep --- got it --- thanks again !