Omitting HTML attributes and images from keywords

We've upgraded from MT 3.6 to MT 4.1.2 on a Windows 2000 SP4 server. When using TeamSite 6.5 or TeamSite 6.7.1's tagging subsystem to retrieve metadata for JSP files, we're seeing MT suggested keywords such as:

width=,height=,/images/dashlLine370.gif

We don't want such terms. How to omit the HTML attributes, file paths, etc? We are not seeing the HTML tags themselves (e.g., "td" or "") included in the keywords, which is good.

Here's the configuration of the Metatagger Keywords.smf manifest file we're using:

summarize -db Summarizer -keywords 10 FREE_TEXT

And here's how we built the index:

Summarizer
jsp
<script>parse -additive -case -mode phrase</script>
Y:/default/main/www/maint/WORKAREA/wa_content/Web/webApplication

In metatagger.cfg, we have:

html
htm
jsp

html
html
General.Keywords
General.Summary
File_Path
Native.Title
Author
Taxonomy.Industry
Places
Specifier.Date
Taxonomy.Subject
Metatagger Keywords
Metatagger Keyphrases
htmlpre

iwmthtmlproc
filePath.pl
iwmtfilternativemd.exe -native Author -native Title -native title
datePreprocessor.pl

We don't have any custom preprocessors or postprocessors.

Any suggestions?

Thanks,

S.

Find more posts tagged with

Comments

Migrateduser

The reason why you are not seeing any s or is because for HTML MT comes pre-configured with iwmthtmlproc pre-processor, which removes all HTML tags but leaves in all tag text, their attributes and values in the cracked text.

Now the solution to your problem would be write another custom pre-processor to further chew on the cracked text to remove all HTML-ish content that you don't want, like remove width,height,id,align,valign... and probably keep src, alt, text etc...

Migrateduser

Are you tagging HTML files or DCR's with Visual Format fields in them? VF isn't handled well by iwmthtmlproc as the VF is HTML-escaped HTML. For example, VF will format a single paragraph as follows:

Pretty, isn't it? Also, if you cut and paste from a MS application that does not generate valid HTML, such as Word 2003, the "HTML" contents may not be valid.

If you are using VF, my recommendation would be to use the XSLT transconverter from the "Cooking with Plugins" Webinar, then use Tidy or JTidy to clean up the VF content.

fiquebem

Thanks for your replies, folks!

Actually, we're tagging JSP files.

We were using a custom summarizer/index with MT 3.6 also, but we didn't see the extraneous keywords (such as "height=") we're seeing with MT 4.1.2.

>Now the solution to your problem would be write another custom pre-processor to further chew on the cracked text to remove all HTML-ish content that you don't want, like remove width,height,id,align,valign... and probably keep src, alt, text etc...

Do you (or anyone) have source code that can be easily adapted to do the above? If so, would you please post it?

Thanks,

S.

IwovWill

To expand on Clark's suggestion, the "Cooking with Plug-In's" webinar comes with a 7 Mb zip-file of example code.

Get it here:http://devnet.interwoven.com/site.fcgi/webcasts/docs/webcast-recordings.html#webcast07-06