Discussions
Categories
Groups
Community Home
Categories
INTERNAL ENABLEMENT
POPULAR
THRUST SERVICES & TOOLS
CLOUD EDITIONS
Quick Links
MY LINKS
HELPFUL TIPS
Back to website
Home
Web CMS (TeamSite)
Omitting HTML attributes and images from keywords
fiquebem
We've upgraded from MT 3.6 to MT 4.1.2 on a Windows 2000 SP4 server. When using TeamSite 6.5 or TeamSite 6.7.1's tagging subsystem to retrieve metadata for JSP files, we're seeing MT suggested keywords such as:
width=,height=,/images/dashlLine370.gif
We don't want such terms. How to omit the HTML attributes, file paths, etc? We are not seeing the HTML tags themselves (e.g., "td" or "") included in the keywords, which is good.
Here's the configuration of the Metatagger Keywords.smf manifest file we're using:
summarize -db Summarizer -keywords 10
FREE_TEXT
And here's how we built the index:
Summarizer
jsp
<script>parse -additive -case -mode phrase</script>
Y:/default/main/www/maint/WORKAREA/wa_content/Web/webApplication
In metatagger.cfg, we have:
html
htm
jsp
html
html
General.Keywords
General.Summary
File_Path
Native.Title
Author
Taxonomy.Industry
Places
Specifier.Date
Taxonomy.Subject
Metatagger Keywords
Metatagger Keyphrases
htmlpre
iwmthtmlproc
filePath.pl
iwmtfilternativemd.exe -native Author -native Title -native title
datePreprocessor.pl
We don't have any custom preprocessors or postprocessors.
Any suggestions?
Thanks,
S.
Find more posts tagged with
Comments
Migrateduser
The reason why you are not seeing any s or is because for HTML MT comes pre-configured with iwmthtmlproc pre-processor, which removes all HTML tags but leaves in all tag text, their attributes and values in the cracked text.
Now the solution to your problem would be write another custom pre-processor to further chew on the cracked text to remove all HTML-ish content that you don't want, like remove width,height,id,align,valign... and probably keep src, alt, text etc...
Migrateduser
Are you tagging HTML files or DCR's with Visual Format fields in them? VF isn't handled well by iwmthtmlproc as the VF is HTML-escaped HTML. For example, VF will format a single paragraph as follows:
<p>Cats & Dogs</p>
Pretty, isn't it? Also, if you cut and paste from a MS application that does not generate valid HTML, such as Word 2003, the "HTML" contents may not be valid.
If you are using VF, my recommendation would be to use the XSLT transconverter from the "Cooking with Plugins" Webinar, then use Tidy or JTidy to clean up the VF content.
fiquebem
Thanks for your replies, folks!
Actually, we're tagging JSP files.
We were using a custom summarizer/index with MT 3.6 also, but we didn't see the extraneous keywords (such as "height=") we're seeing with MT 4.1.2.
>Now the solution to your problem would be write another custom pre-processor to further chew on the cracked text to remove all HTML-ish content that you don't want, like remove width,height,id,align,valign... and probably keep src, alt, text etc...
Do you (or anyone) have source code that can be easily adapted to do the above? If so, would you please post it?
Thanks,
S.
IwovWill
To expand on Clark's suggestion, the "Cooking with Plug-In's" webinar comes with a 7 Mb zip-file of example code.
Get it here:
http://devnet.interwoven.com/site.fcgi/webcasts/docs/webcast-recordings.html#webcast07-06