Home
TeamSite
html file TITLE native metadata
System
Hi DevNetters-
I've been working on generating metadata from a variety of documents. Here's an issue I've found with html files:
There are 2 types of "title" metadata for an html file:
- one in the HEAD section called TITLE, such as in the example
<TITLE>Page Title Here<TITLE/>
and
- one in the META section, like this
<META name="Title" content="Page Title Here">
The one that MetaTagger considers native is the bottom META one, yet most all of the files I'm dealing with use the top TITLE version.
Is there an option to have MetaTagger see the TITLE version as native or am I going to have to build a recognizer to strip out the contents between <TITLE> and <TITLE/>
Thanks,
Wally Box
Find more posts tagged with
Comments
Migrateduser
What you need to do is open up HTMLGeneric.ism and change this token:
TITLE "title>";
to this:
TITLE "title>", "TITLE>";
and then recompile it by running
iwcmppreprocessor HTMLGeneric.ism.
Make sure you make a copy of HTMLGeneric.ism before you modifiy it.
Migrateduser
I tried that, it won't work. Here's why:
HTMLGeneric.ipr is not built to capture the contents between the <TITLE> and </TITLE> tags. It IS built to capture, if present, the contents of a META line that has
<META name="Title" content="some content"> or, for that matter, any meta info.
The product does come with an example of how to capture the TITLE info (HTMLTitlepreprocessor.ism), but it's intended as a demo file for understandability and doesn't include the rest of the processing necessary for html files. Since an extension can have only one .ipr file to process it, that doesn't help.
So, it becomes necessary to try to modify the HTMLGeneric.ism, which isn't developed for understandability (no comments whatsoever in it), and try to merge in the intelligence of the HTMLTitlepreprocessor.ism file and then use the CLTs to debug it - but the structure of the processing in the 2 .ism files isn't even vaguely similar. I've been at it about 2 days now and don't feel any closer to a solution.
What a pain! I've taken the MT Admin class and have reviewed the notes/materials and also have been looking at the MT Admin guide. There's really nothing in all of those that gives me any fuzzier of a feeling about creating / modifying a preprocessor.
Migrateduser
Hi,
Actually, HTMLGeneric is designed to get contents out of title tags, but it obviously is not working for your files, so there is a bug. If you could attach a file to your next post, I will take a look at it to see what's going on.
Lisa
Migrateduser
Hi Lisa - Thanks for looking. I went over this with someone from Interwoven for quite a bit yesterday, so I don't think I'm just mistaken.
HTMLGeneric.ism does grab title info out of html files if you set them up as Meta tags - but won't pull them out of the <TITLE> </TITLE> line...
Unless of course there's a new version? that I didn't get with the 3.5.1 product?
I've attached my copy of the file.
Thanks for looking!
Migrateduser
Hi,
I think you just posted your preprocessor. What I'll need to check this out is one of your HTML files that you want to use the preprocessor on.
Thanks,
Lisa
Migrateduser
Yep, I figured that the preproccesor is what's doing the work and where the issue is..
I'm attaching the file I was testing against.
Migrateduser
Hi,
I fixed HTMLGeneric for you. It will now get out titles from META tags (and all other META tags) and also these forms of title tags--<TITLE> <title> and <Title>.
The problem was that the string "title" was being used for more than one buffer. Also, just in case, make sure that when you add <category> elements to metatagger.cfg that the <tag> value is *exactly* what the preprocessor is seeing, otherwise you won't get anything out.
Attached is the HTMLGeneric.ism file. You'll have to get rid of the .txt extension (a devnet attachment issue) and compile it with iwcmppreprocessor. I checked to make sure that it all looked good. Below is a copy of the metadata record I was able to generate with iwgenmetadata that looked reasonable.
Let me know if you have any more problems.
Lisa
<?xml version="1.0" encoding="ISO-8859-1" ?>
<metadata>
<attribute>
<name>TITLE</name>
<value>Con Games Played on Travelers</value>
</attribute>
<attribute>
<name>Author</name>
<value>Jane Cantrell</value>
</attribute>
<attribute>
<name>Title</name>
<value>Nike Travel</value>
</attribute>
<facet>
<facetName>sector</facetName>
<descriptor>
<vocab>naics</vocab>
<code>522130</code>
<label>Credit Unions CAN</label>
</descriptor>
<descriptor>
<vocab>naics</vocab>
<code>454110</code>
(END)
</descriptor>
<descriptor>
<vocab>naics</vocab>
<code>523120</code>
<label>Securities Brokerage CAN</label>
</descriptor>
<descriptor>
<vocab>naics</vocab>
<code>926150</code>
<label>Regulation, Licensing, and Inspection of Miscellaneous Commercial S
ectors US</label>
</descriptor>
</facet>
</metadata>
Migrateduser
Lisa -
Thanks once again for your assistance.
I compiled and tested the version you sent me and it works as advertised.
Unfortunately, it doesn't solve the problem from my 1st posting in this thread. With the version you fixed up for me, I now have a TITLE element and a Title element associated with an html document.
I use the same datacapture form for some other types of files as well (use .doc as an example). The other ones use Title and not TITLE as the metadata - so when I switch metatagger.cfg and datacapture.cfg over to using TITLE, I start getting results for .html files, but lose results for .doc files.
The solution I'm looking for is to have the HTMLGeneric.ism strip out the info between <TITLE> and </TITLE> and assign it to the Title metadata; in other words to switch what the preprocessor views as the "native" Title metadata - not to add an additional new metatdata type of TITLE.
Does this make sense? I'm trying to use native metadata from several file types and want to change the behavior of the html preprocessor to derive the native Title metadata from a place other than where it is....
Thanks again,
Wally
Migrateduser
OK. I see the problem now. So there are two solutions that I can think of:
1) Stop using a "Title" META tag, because as the HTMLGeneric preprocessor is currently written, it will overwrite the value from the <TITLE> tag if you want to use the string "Title" for your catgory name; or
2) Edit the HTMLGeneric preprocessor so that when it encounters a "Title" META tag, it either ignores it, or assigns it to some other category name other than "Title".
To pick one of these options, we'd need to know why the "Title" META tag is getting generated and if anyone is using that data.
Lisa
Migrateduser
I think it'd have to be the 2nd option, since the other doc types seem to use Title pretty consistently as a native metadata type and I don't think I could prevent the users from putting it into their pages.
TITLE is used consistently here, as it's what controls both the display of an html pages' title at the top of a browser, and is also considered the native Title metadata by the intranet search engine that we use. METATitle is not required, used inconsistently and most likely never varies from the TITLE version.
The steps to modify the preprocessor for #2 would be to:
- find where TITLE is being set and set it to Title instead
- put in some exception handling for Title within the HANDLE_META state.
Sound about right? I think I'll give it a shot.
Wally
Migrateduser
Hi,
That's pretty much it. I think you're right that it's the "HANDLE_META" state you want to modify. You'll need to define a token for "Title" and when that is matched, use the "NOTHING" action and stay within the HANDLE_META state. It might take a few testing iterations to make sure that this is working right. The iwdumppreprocessor tool will be very useful and I would also suggest using iwgenmetadata on a file once you think everything looks good.
Let me know how it turns out.
Lisa