Home
TeamSite
adding metadata to a pdf or doc file
thriventman
Good Morning,
I was wondering if anyone has ever added or attached metadata to a word or pdf file. Kind of like the last post of attaching metadata/tags to an html file.
Thanks a many!
Find more posts tagged with
Comments
Michael
Hi
Perhaps you could clarify?
-- do you simply wish to capture metadata for the word or pdf files. This information is then attached to your file in TeamSite as extended attributes?
-- or do you also wish to include these metadata values into a generated word or pdf file?
The more detail you can provide about what you are trying to achieve the more likely the responses you receive will be helpful!
Cheers
Michael
thriventman
Michael,
From the last posting I was able from Teamsite to Set HTML tags (<META tags) within an HTML file after Setting Metadata from the Tagger Gui. Thus, I was able to physically download the html file to my desktop and look at the source code and see the <Meta tags embedded.
I kind of want the same thing for the pdf or doc files: I know they are binary. However, the pdf file for example has a Summary section in properties. I was hoping there was a way to embed those Set Metadata values within the file itself.
The purpose of having those files physically tagged is that we opendeploy the MetaTagged Teamsite content to a web server. Then a 3rd party search engine could spider the information and improve it's search...
I hope I made since.
Thanks, Again!
Michael
Right, now I have a better understanding of what you are trying to achieve.
When you say that the pdf file has a Summary section in
properties -- I am presuming that you are doing everything in Windows; I am guessing that these properties are Windows specific.
Do you know for sure that your 3rd party search engine would actually look at these properties? I haven't come across this before.
My experience is that generally you would need to do a different form of integration to provide metadata information to your search engine. This may involve things like
- using DataDeploy to place the metadata details in a database used by the search engine.
- dumping the metadata information to an accompanying text / xml file which is sent along with your pdf file and is read by the search engine crawler.
Perhaps if you post back with what search tool you are using. Then someone who has had experience with integrating it previously could provide some specific advice.
Cheers
Michael
thriventman
Good Morning Michael,
Thank you for your comments. The 3rd party search engine is called Atomz. at
www.atomz.com
.
Supposedly, there is no connection between Teamsite and this product. All it does is that it webcrawls through html etc on your website and allows your searching easier. However, the drawback is that it doesn't meta tag html or pdf's etc... like MetaTagger.
Thus, the whole idea is to tag the contents in Teamsite with MetaTagger and open deploy it to a web server where Atomz can read off of.. I do know that Atomz cannot read xml files.
We can add meta tags through worklflow in an html only so far, and I was hoping someone has tagged pdf's doc, outside..
Thanks again!
Michael
Hi
Adding metadata in TeamSite to any type of file is not a problem -- it doesn't care what type of file you are adding metadata to, it just creates the extended attributes on the file for you.
You problem appears to be in doing something with these value -- in your case, making them available to your search engine.
For html files this is easy as you can place them in meta tags when you generate the file. The search engine can then crawl these pages and extract the metadata. For other file types a different approach is required.
I have had a look at the Atomz site. It is an externally hosted 3rd party search engine it appears. This could well make any integration difficult.
On this page
http://www.atomz.com/search/features/standards.htm
it has:
Content Types
Atomz Search supports a wide range of content types including HTML, .txt files, Adobe PDF files, Microsoft Office file formats (Word, Excel, and PowerPoint), MP3 files, and Macromedia Flash (including Flash MX) files.
But this page
http://www.atomz.com/search/features/metadata.htm
Metadata Management Interface
This feature allows customers to inject metadata information directly from the Atomz Search interface, without ever having to touch the HTML or Web page source code. Users can append, replace and manage all metadata with just a few clicks of a mouse. With this feature, customers can assign metadata to documents that could never before be tagged, such as Adobe PDF files, Postscript documents, images, and other files for which customers do not have access to the document source.
This makes me think that if you want to have metadata available to Atomz, for say a PDF, then you will need to enter it via the interface they provide.
I suggest you contact your supplier regarding this. They may have another solution.
.
thriventman
Michael,
Thanks for looking at my situation. I'll let you know if I find such a solution.
Have a great day!