Scan files for metadata with workflow-integration

Hi,

I need to scan all documents, or rather all documents in the latest edition of each branch, within TeamSite for particular metadata (contained in html meta-tags, e.g. expiry date).

This data would then be analyzed and launch a specialized workflow. In the example above with expiry date it would start a workflow for files that have expired so that the branch-owner could take appropriate actions with the document.

Anyone's done something similar? Is there tools withing TeamSite that I can use for this kind of thing or would I have to make a script (Perl, shell or similar) that scans the files in the directory structure and starts jobs with iwinvokejob/iwjobc?

Find more posts tagged with

Comments

Adam Stoller

Are you looking to do this after-the-fact of a new edition being created - or at the time of submission to the staging area?
Is the data only contained *within* HTML documents - or are you using TeamSite's Extended Attributes to attach such metadata to the asset (and thus be able to mark *any* kind of asset - (HTML, other ASCII, binary, etc.) ?

If the using EA's - You could use DataDeploy's DAS and the MetaData Search CGI that already exists, to perhaps cover most of the ground desired.

If you're just talking about HTML and embedded metatags - there are probably search engines out there that already provide tools for mining such data (if you want to go for off-the-shelf programs rather than in-house creations) - otherwise I'd recommend using Perl as that is what it was defined for (Practical Extraction and Reporting Language).

It shouldn't be too difiicult to write, as you can quickly eliminate any files that don't have applicable extensions (.htm, .html, .xhtml, etc.) and then either use brute-force extraction or one of the various *: Smiley Tongue

arser modules that exist for processing such file types.

Of course, you'd want to have some idea of what you plan to do with the data once extracted - are you going to put it into a flat file or load it into a DB (or perhaps a combination)?

--fish
(Interwoven, Curriculum Development)