How to identify the non-searchable/non-OCR documents in the repository?

jeenathreddy.gade
edited December 6, 2012 in Documentum #1

Hello Everyone,

Can someone help me on how to identify the non-searchable documents in the whole repository?

Somehow few documents are entered in to the system without OCR, so we need to find out those non-searchable documents in the repository.

Please provide some inputs on this requirement?

I would appreciate help on this.

Thanks in advance.

Regards,

JeenatH

Tagged:

Comments

  • bacham2
    edited December 4, 2012 #2

    It depends how you manage OCR. If you use a searchable PDF (i.e. PDF contains bitmap layer + OCRed text layer), there's not much you can do to identify those unless did something upstream to tag those (e.g. you set some metadata to identify those or create audit record). If it's a text rendition, you can query those.

  • jeenathreddy.gade
    edited December 4, 2012 #3

    Thanks bacham for your quick response.

    Documents are already available in the repository, which are non-searchable, so now I can’s set meta data on those. I want find them using any metadata, which identifies the searchable OR non-searchable? Do we have any specific attribute, which tells searchable versus non-searchable? Content server updates any meta data by default for all documents?

    How can I identify those documents are text rendition?

  • DCTM_Guru
    edited December 4, 2012 #4

    As bacham said, if the PDF file contains text in the image, there is no way you can tell if the file is searchable or just pdf wrapped image (that doesnt contain text).  That being said, if you exclude this format, and you want to tell if tiff image has text rendition, then look for two dmr_content object associated with same dm_sysobject:

    select * from dmr_content where any i_parent_id = "<r_object_id>"

    Use this with group by / count function to get what you desired.

  • Julien.Fontaine
    edited December 5, 2012 #5

    We use a specific development to identify if integrated document are PDF OCR or not. We use PDFBox and java development.

  • jeenathreddy.gade
    edited December 6, 2012 #6

    Thanks Johnny, can you please give me complete query on dmr_content, since I did not get. Thanks.

  • DCTM_Guru
    edited December 6, 2012 #7

    You can have multiple content files (eg renditions) associated to single sysobject.  To find which sysobject has multiple content files, you will need to use group by/count function.  I'm not a SQL guru, so I dont know the syntax off the top of my head.