How to identify the non-searchable/non-OCR documents in the repository?

Hello Everyone,

Can someone help me on how to identify the non-searchable documents in the whole repository?

Somehow few documents are entered in to the system without OCR, so we need to find out those non-searchable documents in the repository.

Please provide some inputs on this requirement?

I would appreciate help on this.

Thanks in advance.

Regards,

JeenatH

Find more posts tagged with

Documentum

Comments

bacham2

It depends how you manage OCR. If you use a searchable PDF (i.e. PDF contains bitmap layer + OCRed text layer), there's not much you can do to identify those unless did something upstream to tag those (e.g. you set some metadata to identify those or create audit record). If it's a text rendition, you can query those.

jeenathreddy.gade

Thanks bacham for your quick response.

Documents are already available in the repository, which are non-searchable, so now I can’s set meta data on those. I want find them using any metadata, which identifies the searchable OR non-searchable? Do we have any specific attribute, which tells searchable versus non-searchable? Content server updates any meta data by default for all documents?

How can I identify those documents are text rendition?

DCTM_Guru

As bacham said, if the PDF file contains text in the image, there is no way you can tell if the file is searchable or just pdf wrapped image (that doesnt contain text). That being said, if you exclude this format, and you want to tell if tiff image has text rendition, then look for two dmr_content object associated with same dm_sysobject:

select * from dmr_content where any i_parent_id = "<r_object_id>"

Use this with group by / count function to get what you desired.

Julien.Fontaine

We use a specific development to identify if integrated document are PDF OCR or not. We use PDFBox and java development.

jeenathreddy.gade

Thanks Johnny, can you please give me complete query on dmr_content, since I did not get. Thanks.

DCTM_Guru

You can have multiple content files (eg renditions) associated to single sysobject. To find which sysobject has multiple content files, you will need to use group by/count function. I'm not a SQL guru, so I dont know the syntax off the top of my head.