How to identify the non-searchable/non-OCR documents in the repository?
Hello Everyone,
Can someone help me on how to identify the non-searchable documents in the whole repository?
Somehow few documents are entered in to the system without OCR, so we need to find out those non-searchable documents in the repository.
Please provide some inputs on this requirement?
I would appreciate help on this.
Thanks in advance.
Regards,
JeenatH
Comments
-
It depends how you manage OCR. If you use a searchable PDF (i.e. PDF contains bitmap layer + OCRed text layer), there's not much you can do to identify those unless did something upstream to tag those (e.g. you set some metadata to identify those or create audit record). If it's a text rendition, you can query those.
0 -
Thanks bacham for your quick response.
Documents are already available in the repository, which are non-searchable, so now I can’s set meta data on those. I want find them using any metadata, which identifies the searchable OR non-searchable? Do we have any specific attribute, which tells searchable versus non-searchable? Content server updates any meta data by default for all documents?
How can I identify those documents are text rendition?
0 -
As bacham said, if the PDF file contains text in the image, there is no way you can tell if the file is searchable or just pdf wrapped image (that doesnt contain text). That being said, if you exclude this format, and you want to tell if tiff image has text rendition, then look for two dmr_content object associated with same dm_sysobject:
select * from dmr_content where any i_parent_id = "<r_object_id>"
Use this with group by / count function to get what you desired.
0 -
We use a specific development to identify if integrated document are PDF OCR or not. We use PDFBox and java development.
0 -
Thanks Johnny, can you please give me complete query on dmr_content, since I did not get. Thanks.
0
Categories
- All Categories
- 108 Developer Announcements
- 53 Articles
- 106 General Questions
- 144 IM Services
- 43 OpenText Hackathon
- 32 Developer Tools
- 20.6K Analytics
- 4.1K AppWorks
- 8.9K Extended ECM
- 899 Cloud Fax and Notifications
- 77 Digital Asset Management
- 9.3K Documentum
- 29 eDOCS
- 120 Exstream
- 39.8K TeamSite
- 1.7K Web Experience Management