processing docs with batch DFC program in large repositories

Hello,

Wanted to see if others had to deal with this type of issue.

We are working to build a DFC program to apply relationships between documents and another object type. This program will have to process every document in all of our repositories. There is one particular repos that has over 3-4 million documents so this process needs to be efficient and needs to run in batch mode - where it will process a set number of docs, stop, then rerun again on a new set of documents.

The one issue I see is how the program will batch up the object ids to process and how to avoid getting the same set of documents in the query. So bascially, the program will do the following:

1. Get fixed set of documents to process for a particular run

2. Create the relationship object for child and parent ids.

3. Complete processing the fixed set and get the next set of documents that have not been processed

I've been considering using the r_object_id to order the documents lists. So, I would order the documents by r_object_id, and only get the first N documents. Store the last r_object_id processed. Then on the next run, get the next N documents where object Id > last r_object_id processed.

Has anyone else used the r_object_id in this way? Will using the r_object_id burn us in any way?

Any help would be apprecated.

Find more posts tagged with

Comments

JorgKrause

Hi,

I don't see any reason why that approach could case any trouble? You could actually use a query like


Select r_object from <anything> where <conditions>
AND r_object_id not in (select parent_id from dm_relation where relation_name = '<your_relation>'
ENABLE (RETURN_TOP n)

Remember to use IDfQuery.DF_QUERY when executing IDfQuery.execute() in case you want to establish the relations while looping through the collection

Regards

Jørg

Lakshminarayanan

Hello -

Sorting with r_object_id should help - since it is indexed by default. Storing only the last r_object_id in this case is ok.

But hope you have a good error handling in your code - scenarios to log those doc ids which may need to retreated

(scenarios when creation of relation failed for any reason for that matter)

Good luck!

Best Regads

Haroon_A

Just an idea: besides that also, you could use an existing attribute that's not used, as a flag to indicate successful process. And in your query, exclude the flagged objects.

jcastell

Thanks to all for responding so quickly. I really appreciate the feedback and new suggestions.

jcastell

@Jorg: I'm a little concerned with running such a query due to the size of both the document table and the relationship table... Might the "NOT IN" predicate be slow when tables are huge?

jcastell

Lakshminarayanan: Thank you, yes, we will have an error handling process. This is part of a bigger process. We will add any errored docs back to a table to process.

jcastell

Ahmad Sakhi: Thank you, your idea to use an attribute is one that would work in other circumstances. The problem is that our process is not going to make any modifications to the documents themselves - just create a relation object for the doc and another type. But this would work if we didn't have that constraint.

Francois Dauberlieu

To add to Jorge's reply, also don't forget to also close you collection otherewise you're very quickly going to run out of collections (limit is 10)

Francois Dauberlieu

You could still do it by using execsql and modifying the attribute directly in the underlying RDBMS. This way, the documentum object won't be officially modified, no change in the modify date and modifier...

My 2 cents

Francois Dauberlieu

another point:

since you program is going to add a big lot of dm_relation objects, make sure you're not going to have problems with your RDBMS

For example, with Oracle, you mgith want to make sure your tablespaces are big enough or set to automatically be resized when needed. Also, I ofund that during big and heavy processes such as yours, disabling the logging mechanism in Oracle speeds ups quite a lot. Obvuously, you'll need to have a good backup beforehand....

JorgKrause

I'm not too concerned about the performance, actually. You're querying for r_object_id, parent_id and relation_name, right?

At least r_object_id and parent_id are indexed, not certain about the relation_name attribute in dm_relation though.

Query optimizer should be able to handle this smoothly, I would simply give it a shot to check it out. You DO have to run some sort of query against the document object type tables either way, in order to retrieve next data set for your batch processing. If you query any attribute you may have set, this will slow down the query quite a lot. In addition, you do have to update the object, even if you use execsql in order to avoid modifications on the object itself.

And again, 3-4 million rows sounds a lot, but in an RDMBS context, it isn't that much, right?

Jørg

jcastell

Thanks Jorg & Francois! (if i could mark additional comments as 'Helpful' I would but i guess there's a limit)

I will certainly give the query statement a try (also ensuring that collections are closed) and checking with the max connections on the DB side as we may want to handle mult-threading.

I'll test it out before marking this as answered. If anyone else has solved this same problem, would love to hear about it!

JorgKrause


I will certainly give the query statement a try (also ensuring that collections are closed) and checking with the max connections on the DB side as we may want to handle mult-threading.

Multi-threading is indeed a good idea, but you would have to extend the query in order to ensure that your simultaneous threads NEVER do get same object in the resultset!

Good luck

Jørg