Need Help Reducing Memory Footprint of Reports

Not sure what can be done here - hopefully ya'll can help me out.

We're using BIRT in an enterprise application where the data against which it reports can often be quite large. In some cases users are able to provide combinations of filtering criteria to our reports that result in 100,000s, sometimes millions of rows being reported on. For those rows, often there are detail subqueries, so we're talking lots and lots of pages!

(Anecdotaly, one of my production operations guys told me he found a 27gig report file sitting in a temp folder on the server! We've since made some changes to preclude such things...

)

Ok, so the problem is that we're seeing a large memory footprint for reports that are rather simple but that result in many pages. We're not storing the output - we stream the output direct to browser or to a file, but watching the heap during the execution of these reports shows that it takes a huge amount of memory to process the report. (In my test case, for instance, I am working against 1.7 million rows for a report that on each page has just a few lines of text - it's a certificate of completion for a course report. I have never had it finish, even with a 6 gig heap; it had generated about 600meg of output before it ran out of heap memory.)

This feels wrong to me. If we're simply generating a report and streaming the results direct to a file or browser, I would expect that the memory footprint would not seem to correlate with the # of pages, but quickly plateau into a steady-state. Whether we do PDF or HTML, however, it is clear that more pages = more memory. (And for HTML, a "page" would just be the same content that would have been on a page in PDF.)

What we've done so far is to at least DataEngine.MEMORY_BUFFER_SIZE in the AppContext to a reasonable number (10mb is where I settled, but maybe that is too high?). Prior to this, 1.7 million rows would not even result in any output - the result set caching that happens prior to the report being generated would itself blow the heap.

So the question is: is this just normal BIRT behavior? Are there other configuration options we can set that would restrict report memory? (And limiting the # of rows the queries can process is not an option - that would screw up the analytic reports for instance...)

What can we do to get this under control? Or is BIRT not going to scale to situations like this? (Keep in mind that one of our output formats is CSV, and so a million-row CSV report, while cumbersome, is still in the realm of real-world reporting.)

Find more posts tagged with

Comments

JasonW

We like to track all performance related questions in a bugzilla entry. Is there any chance you can open a bug (http://www.eclipse.org/birt/phoenix/community.php)? Please provide:
1-The report design
2 - any Appcontext settings you use.
3 - Output format you are using. Where did you get the CSV Emitter from.
4 - How are you running the report.

If you log the bug please post the number back here.

Thanks

Jason

shamilton

I'll be writing up a bug shortly - I found the problem.

For HTML, the problem is in the HTMLReportEmitter - there's a HashSet<String> outputBookmarks that for each table and div an entry is added to this HashSet but never used (as far as I can see). So if you have 1 million pages (an extreme, I know), you get 1 million bookmark entries added to the hashset.

Ouch!

Oddly, the code to add to the hashset checks first if the outputBookmarks is null and if not, doesn't add, so I subclassed this emitter and forced this to null, and wow, oh, the memory savings! I'm generating millions of pages with negligible memory consumption!!!

I'll log the bug with the specific details on this and the PDF emitter as soon as I get a working prototype of the PDF emitter's solution.

You asked about the CSV emitter. I think we got that from Actuate - I will need to look into that. However, the problem we've been seeing have been with the out of the box emitters (html and pdf). Could be this problem exists with other emitters as well.

For PDF, I realize that the bookmarks are used for the TOC generation. It would be nice to be able to configure this, however. (That'll be in the bug report.)

Thanks,
Scott

shamilton

Bugs entered.

https://bugs.eclipse.org/bugs/show_bug.cgi?id=340109
https://bugs.eclipse.org/bugs/show_bug.cgi?id=340111

JasonW

Thanks for logging this. The bookmarks are used in the viewer toc btw.

Jason

shamilton

Good to know. We're not using the viewer, however - just generating a report and streaming it to the recipient. So a possible alternative is to make this a configurable setting for a report generation.

Did a little more digging into the PDF generation - looks like the cross references data might be a similar problem, but not nearly as easy to work around. Too many package-protected/private-scope classes and class members within iText. Bottom line, a lot of cross references get added to a TreeSet during PDF generation so that the cross reference table can be written to the end of the PDF document.

I don't know if that data is strictly necessary, but if it is, the TreeSet might be converted to a more memory-efficient structure, e.g. something that overflows to disk after a certain amount of elements have been added to the set. I understand the efficiency of such an operation might be horribly slow by comparison, especially if the ordering of the elements in the set are not consistent with the order in which they are added (don't know - didn't debug it). Another idea would be for the cross reference table to be built as elements are added to it, streaming that out to a temporary file, and then just appended to the PDF once done. Again if the order of the elements is an issue this could be inefficient.

I realize these are all iText issues, too - just trying to throw some ideas out to see what folks in the community say. Ideally we get a good practical suggestion and contribute it back to iText.

JasonW

Make sure to add this comment to the bug.

Jason

Virgil Dodson

Hi Scott, you may also want to download and try the BIRT iServer since it is built to handle large scale memory usage.