Which is faster: having one dataset for all charts or one dataset per chart?

Bryan Leow
edited February 11, 2022 in Analytics #1
Suppose I have a chunk of SQL queries to extract transactionCount and warehouseID from a hundred or so warehouses:

--this is just an illustration and not the main part of my question
select transactionCount, warehouseID from warehouse1.itrn
union select transactionCount, warehouseID from warehouse2.itrn
union select transactionCount, warehouseID from warehouse3.itrn
.
.
.
union select transactionCount, warehouseID from warehouse117.itrn

From here, I then want to
a) group by country (warehouseID has an identifier for country code)
b) count the total transactions
c) get the top ten warehouses with highest transactionCount
Each of these a, b, c will be displayed on separate charts


My question would be: which is faster,
1) having all of the above (chunk+a+b+c) in one master data set in BIRT and linking that data set to all the charts,
or
2) having each chunk+a, chunk+b, chunk+c in three separate data sets and linking those to each chart individually?


#1 is easier to maintain if the chunk ever requires editing (e.g. a new warehouse is added), but #2 produces more readable table results (#1 will require FULL JOINing table results for a, b and c into one) so I guess performance will be the deciding factor


I guess a higher-level question would be how BIRT processes data sets - does it strictly process them one-by-one for each chart regardless of repeated queries, or does it "cache" query results so that it can skip future repeated queries in the same view session?



For context: my database uses SQL Oracle, I am on BIRT 4.6.0, and I'm running the chunks+a+b+c as WITH (...) AS x, (....) AS y, (...) AS z] clauses.

Comments

  • jfranken
    edited April 17, 2020 #2
    The attached example shows an option that will work well for this scenario assuming I understand your requirements.  Here are the steps:
    • Create one data set for all of the data
    • Add a table to the report
    • Add a group on the table
    • Add an aggregation element in the table footer to calculate the total for the group
    • Edit the table group- set the filter to "Top n" where the column is the aggregation and n=10
    • Set the group sort to descending
    • Add a group footer row and insert a chart element
    • set the chart to inherit data from the table, set the x & y axis, and save the chart
    • select the table detail row, go to properties and turn off the visibility

    Warning No formatter is installed for the format ipb
  • Hi Jeff, thanks for the reply and the aggregate table. My requirements of a) group by country, b) count total transactions and c) top ten warehouses are actually three separate requirements - I need each a, b and c on 3 different charts.

    For reference, this is the un-simplified SQL query:
    
    WITH
    params AS
    (SELECT ? AS inputDate FROM dual),
    
    
    picking_t AS 
    (
    /*PICKING*/
    select whseid, count(itrnkey) as "COUNT" from wmwhse10.itrn, params where inputDate = params.inputDate and sourcetype='PICKING' and trantype='MV' and TOLOC='PICKTO' and to_char(adddate,'yyyyMMdd') = inputDate group by whseid   
    union select whseid, count(itrnkey) as "COUNT" from wmwhse100.itrn, params where inputDate = params.inputDate and sourcetype='PICKING' and trantype='MV' and TOLOC='PICKTO' and to_char(adddate,'yyyyMMdd') = inputDate group by whseid
    .
    .
    union select whseid, count(itrnkey) as "COUNT" from wmwhse99.itrn, params where inputDate = params.inputDate and sourcetype='PICKING' and trantype='MV' and TOLOC='PICKTO' and to_char(adddate,'yyyyMMdd') = inputDate group by whseid
    ),
    
    name_t AS
    (select db_logid,db_alias from  wmsadmin.pl_db where db_alias not like '%BPV' and  db_alias not like 'PRDO%'),
    
    full_picking_t AS
    (
    SELECT picking_t.COUNT, name_t.db_alias as NAME, SUBSTR(name_t.db_alias, 1, 2) AS COUNTRY
    FROM picking_t
    LEFT JOIN name_t
    ON picking_t.whseid=UPPER(name_t.db_logid)
    ),
    
    ------------------------------------------------------------------------------------------------------------------------------------------------------
    ----------------------------------- full_picking_t == table with ACTIVITY, COUNT for each warehouse NAME and their COUNTRY -----------------------------------a
    ------------------------------------------------------------------------------------------------------------------------------------------------------
    
    TOPTEN_PICKING AS
    (
    select * from
    (
    (select COUNT, NAME from full_picking_t)
    order by COUNT DESC
    ) WHERE ROWNUM <= 10
    ),
    
    
    COUNTRY_PICKING AS
    (
    select SUM(COUNT) as COUNT, COUNTRY from full_picking_t
    GROUP BY COUNTRY
    ),
    
    
    TOTAL_PICKING AS
    (
    select SUM(COUNT) as COUNT, 'PICKING' AS ACTIVITY from picking_t, dual
    )
    
    
    --------------------------------------------------------------------------------
    
    
    --select * from TOPTEN_PICKING
    --select * from COUNTRY_PICKING
    --select * from TOTAL_PICKING
    
    Requirement A) group by country is fulfilled by COUNTRY_PICKING,
    Requirement B) count total transactions is fulfilled by TOTAL_PICKING,
    Requirement C) top ten warehouses is fulfilled by TOPTEN_PICKING


    My issue now is deciding between
    1) having the above dataset for all 3 charts, or
    2) splitting the above dataset into three separate datasets, i.e.
      - dataset 1: params + picking_t + name_t + full_picking_t + COUNTRY_PICKING for requirement A
      - dataset 2: params + picking_t + name_t + full_picking_t + TOTAL_PICKING for requirement B
      - dataset 3: params + picking_t + name_t + full_picking_t + TOPTEN_PICKING for requirement C
    and binding the 3 charts to their respective dataset


    and seeing which method #1 or #2 is faster, because currently Im using #2 and the report takes 10+minutes to run. My question is hence more of a theoretical one, as in which method of saving and running the datasets is more time efficient. 

  • Hi Bryan,

    By default, data sets cache the data and use the same data for all bindings in the report.  There is a setting in the data set advanced properties called "Needs cache for data-engine" that controls the caching.  I recommend leaving it set to true so that the data set only retrieves the data once.

    One way to see how much time is being taken in each data set is to add debug timing statements.  Here are the instructions:
    1. Start the designer using eclipsec.exe instead of eclipse.exe.  This will open a separate console window.
    2. In the beforeOpen event of the first data set, add the code:  java.lang.System.out.println("beforeOpen of data set 1: " + BirtDateTime.now());
    3. Add a similar statement in the data set's beforeClose event (change the label to beforeClose).
    4. Do the same for the other data sets changing the label each time..
    5. Run the report and look at the console window.  It will show the start and end time for each data set and you will be able to see how many times they run.
    Also, in the home install directory of the designer, there is a file called eclipse.ini.  One of the lines, probably the last line in the file, will be something like:
    -Xmx1024m
    It specifies the maximum memory that can be used by eclipse, in this case 1024 MB.  You could be hitting the limit when retrieving the data. Increasing the number and restarting the designer might improve the performance.  Going too high could impact your OS and other apps. 

    Another good test is to run your queries from the designer machine, but outside of the designer using a query tool.  How long does it take to retrieve the data outside the designer but on the same machine?

    Warning No formatter is installed for the format ipb