partition outer join in oracle 10g
Partition outer join is a new mechanism in 10g to "invent" data to fill the gaps in non-contiguous results. In 10g there are many methods to deal with such a problem (including the awe-inspiring, but equally terrifying, MODEL clause). In older versions of Oracle, "data-densification" was not as simple and certainly less efficient than it has now become.
the problem
This article has been motivated by a response I gave to a problem raised on an Oracle developer forum. Our requirement is to produce a report that details customer spending for each month of the year. Our database only records actual spend, so for any given month, data for dormant or idle customers will have to be generated.
setup
First, we'll create a mock CUSTOMER_ORDERS table with sparse data to represent customer spending. To keep the example simple, we'll denormalise the customer name onto the orders table.
SQL> CREATE TABLE customer_orders (name, dt, amt) 2 AS 3 SELECT * 4 FROM ( 5 SELECT owner 6 , TRUNC(created) + MOD(ROWNUM,6) 7 , TRUNC(object_id/ROWNUM) 8 FROM all_objects 9 WHERE created > TRUNC(SYSDATE,'YEAR') 10 AND owner IN ('ORDSYS','WKSYS') 11 ORDER BY 12 DBMS_RANDOM.RANDOM 13 ) 14 WHERE ROWNUM <= 1000;
Table created.
a sparse report
With our customer orders data as sparse as it is, a monthly report for purchases by customer would look as follows.
SQL> SELECT name 2 , TO_CHAR(dt,'YYYYMM') AS year_month 3 , SUM(amt) AS total_amt 4 FROM customer_orders 5 GROUP BY 6 name 7 , TO_CHAR(dt,'YYYYMM') 8 ORDER BY 2, 1;
NAME YEAR_MONTH TOTAL_AMT --------------- ------------ ---------- ORDSYS 200405 3668 WKSYS 200405 93 ORDSYS 200406 21427 WKSYS 200406 371 ORDSYS 200407 156864 WKSYS 200407 2163 6 rows selected.
adding the missing months
We can see from the data that we are missing most months of the year for our two customers. Remember that our requirement is to show a report for every month in 2004 for every customer. First we will build a "time dimension" set (using subquery factoring) and outer join it to our orders table.
SQL> WITH year_months AS ( 2 SELECT TO_CHAR( 3 ADD_MONTHS( 4 DATE '2004-01-01', ROWNUM-1 ), 'YYYYMM' ) AS year_month 5 FROM dual 6 CONNECT BY ROWNUM < 12 7 ) 8 SELECT co.name 9 , ym.year_month 10 , NVL(SUM(co.amt),0) AS total_amount 11 FROM year_months ym 12 LEFT OUTER JOIN 13 customer_orders co 14 ON (TO_CHAR(co.dt,'YYYYMM') = ym.year_month) 15 GROUP BY 16 co.name 17 , ym.year_month 18 ORDER BY 19 co.name 20 , ym.year_month;
NAME YEAR_MONTH TOTAL_AMOUNT --------------- ------------ ------------ ORDSYS 200405 3668 ORDSYS 200406 21427 ORDSYS 200407 156864 WKSYS 200405 93 WKSYS 200406 371 WKSYS 200407 2163 200401 0 200402 0 200403 0 200404 0 200408 0 200409 0 200410 0 200411 0 200412 0 15 rows selected.
We can see that this hasn't quite worked. We have the zero sums and the year-months, but we are missing customer names. This is because we outer joined to CUSTOMER_ORDERS on the year-months, so any customer columns would show as NULL for deficient rows. Until PARTITION OUTER JOIN appeared in Oracle 10g, we couldn't "invent" data easily, though the next section shows that it is possible in prior versions.
data-densification without partition outer join
We can now take the previous example further and generate the missing customer names for our report. We need to create a set of distinct customer names in addition to the year-months that we generated by subquery. We can combine the two generated "dimensions" (name and year-month) to drive the report as follows.
SQL> WITH year_months AS ( 2 SELECT TO_CHAR( 3 ADD_MONTHS( 4 DATE '2004-01-01', ROWNUM-1 ), 'YYYYMM' ) AS year_month 5 FROM dual 6 CONNECT BY ROWNUM < 12 7 ) 8 , customer_names AS ( 9 SELECT DISTINCT name 10 FROM customer_orders 11 ) 12 SELECT sq.name 13 , sq.year_month 14 , NVL(co.total_amount,0) AS total_amount 15 FROM ( 16 SELECT name 17 , TO_CHAR(dt,'YYYYMM') AS year_month 18 , SUM(amt) AS total_amount 19 FROM customer_orders 20 GROUP BY 21 name 22 , TO_CHAR(dt,'YYYYMM') 23 ) co 24 , ( 25 SELECT * 26 FROM customer_names 27 , year_months 28 ) sq 29 WHERE sq.year_month = co.year_month (+) 30 AND sq.name = co.name (+) 31 ORDER BY 32 sq.name 33 , sq.year_month;
NAME YEAR_MONTH TOTAL_AMOUNT --------------- ------------ ------------ ORDSYS 200401 0 ORDSYS 200402 0 ORDSYS 200403 0 ORDSYS 200404 0 ORDSYS 200405 3668 ORDSYS 200406 21427 ORDSYS 200407 156864 ORDSYS 200408 0 ORDSYS 200409 0 ORDSYS 200410 0 ORDSYS 200411 0 ORDSYS 200412 0 WKSYS 200401 0 WKSYS 200402 0 WKSYS 200403 0 WKSYS 200404 0 WKSYS 200405 93 WKSYS 200406 371 WKSYS 200407 2163 WKSYS 200408 0 WKSYS 200409 0 WKSYS 200410 0 WKSYS 200411 0 WKSYS 200412 0 24 rows selected.
We can see that this has now become quite complicated and unintuitive. Using the generated dimensions, we built a cartesian product of our range of months and customer names to derive the full range of dates per customer. We then outer joined this cartesian product to CUSTOMER_ORDERS to fill in the gaps for our report. We can imagine the potential cost of such a method if our CUSTOMER_ORDERS table had millions of rows and we had to generate a distinct set of names and then join this back to the table.
In the following section, we'll see just how much simpler the PARTITION OUTER JOIN solution is.
partition outer join
In 10g, the PARTITION OUTER JOIN enables us to avoid the second access of CUSTOMER_ORDERS, and makes the report much easier to build. We can see this with a re-write of the previous example.
SQL> WITH year_months AS ( 2 SELECT TO_CHAR( 3 ADD_MONTHS( 4 DATE '2004-01-01', ROWNUM-1), 'YYYYMM') AS year_month 5 FROM dual 6 CONNECT BY ROWNUM < 12 7 ) 8 SELECT co.name 9 , ym.year_month 10 , NVL(SUM(co.amt),0) AS total_amount 11 FROM year_months ym 12 LEFT OUTER JOIN 13 customer_orders co PARTITION BY (co.name) 14 ON (TO_CHAR(co.dt,'YYYYMM') = ym.year_month) 15 GROUP BY 16 co.name 17 , ym.year_month 18 ORDER BY 19 co.name 20 , ym.year_month;
NAME YEAR_MONTH TOTAL_AMOUNT --------------- ------------ ------------ ORDSYS 200401 0 ORDSYS 200402 0 ORDSYS 200403 0 ORDSYS 200404 0 ORDSYS 200405 3668 ORDSYS 200406 21427 ORDSYS 200407 156864 ORDSYS 200408 0 ORDSYS 200409 0 ORDSYS 200410 0 ORDSYS 200411 0 ORDSYS 200412 0 WKSYS 200401 0 WKSYS 200402 0 WKSYS 200403 0 WKSYS 200404 0 WKSYS 200405 93 WKSYS 200406 371 WKSYS 200407 2163 WKSYS 200408 0 WKSYS 200409 0 WKSYS 200410 0 WKSYS 200411 0 WKSYS 200412 0 24 rows selected.
This is so much easier! The PARTITION BY clause on the outer join has enabled us to split our deficient data into groups and outer join to each group. In the context of our example, this means that Oracle knows the customer name for each group that we are outer joining to. Hence we did not need to generate a separate set of names and this in turn keeps our resource usage to a minimum.
If we fetch the execution plan for the previous query using DBMS_XPLAN.DISPLAY_CURSOR, we can see the presence of the PARTITION OUTER JOIN as follows.
SQL> SELECT plan_table_output 2 FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR);
PLAN_TABLE_OUTPUT ------------------------------------------------------------------------------------------------------ SQL_ID 2w0usftp0k4un, child number 0 ------------------------------------- WITH year_months AS ( SELECT TO_CHAR( ADD_MONTHS( DATE '2004-01-01', ROWNUM-1), 'YYYYMM') AS year_month FROM dual CONNECT BY ROWNUM < 12 ) SELECT co.name , ym.year_month , NVL(SUM(co.amt),0) AS total_amount FROM year_months ym LEFT OUTER JOIN customer_orders co PARTITION BY (co.name) ON (TO_CHAR(co.dt,'YYYYMM') = ym.year_month) GROUP BY co.name , ym.year_month ORDER BY co.name , ym.year_month Plan hash value: 2660043991 ------------------------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | | | 7 (100)| | | 1 | SORT GROUP BY | | 1 | 79 | 6 (34)| 00:00:01 | | 2 | VIEW | | 1 | 79 | 5 (20)| 00:00:01 | | 3 | NESTED LOOPS PARTITION OUTER | | 1 | 44 | 5 (20)| 00:00:01 | | 4 | BUFFER SORT | | | | | | | 5 | VIEW | | 1 | 5 | 2 (0)| 00:00:01 | | 6 | COUNT | | | | | | | 7 | CONNECT BY WITHOUT FILTERING| | | | | | | 8 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 | |* 9 | FILTER | | | | | | | 10 | SORT PARTITION JOIN | | 10 | 390 | 3 (34)| 00:00:01 | | 11 | TABLE ACCESS FULL | CUSTOMER_ORDERS | 10 | 390 | 2 (0)| 00:00:01 | ------------------------------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 9 - filter("YM"."YEAR_MONTH"=TO_CHAR("CO"."DT",'YYYYMM')) Note ----- - dynamic sampling used for this statement 37 rows selected.
The effects of the PARTITION OUTER JOIN can be seen in steps 3 and 10. Step 10 is where Oracle sorts the customer data to partition by name and step 3 is where the partitions are outer joined to the year-month data. Note that the CONNECT BY steps are a result of the method we chose to generate the time dimension data.
further reading
For more information on PARTITION OUTER JOIN and how it can be applied, see the online documentation. Jonathan Gennick has written about this feature in a wider article on the MODEL clause in Oracle Magazine and this thread on Ask Tom shows the original forum problem.
source code
The source code for the examples in this article can be downloaded from here.
Adrian Billington, June 2004
Back to Top