partition outer join in oracle 10g

Partition outer join is a new mechanism in 10g to "invent" data to fill the gaps in non-contiguous results. In 10g there are many methods to deal with such a problem (including the awe-inspiring, but equally terrifying, MODEL clause). In older versions of Oracle, "data-densification" was not as simple and certainly less efficient than it has now become.

the problem

This article has been motivated by a response I gave to a problem raised on an Oracle developer forum. Our requirement is to produce a report that details customer spending for each month of the year. Our database only records actual spend, so for any given month, data for dormant or idle customers will have to be generated.

setup

First, we'll create a mock CUSTOMER_ORDERS table with sparse data to represent customer spending. To keep the example simple, we'll denormalise the customer name onto the orders table.

SQL> CREATE TABLE customer_orders (name, dt, amt)
  2  AS
  3     SELECT *
  4     FROM  (
  5            SELECT owner
  6            ,      TRUNC(created) + MOD(ROWNUM,6)
  7            ,      TRUNC(object_id/ROWNUM)
  8            FROM   all_objects
  9            WHERE  created > TRUNC(SYSDATE,'YEAR')
 10            AND    owner IN ('ORDSYS','WKSYS')
 11            ORDER  BY
 12                   DBMS_RANDOM.RANDOM
 13           )
 14     WHERE  ROWNUM <= 1000;

Table created.

a sparse report

With our customer orders data as sparse as it is, a monthly report for purchases by customer would look as follows.

SQL> SELECT name
  2  ,      TO_CHAR(dt,'YYYYMM') AS year_month
  3  ,      SUM(amt)             AS total_amt
  4  FROM   customer_orders
  5  GROUP  BY
  6         name
  7  ,      TO_CHAR(dt,'YYYYMM')
  8  ORDER  BY 2, 1;

NAME            YEAR_MONTH    TOTAL_AMT
--------------- ------------ ----------
ORDSYS          200405             3668
WKSYS           200405               93
ORDSYS          200406            21427
WKSYS           200406              371
ORDSYS          200407           156864
WKSYS           200407             2163

6 rows selected.

adding the missing months

We can see from the data that we are missing most months of the year for our two customers. Remember that our requirement is to show a report for every month in 2004 for every customer. First we will build a "time dimension" set (using subquery factoring) and outer join it to our orders table.

SQL> WITH year_months AS (
  2         SELECT TO_CHAR(
  3                   ADD_MONTHS(
  4                      DATE '2004-01-01', ROWNUM-1 ), 'YYYYMM' ) AS year_month
  5         FROM   dual
  6         CONNECT BY ROWNUM < 12
  7         )
  8  SELECT co.name
  9  ,      ym.year_month
 10  ,      NVL(SUM(co.amt),0) AS total_amount
 11  FROM   year_months ym
 12         LEFT OUTER JOIN
 13         customer_orders co
 14         ON (TO_CHAR(co.dt,'YYYYMM') = ym.year_month)
 15  GROUP  BY
 16         co.name
 17  ,      ym.year_month
 18  ORDER  BY
 19         co.name
 20  ,      ym.year_month;

NAME            YEAR_MONTH   TOTAL_AMOUNT
--------------- ------------ ------------
ORDSYS          200405               3668
ORDSYS          200406              21427
ORDSYS          200407             156864
WKSYS           200405                 93
WKSYS           200406                371
WKSYS           200407               2163
                200401                  0
                200402                  0
                200403                  0
                200404                  0
                200408                  0
                200409                  0
                200410                  0
                200411                  0
                200412                  0

15 rows selected.

We can see that this hasn't quite worked. We have the zero sums and the year-months, but we are missing customer names. This is because we outer joined to CUSTOMER_ORDERS on the year-months, so any customer columns would show as NULL for deficient rows. Until PARTITION OUTER JOIN appeared in Oracle 10g, we couldn't "invent" data easily, though the next section shows that it is possible in prior versions.

data-densification without partition outer join

We can now take the previous example further and generate the missing customer names for our report. We need to create a set of distinct customer names in addition to the year-months that we generated by subquery. We can combine the two generated "dimensions" (name and year-month) to drive the report as follows.

SQL> WITH year_months AS (
  2          SELECT TO_CHAR(
  3                    ADD_MONTHS(
  4                       DATE '2004-01-01', ROWNUM-1 ), 'YYYYMM' ) AS year_month
  5          FROM   dual
  6          CONNECT BY ROWNUM < 12
  7          )
  8  ,    customer_names AS (
  9          SELECT DISTINCT name
 10          FROM   customer_orders
 11          )
 12  SELECT sq.name
 13  ,      sq.year_month
 14  ,      NVL(co.total_amount,0) AS total_amount
 15  FROM  (
 16         SELECT name
 17         ,      TO_CHAR(dt,'YYYYMM') AS year_month
 18         ,      SUM(amt)             AS total_amount
 19         FROM   customer_orders
 20         GROUP  BY
 21                name
 22         ,      TO_CHAR(dt,'YYYYMM')
 23        ) co
 24  ,    (
 25        SELECT *
 26        FROM   customer_names
 27        ,      year_months
 28       ) sq
 29  WHERE  sq.year_month = co.year_month (+)
 30  AND    sq.name = co.name (+)
 31  ORDER  BY
 32         sq.name
 33  ,      sq.year_month;

NAME            YEAR_MONTH   TOTAL_AMOUNT
--------------- ------------ ------------
ORDSYS          200401                  0
ORDSYS          200402                  0
ORDSYS          200403                  0
ORDSYS          200404                  0
ORDSYS          200405               3668
ORDSYS          200406              21427
ORDSYS          200407             156864
ORDSYS          200408                  0
ORDSYS          200409                  0
ORDSYS          200410                  0
ORDSYS          200411                  0
ORDSYS          200412                  0
WKSYS           200401                  0
WKSYS           200402                  0
WKSYS           200403                  0
WKSYS           200404                  0
WKSYS           200405                 93
WKSYS           200406                371
WKSYS           200407               2163
WKSYS           200408                  0
WKSYS           200409                  0
WKSYS           200410                  0
WKSYS           200411                  0
WKSYS           200412                  0

24 rows selected.

We can see that this has now become quite complicated and unintuitive. Using the generated dimensions, we built a cartesian product of our range of months and customer names to derive the full range of dates per customer. We then outer joined this cartesian product to CUSTOMER_ORDERS to fill in the gaps for our report. We can imagine the potential cost of such a method if our CUSTOMER_ORDERS table had millions of rows and we had to generate a distinct set of names and then join this back to the table.

In the following section, we'll see just how much simpler the PARTITION OUTER JOIN solution is.

partition outer join

In 10g, the PARTITION OUTER JOIN enables us to avoid the second access of CUSTOMER_ORDERS, and makes the report much easier to build. We can see this with a re-write of the previous example.

SQL> WITH year_months AS (
  2         SELECT TO_CHAR(
  3                   ADD_MONTHS(
  4                      DATE '2004-01-01', ROWNUM-1), 'YYYYMM') AS year_month
  5         FROM   dual
  6         CONNECT BY ROWNUM < 12
  7         )
  8  SELECT co.name
  9  ,      ym.year_month
 10  ,      NVL(SUM(co.amt),0) AS total_amount
 11  FROM   year_months ym
 12         LEFT OUTER JOIN
 13         customer_orders co PARTITION BY (co.name)
 14         ON (TO_CHAR(co.dt,'YYYYMM') = ym.year_month)
 15  GROUP  BY
 16         co.name
 17  ,      ym.year_month
 18  ORDER  BY
 19         co.name
 20  ,      ym.year_month;

NAME            YEAR_MONTH   TOTAL_AMOUNT
--------------- ------------ ------------
ORDSYS          200401                  0
ORDSYS          200402                  0
ORDSYS          200403                  0
ORDSYS          200404                  0
ORDSYS          200405               3668
ORDSYS          200406              21427
ORDSYS          200407             156864
ORDSYS          200408                  0
ORDSYS          200409                  0
ORDSYS          200410                  0
ORDSYS          200411                  0
ORDSYS          200412                  0
WKSYS           200401                  0
WKSYS           200402                  0
WKSYS           200403                  0
WKSYS           200404                  0
WKSYS           200405                 93
WKSYS           200406                371
WKSYS           200407               2163
WKSYS           200408                  0
WKSYS           200409                  0
WKSYS           200410                  0
WKSYS           200411                  0
WKSYS           200412                  0

24 rows selected.

This is so much easier! The PARTITION BY clause on the outer join has enabled us to split our deficient data into groups and outer join to each group. In the context of our example, this means that Oracle knows the customer name for each group that we are outer joining to. Hence we did not need to generate a separate set of names and this in turn keeps our resource usage to a minimum.

If we fetch the execution plan for the previous query using DBMS_XPLAN.DISPLAY_CURSOR, we can see the presence of the PARTITION OUTER JOIN as follows.

SQL> SELECT plan_table_output
  2  FROM   TABLE(DBMS_XPLAN.DISPLAY_CURSOR);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------
SQL_ID  2w0usftp0k4un, child number 0
-------------------------------------
WITH year_months AS (        SELECT TO_CHAR(                  ADD_MONTHS(
DATE '2004-01-01', ROWNUM-1), 'YYYYMM') AS year_month        FROM   dual        CONNECT BY
ROWNUM < 12        ) SELECT co.name ,      ym.year_month ,      NVL(SUM(co.amt),0) AS
total_amount FROM   year_months ym        LEFT OUTER JOIN        customer_orders co PARTITION
BY (co.name)        ON (TO_CHAR(co.dt,'YYYYMM') = ym.year_month) GROUP  BY        co.name ,
   ym.year_month ORDER  BY        co.name ,      ym.year_month

Plan hash value: 2660043991

------------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |                 |       |       |     7 (100)|          |
|   1 |  SORT GROUP BY                     |                 |     1 |    79 |     6  (34)| 00:00:01 |
|   2 |   VIEW                             |                 |     1 |    79 |     5  (20)| 00:00:01 |
|   3 |    NESTED LOOPS PARTITION OUTER    |                 |     1 |    44 |     5  (20)| 00:00:01 |
|   4 |     BUFFER SORT                    |                 |       |       |            |          |
|   5 |      VIEW                          |                 |     1 |     5 |     2   (0)| 00:00:01 |
|   6 |       COUNT                        |                 |       |       |            |          |
|   7 |        CONNECT BY WITHOUT FILTERING|                 |       |       |            |          |
|   8 |         FAST DUAL                  |                 |     1 |       |     2   (0)| 00:00:01 |
|*  9 |     FILTER                         |                 |       |       |            |          |
|  10 |      SORT PARTITION JOIN           |                 |    10 |   390 |     3  (34)| 00:00:01 |
|  11 |       TABLE ACCESS FULL            | CUSTOMER_ORDERS |    10 |   390 |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   9 - filter("YM"."YEAR_MONTH"=TO_CHAR("CO"."DT",'YYYYMM'))

Note
-----
   - dynamic sampling used for this statement

37 rows selected.

The effects of the PARTITION OUTER JOIN can be seen in steps 3 and 10. Step 10 is where Oracle sorts the customer data to partition by name and step 3 is where the partitions are outer joined to the year-month data. Note that the CONNECT BY steps are a result of the method we chose to generate the time dimension data.

source code

The source code for the examples in this article can be downloaded from here.

Adrian Billington, June 2004

oracle-developer.net