collect enhancements in 11g release 2

The COLLECT aggregate function was introduced in Oracle 10g and enables us to aggregate data into a collection in SQL. Oracle 11g Release 2 includes two enhancements to this function: the ability to order and de-duplicate data as it is aggregated. In this article, we will examine these new features and demonstrate a way to replicate their functionality in 10g and 11g Release 1.

a brief reminder

Before we demonstrate the COLLECT function's 11.2 new features, we will remind ourselves of its original 10g functionality. We will begin by querying the EMP table and aggregating the employee names into a collection for each department, as follows.

SQL> break on deptno skip 1

SQL> SELECT deptno
  2  ,      COLLECT(ename) AS emps
  3  FROM   emp
  4  GROUP  BY
  5         deptno;

  DEPTNO EMPS
------- ------------------------------------------------------------------------------------
     10 SYSTPd/UcOeePQOrgQKjAAjgR1A==('CLARK', 'MILLER', 'KING')

     20 SYSTPd/UcOeePQOrgQKjAAjgR1A==('SMITH', 'FORD', 'ADAMS', 'SCOTT', 'JONES')

     30 SYSTPd/UcOeePQOrgQKjAAjgR1A==('ALLEN', 'JAMES', 'TURNER', 'BLAKE', 'MARTIN', 'WARD')

3 rows selected.

Using COLLECT, we have aggregated a single collection per department as required but the nested table type is system-generated and not very usable. By using our own collection type, we can make this much more usable and flexible. First, we will create a general collection type, as follows.

SQL> CREATE TYPE varchar2_ntt AS TABLE OF VARCHAR2(4000);
  2  /

Type created.

We will now repeat our query but this time CAST the system-generated collection type to our own user-defined type, as follows.

SQL> SELECT deptno
  2  ,      CAST(COLLECT(ename) AS varchar2_ntt) AS emps
  3  FROM   emp
  4  GROUP  BY
  5         deptno;

    DEPTNO EMPS
---------- -------------------------------------------------------------------
        10 VARCHAR2_NTT('CLARK', 'MILLER', 'KING')

        20 VARCHAR2_NTT('SMITH', 'FORD', 'ADAMS', 'SCOTT', 'JONES')

        30 VARCHAR2_NTT('ALLEN', 'JAMES', 'TURNER', 'BLAKE', 'MARTIN', 'WARD')


3 rows selected.

Having cast the collection in this way, it is now available to use throughout our application (i.e. wherever a parameter, variable or column of VARCHAR2_NTT is required). This summarises, in general, how we use the COLLECT function.

This is a short reminder of the COLLECT function's 10g usage. Readers who are not familiar with the function are advised to read this oracle-developer.net article for further discussion and detail on the concepts described above.

new feature: ordering collection elements

Sometimes we might wish to populate our collection with elements in a specific order. From 11g Release 2, COLLECT officially supports the ordering of elements during aggregation. Note that we use the term "officially supports" because this has been a hidden (and therefore unsupported) feature of COLLECT since its introduction in 10g Release 1. It is only with the release of 11g Release 2 that this option has been documented.

With this in mind, in the following example we will collect the employee names in alphabetical order by using the new extended syntax.

SQL> SELECT deptno
  2  ,      CAST(
  3            COLLECT(ename ORDER BY ename)
  4               AS varchar2_ntt) AS ordered_emps
  5  FROM   emp
  6  GROUP  BY
  7         deptno;

    DEPTNO ORDERED_EMPS
---------- -------------------------------------------------------------------
        10 VARCHAR2_NTT('CLARK', 'KING', 'MILLER')

        20 VARCHAR2_NTT('ADAMS', 'FORD', 'JONES', 'SCOTT', 'SMITH')

        30 VARCHAR2_NTT('ALLEN', 'BLAKE', 'JAMES', 'MARTIN', 'TURNER', 'WARD')


3 rows selected.

We can see that the collection elements are now in ENAME order. Of course, we don't need to order the collection according to the element we are collecting. We can order our set using other columns or expressions. In the following example, we will collect and order the employees according to their hiredates.

SQL> SELECT deptno
  2  ,      CAST(
  3            COLLECT(ename ORDER BY hiredate)
  4               AS varchar2_ntt) AS ordered_emps
  5  FROM   emp
  6  GROUP  BY
  7         deptno;

    DEPTNO ORDERED_EMPS
---------- ----------------------------------------------------------------------
        10 VARCHAR2_NTT('CLARK', 'KING', 'MILLER')

        20 VARCHAR2_NTT('SMITH', 'JONES', 'FORD', 'SCOTT', 'ADAMS')

        30 VARCHAR2_NTT('ALLEN', 'WARD', 'BLAKE', 'TURNER', 'MARTIN', 'JAMES')


3 rows selected.

We can verify these results using a simple query of the EMP table, as follows.

SQL> break on deptno skip 1

SQL> SELECT deptno
  2  ,      ename
  3  ,      hiredate
  4  FROM   emp
  5  ORDER  BY
  6         deptno
  7  ,      hiredate;

    DEPTNO ENAME      HIREDATE
---------- ---------- -----------
        10 CLARK      09/06/1981 
           KING       17/11/1981 
           MILLER     23/01/1982 

        20 SMITH      17/12/1980 
           JONES      02/04/1981 
           FORD       03/12/1981 
           SCOTT      09/12/1982 
           ADAMS      12/01/1983 

        30 ALLEN      20/02/1981 
           WARD       22/02/1981 
           BLAKE      01/05/1981 
           TURNER     08/09/1981 
           MARTIN     28/09/1981 
           JAMES      03/12/1981 

14 rows selected.

new feature: distinct collection elements

In addition to ordered elements, we can also request distinct collection elements in 11.2 (note that this option is also available unsupported in earlier releases). Before we demonstrate this extension, we will populate non-unique collections of jobs from our EMP table for reference, as follows.

SQL> SELECT deptno
  2  ,      CAST(COLLECT(job) AS varchar2_ntt) AS jobs
  3  FROM   emp
  4  GROUP  BY
  5         deptno;

 DEPTNO JOBS
------- --------------------------------------------------------------------------------
     10 VARCHAR2_NTT('MANAGER', 'CLERK', 'PRESIDENT')

     20 VARCHAR2_NTT('CLERK', 'ANALYST', 'CLERK', 'ANALYST', 'MANAGER')

     30 VARCHAR2_NTT('SALESMAN', 'CLERK', 'SALESMAN', 'MANAGER', 'SALESMAN', 'SALESMAN')

3 rows selected.

We can see repeating elements in two of our collections, which we can de-duplicate with the COLLECT extended syntax as follows.

SQL> SELECT deptno
  2  ,      CAST(
  3            COLLECT(DISTINCT job)
  4               AS varchar2_ntt) AS distinct_jobs
  5  FROM   emp
  6  GROUP  BY
  7         deptno;

    DEPTNO DISTINCT_JOBS
---------- --------------------------------------------------
        10 VARCHAR2_NTT('CLERK', 'MANAGER', 'PRESIDENT')

        20 VARCHAR2_NTT('ANALYST', 'CLERK', 'MANAGER')

        30 VARCHAR2_NTT('CLERK', 'MANAGER', 'SALESMAN')

3 rows selected.

As an alternative, we can use the UNIQUE keyword (this is synonymous with DISTINCT in the same way as SELECT UNIQUE is synonymous with SELECT DISTINCT in SQL).

SQL> SELECT deptno
  2  ,      CAST(
  3            COLLECT(UNIQUE job)
  4               AS varchar2_ntt) AS distinct_jobs
  5  FROM   emp
  6  GROUP  BY
  7         deptno;

    DEPTNO DISTINCT_JOBS
---------- --------------------------------------------------
        10 VARCHAR2_NTT('CLERK', 'MANAGER', 'PRESIDENT')

        20 VARCHAR2_NTT('ANALYST', 'CLERK', 'MANAGER')

        30 VARCHAR2_NTT('CLERK', 'MANAGER', 'SALESMAN')

3 rows selected.

As an aside (and to continue the SQL comparison), the non-unique alternative is ALL, which we can use in our COLLECT syntax, as follows.

SQL> SELECT deptno
  2  ,      CAST(
  3            COLLECT(ALL job)
  4               AS varchar2_ntt) AS distinct_jobs
  5  FROM   emp
  6  GROUP  BY
  7         deptno;

    DEPTNO DISTINCT_JOBS
---------- -----------------------------------------------------------------------------------
        10 VARCHAR2_NTT('MANAGER', 'CLERK', 'PRESIDENT')

        20 VARCHAR2_NTT('CLERK', 'ANALYST', 'CLERK', 'ANALYST', 'MANAGER')

        30 VARCHAR2_NTT('SALESMAN', 'CLERK', 'SALESMAN', 'MANAGER', 'SALESMAN', 'SALESMAN')

3 rows selected.

Collecting all elements is, of course, the default behaviour and is implicit in the same way that SELECT ALL is implied in SQL.

combining both new features

The syntax diagram in the documentation suggests that both extensions can be combined to build ordered, distinct collections for each group. However, in version 11.2.0.1, this doesn't appear to work. In the following query, we will attempt to collect the distinct jobs in each department and then order them.

SQL> SELECT deptno
  2  ,      CAST(
  3            COLLECT(
  4               DISTINCT job
  5               ORDER BY job
  6               ) AS varchar2_ntt) AS distinct_ordered_jobs
  7  FROM   emp
  8  GROUP  BY
  9         deptno;

    DEPTNO DISTINCT_ORDERED_JOBS
---------- --------------------------------------------------------------------------------
        10 VARCHAR2_NTT('CLERK', 'MANAGER', 'PRESIDENT')

        20 VARCHAR2_NTT('ANALYST', 'ANALYST', 'CLERK', 'CLERK', 'MANAGER')

        30 VARCHAR2_NTT('CLERK', 'MANAGER', 'SALESMAN', 'SALESMAN', 'SALESMAN', 'SALESMAN')

3 rows selected.

We can see duplicate elements in our ordered collections so something is clearly not working as of 11.2.0.1.

using complex types

Our examples so far have used a simple scalar collection type. We can also use COLLECT to aggregate collections of more complex object types and perform the new operations on them. We will see an example of this below. First we will create an object type based on a couple of the attributes of the EMP table, as follows.

SQL> CREATE TYPE empsal_ot AS OBJECT
  2  ( ename VARCHAR2(30)
  3  , sal   NUMBER
  4  );
  5  /

Type created.

We also need a collection type of this object, which we create as follows.

SQL> CREATE TYPE empsal_ntt AS TABLE OF empsal_ot;
  2  /

Type created.

As with our previous examples, it is simple to COLLECT data of this type, as the following query demonstrates.

SQL> col empsals format a50 wrap
SQL> break on deptno skip 1

SQL> SELECT deptno
  2  ,      CAST(
  3            COLLECT(
  4               empsal_ot(ename, sal)
  5               ) AS empsal_ntt) AS empsals
  6  FROM   emp
  7  GROUP  BY
  8         deptno;

    DEPTNO EMPSALS(ENAME, SAL)
---------- --------------------------------------------------
        10 EMPSAL_NTT(EMPSAL_OT('CLARK', 2450), EMPSAL_OT('MI
           LLER', 1300), EMPSAL_OT('KING', 5000))


        20 EMPSAL_NTT(EMPSAL_OT('SMITH', 800), EMPSAL_OT('FOR
           D', 3000), EMPSAL_OT('ADAMS', 1100), EMPSAL_OT('SC
           OTT', 3000), EMPSAL_OT('JONES', 2975))


        30 EMPSAL_NTT(EMPSAL_OT('ALLEN', 1600), EMPSAL_OT('JA
           MES', 950), EMPSAL_OT('TURNER', 1500), EMPSAL_OT('
           BLAKE', 2850), EMPSAL_OT('MARTIN', 1250), EMPSAL_O
           T('WARD', 1250))

3 rows selected.

If we wish to order each collection by one of the input columns (in our case, either the ENAME or SAL column from EMP), we can use the new ordering extension to COLLECT, as below.

SQL> SELECT deptno
  2  ,      CAST(
  3            COLLECT(
  4               empsal_ot(ename, sal) ORDER BY sal
  5               ) AS empsal_ntt) AS empsals
  6  FROM   emp
  7  GROUP  BY
  8         deptno;

    DEPTNO EMPSALS(ENAME, SAL)
---------- --------------------------------------------------
        10 EMPSAL_NTT(EMPSAL_OT('MILLER', 1300), EMPSAL_OT('C
           LARK', 2450), EMPSAL_OT('KING', 5000))


        20 EMPSAL_NTT(EMPSAL_OT('SMITH', 800), EMPSAL_OT('ADA
           MS', 1100), EMPSAL_OT('JONES', 2975), EMPSAL_OT('S
           COTT', 3000), EMPSAL_OT('FORD', 3000))


        30 EMPSAL_NTT(EMPSAL_OT('JAMES', 950), EMPSAL_OT('WAR
           D', 1250), EMPSAL_OT('MARTIN', 1250), EMPSAL_OT('T
           URNER', 1500), EMPSAL_OT('ALLEN', 1600), EMPSAL_OT
           ('BLAKE', 2850))

3 rows selected.

We can see that the elements of each collection are ordered according to salary. However, we can't order the collection by each entire object instance it contains, as the following query demonstrates.

SQL> SELECT deptno
  2  ,      CAST(
  3            COLLECT(
  4               empsal_ot(ename, sal)
  5                  ORDER BY empsal_ot(ename, sal)
  6               ) AS empsal_ntt) AS empsals
  7  FROM   emp
  8  GROUP  BY
  9         deptno;
                ORDER BY empsal_ot(ename, sal)
                         *
ERROR at line 5:
ORA-22950: cannot ORDER objects without MAP or ORDER method

Oracle tells us that we need a MAP or ORDER method to be able to sort instances of the object type. The same is also true if we wish to use the new distinct option, as follows.

SQL> SELECT deptno
  2  ,      CAST(
  3            COLLECT(
  4               DISTINCT empsal_ot(ename, sal)
  5               ) AS empsal_ntt) AS empsals
  6  FROM   emp
  7  GROUP  BY
  8         deptno;
             DISTINCT empsal_ot(ename, sal)
                      *
ERROR at line 4:
ORA-22950: cannot ORDER objects without MAP or ORDER method

This makes sense as we cannot expect Oracle to be able to order or compare multi-attribute instances of EMPSAL_OT unless we define the sort criteria. A MAP method (or alternatively an ORDER method) is required to tell Oracle how to sort or compare data of this type. We will therefore remove our example types and start again, as follows.

SQL> DROP TYPE empsal_ntt;

Type dropped.

SQL> DROP TYPE empsal_ot FORCE;

Type dropped.

SQL> CREATE TYPE empsal_ot AS OBJECT
  2  ( ename VARCHAR2(30)
  3  , sal   NUMBER
  4  , MAP MEMBER FUNCTION sort_key RETURN VARCHAR2
  5  );
  6  /

Type created.

Note the MAP method definition in our new EMPSAL_OT type specification. It returns a simple scalar value to represent a comparison key for the data held in the type. We will define the sort key in the type body below.

SQL> CREATE TYPE BODY empsal_ot AS
  2     MAP MEMBER FUNCTION sort_key RETURN VARCHAR2 IS
  3     BEGIN
  4        RETURN ename || TO_CHAR(sal,'fm0000');
  5     END;
  6  END;
  7  /

Type body created.

For the purposes of this example, we have defined our sort key to be a concatenation of the type attributes. To complete the setup, we will re-create our collection type, as follows.

SQL> CREATE TYPE empsal_ntt AS TABLE OF empsal_ot;
  2  /

Type created.

Now that Oracle has a method to sort and compare instances of EMPSAL_OT, we can retry our failed queries with the 11.2 COLLECT new features, starting with the ordered option, as follows.

SQL> SELECT CAST(
  2            COLLECT(
  3               DISTINCT empsal_ot(ename, sal)
  4               ) AS empsal_ntt) AS empsals
  5  FROM   emp;

EMPSALS(ENAME, SAL)
------------------------------------------------------------
EMPSAL_NTT(EMPSAL_OT('ADAMS', 1100), EMPSAL_OT('ALLEN', 1600
), EMPSAL_OT('BLAKE', 2850), EMPSAL_OT('CLARK', 2450), EMPSA
L_OT('FORD', 3000), EMPSAL_OT('JAMES', 950), EMPSAL_OT('JONE
S', 2975), EMPSAL_OT('KING', 5000), EMPSAL_OT('MARTIN', 1250
), EMPSAL_OT('MILLER', 1300), EMPSAL_OT('SCOTT', 3000), EMPS
AL_OT('SMITH', 800), EMPSAL_OT('TURNER', 1500), EMPSAL_OT('W
ARD', 1250))

1 row selected.

To retry the distinct option, we will first add a duplicate row for 'ALLEN' into EMP and then repeat our query, as follows.

SQL> INSERT INTO emp (empno, ename, sal)
  2     VALUES (9999, 'ALLEN', 1600);

1 row created.

SQL> SELECT CAST(
  2            COLLECT(
  3               DISTINCT empsal_ot(ename, sal)
  4               ) AS empsal_ntt) AS empsals
  5  FROM   emp;

EMPSALS(ENAME, SAL)
------------------------------------------------------------
EMPSAL_NTT(EMPSAL_OT('ADAMS', 1100), EMPSAL_OT('ALLEN', 1600
), EMPSAL_OT('BLAKE', 2850), EMPSAL_OT('CLARK', 2450), EMPSA
L_OT('FORD', 3000), EMPSAL_OT('JAMES', 950), EMPSAL_OT('JONE
S', 2975), EMPSAL_OT('KING', 5000), EMPSAL_OT('MARTIN', 1250
), EMPSAL_OT('MILLER', 1300), EMPSAL_OT('SCOTT', 3000), EMPS
AL_OT('SMITH', 800), EMPSAL_OT('TURNER', 1500), EMPSAL_OT('W
ARD', 1250))

1 row selected.

SQL> ROLLBACK;

Rollback complete.

We can see that our collection has been de-duplicated and an object instance for 'ALLEN' appears just once in our collection.

emulating collect new features before 11g release 2

As noted earlier, the ordering or de-duplication of elements during a COLLECT function call has been available (but unsupported) since the function was first released. Both extensions can be emulated quite simply in all versions since 10g Release 1. We will see some examples below that use supported methods only.

ordering collection elements before 11.2

To order collection elements in a deterministic way, we can write a simple function to sort the collection once it has been populated, as follows.

SQL> CREATE FUNCTION sort_collection (
  2                  p_collection IN varchar2_ntt
  3                  ) RETURN varchar2_ntt IS
  4     v_collection varchar2_ntt;
  5  BEGIN
  6     SELECT column_value
  7     BULK   COLLECT INTO v_collection
  8     FROM   TABLE(p_collection)
  9     ORDER  BY
 10            column_value;
 11     RETURN v_collection;
 12  END sort_collection;
 13  /

Function created.

Our function is very simple. It uses the TABLE() operator to cast our collection to a rowsource, which we then sort using a SQL ORDER BY clause and finally BULK COLLECT into a new collection. Using the function is also very simple. In the following example, we will populate and order a collection of employee names using COLLECT and our SORT_COLLECTION function.

SQL> SELECT deptno
  2  ,      SORT_COLLECTION(
  3            CAST(COLLECT(ename) AS varchar2_ntt)
  4            ) AS ordered_emps
  5  FROM   emp
  6  GROUP  BY
  7         deptno;

    DEPTNO ORDERED_EMPS
---------- --------------------------------------------------------------------
        10 VARCHAR2_NTT('CLARK', 'KING', 'MILLER')

        20 VARCHAR2_NTT('ADAMS', 'FORD', 'JONES', 'SCOTT', 'SMITH')

        30 VARCHAR2_NTT('ALLEN', 'BLAKE', 'JAMES', 'MARTIN', 'TURNER', 'WARD')

3 rows selected.

Of course, this method doesn't completely emulate the COLLECT ordering feature. Remember that the COLLECT function can order a collection using columns/expressions that are not being collected. To achieve the same prior to 11.2, we can use the MULTISET function. In the following example, we will populate collections of enames ordered by their hire dates. First, we will see the correct order using the new COLLECT feature below.

SQL> SELECT deptno
  2  ,      CAST(
  3            COLLECT(ename ORDER BY hiredate)
  4               AS varchar2_ntt) AS ordered_emps
  5  FROM   emp
  6  GROUP  BY
  7         deptno;

    DEPTNO ORDERED_EMPS
---------- ---------------------------------------------------------------------------
        10 VARCHAR2_NTT('CLARK', 'KING', 'MILLER')

        20 VARCHAR2_NTT('SMITH', 'JONES', 'FORD', 'SCOTT', 'ADAMS')

        30 VARCHAR2_NTT('ALLEN', 'WARD', 'BLAKE', 'TURNER', 'MARTIN', 'JAMES')

3 rows selected.

We can emulate this with MULTISET but it requires slightly more work, as the following query demonstrates.

SQL> SELECT e.deptno
  2  ,      CAST(
  3            MULTISET(
  4               SELECT e2.ename
  5               FROM   emp e2
  6               WHERE  e2.deptno = e.deptno
  7               ORDER  BY
  8                      e2.hiredate
  9               ) AS varchar2_ntt) AS ordered_emps
 10  FROM   emp e
 11  GROUP  BY
 12         e.deptno
 13  ORDER  BY
 14         e.deptno;

    DEPTNO ORDERED_EMPS
---------- --------------------------------------------------------------------
        10 VARCHAR2_NTT('CLARK', 'KING', 'MILLER')

        20 VARCHAR2_NTT('SMITH', 'JONES', 'FORD', 'SCOTT', 'ADAMS')

        30 VARCHAR2_NTT('ALLEN', 'WARD', 'BLAKE', 'TURNER', 'MARTIN', 'JAMES')

3 rows selected.

The ORDER BY clause in the MULTISET subquery enables us to order the inputs to each collection and we can see that they are correctly ordered according to hire dates. However, with this method we have two EMP accesses which will naturally lead to a greater workload. The Cost Based Optimiser is smart enough, however, to avoid executing the MULTISET subquery for every EMP record in the outer query block (to verify this, repeat the query with statistics level set to ALL, query DBMS_XPLAN.DISPLAY_CURSOR with the format option "IOSTATS" and look at the number of Starts for the subquery, which will be equivalent to the number of distinct departments).

To measure the additional workload that this technique generates, we will compare the COLLECT and MULTISET methods with a larger dataset. Using the ALL_OBJECTS view, we will populate collections of OBJECT_NAME by OWNER, with the collection elements ordered by OBJECT_ID. We will compare their performance using Autotrace statistics and a wall-clock (having queried ALL_OBJECTS several times beforehand to eliminate physical I/O interference).

First, we will use the new COLLECT method, as follows.

SQL> set timing on
SQL> set autotrace traceonly statistics

SQL> SELECT owner
  2  ,      object_type
  3  ,      CAST(
  4            COLLECT(object_name ORDER BY object_id)
  5               AS varchar2_ntt) AS object_names
  6  FROM   all_objects
  7  GROUP  BY
  8         owner
  9  ,      object_type;

272 rows selected.

Elapsed: 00:00:14.92

Statistics
----------------------------------------------------------
        222  recursive calls
          0  db block gets
      95805  consistent gets
          0  physical reads
          0  redo size
    1882434  bytes sent via SQL*Net to client
      57167  bytes received via SQL*Net from client
        548  SQL*Net roundtrips to/from client
       1464  sorts (memory)
          0  sorts (disk)
        272  rows processed

We will now repeat the query using MULTISET, as follows.

SQL> SELECT a.owner
  2  ,      a.object_type
  3  ,      CAST(
  4            MULTISET(
  5               SELECT b.object_name
  6               FROM   all_objects b
  7               WHERE  a.owner = b.owner
  8               AND    a.object_type = b.object_type
  9               ORDER  BY
 10                      b.object_id
 11               ) AS varchar2_ntt) AS object_names
 12  FROM   all_objects a
 13  GROUP  BY
 14         a.owner
 15  ,      a.object_type;

272 rows selected.

Elapsed: 00:00:20.35

Statistics
----------------------------------------------------------
        427  recursive calls
          0  db block gets
     308412  consistent gets
        140  physical reads
          0  redo size
    1883832  bytes sent via SQL*Net to client
      57375  bytes received via SQL*Net from client
        550  SQL*Net roundtrips to/from client
       3198  sorts (memory)
          0  sorts (disk)
        272  rows processed

As expected, the MULTISET method uses significantly more resources (particularly LIOs and sorts) to answer the same query, which naturally leads to a longer elapsed time (almost 40% longer in the example above).

de-duplicating collections before 11.2

To emulate the new distinct extension to COLLECT, we can use the MULTISET functions and operators introduced in 10g. Note that these will only work with collections of nested table types (i.e. not VARRAYs), but they are simple to use. Furthermore, we need a MAP or ORDER method to use these with complex collections, as demonstrated by our earlier examples. For simplicity, however, we will use our simple scalar collection type for the following examples.

First, we can use the SET function to de-duplicate a collection, as below.

SQL> SELECT deptno
  2  ,      SET(
  3            CAST(
  4               COLLECT(job)
  5                  AS varchar2_ntt)) AS distinct_jobs
  6  FROM   emp
  7  GROUP  BY
  8         deptno;

    DEPTNO DISTINCT_JOBS
---------- -------------------------------------------------
        10 VARCHAR2_NTT('MANAGER', 'CLERK', 'PRESIDENT')

        20 VARCHAR2_NTT('CLERK', 'ANALYST', 'MANAGER')

        30 VARCHAR2_NTT('SALESMAN', 'CLERK', 'MANAGER')

3 rows selected.

Note the placement of the SET function. It must be outside the CAST call, because SET will not work with Oracle's system-generated types that are created by our use of the COLLECT function. We will demonstrate this below.

SQL> SELECT deptno
  2  ,      CAST(
  3            SET(COLLECT(job))
  4               AS varchar2_ntt) AS distinct_jobs
  5  FROM   emp
  6  GROUP  BY
  7         deptno;
          SET(COLLECT(job))
              *
ERROR at line 3:
ORA-00932: inconsistent datatypes: expected UDT got -

An alternative MULTISET method we can use is the MULTISET UNION DISTINCT operator. This is used to perform a union of two collections, as the following query demonstrates.

SQL> SELECT deptno
  2  ,      CAST(
  3            COLLECT(job)
  4               AS varchar2_ntt
  5            ) MULTISET UNION DISTINCT varchar2_ntt() AS distinct_jobs
  6  FROM   emp
  7  GROUP  BY
  8         deptno;

    DEPTNO DISTINCT_JOBS
---------- -------------------------------------------------
        10 VARCHAR2_NTT('MANAGER', 'CLERK', 'PRESIDENT')

        20 VARCHAR2_NTT('CLERK', 'ANALYST', 'MANAGER')

        30 VARCHAR2_NTT('SALESMAN', 'CLERK', 'MANAGER')

3 rows selected.

We have performed a distinct union on our collection by simply appending an empty collection of the same type with the MULTISET UNION DISTINCT operator. This is the semantic equivalent of a SQL UNION.

Using the ALL_OBJECTS dataset and a wall-clock, we will compare the performance of the two MULTISET techniques and the 11.2 COLLECT function below. We will begin by enabling Autotrace to suppress the resultset and use the new COLLECT function, as follows.

SQL> set timing on
SQL> set autotrace traceonly statistics

SQL> SELECT owner
  2  ,      object_type
  3  ,      CAST(
  4            COLLECT(DISTINCT object_name)
  5               AS varchar2_ntt) AS object_names
  6  FROM   all_objects
  7  GROUP  BY
  8         owner
  9  ,      object_type;

272 rows selected.

Elapsed: 00:00:12.35

Statistics
----------------------------------------------------------
        222  recursive calls
          0  db block gets
      95811  consistent gets
          0  physical reads
          0  redo size
    1883131  bytes sent via SQL*Net to client
      57375  bytes received via SQL*Net from client
        550  SQL*Net roundtrips to/from client
       1464  sorts (memory)
          0  sorts (disk)
        272  rows processed

The COLLECT-based query completes in approximately 12 seconds, with all sorts being performed in memory. We can compare this to the SET-based query, as follows.

SQL> SELECT owner
  2  ,      object_type
  3  ,      SET(
  4            CAST(
  5               COLLECT(object_name)
  6                  AS varchar2_ntt)) AS object_names
  7  FROM   all_objects
  8  GROUP  BY
  9         owner
 10  ,      object_type;

272 rows selected.

Elapsed: 00:01:09.42

Statistics
----------------------------------------------------------
        222  recursive calls
          0  db block gets
      95811  consistent gets
          0  physical reads
          0  redo size
    1876306  bytes sent via SQL*Net to client
      57375  bytes received via SQL*Net from client
        550  SQL*Net roundtrips to/from client
       1464  sorts (memory)
          0  sorts (disk)
        272  rows processed

The SET query takes significantly longer to complete (almost 70 seconds compared with 12 seconds for the COLLECT query) but with the same resource profile being reported by Autotrace.

Finally, we will test the performance of the MULTISET UNION DISTINCT method below.

SQL> SELECT owner
  2  ,      object_type
  3  ,      CAST(
  4            COLLECT(object_name)
  5               AS varchar2_ntt
  6            ) MULTISET UNION DISTINCT varchar2_ntt() AS object_names
  7  FROM   all_objects
  8  GROUP  BY
  9         owner
 10  ,      object_type;
ERROR:
ORA-22813: operand value exceeds system limits

no rows selected

Elapsed: 00:00:07.51

Statistics
----------------------------------------------------------
        228  recursive calls
          0  db block gets
      94179  consistent gets
          0  physical reads
          0  redo size
      26650  bytes sent via SQL*Net to client
        799  bytes received via SQL*Net from client
          5  SQL*Net roundtrips to/from client
       1464  sorts (memory)
          0  sorts (disk)
          0  rows processed

After just 7 seconds and the same resource profile as the previous queries, our MULTISET UNION DISTINCT method gives up with an ORA-22813 exception due to collection size (note that this was tested on an 11.1.0.7 database). We can therefore conclude that for collection de-duplication, the new COLLECT function is more efficient than the pre-11g Release 2 alternatives (and that the alternatives will not necessarily be appropriate when working with larger collections).

acknowledgements

Thanks to Jonathan Heller for suggesting the use of the SET function to emulate the new distinct feature in versions prior to 11g Release 2.

further reading

For more information on COLLECT, see the online SQL Reference. More information on MULTISET features for working with collections is available in this oracle-developer.net article. A discussion of collection sorting techniques can be found in this oracle-developer.net article and a utility for sorting and de-duplicating collections is available as an oracle-developer.net utility.

source code

The source code for the examples in this article can be downloaded from here.

Adrian Billington, November 2009

Back to Top