emulating string-to-table functionality using sql
This short article demonstrates a number of methods for emulating "string-to-table" functionality in SQL only (i.e. without a PL/SQL function). I've put this together following my response to a forum posting on DBA Support. The forum question asked how to split a column of single, delimited strings into its constituent elements, but in SQL only. Each element must be represented in its own row, such that 'A,B,C' generates three records.
what's included?
In addition to my solution posted to DBA Support, this article also includes follow-up examples by William Robertson, Adrian Christie and Mohit Agarwal. In total, five methods are included:
- my SQL method, which requires a collection type as it uses TABLE(CAST(MULTISET())) functionality. This works on all versions that support subquery factoring (9i+);
- William Robertson's method, which uses XML features of Oracle 9.2+ to convert each string to a collection of elements in an XMLType, from which the elements can be extracted;
- Adrian Christie's method, which uses the MODEL clause for a 10g+ implementation;
- Mohit Agarwal's method, which uses regular expressions and a hierarchical query technique for a 10g+ implementation; and
- an 8i version of my SQL method, provided for completeness.
setup
First we will create a table with some test data and a collection type, starting with the table (with just a couple of rows for demonstration purposes).
SQL> CREATE TABLE t (str VARCHAR2(30));
Table created.
SQL> INSERT INTO t VALUES ( 'X,Y,Z' );
1 row created.
SQL> INSERT INTO t VALUES ( 'XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG' );
1 row created.
Next we create a collection type for use in the SQL-only implementation.
SQL> CREATE OR REPLACE TYPE number_ntt AS TABLE OF NUMBER; 2 /
Type created.
sql solution
As noted, the following SQL-only solution works for all versions of 9i and onwards. It transforms each delimited string into multiple rows. Note that in this example, the string-delimiter is hard-coded to a comma. Some further comments on the SQL follow the example.
SQL> WITH ilv AS ( 2 SELECT str || ',' AS str 3 , (LENGTH(str) - LENGTH(REPLACE(str, ','))) + 1 AS no_of_elements 4 FROM t 5 ) 6 SELECT RTRIM(str, ',') AS original_string 7 , SUBSTR(str, start_pos, (next_pos-start_pos)) AS single_element 8 , element_no 9 FROM ( 10 SELECT ilv.str 11 , nt.column_value AS element_no 12 , INSTR( 13 ilv.str, 14 ',', 15 DECODE(nt.column_value, 1, 0, 1), 16 DECODE(nt.column_value, 1, 1, nt.column_value-1)) + 1 AS start_pos 17 , INSTR( 18 ilv.str, 19 ',', 20 1, 21 DECODE(nt.column_value, 1, 1, nt.column_value)) AS next_pos 22 FROM ilv 23 , TABLE( 24 CAST( 25 MULTISET( 26 SELECT ROWNUM FROM dual CONNECT BY ROWNUM < ilv.no_of_elements 27 ) AS number_ntt )) nt 28 );
ORIGINAL_STRING SINGLE_ELEMENT ELEMENT_NO ------------------------------- --------------- ---------- X,Y,Z X 1 X,Y,Z Y 2 X,Y,Z Z 3 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG XXX 1 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG Y 2 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG ZZ 3 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG AAAAA 4 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG B 5 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG CCC 6 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG D 7 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG E 8 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG F 9 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG GGG 10 13 rows selected.
Note that the above SQL performs the following steps:
- determines how many elements are in each string (WITH clause);
- for each string, generates a collection of n elements (TABLE expression), where n is the derived number of elements in the string. Note in particular the use of "less than" in the "CONNECT BY ROWNUM < ilv.no_of_elements" on line 26. In all versions other than 10.1.x, this will need to be "CONNECT BY ROWNUM <= ilv.no_of_elements" (i.e. "less than or equal to"). There is an unusual bug with this row-generation technique in 10.1 that generates an additional row from the CONNECT BY;
- uses the generated rows in a Cartesian Product with the original data to generate n rows per string, based on the above definition of n;
- calculates the start and end position of each element in each string (INSTR); and
- cuts each element from each string (SUBSTR).
model solution
Adrian Christie's method uses the MODEL clause to generate the rows to "pivot" each string. MODEL is supported for all versions of 10g onwards. We'll look at the SQL first and then some points of note.
SQL> col element_count noprint SQL> WITH ilv AS ( 2 SELECT str AS orig_str 3 , ',' || str || ',' AS mod_str 4 , 1 AS start_pos 5 , LENGTH(str) AS end_pos 6 , (LENGTH(str) - LENGTH(REPLACE(str, ','))) + 1 AS element_count 7 , 0 AS element_no 8 , ROWNUM AS rn 9 FROM t 10 ) 11 SELECT orig_str AS original_string 12 , SUBSTR(mod_str, start_pos, end_pos-start_pos) AS single_element 13 , element_no 14 , element_count 15 FROM ( 16 SELECT * 17 FROM ilv 18 MODEL 19 PARTITION BY (rn, orig_str, mod_str) 20 DIMENSION BY (element_no) 21 MEASURES (start_pos, end_pos, element_count) 22 RULES 23 ITERATE (2000) 24 UNTIL (ITERATION_NUMBER+1 = element_count[0]) 25 ( 26 start_pos[ITERATION_NUMBER+1] 27 = INSTR(CV(mod_str), ',', 1, CV(element_no)) + 1, 28 end_pos[ITERATION_NUMBER+1] 29 = INSTR(CV(mod_str), ',', 1, CV(element_no) + 1) 30 ) 31 ) 32 WHERE element_no != 0 33 ORDER BY 34 mod_str 35 , element_no;
ORIGINAL_STRING SINGLE_ELEMENT ELEMENT_NO ------------------------------ --------------- ---------- X,Y,Z X 1 X,Y,Z Y 2 X,Y,Z Z 3 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG XXX 1 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG Y 2 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG ZZ 3 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG AAAAA 4 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG B 5 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG CCC 6 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG D 7 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG E 8 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG F 9 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG GGG 10 13 rows selected.
Some points to note are as follows.
- the maximum number of iterations have to be hard-coded. Because this is a VARCHAR2 column (maximum 4000 bytes), the maximum number of elements possible will be 2,000 (2,000 characters + 2,000 commas). The count of elements will serve as a stopkey to avoid MODEL iterating too much for each string;
- there appears to be a bug with the projection of the element_count column, so it is excluded from the output. If we comment it out completely, it doesn't get projected at all and hence there's no stopkey to the MODEL iterations; and
- the combination of PARTITION BY and DIMENSION BY must be unique, hence the use of ROWNUM to force a unique key.
xml solution
William Robertson's method uses the XML features of Oracle 9.2 upwards. Each string is transformed into a collection of XML elements by replacing the commas with tags and casting the strings into XMLTypes. Once in an XMLType, the elements can be extracted using XPath expressions. Unfortunately, this method will be slow at anything other than low volumes.
SQL> SELECT str AS original_string 2 , EXTRACTVALUE( xt.column_value, 'element' ) AS single_element 3 FROM t 4 , TABLE( 5 XMLSEQUENCE( 6 EXTRACT( 7 XMLTYPE( 8 '' 11 ), 12 '/coll/element' 13 ) 14 ) 15 ) xt; ' || 9 REPLACE( t.str, ',', ' ' ) || 10 '
ORIGINAL_STRING SINGLE_ELEMENT ------------------------------ --------------- X,Y,Z X X,Y,Z Y X,Y,Z Z XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG XXX XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG Y XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG ZZ XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG AAAAA XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG B XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG CCC XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG D XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG E XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG F XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG GGG 13 rows selected.
hierarchical solution
Mohit Agarwal's solution utilises a hierarchical query (CONNECT BY) combined with regular expressions to strip the tokens from each string. This is a much simpler version to read and uses a single in-line view, as follows.
SQL> SELECT str 2 , REGEXP_SUBSTR(str, '[^,]+', 1, LEVEL) AS single_element 3 , LEVEL AS element_no 4 FROM ( 5 SELECT ROWNUM AS id 6 , str 7 FROM t 8 ) 9 CONNECT BY INSTR(str, ',', 1, LEVEL-1) > 0 10 AND id = PRIOR id 11 AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL;
STR SINGLE_ELEMENT ELEMENT_NO ------------------------------ ------------------------------ ---------- X,Y,Z X 1 X,Y,Z Y 2 X,Y,Z Z 3 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG XXX 1 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG Y 2 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG ZZ 3 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG AAAAA 4 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG B 5 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG CCC 6 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG D 7 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG E 8 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG F 9 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG GGG 10 13 rows selected.
Note that the in-line view is required to alias ROWNUM (as it cannot be used directly in the PRIOR clause on line 10).
sql solution for 8i
We can easily modify the original SQL version to be 8i-compatible. We'll see the SQL first then examine the key differences from the 9i+ version.
SQL> SELECT RTRIM(str, ',') AS original_string 2 , SUBSTR(str, start_pos, (next_pos-start_pos)) AS single_element 3 , element_no 4 FROM ( 5 SELECT ilv.str 6 , nt.column_value AS element_no 7 , INSTR( 8 ilv.str, 9 ',', 10 DECODE(nt.column_value, 1, 0, 1), 11 DECODE(nt.column_value, 1, 1, nt.column_value-1)) + 1 AS start_pos 12 , INSTR( 13 ilv.str, 14 ',', 15 1, 16 DECODE(nt.column_value, 1, 1, nt.column_value)) AS next_pos 17 FROM ( 18 SELECT str || ',' AS str 19 , (LENGTH(str) - LENGTH(REPLACE(str, ','))) + 1 AS no_of_elements 20 FROM t 21 ) ilv 22 , TABLE( 23 CAST( 24 MULTISET( 25 SELECT ROWNUM FROM all_objects WHERE ROWNUM <= ilv.no_of_elements 26 ) AS number_ntt )) nt 27 );
ORIGINAL_STRING SINGLE_ELEMENT ELEMENT_NO ------------------------------- --------------- ---------- X,Y,Z X 1 X,Y,Z Y 2 X,Y,Z Z 3 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG XXX 1 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG Y 2 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG ZZ 3 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG AAAAA 4 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG B 5 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG CCC 6 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG D 7 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG E 8 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG F 9 XXX,Y,ZZ,AAAAA,B,CCC,D,E,F,GGG GGG 10 13 rows selected.
The main differences are as follows.
- the T translation is inside a further in-line view, as subquery factoring (WITH clause) was unavailable before 9i. Subquery factoring in this example makes no tangible difference, other than making the SQL easier to read; and
- the row generation is done using a stopkey against a dictionary view, rather than a CONNECT BY against DUAL, which wasn't supported in 8i. The potential downsides of using a view such as ALL_OBJECTS are that it must contain at least the same number of rows as the string with the most elements and also it is a query in itself after all.
summary timings
To test the performance of each of these methods, the T table was scaled up to 100,000 rows of 4 elements. We can see the average "wall-clock" timings of each method below.
Method | Timing (s) |
9i+ SQL | 11 |
8i+ SQL (running on 8i) | 14 |
10g+ MODEL | 27 |
10g+ hierarchical SQL | 32 |
8i+ SQL (running on 10g) | 114 |
9.2+ XML | cancelled after 600 |
further reading
The original thread that prompted this article can be found here.
acknowledgements
Thanks to Adrian Christie, William Robertson and Mohit Agarwal for contributing to this article.
source code
The source code for the examples in this article can be downloaded from here.
Adrian Billington, July 2005 (updated August 2007)
Back to Top