regular expression enhancements in 11g

Regular expressions for SQL and PL/SQL were introduced in 10g Release 1 (as described in this oracle-developer.net article) and enhanced in 10g Release 2 (this included support for Perl-style expressions). Until 11g, Oracle's regular expression support comprised four functions (REGEXP_LIKE, REGEXP_SUBSTR, REGEXP_INSTR and REGEXP_REPLACE), but with the new version, Oracle has added a fifth, REGEXP_COUNT. In addition, 11g includes some usability enhancements to two of the existing functions.

In this article, we will explore all of the 11g new features for regular expressions. Note that this not a primer in regular expressions and does not work through the metacharacters and their usage. It is assumed that readers are familiar with Oracle's original regular expression support in 10g. For readers who require some background reading, see the article referenced above.

setup

To keep the examples simple and concise, we will use a single string to test our regular expressions. We will select this string from a view, as follows.

SQL> CREATE VIEW v
  2  AS
  3     SELECT 'www.oracle-developer.net' AS string
  4     FROM   dual;

View created.

SQL> SELECT * FROM v;

STRING
------------------------
www.oracle-developer.net

1 row selected.

regexp_count

We will begin with a new function, REGEXP_COUNT. As its name suggests, this 11g function counts the number of occurrences of an expression within a string. The parameters to this function follow a similar pattern to some of the existing REGEXP_% functions (REGEXP_COUNT's parameters are string, expression, starting position and match parameter) and we will see examples of their usage below.

In the first example, we will search for the number of occurrences of "e" in this website's address, as follows.

SQL> SELECT REGEXP_COUNT(string, 'e') AS "e Count"
  2  FROM   v;

   e Count
----------
         5

1 row selected.

Using the third argument to REGEXP_COUNT, we can also specify at which point in the base string we wish to begin our search. In the following example, we will count the number of "e" occurrences starting at the 20th character.

SQL> SELECT REGEXP_COUNT(string, 'e', 20) AS "e Count From Position 20"
  2  FROM   v;

e Count From Position 20
------------------------
                       1

1 row selected.

Readers who are familiar with REGEXP_SUBSTR, REGEXP_INSTR and REGEXP_REPLACE will recognise the general pattern of arguments to REGEXP_COUNT. The fourth argument to REGEXP_COUNT is the "match parameter", which enables us to apply some additional control over our search. For example, we can tell Oracle to ignore case in our search, as the following example demonstrates.

SQL> SELECT REGEXP_COUNT(string, 'E')         AS "E Count"
  2  ,      REGEXP_COUNT(string, 'E', 1, 'i') AS "E Count Ignore Case"
  3  FROM   v;

   E Count E Count Ignore Case
---------- -------------------
         0                   5

1 row selected.

We can see that our first expression found zero capital "E" characters, but when we ignore case (match parameter = 'i'), Oracle finds five occurrences. Incidentally, as with all regular expressions, special characters must be "escaped". In the following example, we will search for the number of full-stops in our string, both with and without an escaping character.

SQL> SELECT REGEXP_COUNT(string, '.')  AS "Dot No Escape (Any Character)"
  2  ,      REGEXP_COUNT(string, '\.') AS "Dot With Escape (Actual Dot)"
  3  FROM   v;

Dot No Escape (Any Character) Dot With Escape (Actual Dot)
----------------------------- ----------------------------
                           24                            2

1 row selected.

The "dot" character in a regular expression means "match any character", so every character in "www.oracle-developer.net" is a match for our first expression. When we escape this character in our second function call, we receive the correct answer of two occurrences.

In the examples so far, we have searched for single characters. REGEXP_COUNT is also useful for counting occurrences of multiple characters. In the following example, we will count the number of times "www" appears in this website's address. We will do this in three different ways.

SQL> SELECT REGEXP_COUNT(string, 'www')       AS "www Count Literal"
  2  ,      REGEXP_COUNT(string, 'w{3}')      AS "www Count Metacharacters"
  3  ,      REGEXP_COUNT(string, '(w)(w)(w)') AS "www Count Subexpressions"
  4  FROM   v;

www Count Literal www Count Metacharacters www Count Subexpressions
----------------- ------------------------ ------------------------
                1                        1                        1

1 row selected.

Of course, we knew the answer already, but it is useful to demonstrate how REGEXP_COUNT ignores subexpressions. In the third function call above, we treated each "w" in the search expression as an expression in its own right (we wrapped it in parentheses to create a "subexpression"). These subexpressions can be extremely useful. In 10g, they are primarily used in REGEXP_REPLACE calls, but in 11g they extend to the REGEXP_SUBSTR and REGEXP_INSTR functions, as we will see later in this article.

counting occurrences pre-11g

The REGEXP_COUNT function is a convenient new way of searching for the occurrences of strings or patterns, but it is possible to do this with pre-11g built-ins. If we are searching for a literal string, the following example will work in as many versions of Oracle as most readers have used.

SQL> SELECT LENGTH(string) - LENGTH(REPLACE(string, 'e')) AS "e Count Any Version"
  2  FROM   v;

e Count Any Version
-------------------
                  5

1 row selected.

We have correctly identified that the "e" character appears five times in our sample string. If we wish to count the occurrences of strings greater than one character in length, we need to amend the algorithm slightly to factor in the search string length. In the following example, we calculate the number of times "www" appears in our string using simple built-ins.

SQL> SELECT (LENGTH(string) - LENGTH(REPLACE(string, 'www'))) /
  2          LENGTH('www') AS "www Count Any Version"
  3  FROM   v;

www Count Any Version
---------------------
                    1
                    
1 row selected.

Prior to 11g, it is also possible to search for the number of times an expression (rather than a literal) occurs in a string. In all versions of 10g, we can use REGEXP_REPLACE instead of REPLACE in our algorithm. In the following example, we will search for the number of non-alpha characters in the website address.

SQL> SELECT LENGTH(string)
  2           - LENGTH(REGEXP_REPLACE(string, '[^a-z]')) AS "Non-Alpha Count 10g"
  3  FROM   v;

Non-Alpha Count 10g
-------------------
                  3

1 row selected.

Oracle has correctly counted the three non-alpha characters in our base string.

performance considerations for regexp_count

The new REGEXP_COUNT function is much more simple to use than the pre-11g alternatives. However, being a regular expression, it will be far more intensive in terms of the work it needs to do to calculate the occurrences of an expression or literal. In the following performance tests, we will compare the time it takes REGEXP_COUNT to find the "e" character in our string with the "any version" method across one million iterations.

Because of the simplicity of this test, we will start by setting the PL/SQL optimisation level to 0 (to prevent Oracle from "optimising out" our constant assignment in the loop).

SQL> ALTER SESSION SET PLSQL_OPTIMIZE_LEVEL = 0;

Session altered.

We will begin the comparison with the REGEXP_COUNT version of our code, as follows.

SQL> set timing on

SQL> DECLARE
  2     v_string    VARCHAR2(30) := 'www.oracle-developer.net';
  3     v_delimiter VARCHAR2(1)  := 'e';
  4     v_count     PLS_INTEGER;
  5  BEGIN
  6     FOR i IN 1 .. 1000000 LOOP
  7        v_count := REGEXP_COUNT(v_string, v_delimiter);
  8     END LOOP;
  9  END;
 10  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:07.89

One million executions of the REGEXP_COUNT function against our website string in PL/SQL took almost eight seconds (repeated executions of this example gives similar results). The "any version" alternative, using simple built-ins, is as follows.

SQL> DECLARE
  2     v_string    VARCHAR2(30) := 'www.oracle-developer.net';
  3     v_delimiter VARCHAR2(1)  := 'e';
  4     v_count     PLS_INTEGER;
  5  BEGIN
  6     FOR i IN 1 .. 1000000 LOOP
  7        v_count := (LENGTH(v_string) -
  8                    LENGTH(REPLACE(v_string, v_delimiter))) /
  9                    LENGTH(v_delimiter);
 10     END LOOP;
 11  END;
 12  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.85

This completes in less than one second. The flexibility and ease of the REGEXP_COUNT function comes at a price, therefore, and if our processing requirements are reasonably intensive, it might be better to use the pre-11g algorithm for literal searches. For most developers, however, the performance "hit" will probably be minor, depending on the type of application, queries and overall intensity of usage. If milliseconds are not hugely important (as in batch systems, for example), then REGEXP_COUNT's performance cost might be worth bearing for the simplicity it offers.

using regexp_count in "string-to-table" queries

To complete this section on REGEXP_COUNT, we will see a small example whereby counting string occurrences can be very useful. One of the FAQs on Oracle web forums is how to split a delimited string into its constituent elements. The following simple example highlights how we can use REGEXP_COUNT to help us to do this in SQL. We will imagine the website address as a delimited string and define the "." character as our delimiter.

SQL> SELECT string
  2  ,      LEVEL                                     AS element_no
  3  ,      REGEXP_SUBSTR(string, '[^\.]+', 1, LEVEL) AS single_element
  4  FROM   v
  5  CONNECT BY LEVEL <= REGEXP_COUNT(string, '\.') + 1;

STRING                   ELEMENT_NO SINGLE_ELEMENT
------------------------ ---------- ------------------------
www.oracle-developer.net          1 www
www.oracle-developer.net          2 oracle-developer
www.oracle-developer.net          3 net

3 rows selected.

There are many variations on this technique. For each string, we need to generate as many rows as there are constituent pieces. We do this by adding one to the count of delimiters and using a CONNECT BY "trick" to generate this number of rows. We can then use either REGEXP_SUBSTR (as in our example) or a combination of SUBSTR and INSTR to extract each element.

regexp_substr

As stated earlier, REGEXP_SUBSTR is one of the original regular expressions introduced in 10g. This is a useful function as it enables us to cut a literal or expression from within a string without needing to know its starting or ending position. In the following example, we will use REGEXP_SUBSTR to extract "oracle-developer" from the website's address (using an expression rather than literal for demonstration purposes).

SQL> SELECT REGEXP_SUBSTR(string, '\.[a-z-]+\.') AS url_middle_10g
  2  FROM   v;

URL_MIDDLE_10G
------------------
.oracle-developer.

1 row selected.

We can see that we almost have our result. We defined our expression as a string of lower-case alpha characters and a hyphen bounded by full-stops. Unfortunately, we also have our boundary characters (the full-stops). In 10g, we need to strip these using functions such as REPLACE, LTRIM, RTRIM or even SUBSTR. For example, we will remove the full-stops using REPLACE, as follows.

SQL> SELECT REPLACE(
  2            REGEXP_SUBSTR(string, '\.[a-z-]+\.'),
  3            '.') AS url_middle_cleaned_10g
  4  FROM   v;

URL_MIDDLE_CLEANED_10G
----------------------
oracle-developer

1 row selected.

We now have the string we actually wanted and this leads us nicely on to the 11g enhancement to REGEXP_SUBSTR. Oracle has added a new parameter to the function to enable us to extract the results of a subexpression, rather than an entire expression. In our example, the regular expression has three logical components or "subexpressions". First, we have a full-stop to the left of the string we want to extract. Second, we have the expression to find the string we actually want to extract. Third, we have the terminating full-stop.

We can use parentheses to define each of these subexpressions which turns our original expression \.[a-z-]+\. into (\.)([a-z-]+)(\.). Now that we have three subexpressions, we can ask REGEXP_SUBSTR to return any one of them, rather than the entire result. In the following example, therefore, we will request the results of the second subexpression only.

SQL> SELECT REGEXP_SUBSTR(
  2            string,
  3            '(\.)([a-z-]+)(\.)', --<-- expression with subexpressions
  4            1,                   --<-- starting position
  5            1,                   --<-- nth occurrence
  6            'i',                 --<-- match parameter (ignore case)
  7            2                    --<-- 11g: subexpression to return
  8            ) AS url_middle_11g
  9  FROM   v;

URL_MIDDLE_11G
----------------
oracle-developer

1 row selected.

Note that we can only specify one subexpression to return. However, this new feature is an extremely useful addition to REGEXP_SUBSTR and one which greatly simplifies its use.

regexp_instr

Oracle has added the same subexpression parameter to REGEXP_INSTR in 11g. Continuing with the example we used for REGEXP_SUBSTR above, we will use REGEXP_INSTR in its 10g form to determine the start and offset positions of the "oracle-developer" part of the sample website address. As before, we will use regular expressions rather than literals for demonstration purposes. We will begin by finding the start position of the search expression, as follows.

SQL> SELECT string
  2  ,      REGEXP_INSTR(string, '\.[a-z-]+\.') AS url_middle_start_10g
  3  FROM   v;

STRING                   URL_MIDDLE_START_10G
------------------------ --------------------
www.oracle-developer.net                    4

1 row selected.

The REGEXP_INSTR function returns the starting position of the entire expression, which in the case is the position of the full-stop. Using REGEXP_INSTR's offset parameter (described in the background reading), we will try to determine the position immediately following the end of the expression, as follows.

SQL> SELECT string
  2  ,      REGEXP_INSTR(
  3            string,
  4            '\.[a-z-]+\.',             --<-- expression
  5            1,                         --<-- starting position
  6            1,                         --<-- nth occurrence
  7            1                          --<-- offset indicator
  8            ) AS url_middle_offset_10g
  9  FROM   v;

STRING                   URL_MIDDLE_OFFSET_10G
------------------------ ---------------------
www.oracle-developer.net                    22

1 row selected.

We will use the SUBSTR function to summarise our results, as follows.

SQL> SELECT SUBSTR(string, 4, 1)  AS instr_start_10g
  2  ,      SUBSTR(string, 22, 1) AS instr_offset_10g
  3  FROM   v;

INSTR_START_10G INSTR_OFFSET_10G
--------------- ----------------
.               n

1 row selected.

These are clearly not what we wanted, although they are correct in the context of the 10g version of REGEXP_INSTR. Using the new 11g subexpression parameter, however, we can more accurately define our results, as follows.

SQL> SELECT string
  2  ,      REGEXP_INSTR(
  3            string,
  4            '(\.)([a-z-]+)(\.)',      --<-- expression with subexpressions
  5            1,                        --<-- starting position
  6            1,                        --<-- nth occurrence
  7            0,                        --<-- offset indicator
  8            'i',                      --<-- match parameter (ignore case)
  9            2                         --<-- 11g: subexpression to return
 10            ) AS url_middle_start_11g
 11  FROM   v;

STRING                   URL_MIDDLE_START_11G
------------------------ --------------------
www.oracle-developer.net                    5

1 row selected.

By specifying that we are only interested in the second subexpression, we have correctly identified the start of the "oracle-developer" string within the website address. We can do the same for the position immediately after this string, either by using the offset parameter with subexpression two or by finding the start of the third subexpression, as follows.

SQL> SELECT string
  2  ,      REGEXP_INSTR(
  3            string,
  4            '(\.)([a-z-]+)(\.)',      --<-- expression with subexpressions
  5            1,                        --<-- starting position
  6            1,                        --<-- nth occurrence
  7            1,                        --<-- offset indicator
  8            'i',                      --<-- match parameter (ignore case)
  9            2                         --<-- 11g: subexpression to return
 10            ) AS url_middle_offset_1_11g
 11  ,      REGEXP_INSTR(
 12            string,
 13            '(\.)([a-z-]+)(\.)',      --<-- expression with subexpressions
 14            1,                        --<-- starting position
 15            1,                        --<-- nth occurrence
 16            0,                        --<-- no offset indicator
 17            'i',                      --<-- match parameter (ignore case)
 18            3                         --<-- 11g: subexpression to return
 19            ) AS url_middle_offset_2_11g
 20  FROM   v;

STRING                   URL_MIDDLE_OFFSET_1_11G URL_MIDDLE_OFFSET_2_11G
------------------------ ----------------------- -----------------------
www.oracle-developer.net                      21                      21

1 row selected.

We can see that either method returns the same result. We can verify the results of the 11g version of REGEXP_INSTR as follows.

SQL> SELECT SUBSTR(string, 5, 1)  AS instr_start_11g
  2  ,      SUBSTR(string, 21, 1) AS instr_offset_11g
  3  FROM   v;

INSTR_START_11G INSTR_OFFSET_11G
--------------- ----------------
o               .

1 row selected.

As we can see, the subxpressions parameter enables us to define the results we want from the REGEXP_INSTR function with greater precision.

source code

The source code for the examples in this article can be downloaded from here.

Adrian Billington, May 2008

oracle-developer.net