Advertisements
This article describes the new LISTAGG function in Oracle 11g Release 2. LISTAGG is a built-in function that enables us to perform string aggregation natively. String aggregation is a popular technique, and there are several methods available on the web, so we will compare their performance to the new LISTAGG function later in this article.
sample data
For reference, we will use the following sample data for our examples.
DEPTNO ENAME HIREDATE
---------- ---------- ----------
10 CLARK 09/06/1981
10 KING 17/11/1981
10 MILLER 23/01/1982
20 ADAMS 12/01/1983
20 FORD 03/12/1981
20 JONES 02/04/1981
20 SCOTT 09/12/1982
20 SMITH 17/12/1980
30 ALLEN 20/02/1981
30 BLAKE 01/05/1981
30 JAMES 03/12/1981
30 MARTIN 28/09/1981
30 TURNER 08/09/1981
30 WARD 22/02/1981
what is string aggregation?
String aggregation is simply the grouping and concatenation of multiple rows of data into a single row per group. For example, consider the following resultset:
DEPTNO ENAME
--------- ----------
10 CLARK
10 KING
10 MILLER
20 ADAMS
20 FORD
20 JONES
With string aggregation, this resultset would be grouped (by DEPTNO) as follows:
DEPTNO AGGREGATED_ENAMES
--------- -------------------------
10 CLARK,KING,MILLER
20 ADAMS,FORD,JONES
We can see that the employee names have simply been grouped and concatenated into a single column (values are delimited by comma) per group. As stated, there are several techniques available to perform this aggregation (references are provided at the end of this article), but the new LISTAGG function makes it much easier, as we will see below.
listagg syntax overview
The LISTAGG function has the following syntax structure:
LISTAGG( [,]) WITHIN GROUP (ORDER BY ) [OVER (PARTITION BY )]
LISTAGG is an aggregate function that can optionally be used as an analytic (i.e. the optional OVER() clause). The following elements are mandatory:
the column or expression to be aggregated;
the WITHIN GROUP keywords;
the ORDER BY clause within the grouping.
We will now see some examples of the function below.
listagg as an aggregate function
We will begin with a simple example that aggregates the employee names for each department in the EMP table, using a comma as delimiter.
SQL> SELECT deptno
2 , LISTAGG(ename, ',') WITHIN GROUP (ORDER BY ename) AS employees
3 FROM emp
4 GROUP BY
5 deptno;
DEPTNO EMPLOYEES
---------- ------------------------------------------------------------
10 CLARK,KING,MILLER
20 ADAMS,FORD,JONES,SCOTT,SMITH
30 ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD
3 rows selected.
Note that we chose to order the employees within each aggregation by the employee name. It should be noted that ordering the elements of a string aggregation is not a trivial task in some of the alternative techniques to LISTAGG.
In the following example, we will aggregate the employee names but order them by their respective hire dates.
SQL> SELECT deptno
2 , LISTAGG(ename, ',') WITHIN GROUP (ORDER BY hiredate) AS employees
3 FROM emp
4 GROUP BY
5 deptno;
DEPTNO EMPLOYEES
---------- ------------------------------------------------------------
10 CLARK,KING,MILLER
20 SMITH,JONES,FORD,SCOTT,ADAMS
30 ALLEN,WARD,BLAKE,TURNER,MARTIN,JAMES
3 rows selected.
We can see that the order of employee names within each group differs from the first example.
listagg as an analytic function
As with many aggregate functions, LISTAGG can be converted to an analytic function by adding the OVER() clause. The following example demonstrates the analytic equivalent of our previous example (for each department, aggregate employee names in hire date order).
SQL> SELECT deptno
2 , ename
3 , hiredate
4 , LISTAGG(ename, ',')
5 WITHIN GROUP (ORDER BY hiredate)
6 OVER (PARTITION BY deptno) AS employees
7 FROM emp;
DEPTNO ENAME HIREDATE EMPLOYEES
---------- ---------- ----------- -------------------------------------
10 CLARK 09/06/1981 CLARK,KING,MILLER
10 KING 17/11/1981 CLARK,KING,MILLER
10 MILLER 23/01/1982 CLARK,KING,MILLER
20 SMITH 17/12/1980 SMITH,JONES,FORD,SCOTT,ADAMS
20 JONES 02/04/1981 SMITH,JONES,FORD,SCOTT,ADAMS
20 FORD 03/12/1981 SMITH,JONES,FORD,SCOTT,ADAMS
20 SCOTT 19/04/1987 SMITH,JONES,FORD,SCOTT,ADAMS
20 ADAMS 23/05/1987 SMITH,JONES,FORD,SCOTT,ADAMS
30 ALLEN 20/02/1981 ALLEN,WARD,BLAKE,TURNER,MARTIN,JAMES
30 WARD 22/02/1981 ALLEN,WARD,BLAKE,TURNER,MARTIN,JAMES
30 BLAKE 01/05/1981 ALLEN,WARD,BLAKE,TURNER,MARTIN,JAMES
30 TURNER 08/09/1981 ALLEN,WARD,BLAKE,TURNER,MARTIN,JAMES
30 MARTIN 28/09/1981 ALLEN,WARD,BLAKE,TURNER,MARTIN,JAMES
30 JAMES 03/12/1981 ALLEN,WARD,BLAKE,TURNER,MARTIN,JAMES
14 rows selected.
Remember that when use an analytic function, we do not lose any rows from our resultset: rather we see the results of the aggregation on every source row (depending on the analytic function and the ordering/windowing clauses). We can clearly see this above. The string aggregation for each department is available on every corresponding row.
more on ordering
As stated earlier, the ORDER BY clause is mandatory, as following example demonstrates.
SQL> SELECT deptno
2 , LISTAGG(ename, ',') WITHIN GROUP () AS employees
3 FROM emp
4 GROUP BY
5 deptno;
, LISTAGG(ename, ',') WITHIN GROUP () AS employees
*
ERROR at line 2:
ORA-30491: missing ORDER BY clause
If the order of the aggregated elements is irrelevant to us, we can use a constant expression such as NULL, as follows.
SQL> SELECT deptno
2 , LISTAGG(ename, ',') WITHIN GROUP (ORDER BY NULL) AS employees
3 FROM emp
4 GROUP BY
5 deptno;
DEPTNO EMPLOYEES
---------- ------------------------------------------------------------
10 CLARK,KING,MILLER
20 ADAMS,FORD,JONES,SCOTT,SMITH
30 ALLEN,BLAKE,JAMES,MARTIN,TURNER,WARD
3 rows selected.
In this example, the elements have been aggregated alphabetically, despite the NULL ordering clause. This appears to be the default behaviour when using a constant ORDER BY expression such as the above.
delimiters
We can use a range of constants or expressions as a delimiter for our aggregated strings. In fact, the delimiter is optional and can be excluded altogether, as the following example demonstrates.
SQL> SELECT deptno
2 , LISTAGG(ename) WITHIN GROUP (ORDER BY ename) AS employees
3 FROM emp
4 GROUP BY
5 deptno;
DEPTNO EMPLOYEES
---------- ------------------------------------------------------------
10 CLARKKINGMILLER
20 ADAMSFORDJONESSCOTTSMITH
30 ALLENBLAKEJAMESMARTINTURNERWARD
3 rows selected.
One restriction is that delimiters have to be either constants (such as a literal) or based on a deterministic expression that includes a column or expression used in the grouping. For example, we cannot use ROWNUM as a delimiter, as we see below.
SQL> SELECT deptno
2 , LISTAGG(ename, '(' || ROWNUM || ')')
3 WITHIN GROUP (ORDER BY hiredate) AS employees
4 FROM emp
5 GROUP BY
6 deptno;
, LISTAGG(ename, '(' || ROWNUM || ')')
*
ERROR at line 2:
ORA-30497: Argument should be a constant or a function of expressions in GROUP BY.
The error message is clear: ROWNUM is neither a constant nor an expression that involves our grouping column (i.e. DEPTNO). If we try to use our grouping column, we are limited to the type of expression we can use, as we can see below.
SQL> SELECT deptno
2 , LISTAGG(ename, '(' || MAX(deptno) || ')')
3 WITHIN GROUP (ORDER BY hiredate) AS employees
4 FROM emp
5 GROUP BY
6 deptno;
, LISTAGG(ename, '(' || MAX(deptno) || ')')
*
ERROR at line 2:
ORA-30496: Argument should be a constant.
This time, Oracle recognises that we tried to use our grouping column, but we did not use a valid expression (hence we receive another error message; albeit different to the previous example). The following is an example of a deterministic expression on our grouping column that Oracle will accept.
SQL> SELECT deptno
2 , LISTAGG(ename, '(' || CHR(deptno+55) || '); ')
3 WITHIN GROUP (ORDER BY hiredate) AS employees
4 FROM emp
5 GROUP BY
6 deptno;
DEPTNO EMPLOYEES
---------- ------------------------------------------------------------
10 CLARK(A); KING(A); MILLER
20 SMITH(K); JONES(K); FORD(K); SCOTT(K); ADAMS
30 ALLEN(U); WARD(U); BLAKE(U); TURNER(U); MARTIN(U); JAMES
3 rows selected.
This is a contrived example, but we've simply converted the DEPTNO into an ASCII character to use as a delimiter. This is a deterministic expression on the value of the grouping column and Oracle allows it.
other restrictions
The results of LISTAGG are constrained to the maximum size of VARCHAR2 in SQL (i.e. 4000) as we can see below.
SQL> SELECT LISTAGG(object_name) WITHIN GROUP (ORDER BY NULL)
2 FROM all_objects;
FROM all_objects
*
ERROR at line 2:
ORA-01489: result of string concatenation is too long
There is no CLOB or larger VARCHAR2 equivalent, so for larger strings we would need to use an alternative means of gathering the elements (such as a collection or a user-defined PL/SQL function).
performance considerations
We will compare the performance of LISTAGG to some of the alternatives that pre-date it. The methods we will compare are as follows:
LISTAGG (11g Release 2);
COLLECT + PL/SQL function(10g);
Oracle Data Cartridge - user-defined aggregate function (9i)
MODEL SQL (10g).
The main difference with LISTAGG is that it is a built-in function, so we should expect its performance to be at the very least comparable to its alternatives.
setup
For the performance comparisons, we will use a larger source dataset of 1 million rows spread evenly between 2000 groups, as follows.
SQL> CREATE TABLE t
2 AS
3 SELECT ROWNUM AS id
4 , MOD(ROWNUM,2000) AS grp
5 , DBMS_RANDOM.STRING('u',5) AS val
6 , DBMS_RANDOM.STRING('u',30) AS pad
7 FROM dual
8 CONNECT BY ROWNUM <= 1000000;
Table created.
SQL> SELECT COUNT(*) FROM t;
COUNT(*)
----------
1000000
1 row selected.
SQL> exec DBMS_STATS.GATHER_TABLE_STATS(USER, 'T');
PL/SQL procedure successfully completed.
We will compare LISTAGG to the alternatives using both Autotrace and the wall-clock. Note that the sample data is cached prior to the comparisons. We will begin by preparing our environment as follows.
SQL> set autotrace traceonly statistics
SQL> set timing on
SQL> set arrays 500
listagg
Our first test will be with LISTAGG. We will aggregate and order the value string within each of our 2000 groups, as follows.
SQL> SELECT grp
2 , LISTAGG(val, ',') WITHIN GROUP (ORDER BY val) AS vals
3 FROM t
4 GROUP BY
5 grp;
2000 rows selected.
Elapsed: 00:00:05.85
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
7092 consistent gets
0 physical reads
0 redo size
6039067 bytes sent via SQL*Net to client
552 bytes received via SQL*Net from client
5 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
2000 rows processed
This executes in just under 6 seconds on the test database, with no physical I/O and all sorting in memory.
stragg/wm_concat
Next we will use the best known technique for string aggregation: Tom Kyte's user-defined aggregate function known as STRAGG. In 10g, Oracle implemented a similar function in the WMSYS schema, so we will assume this to be a direct equivalent of STRAGG and use it in our test below. Note that one of the details of the STRAGG method is that it doesn't enable the strings to be ordered.
SQL> SELECT grp
2 , WMSYS.WM_CONCAT(val) AS vals --<-- WM_CONCAT ~= STRAGG
3 FROM t
4 GROUP BY
5 grp;
2000 rows selected.
Elapsed: 00:00:19.45
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
7206 consistent gets
0 physical reads
0 redo size
6039067 bytes sent via SQL*Net to client
552 bytes received via SQL*Net from client
5 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
2000 rows processed
We can see that this is over 3 times slower than LISTAGG (even without element ordering). The user-defined aggregate function carries with it the general overhead of PL/SQL functions (i.e. context-switching) and we can see the impact of this in the results above.
collect (without ordering)
When 10g was released, I devised a quicker alternative to STRAGG using the COLLECT function combined with a "collection-to-string" PL/SQL function (described in this oracle-developer.net article). The 10g version of COLLECT had no ordering facility so we'll start with this variant, as follows. Note that the TO_STRING source code is available in the referenced article.
SQL> SELECT grp
2 , TO_STRING(
3 CAST(COLLECT(val) AS varchar2_ntt)
4 ) AS vals
5 FROM t
6 GROUP BY
7 grp;
2000 rows selected.
Elapsed: 00:00:02.90
Statistics
----------------------------------------------------------
10 recursive calls
0 db block gets
7197 consistent gets
0 physical reads
0 redo size
6039067 bytes sent via SQL*Net to client
552 bytes received via SQL*Net from client
5 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
2000 rows processed
Without sorting the elements, the original COLLECT/TO_STRING method is twice as fast as LISTAGG. However, this isn't a fair comparison because LISTAGG always incurs the cost of ordering the elements. If the order of the elements is irrelevant, however, the original COLLECT technique will be the quickest.
collect (with ordering)
For a fair comparison between COLLECT and LISTAGG, we must include the sorting of the elements in the COLLECT call (an 11g Release 1 new feature described in this oracle-developer.net article). The following is equivalent to our LISTAGG example.
SQL> SELECT grp
2 , TO_STRING(
3 CAST(COLLECT(val ORDER BY val) AS varchar2_ntt)
4 ) AS vals
5 FROM t
6 GROUP BY
7 grp;
2000 rows selected.
Elapsed: 00:00:07.08
Statistics
----------------------------------------------------------
10 recursive calls
0 db block gets
7197 consistent gets
0 physical reads
0 redo size
6039067 bytes sent via SQL*Net to client
552 bytes received via SQL*Net from client
5 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
2000 rows processed
This time, when we include the ordering of the elements, the COLLECT method is actually slower than the new LISTAGG function.
model
Our final performance comparison is with an example of string aggregation that uses the MODEL clause. The following example was sourced from Rob van Wijk's About Oracle blog and has been modified to fit our sample data.
SQL> SELECT grp
2 , vals
3 FROM (
4 SELECT grp
5 , RTRIM(vals, ',') AS vals
6 , rn
7 FROM t
8 MODEL
9 PARTITION BY (grp)
10 DIMENSION BY (ROW_NUMBER() OVER (PARTITION BY grp ORDER BY val) AS rn)
11 MEASURES (CAST(val AS VARCHAR2(4000)) AS vals)
12 RULES
13 ( vals[ANY] ORDER BY rn DESC = vals[CV()] || ',' || vals[CV()+1]
14 )
15 )
16 WHERE rn = 1
17 ORDER BY
18 grp;
2000 rows selected.
Elapsed: 00:03:28.15
Statistics
----------------------------------------------------------
3991 recursive calls
0 db block gets
7092 consistent gets
494791 physical reads
0 redo size
6039067 bytes sent via SQL*Net to client
553 bytes received via SQL*Net from client
5 SQL*Net roundtrips to/from client
130 sorts (memory)
0 sorts (disk)
2000 rows processed
This ran for over 3 minutes! We can see from the statistics that it has incurred a significant number of physical reads, recursive calls and in-memory sorting. In fact, this poor timing is largely due to an enormous number of direct path reads/writes to/from the temporary tablespace during query execution (although the disk sorts are not reported by the Autotrace statistics).
The execution plan for the MODEL string aggregation is as follows.
--------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc|
--------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | |
| 1 | SORT ORDER BY | | 1000K| 1934M| 1953M|
|* 2 | VIEW | | 1000K| 1934M| |
| 3 | SQL MODEL ORDERED | | 1000K| 9765K| |
| 4 | WINDOW SORT | | 1000K| 9765K| 19M|
| 5 | TABLE ACCESS FULL| T | 1000K| 9765K| |
--------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("RN"=1)
According to a SQL monitor report (using DBMS_SQLTUNE.REPORT_SQL_MONITOR), the ordering of data within the SQL MODEL ORDERED operation at step 3 accounted for almost 4Gb of temporary space! An alternative example from Gary Myers' Sydney Oracle Lab blog displays similar performance characteristics.
performance summary
To summarise, therefore, the new LISTAGG function is the fastest technique for string aggregation and has the additional benefit of being a simple built-in function. If the ordering of elements is irrelevant, a technique that uses COLLECT without sorting is quicker still, but for ordered string aggregation, LISTAGG is unbeatable. Note that there is an issue with the MODEL examples used in the performance comparisons, but this might be isolated to either the version of Oracle used for the examples (11.2.0.1 Windows 64-bit) or the test database itself.