Optimizing `IN`/`=ANY` Subqueries

Certain optimizations are applicable to comparisons that use the IN operator to test subquery results (or that use =ANY, which is equivalent). This section discusses these optimizations, particularly with regard to the challenges that NULL values present. Suggestions on what you can do to help the optimizer are given at the end of the discussion.

Consider the following subquery comparison:

outer_expr IN (SELECT inner_expr FROM ... WHERE subquery_where)

MySQL evaluates queries "from outside to inside." That is, it first obtains the value of the outer expression outer_expr, and then runs the subquery and captures the rows that it produces.

A very useful optimization is to "inform" the subquery that the only rows of interest are those where the inner expression inner_expr is equal to outer_expr. This is done by pushing down an appropriate equality into the subquery's WHERE clause. That is, the comparison is converted to this:

EXISTS (SELECT 1 FROM ... WHERE subquery_where AND outer_expr=inner_expr)

After the conversion, MariaDB can use the pushed-down equality to limit the number of rows that it must examine when evaluating the subquery.

More generally, a comparison of N values to a subquery that returns N-value rows is subject to the same conversion. If oe_i and ie_i represent corresponding outer and inner expression values, this subquery comparison:

(oe_1, ..., oe_N) IN
 (SELECT ie_1, ..., ie_N FROM ... WHERE subquery_where)

Becomes:

EXISTS (SELECT 1 FROM ... WHERE subquery_where
 AND oe_1 = ie_1
 AND ...
 AND oe_N = ie_N)

The following discussion assumes a single pair of outer and inner expression values for simplicity.

The conversion just described has its limitations. It is valid only if we ignore possible NULL values. That is, the "pushdown" strategy works as long as both of these two conditions are true:

outer_expr and inner_expr cannot be NULL.
You do not need to distinguish NULL from FALSE subquery results. (If the subquery is a part of an OR or AND expression in the WHERE clause, MariaDB assumes that you don't care.)

When either or both of those conditions do not hold, optimization is more complex.

Suppose that outer_expr is known to be a non-NULL value but the subquery does not produce a row such that outer_expr = inner_expr. Then outer_expr IN (SELECT ...) evaluates as follows:

NULL, if the SELECT produces any row where inner_expr is NULL
FALSE, if the SELECT produces only non-NULL values or produces nothing

In this situation, the approach of looking for rows with outer_expr = inner_expr is no longer valid. It is necessary to look for such rows, but if none are found, also look for rows where inner_expr is NULL. Roughly speaking, the subquery can be converted to:

EXISTS (SELECT 1 FROM ... WHERE subquery_where AND
 (outer_expr=inner_expr OR inner_expr IS NULL))

The need to evaluate the extra IS NULL condition is why MariaDB has the ref_or_null access method:

mysql> EXPLAIN
 -> SELECT outer_expr IN (SELECT t2.maybe_null_key
 -> FROM t2, t3 WHERE ...)
 -> FROM t1;
*************************** 1. row ***************************
 id: 1
 select_type: PRIMARY
 table: t1
...
*************************** 2. row ***************************
 id: 2
 select_type: DEPENDENT SUBQUERY
 table: t2
 type: ref_or_null possible_keys: maybe_null_key
 key: maybe_null_key
 key_len: 5
 ref: func
 rows: 2
 Extra: Using where; Using index
...

The unique-subquery and index_subquery subquery-specific access methods also have or-null variants. However, they are not visible in EXPLAIN output, so you must use EXPLAIN EXTENDED followed by SHOW WARNINGS (note the checking NULL in the warning message):

mysql> EXPLAIN EXTENDED
 -> SELECT outer_expr IN (SELECT maybe_null_key FROM t2) FROM t1\G
*************************** 1. row ***************************
 id: 1
 select_type: PRIMARY
 table: t1
...
*************************** 2. row ***************************
 id: 2
 select_type: DEPENDENT SUBQUERY
 table: t2
 type: index_subquery possible_keys: maybe_null_key
 key: maybe_null_key
 key_len: 5
 ref: func
 rows: 2
 Extra: Using index mysql> SHOW WARNINGS\G
*************************** 1. row ***************************
 Level: Note
 Code: 1003
Message: select (`test`.`t1`.`outer_expr`,
 (((`test`.`t1`.`outer_expr`) in t2 on
 maybe_null_key checking NULL))) AS `outer_expr IN (SELECT
 maybe_null_key FROM t2)` from `test`.`t1`

The additional OR ... IS NULL condition makes query execution slightly more complicated (and some optimizations within the subquery become inapplicable), but generally this is tolerable.

The situation is much worse when outer_expr can be NULL. According to the SQL interpretation of NULL as "unknown value," NULL IN (SELECT inner_expr ...) should evaluate to:

NULL, if the SELECT produces any rows
FALSE, if the SELECT produces no rows

For proper evaluation, it is necessary to be able to check whether the SELECT has produced any rows at all, so outer_expr = inner_expr cannot be pushed down into the subquery. This is a problem, because many real world subqueries become very slow unless the equality can be pushed down.

Essentially, there must be different ways to execute the subquery depending on the value of outer_expr.

The optimizer chooses SQL compliance over speed, so it accounts for the possibility that outer_expr might be NULL.

If outer_expr is NULL, to evaluate the following expression, it is necessary to run the SELECT to determine whether it produces any rows:

NULL IN (SELECT inner_expr FROM ... WHERE subquery_where)

It is necessary to run the original SELECT here, without any pushed-down equalities of the kind mentioned earlier.

On the other hand, when outer_expr is not NULL, it is absolutely essential that this comparison:

outer_expr IN (SELECT inner_expr FROM ... WHERE subquery_where)

be converted to this expression that uses a pushed-down condition:

EXISTS (SELECT 1 FROM ... WHERE subquery_where AND outer_expr=inner_expr)

Without this conversion, subqueries will be slow. To solve the dilemma of whether to push down or not push down conditions into the subquery, the conditions are wrapped in "trigger" functions. Thus, an expression of the following form:

outer_expr IN (SELECT inner_expr FROM ... WHERE subquery_where)

is converted into:

EXISTS (SELECT 1 FROM ... WHERE subquery_where
 AND trigcond(outer_expr=inner_expr))

More generally, if the subquery comparison is based on several pairs of outer and inner expressions, the conversion takes this comparison:

(oe_1, ..., oe_N) IN (SELECT ie_1, ..., ie_N FROM ... WHERE subquery_where)

and converts it to this expression:

EXISTS (SELECT 1 FROM ... WHERE subquery_where
 AND trigcond(oe_1=ie_1)
 AND ...
 AND trigcond(oe_N=ie_N)
 )

Each trigcond(X) is a special function that evaluates to the following values:

X when the "linked" outer expression oe_i is not NULL
TRUE when the "linked" outer expression oe_i is NULL

Note that trigger functions are not triggers of the kind that you create with CREATE TRIGGER.

Equalities that are wrapped into trigcond() functions are not first class predicates for the query optimizer. Most optimizations cannot deal with predicates that may be turned on and off at query execution time, so they assume any trigcond(X) to be an unknown function and ignore it. At the moment, triggered equalities can be used by those optimizations:

Reference optimizations: trigcond(X=Y [OR Y IS NULL]) can be used to construct ref, eq-ref, or ref_or_null table accesses.
Index lookup-based subquery execution engines: trigcond(X=Y) can be used to construct unique_subquery or index_subquery accesses.
Table-condition generator: If the subquery is a join of several tables, the triggered condition will be checked as soon as possible.

When the optimizer uses a triggered condition to create some kind of index lookup-based access (as for the first two items of the preceding list), it must have a fallback strategy for the case when the condition is turned off. This fallback strategy is always the same: Do a full table scan. In EXPLAIN output, the fallback shows up as Full scan on NULL key in the Extra column:

mysql> EXPLAIN SELECT t1.col1,
 -> t1.col1 IN (SELECT t2.key1 FROM t2 WHERE t2.col2=t1.col2) FROM t1\G
*************************** 1. row ***************************
 id: 1
 select_type: PRIMARY
 table: t1
 ...
*************************** 2. row ***************************
 id: 2
 select_type: DEPENDENT SUBQUERY
 table: t2
 type: index_subquery possible_keys: key1
 key: key1
 key_len: 5
 ref: func
 rows: 2
 Extra: Using where; Full scan on NULL key

If you run EXPLAIN EXTENDED followed by SHOW WARNINGS, you can see the triggered condition:

*************************** 1. row ***************************
 Level: Note
 Code: 1003
Message: select `test`.`t1`.`col1` AS `col1`,
 <in_optimizer>(`test`.`t1`.`col1`,
 <exists>(<index_lookup>(<cache>(`test`.`t1`.`col1`) in t2
 on key1 checking NULL
 where (`test`.`t2`.`col2` = `test`.`t1`.`col2`) having
 trigcond(<is_not_null_test>(`test`.`t2`.`key1`))))) AS
 `t1.col1 IN (select t2.key1 from t2 where t2.col2=t1.col2)`
 from `test`.`t1`

The use of triggered conditions has some performance implications. A NULL IN (SELECT ...) expression now may cause a full table scan (which is slow) when it previously did not. This is the price paid for correct results (the goal of the trigger-condition strategy was to improve compliance and not speed).

For multiple-table subqueries, execution of NULL IN (SELECT ...) will be particularly slow because the join optimizer doesn't optimize for the case where the outer expression is NULL. It assumes that subquery evaluations with NULL on the left side are very rare, even if there are statistics that indicate otherwise. On the other hand, if the outer expression might be NULL but never actually is, there is no performance penalty.

To help the query optimizer better execute your queries, use these tips:

A column must be declared as NOT NULL if it really is. (This also helps other aspects of the optimizer.)
If you don't need to distinguish a NULL from FALSE subquery result, you can easily avoid the slow execution path. Replace a comparison that looks like this:
```
outer_expr IN (SELECT inner_expr FROM ...)
```
with this expression:
```
(outer_expr IS NOT NULL) AND (outer_expr IN (SELECT inner_expr FROM ...))
```
Then NULL IN (SELECT ...) will never be evaluated because MariaDB stops evaluating AND parts as soon as the expression result is clear.

Optimizing IN/=ANY Subqueries

Retornar

Optimizing `IN`/`=ANY` Subqueries