How to misuse GROUP BY in SQL
How to misuse GROUP BY in SQL
Recently, we upgraded MySQL in our project from version 5.6 to version 5.7 (if it depended on me, I’d upgrade to 8, but that was out of question). After a while we started to receive feedback that one of our functions no longer works correctly.
That function is a selection of a best matching record from the DB, based on several criteria. We started to look closely and found some heavy legacy code that did the following:
- Select a whole bunch of records, at the same time making a “unique_id” for every record, by concatenation of several fields. This SELECT had a proper ORDER BY clause, which puts more specific matches (more relevant) on top.
- Those “unique_ids” were often not so unique. That is, the whole SELECT produced more than 1 record with almost every given “unique_id”.
- In order to remove duplicates, and leave just the most relevant rows, the following trick was used.
The above SELECT was wrapped into another SELECT, with a GROUP BY, like this:
SELECT * FROM (SELECT * FROM t1 WHERE cond ORDER BY relevance) GROUP BY unique;
It’s pretty obvious that this query is not a good example of SQL usage. GROUP BY is for aggregating, not for filtering which is essentially what happened. Moreover, this query is not portable, because even MySQL is moving in direction of stricter standard compliance, and according to standard, in a query with GROUP BY, every selected column must be either used in GROUP BY or used in an aggregate function like
SUM()
orAVG()
.
A fresh MySQL version with default settings (@sql_mode
, to be precise), will reject this query.
However, this code worked just fine with the MySQL installation used in our project.
Also, because we didn’t want to rewrite possibly half of the SQL code after upgrading to MySQL 5.7,
we set @sql_mode
to maximum tolerance, in order to avoid the whole system malfunctioning.
As such, this query didn’t fail. What happened was that it started to produce different results then MySQL 5.6. Namely, the result of the outer query contained the least relevant rows instead of the most relevant.
This is not a MySQL bug, if anyone thinks I’m trying to say that. This query is non-standard, no-one should ever use this kind of queries because the results are ambiguous.
What lesson did I learn from this bug?
- Test your code with unit and integration tests. It’s about database, so this is subject to an integration test. We’d never see this on production if we had this covered with a test. Unfortunately, this is a legacy project, which only starts to be covered with tests.
- Please, oh please, don’t ever write code that you don’t understand. In this case, the GROUP BY was obviously misused. I will probably rewrite the query in a more clear and efficient way, which would only select what is needed. In the meanwhile, I simply removed the outer query with GROUP BY and implemented the same functionality with PHP, and I did write a unit test for it! It was just 5 lines.
Discussion of my blog post about SQL and _not_ to use it: https://t.co/O0Q9q8DTtK
— Victor Bolshov (@crocodile2u) December 11, 2020