Categories
aggregate-functions greatest-n-per-group groupwise-maximum mysql sql

SQL select only rows with max value on a column [duplicate]

1540

I have this table for documents (simplified version here):

idrevcontent
11
21
12
13

How do I select one row per id and only the greatest rev?
With the above data, the result should contain two rows: [1, 3, ...] and [2, 1, ..]. I’m using MySQL.

Currently I use checks in the while loop to detect and over-write old revs from the resultset. But is this the only method to achieve the result? Isn’t there a SQL solution?

11

2408

At first glance…

All you need is a GROUP BY clause with the MAX aggregate function:

SELECT id, MAX(rev)
FROM YourTable
GROUP BY id

It’s never that simple, is it?

I just noticed you need the content column as well.

This is a very common question in SQL: find the whole data for the row with some max value in a column per some group identifier. I heard that a lot during my career. Actually, it was one the questions I answered in my current job’s technical interview.

It is, actually, so common that Stack Overflow community has created a single tag just to deal with questions like that: .

Basically, you have two approaches to solve that problem:

Joining with simple group-identifier, max-value-in-group Sub-query

In this approach, you first find the group-identifier, max-value-in-group (already solved above) in a sub-query. Then you join your table to the sub-query with equality on both group-identifier and max-value-in-group:

SELECT a.id, a.rev, a.contents
FROM YourTable a
INNER JOIN (
    SELECT id, MAX(rev) rev
    FROM YourTable
    GROUP BY id
) b ON a.id = b.id AND a.rev = b.rev

Left Joining with self, tweaking join conditions and filters

In this approach, you left join the table with itself. Equality goes in the group-identifier. Then, 2 smart moves:

  1. The second join condition is having left side value less than right value
  2. When you do step 1, the row(s) that actually have the max value will have NULL in the right side (it’s a LEFT JOIN, remember?). Then, we filter the joined result, showing only the rows where the right side is NULL.

So you end up with:

SELECT a.*
FROM YourTable a
LEFT OUTER JOIN YourTable b
    ON a.id = b.id AND a.rev < b.rev
WHERE b.id IS NULL;

Conclusion

Both approaches bring the exact same result.

If you have two rows with max-value-in-group for group-identifier, both rows will be in the result in both approaches.

Both approaches are SQL ANSI compatible, thus, will work with your favorite RDBMS, regardless of its “flavor”.

Both approaches are also performance friendly, however your mileage may vary (RDBMS, DB Structure, Indexes, etc.). So when you pick one approach over the other, benchmark. And make sure you pick the one which make most of sense to you.

6

  • 7

    This is a really bad idea because the field that you want to maximize might be a double, and comparing doubles for equality is non-deterministic. I think only the O(n^2) algorithm works here.

    Mar 21, 2021 at 16:31


  • @Adriano how would this work if there would be an extra column user_id , and you want to limit your results to that user_id ? I suppose that filter should happen quite early on to avoid it’s joining stuff together from irrelevant user_ids that will be ditched later?

    Aug 26, 2021 at 16:14

  • I’m not sure the two approaches will “bring the exact same result”: I think the second approach will keep records for which the rev field is NULL (they will have no match in the join), but the first approach will not keep them (their rev is not the max so they are not selected).

    – a3nm

    Sep 16, 2021 at 12:01

  • 2

    Another way is by using window functions. They seem to offer better performance. I would do something like: SELECT DISTINCT id, MAX(rev) OVER (PARTITION BY id), FIRST_VALUE(content) OVER (PARTITION BY id ORDER BY rev DESC) FROM YourTable

    – Marcos

    Dec 7, 2021 at 18:37


  • 4

    @mk3009hppw: Comparing doubles for equality is entirely deterministic, although the idea that it’s somehow not is a common misconception. What people usually mean by it (if they’re not just parroting something they heard from elsewhere) is that inexact floating point calculations (which could be as simple as 0.1 + 0.2) may not return exactly the “expected” result (0.3) due to rounding, or that comparing numeric types with different precision may behave unexpectedly. But neither of those happens here.

    Dec 8, 2021 at 0:15

368

My preference is to use as little code as possible…

You can do it using IN
try this:

SELECT * 
FROM t1 WHERE (id,rev) IN 
( SELECT id, MAX(rev)
  FROM t1
  GROUP BY id
)

to my mind it is less complicated… easier to read and maintain.

12

  • 36

    Curious – which database engine can we use this type of WHERE clause in? This is not supported in SQL Server.

    – Kash

    Nov 17, 2011 at 17:04

  • 32

    oracle & mysql (not sure about other databases sorry)

    Nov 17, 2011 at 18:03


  • 34

    Works on PostgreSQL too.

    – lcguida

    Jan 15, 2014 at 17:43

  • 19

    Confirmed working in DB2

    Jan 29, 2014 at 2:32

  • 24

    Does not work with SQLite.

    Oct 26, 2014 at 20:32

184

I am flabbergasted that no answer offered SQL window function solution:

SELECT a.id, a.rev, a.contents
  FROM (SELECT id, rev, contents,
               ROW_NUMBER() OVER (PARTITION BY id ORDER BY rev DESC) ranked_order
          FROM YourTable) a
 WHERE a.ranked_order = 1 

Added in SQL standard ANSI/ISO Standard SQL:2003 and later extended with ANSI/ISO Standard SQL:2008, window (or windowing) functions are available with all major vendors now. There are more types of rank functions available to deal with a tie issue: RANK, DENSE_RANK, PERSENT_RANK.

6

  • 5

    intuition is tricky thing. I find it more intuitive than other answers as it builds explicit data structure that answers the question. But, again, intuition is the other side of bias…

    – topchef

    Jan 10, 2017 at 18:22


  • 13

    This might work in MariaDB 10.2 and MySQL 8.0.2, but not before.

    Apr 1, 2017 at 22:01

  • 2

    The approach of window functions should be preferred due to simplicity.

    Mar 30, 2021 at 20:01

  • Yes, window function seems to be a better approach. At least it has better performance. I would use MAX and FIRST_VALUE functions, though: SELECT DISTINCT id, MAX(rev) OVER (PARTITION BY id), FIRST_VALUE(content) OVER (PARTITION BY id ORDER BY rev DESC) FROM YourTable

    – Marcos

    Dec 7, 2021 at 18:35

  • 2

    This is the more efficient way of doing this compared to correlated queries (performance killer) or other aggregate functions. This should now be marked as accepted answer.

    Feb 1 at 15:00