Categories
greatest-n-per-group sql sql-server sql-server-2005 tsql

Get top 1 row of each group

696

I have a table which I want to get the latest entry for each group. Here’s the table:

DocumentStatusLogs Table

|ID| DocumentID | Status | DateCreated |
| 2| 1          | S1     | 7/29/2011   |
| 3| 1          | S2     | 7/30/2011   |
| 6| 1          | S1     | 8/02/2011   |
| 1| 2          | S1     | 7/28/2011   |
| 4| 2          | S2     | 7/30/2011   |
| 5| 2          | S3     | 8/01/2011   |
| 6| 3          | S1     | 8/02/2011   |

The table will be grouped by DocumentID and sorted by DateCreated in descending order. For each DocumentID, I want to get the latest status.

My preferred output:

| DocumentID | Status | DateCreated |
| 1          | S1     | 8/02/2011   |
| 2          | S3     | 8/01/2011   |
| 3          | S1     | 8/02/2011   |
  • Is there any aggregate function to get only the top from each group? See pseudo-code GetOnlyTheTop below:

    SELECT
      DocumentID,
      GetOnlyTheTop(Status),
      GetOnlyTheTop(DateCreated)
    FROM DocumentStatusLogs
    GROUP BY DocumentID
    ORDER BY DateCreated DESC
    
  • If such function doesn’t exist, is there any way I can achieve the output I want?

  • Or at the first place, could this be caused by unnormalized database? I’m thinking, since what I’m looking for is just one row, should that status also be located in the parent table?

Please see the parent table for more information:

Current Documents Table

| DocumentID | Title  | Content  | DateCreated |
| 1          | TitleA | ...      | ...         |
| 2          | TitleB | ...      | ...         |
| 3          | TitleC | ...      | ...         |

Should the parent table be like this so that I can easily access its status?

| DocumentID | Title  | Content  | DateCreated | CurrentStatus |
| 1          | TitleA | ...      | ...         | s1            |
| 2          | TitleB | ...      | ...         | s3            |
| 3          | TitleC | ...      | ...         | s1            |

UPDATE
I just learned how to use “apply” which makes it easier to address such problems.

4

891

;WITH cte AS
(
   SELECT *,
         ROW_NUMBER() OVER (PARTITION BY DocumentID ORDER BY DateCreated DESC) AS rn
   FROM DocumentStatusLogs
)
SELECT *
FROM cte
WHERE rn = 1

If you expect 2 entries per day, then this will arbitrarily pick one. To get both entries for a day, use DENSE_RANK instead

As for normalised or not, it depends if you want to:

  • maintain status in 2 places
  • preserve status history

As it stands, you preserve status history. If you want latest status in the parent table too (which is denormalisation) you’d need a trigger to maintain “status” in the parent. or drop this status history table.

12

  • 5

    And… What is Partition By? With is new to me also 🙁 I’m using mssql 2005 anyway.

    – dpp

    Jul 27, 2011 at 8:48

  • 8

    @domanokz: Partition By resets the count. So in this case, it says to count per DocumentID

    – gbn

    Jul 27, 2011 at 8:50

  • 2

    Hm, I worry about the performance, I’ll be querying millions of rows. Is SELECT * FROM (SELECT …) affects the performance? Also, is ROW_NUMBER some kind of a subquery for each row?

    – dpp

    Jul 27, 2011 at 9:21

  • 1

    @domanokz: no, it’s not a subquery. If you have correct indexes then millions shouldn’t be a problem. There are only 2 set based ways anyway: this and the aggregate (Ariel’s solution). So try them both…

    – gbn

    Jul 27, 2011 at 9:30

  • 1

    @domanokz: Just change ORDER BY DateCreated DESC to ORDER BY ID DESC

    – gbn

    Jul 27, 2011 at 9:52

233

I just learned how to use cross apply. Here’s how to use it in this scenario:

 select d.DocumentID, ds.Status, ds.DateCreated 
 from Documents as d 
 cross apply 
     (select top 1 Status, DateCreated
      from DocumentStatusLogs 
      where DocumentID = d.DocumentId
      order by DateCreated desc) as ds

16

  • 2

    That actually makes no difference since the issue is still addressed.

    – dpp

    Sep 5, 2012 at 5:57


  • 30

    I just posted the results of my timing tests against all of the proposed solutions and yours came out on top. Giving you an up vote 🙂

    Mar 7, 2015 at 15:00

  • 5

    +1 for huge speed improvement. This is much faster than a windowing function such as ROW_NUMBER(). It would be nice if SQL recognized ROW_NUMBER() = 1 like queries and optimized them into Applies. Note: I used OUTER APPLY as I needed results, even if they didn’t exist in the apply.

    Oct 19, 2015 at 14:17

  • 9

    @TamusJRoyce you can’t extrapolate that just because it was faster once this is always the case. It depends. As described here sqlmag.com/database-development/optimizing-top-n-group-queries

    Jun 3, 2016 at 21:26

  • 3

    This works well when you already have a separate Documents table that gives one row per group, as desired in the output. But if you’re only working with the one table (DocumentStatusLogs in this case), you’d first have to do some sort of DISTINCT operation on DocumentID (or ROW_NUMBER(), MAX(ID), etc.), losing all that gained performance.

    Aug 3, 2020 at 4:25


181

I know this is an old thread but the TOP 1 WITH TIES solutions is quite nice and might be helpful to some reading through the solutions.

select top 1 with ties
   DocumentID
  ,Status
  ,DateCreated
from DocumentStatusLogs
order by row_number() over (partition by DocumentID order by DateCreated desc)

The select top 1 with ties clause tells SQL Server that you want to return the first row per group. But how does SQL Server know how to group up the data? This is where the order by row_number() over (partition by DocumentID order by DateCreated desc comes in. The column/columns after partition by defines how SQL Server groups up the data. Within each group, the rows will be sorted based on the order by columns. Once sorted, the top row in each group will be returned in the query.

More about the TOP clause can be found here.

8

  • 15

    This is the most elegant solution imo

    Oct 16, 2018 at 13:20

  • 2

    agreed – this best replicates what is very easy to do in other versions of SQL and other languages imo

    Nov 12, 2019 at 21:38

  • 3

    Wish I could upvote more than once. I have returned to this answer about 7.000 times already. There might come a day, when I take the time to understand this, so I wouldn’t have to come back. But it is not this day.

    – mpn275

    Sep 9, 2020 at 9:56

  • 3

    Hmm, ‘With Ties’ might cause more rows to be returned than the value specified in expression (TOP 1). If the OP wants only 1, then you need to remove this phrase, right?

    – TK Bruin

    Oct 27, 2020 at 17:26

  • 2

    @TKBruin that is why the order by row_number() is required. This allows the top record per partition to be retrieved.

    Oct 28, 2020 at 11:48