Categories
dataframe group-by max pandas python

Get the row(s) which have the max value in groups using groupby

410

How do I find all rows in a pandas DataFrame which have the max value for count column, after grouping by ['Sp','Mt'] columns?

Example 1: the following DataFrame, which I group by ['Sp','Mt']:

   Sp   Mt Value   count
0  MM1  S1   a     **3**
1  MM1  S1   n       2
2  MM1  S3   cb    **5**
3  MM2  S3   mk    **8**
4  MM2  S4   bg    **10**
5  MM2  S4   dgd     1
6  MM4  S2   rd      2
7  MM4  S2   cb      2
8  MM4  S2   uyi   **7**

Expected output is to get the result rows whose count is max in each group, like this:

0  MM1  S1   a      **3**
2  MM1  S3   cb     **5**
3  MM2  S3   mk     **8**
4  MM2  S4   bg     **10** 
8  MM4  S2   uyi    **7**

Example 2: this DataFrame, which I group by ['Sp','Mt']:

   Sp   Mt   Value  count
4  MM2  S4   bg     10
5  MM2  S4   dgd    1
6  MM4  S2   rd     2
7  MM4  S2   cb     8
8  MM4  S2   uyi    8

Expected output is to get all the rows where count equals max in each group like this:

   Sp   Mt   Value  count
4  MM2  S4   bg     10
7  MM4  S2   cb     8
8  MM4  S2   uyi    8

5

529

In [1]: df
Out[1]:
    Sp  Mt Value  count
0  MM1  S1     a      3
1  MM1  S1     n      2
2  MM1  S3    cb      5
3  MM2  S3    mk      8
4  MM2  S4    bg     10
5  MM2  S4   dgd      1
6  MM4  S2    rd      2
7  MM4  S2    cb      2
8  MM4  S2   uyi      7

In [2]: df.groupby(['Mt'], sort=False)['count'].max()
Out[2]:
Mt
S1     3
S3     8
S4    10
S2     7
Name: count

To get the indices of the original DF you can do:

In [3]: idx = df.groupby(['Mt'])['count'].transform(max) == df['count']

In [4]: df[idx]
Out[4]:
    Sp  Mt Value  count
0  MM1  S1     a      3
3  MM2  S3    mk      8
4  MM2  S4    bg     10
8  MM4  S2   uyi      7

Note that if you have multiple max values per group, all will be returned.

Update

On a hail mary chance that this is what the OP is requesting:

In [5]: df['count_max'] = df.groupby(['Mt'])['count'].transform(max)

In [6]: df
Out[6]:
    Sp  Mt Value  count  count_max
0  MM1  S1     a      3          3
1  MM1  S1     n      2          3
2  MM1  S3    cb      5          8
3  MM2  S3    mk      8          8
4  MM2  S4    bg     10         10
5  MM2  S4   dgd      1         10
6  MM4  S2    rd      2          7
7  MM4  S2    cb      2          7
8  MM4  S2   uyi      7          7

6

  • @Zelazny7, is there a way to adopt this answer to apply to grouping by a column and then looking at 2 columns and doing a max of them to get a greater of the two? I can’t get that to work. What I currently have is:def Greater(Merge, maximumA, maximumB): a = Merge[maximumA] b = Merge[maximumB] return max(a, b) Merger.groupby(“Search_Term”).apply(Greater,”Ratio_x”,”Ratio_y”)

    Nov 15, 2017 at 20:35


  • 4

    @Zelazny7 I’m using the second, idx approach. But, I can only afford to a single maximum for each group (and my data has a few duplicate-max’s). is there a way to get around this with your solution?

    – 3pitt

    Jan 3, 2018 at 20:36

  • actually, that does not work for me. I can not track the problem, because dataframe if quit big, but the solution by @Rani works good

    Feb 18, 2018 at 18:09

  • Hi Zealzny, If I want to take top 3 maximum row instead of one max value, How can I tweak your code?

    – Zephyr

    Nov 13, 2018 at 15:51

  • transform method may have pool performance when the data set is large enough, get the max value first then merge the dataframes will be better.

    Apr 10, 2019 at 2:54

259

You can sort the dataFrame by count and then remove duplicates. I think it’s easier:

df.sort_values('count', ascending=False).drop_duplicates(['Sp','Mt'])

7

  • 7

    Very nice! Fast with largish frames (25k rows)

    Sep 27, 2017 at 18:23

  • 3

    For those who are somewhat new with Python, you will need to assign this to a new variable, it doesn’t change the current df variable.

    – Tyler

    Dec 27, 2018 at 17:14

  • 3

    @Samir or use inplace = True as an argument to drop_duplicates

    – TMrtSmith

    Feb 4, 2019 at 13:11


  • 10

    This is a great answer when need only one of rows with the same max values, however it wont work as expected if I need all the rows with max values.

    Apr 10, 2019 at 2:50

  • 2

    I mean if the dataframe is pd.DataFrame({‘sp’:[1, 1, 2], ‘mt’:[1, 1, 2], ‘value’:[2, 2, 3]}, then there will be 2 rows with the same max value 2 in the group where sp==1 and mt==2. @Rani

    Apr 11, 2019 at 9:37

100

Easy solution would be to apply the idxmax() function to get indices of rows with max values.
This would filter out all the rows with max value in the group.

In [365]: import pandas as pd

In [366]: df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})

In [367]: df                                                                                                       
Out[367]: 
   count  mt   sp  val
0      3  S1  MM1    a
1      2  S1  MM1    n
2      5  S3  MM1   cb
3      8  S3  MM2   mk
4     10  S4  MM2   bg
5      1  S4  MM2  dgb
6      2  S2  MM4   rd
7      2  S2  MM4   cb
8      7  S2  MM4  uyi


### Apply idxmax() and use .loc() on dataframe to filter the rows with max values:
In [368]: df.loc[df.groupby(["sp", "mt"])["count"].idxmax()]                                                       
Out[368]: 
   count  mt   sp  val
0      3  S1  MM1    a
2      5  S3  MM1   cb
3      8  S3  MM2   mk
4     10  S4  MM2   bg
8      7  S2  MM4  uyi

### Just to show what values are returned by .idxmax() above:
In [369]: df.groupby(["sp", "mt"])["count"].idxmax().values                                                        
Out[369]: array([0, 2, 3, 4, 8])

2

  • 12

    The questioner here specified "I want to get ALL the rows where count equals max in each group", while idxmax Return[s] index of first occurrence of maximum over requested axis" according to the docs (0.21).

    – Max Power

    Dec 19, 2017 at 11:55


  • 7

    This is a great solution, but for a different problem

    Oct 27, 2019 at 18:40