Categories
aggregate pandas pandas-groupby python

Keep other columns when doing groupby

96

I’m using groupby on a pandas dataframe to drop all rows that don’t have the minimum of a specific column. Something like this:

df1 = df.groupby("item", as_index=False)["diff"].min()

However, if I have more than those two columns, the other columns (e.g. otherstuff in my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?

My data looks like:

    item    diff   otherstuff
   0   1       2            1
   1   1       1            2
   2   1       3            7
   3   2      -1            0
   4   2       1            3
   5   2       4            9
   6   2      -6            2
   7   3       0            0
   8   3       2            9

and should end up like:

    item   diff  otherstuff
   0   1      1           2
   1   2     -6           2
   2   3      0           0

but what I’m getting is:

    item   diff
   0   1      1           
   1   2     -6           
   2   3      0                 

I’ve been looking through the documentation and can’t find anything. I tried:

df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()

df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]

df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()

But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).

1

  • 1

    So you want to keep the entire row which contains the minimum of ‘diff’ column, including whatever values the other columns are. Yes this isn’t covered by pandas doc but should be, please raise a docbug.

    – smci

    Sep 23, 2019 at 16:45


162

Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:

>>> df.loc[df.groupby("item")["diff"].idxmin()]
   item  diff  otherstuff
1     1     1           2
6     2    -6           2
7     3     0           0

[3 rows x 3 columns]

Method #2: sort by diff, and then take the first element in each item group:

>>> df.sort_values("diff").groupby("item", as_index=False).first()
   item  diff  otherstuff
0     1     1           2
1     2    -6           2
2     3     0           0

[3 rows x 3 columns]

Note that the resulting indices are different even though the row content is the same.

3

  • 4

    The first returns all NaN, but the second method works. Thank you!

    – PointXIV

    Apr 30, 2014 at 17:48

  • 1

    Thank you for providing two ways of doing this. I have found multiple instances that one works when the other does not and visa versa. Seems like it depends on if your elements are numerical or not. Good to have both.

    – sfortney

    Aug 21, 2014 at 20:57


  • 3

    Thanks for this! I found the first one to be very slow and incorrect, but the second one was fast and did just what I wanted. I was using multiple indices simultaneously, and that may have messed with things.

    – Fractaly

    Jan 26, 2019 at 4:59

13

You can use DataFrame.sort_values with DataFrame.drop_duplicates:

df = df.sort_values(by='diff').drop_duplicates(subset="item")
print (df)
   item  diff  otherstuff
6     2    -6           2
7     3     0           0
1     1     1           2

If possible multiple minimal values per groups and want all min rows use boolean indexing with transform for minimal values per groups:

print (df)
   item  diff  otherstuff
0     1     2           1
1     1     1           2 <-multiple min
2     1     1           7 <-multiple min
3     2    -1           0
4     2     1           3
5     2     4           9
6     2    -6           2
7     3     0           0
8     3     2           9

print (df.groupby("item")["diff"].transform('min'))
0    1
1    1
2    1
3   -6
4   -6
5   -6
6   -6
7    0
8    0
Name: diff, dtype: int64

df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
   item  diff  otherstuff
1     1     1           2
2     1     1           7
6     2    -6           2
7     3     0           0

    1

    The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin() doesn’t give you. This worked

    def filter_group(dfg, col):
        return dfg[dfg[col] == dfg[col].min()]
    
    df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
    df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))
    

    As an aside, .filter() is also relevant to this question but didn’t work for me.