Categories
dataframe pandas python

Strange behavior with Pandas median

Consider the following dataframe:

       b           c     d     e  f     g     h
0 6.25 2018-04-01 True NaN 7 54.0 64.0
1 32.50 2018-04-01 True NaN 7 54.0 64.0
2 16.75 2018-04-01 True NaN 7 54.0 64.0
3 29.25 2018-04-01 True NaN 7 54.0 64.0
4 21.75 2018-04-01 True NaN 7 54.0 64.0
5 21.75 2018-04-01 True True 7 54.0 64.0
6 7.75 2018-04-01 True True 7 54.0 64.0
7 23.25 2018-04-01 True True 7 54.0 64.0
8 12.25 2018-04-01 True True 7 54.0 64.0
9 30.50 2018-04-01 True NaN 7 54.0 64.0

(copy and paste and use df = pd.read_clipboard() to create the dataframe)

Finding the medians initially works with no problem:

df.median()
b 21.75
d 1.00
e 1.00
f 7.00
g 54.00
h 64.00
dtype: float64

However, if a column is dropped and then the median is found, the median for column e disappears:

new_df = df.drop(columns=['b'])
new_df.median()
d 1.0
f 7.0
g 54.0
h 64.0
dtype: float64

This behavior is a little unexpected and finding the median for column e by itself still works:

new_df['e'].median()
1.0

Using skipna=False does not make a difference:

new_df.median(skipna=False)
d 1.0
f 7.0
g 54.0
h 64.0
dtype: float64

(it does for the original dataframe):

df.median(skipna=False)
b 21.75
d 1.00
e NaN
f 7.00
g 54.00
h 64.00
dtype: float64

The datatype of column e is object in both df and new_df and the only difference between the two dataframes is new_df does not have column b. Adding the column back into new_df does not resolve the issue. This only occurs when the first column b is dropped. It does not occur if column e is a float or integer datatype.

This behavior is present in both pandas==0.22.0 and pandas==0.24.1

There is now an open GitHub issue for anyone to try and solve this!