Categories
dataframe numpy pandas python

Pandas conditional creation of a series/dataframe column

447

How do I add a color column to the following dataframe so that color="green" if Set == 'Z', and color="red" otherwise?

    Type       Set
1    A          Z
2    B          Z           
3    B          X
4    C          Y

0

    958

    If you only have two choices to select from:

    df['color'] = np.where(df['Set']=='Z', 'green', 'red')
    

    For example,

    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
    df['color'] = np.where(df['Set']=='Z', 'green', 'red')
    print(df)
    

    yields

      Set Type  color
    0   Z    A  green
    1   Z    B  green
    2   X    B    red
    3   Y    C    red
    

    If you have more than two conditions then use np.select. For example, if you want color to be

    • yellow when (df['Set'] == 'Z') & (df['Type'] == 'A')
    • otherwise blue when (df['Set'] == 'Z') & (df['Type'] == 'B')
    • otherwise purple when (df['Type'] == 'B')
    • otherwise black,

    then use

    df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
    conditions = [
        (df['Set'] == 'Z') & (df['Type'] == 'A'),
        (df['Set'] == 'Z') & (df['Type'] == 'B'),
        (df['Type'] == 'B')]
    choices = ['yellow', 'blue', 'purple']
    df['color'] = np.select(conditions, choices, default="black")
    print(df)
    

    which yields

      Set Type   color
    0   Z    A  yellow
    1   Z    B    blue
    2   X    B  purple
    3   Y    C   black
    

    4

    • 1

      what is the reason for having to use numpy?

      – cikatomo

      Mar 26, 2021 at 3:59

    • It’s the library pandas was built with. As it already has this functionality there’s no need for pandas to implement it.

      – aydow

      Jul 9, 2021 at 5:22

    • 2

      This code now (January 2022) returns A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

      – Luis

      Jan 17 at 16:48

    • 1

      @Luis , for your case, it’s not related to the np.select function, but rather caused by how you assign new Series / DataFrame values. And the message is simply a warning. Please check this out: pandas.pydata.org/docs/user_guide/…

      – Azhar

      Jun 19 at 13:46

    166

    List comprehension is another way to create another column conditionally. If you are working with object dtypes in columns, like in your example, list comprehensions typically outperform most other methods.

    Example list comprehension:

    df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
    

    %timeit tests:

    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
    %timeit df['color'] = ['red' if x == 'Z' else 'green' for x in df['Set']]
    %timeit df['color'] = np.where(df['Set']=='Z', 'green', 'red')
    %timeit df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')
    
    1000 loops, best of 3: 239 µs per loop
    1000 loops, best of 3: 523 µs per loop
    1000 loops, best of 3: 263 µs per loop
    

    6

    • 13

      Note that, with much larger dataframes (think pd.DataFrame({'Type':list('ABBC')*100000, 'Set':list('ZZXY')*100000})-size), numpy.where outpaces map, but the list comprehension is king (about 50% faster than numpy.where).

      – blacksite

      Apr 20, 2017 at 16:45

    • 5

      Can the list comprehension method be used if the condition needs information from multiple columns? I am looking for something like this (this does not work): df['color'] = ['red' if (x['Set'] == 'Z') & (x['Type'] == 'B') else 'green' for x in df]

      – Matti

      Jan 1, 2019 at 6:42

    • 2

      Add iterrows to the dataframe, then you can access multiple columns via row: [‘red’ if (row[‘Set’] == ‘Z’) & (row[‘Type’] == ‘B’) else ‘green’ for index, row in in df.iterrows()]

      Jan 14, 2019 at 1:38


    • 1

      Note this nice solution will not work if you need to take replacement values from another series in the data frame, such as df['color_type'] = np.where(df['Set']=='Z', 'green', df['Type'])

      Sep 17, 2019 at 15:28

    • 2

      @cheekybastard Or don’t, since .iterrows() is notoriously sluggish and the DataFrame shouldn’t be modified while iterating.

      – AMC

      Feb 10, 2020 at 1:51

    28

    Another way in which this could be achieved is

    df['color'] = df.Set.map( lambda x: 'red' if x == 'Z' else 'green')
    

    0