Categories
concatenation join merge pandas python

Pandas Merging 101

786

  • How can I perform a (INNER| (LEFT|RIGHT|FULL) OUTER) JOIN with pandas?
  • How do I add NaNs for missing rows after a merge?
  • How do I get rid of NaNs after merging?
  • Can I merge on the index?
  • How do I merge multiple DataFrames?
  • Cross join with pandas
  • merge? join? concat? update? Who? What? Why?!

… and more. I’ve seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.

This Q&A is meant to be the next installment in a series of helpful user guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).

Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.


Table of Contents

For ease of access.

0

    1099

    +800

    This post aims to give readers a primer on SQL-flavored merging with Pandas, how to use it, and when not to use it.

    In particular, here’s what this post will go through:

    • The basics – types of joins (LEFT, RIGHT, OUTER, INNER)

      • merging with different column names
      • merging with multiple columns
      • avoiding duplicate merge key column in output

    What this post (and other posts by me on this thread) will not go through:

    • Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
    • Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!

    Note
    Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.

    Furthermore, all the DataFrames here can be copied and replicated so
    you can play with them. Also, see this
    post

    on how to read DataFrames from your clipboard.

    Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.



    Enough talk – just show me how to use merge!

    Setup & Basics

    np.random.seed(0)
    left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
    right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})
    
    left
    
      key     value
    0   A  1.764052
    1   B  0.400157
    2   C  0.978738
    3   D  2.240893
    
    right
    
      key     value
    0   B  1.867558
    1   D -0.977278
    2   E  0.950088
    3   F -0.151357
    

    For the sake of simplicity, the key column has the same name (for now).

    An INNER JOIN is represented by

    Note
    This, along with the forthcoming figures all follow this convention:

    • blue indicates rows that are present in the merge result
    • red indicates rows that are excluded from the result (i.e., removed)
    • green indicates missing values that are replaced with NaNs in the result

    To perform an INNER JOIN, call merge on the left DataFrame, specifying the right DataFrame and the join key (at the very least) as arguments.

    left.merge(right, on='key')
    # Or, if you want to be explicit
    # left.merge(right, on='key', how='inner')
    
      key   value_x   value_y
    0   B  0.400157  1.867558
    1   D  2.240893 -0.977278
    

    This returns only rows from left and right which share a common key (in this example, “B” and “D).

    A LEFT OUTER JOIN, or LEFT JOIN is represented by

    This can be performed by specifying how='left'.

    left.merge(right, on='key', how='left')
    
      key   value_x   value_y
    0   A  1.764052       NaN
    1   B  0.400157  1.867558
    2   C  0.978738       NaN
    3   D  2.240893 -0.977278
    

    Carefully note the placement of NaNs here. If you specify how='left', then only keys from left are used, and missing data from right is replaced by NaN.

    And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is…

    …specify how='right':

    left.merge(right, on='key', how='right')
    
      key   value_x   value_y
    0   B  0.400157  1.867558
    1   D  2.240893 -0.977278
    2   E       NaN  0.950088
    3   F       NaN -0.151357
    

    Here, keys from right are used, and missing data from left is replaced by NaN.

    Finally, for the FULL OUTER JOIN, given by

    specify how='outer'.

    left.merge(right, on='key', how='outer')
    
      key   value_x   value_y
    0   A  1.764052       NaN
    1   B  0.400157  1.867558
    2   C  0.978738       NaN
    3   D  2.240893 -0.977278
    4   E       NaN  0.950088
    5   F       NaN -0.151357
    

    This uses the keys from both frames, and NaNs are inserted for missing rows in both.

    The documentation summarizes these various merges nicely:

    Enter image description here


    Other JOINs – LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs

    If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.

    For LEFT-Excluding JOIN, represented as

    Start by performing a LEFT OUTER JOIN and then filtering to rows coming from left only (excluding everything from the right),

    (left.merge(right, on='key', how='left', indicator=True)
         .query('_merge == "left_only"')
         .drop('_merge', 1))
    
      key   value_x  value_y
    0   A  1.764052      NaN
    2   C  0.978738      NaN
    

    Where,

    left.merge(right, on='key', how='left', indicator=True)
    
      key   value_x   value_y     _merge
    0   A  1.764052       NaN  left_only
    1   B  0.400157  1.867558       both
    2   C  0.978738       NaN  left_only
    3   D  2.240893 -0.977278       both

    And similarly, for a RIGHT-Excluding JOIN,

    (left.merge(right, on='key', how='right', indicator=True)
         .query('_merge == "right_only"')
         .drop('_merge', 1))
    
      key  value_x   value_y
    2   E      NaN  0.950088
    3   F      NaN -0.151357

    Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),

    You can do this in similar fashion—

    (left.merge(right, on='key', how='outer', indicator=True)
         .query('_merge != "both"')
         .drop('_merge', 1))
    
      key   value_x   value_y
    0   A  1.764052       NaN
    2   C  0.978738       NaN
    4   E       NaN  0.950088
    5   F       NaN -0.151357
    

    Different names for key columns

    If the key columns are named differently—for example, left has keyLeft, and right has keyRight instead of key—then you will have to specify left_on and right_on as arguments instead of on:

    left2 = left.rename({'key':'keyLeft'}, axis=1)
    right2 = right.rename({'key':'keyRight'}, axis=1)
    
    left2
    
      keyLeft     value
    0       A  1.764052
    1       B  0.400157
    2       C  0.978738
    3       D  2.240893
    
    right2
    
      keyRight     value
    0        B  1.867558
    1        D -0.977278
    2        E  0.950088
    3        F -0.151357
    
    left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
    
      keyLeft   value_x keyRight   value_y
    0       B  0.400157        B  1.867558
    1       D  2.240893        D -0.977278
    

    Avoiding duplicate key column in output

    When merging on keyLeft from left and keyRight from right, if you only want either of the keyLeft or keyRight (but not both) in the output, you can start by setting the index as a preliminary step.

    left3 = left2.set_index('keyLeft')
    left3.merge(right2, left_index=True, right_on='keyRight')
    
        value_x keyRight   value_y
    0  0.400157        B  1.867558
    1  2.240893        D -0.977278
    

    Contrast this with the output of the command just before (that is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')), you’ll notice keyLeft is missing. You can figure out what column to keep based on which frame’s index is set as the key. This may matter when, say, performing some OUTER JOIN operation.


    Merging only a single column from one of the DataFrames

    For example, consider

    right3 = right.assign(newcol=np.arange(len(right)))
    right3
      key     value  newcol
    0   B  1.867558       0
    1   D -0.977278       1
    2   E  0.950088       2
    3   F -0.151357       3
    

    If you are required to merge only “newcol” (without any of the other columns), you can usually just subset columns before merging:

    left.merge(right3[['key', 'newcol']], on='key')
    
      key     value  newcol
    0   B  0.400157       0
    1   D  2.240893       1
    

    If you’re doing a LEFT OUTER JOIN, a more performant solution would involve map:

    # left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
    left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))
    
      key     value  newcol
    0   A  1.764052     NaN
    1   B  0.400157     0.0
    2   C  0.978738     NaN
    3   D  2.240893     1.0
    

    As mentioned, this is similar to, but faster than

    left.merge(right3[['key', 'newcol']], on='key', how='left')
    
      key     value  newcol
    0   A  1.764052     NaN
    1   B  0.400157     0.0
    2   C  0.978738     NaN
    3   D  2.240893     1.0
    

    Merging on multiple columns

    To join on more than one column, specify a list for on (or left_on and right_on, as appropriate).

    left.merge(right, on=['key1', 'key2'] ...)
    

    Or, in the event the names are different,

    left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])
    

    Other useful merge* operations and functions

    This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge, join, and concat as well as the links to the function specifications.



    Continue Reading

    Jump to other topics in Pandas Merging 101 to continue learning:

    *You are here.

    2

    • 5

      If anyone is confused by the table of contents at the end of each post, I split up this massive answer into 4 separate ones, 3 on this question and 1 on another. The way it was setup previously made it harder to reference folks to specific topics. This allows you to bookmark separate topics easily now!

      – cs95

      Dec 17, 2020 at 10:17


    • 2

      This is an awesome resource! The only question I still have is why call it merge instead of join, and join instead of merge?

      Apr 8, 2021 at 22:31

    88

    A supplemental visual view of pd.concat([df0, df1], kwargs).
    Notice that, kwarg axis=0 or axis=1 ‘s meaning is not as intuitive as df.mean() or df.apply(func)


    on pd.concat([df0, df1])

    7

    • 12

      This is a nice diagram. May I ask how you produced it?

      – cs95

      May 20, 2019 at 17:27

    • 8

      google doc’s built-in “insert ==> drawing… ==> new” (as of 2019-May). But, to be clear: the only reason I used google doc for this picture is because my notes is stored in google doc, and I would like a picture that can be modified quickly within google doc itself. Actually now you mentioned it, the google doc’s drawing tool is pretty neat.

      – eliu

      May 21, 2019 at 18:19

    • 1

      Wow, this is great. Coming from the SQL world, “vertical” join is not a join in my head, as the table’s structure is always fixed. Now even think pandas should consolidate concat and merge with a direction parameter being horizontal or vertical.

      – Ufos

      Aug 1, 2019 at 11:16

    • 3

      @Ufos Isn’t that exactly what axis=1 and axis=0 is?

      – cs95

      Aug 6, 2019 at 17:31


    • 4

      yes, there’re now merge and concat and axis and whatever. However, as @eliu shows, it’s all just the same concept of merge with “left” and “right” and “horizontal” or “vertical”. I, personally, have to look into the documentation every time I have to remember which “axis” is 0 and which is 1.

      – Ufos

      Aug 19, 2019 at 9:32

    51

    Joins 101

    These animations might be better to explain you visually.
    Credits: Garrick Aden-Buie tidyexplain repo

    Inner Join

    enter image description here

    Outer Join or Full Join

    enter image description here

    Right Join

    enter image description here

    Left Join

    enter image description here

    2

    • 5

      These are awesome!

      Dec 1, 2021 at 4:05

    • 2

      I appreciate the effort put in to achieve this. Beautifully done.

      Feb 25 at 5:22