Categories
r r-faq

How to join (merge) data frames (inner, outer, left, right)

1470

Given two data frames:

df1 = data.frame(CustomerId = c(1:6), Product = c(rep("Toaster", 3), rep("Radio", 3)))
df2 = data.frame(CustomerId = c(2, 4, 6), State = c(rep("Alabama", 2), rep("Ohio", 1)))

df1
#  CustomerId Product
#           1 Toaster
#           2 Toaster
#           3 Toaster
#           4   Radio
#           5   Radio
#           6   Radio

df2
#  CustomerId   State
#           2 Alabama
#           4 Alabama
#           6    Ohio

How can I do database style, i.e., sql style, joins? That is, how do I get:

  • An inner join of df1 and df2:
    Return only the rows in which the left table have matching keys in the right table.
  • An outer join of df1 and df2:
    Returns all rows from both tables, join records from the left which have matching keys in the right table.
  • A left outer join (or simply left join) of df1 and df2
    Return all rows from the left table, and any rows with matching keys from the right table.
  • A right outer join of df1 and df2
    Return all rows from the right table, and any rows with matching keys from the left table.

Extra credit:

How can I do a SQL style select statement?

3

1581

By using the merge function and its optional parameters:

Inner join: merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId") to make sure that you were matching on only the fields you desired. You can also use the by.x and by.y parameters if the matching variables have different names in the different data frames.

Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)

Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)

Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)

Cross join: merge(x = df1, y = df2, by = NULL)

Just as with the inner join, you would probably want to explicitly pass “CustomerId” to R as the matching variable. I think it’s almost always best to explicitly state the identifiers on which you want to merge; it’s safer if the input data.frames change unexpectedly and easier to read later on.

You can merge on multiple columns by giving by a vector, e.g., by = c("CustomerId", "OrderId").

If the column names to merge on are not the same, you can specify, e.g., by.x = "CustomerId_in_df1", by.y = "CustomerId_in_df2" where CustomerId_in_df1 is the name of the column in the first data frame and CustomerId_in_df2 is the name of the column in the second data frame. (These can also be vectors if you need to merge on multiple columns.)

12

  • 3

    @MattParker I have been using sqldf package for a whole host of complex queries against dataframes, really needed it to do a self-cross join (ie data.frame cross-joining itself) I wonder how it compares from a performance perspective….???

    Feb 12, 2013 at 4:42

  • 10

    @ADP I’ve never really used sqldf, so I’m not sure about speed. If performance is a major issue for you, you should also look into the data.table package – that’s a whole new set of join syntax, but it’s radically faster than anything we’re talking about here.

    Feb 12, 2013 at 16:22

  • 6

    With more clarity and explanation….. mkmanu.wordpress.com/2016/04/08/…

    Apr 7, 2016 at 20:08


  • 55

    A minor addition that was helpful for me – When you want to merge using more than one column: merge(x=df1,y=df2, by.x=c("x_col1","x_col2"), by.y=c("y_col1","y_col2"))

    Dec 1, 2016 at 10:40

  • 10

    This works in data.table now, same function just faster.

    – marbel

    Dec 2, 2016 at 19:44

251

I would recommend checking out Gabor Grothendieck’s sqldf package, which allows you to express these operations in SQL.

library(sqldf)

## inner join
df3 <- sqldf("SELECT CustomerId, Product, State 
              FROM df1
              JOIN df2 USING(CustomerID)")

## left join (substitute 'right' for right join)
df4 <- sqldf("SELECT CustomerId, Product, State 
              FROM df1
              LEFT JOIN df2 USING(CustomerID)")

I find the SQL syntax to be simpler and more natural than its R equivalent (but this may just reflect my RDBMS bias).

See Gabor’s sqldf GitHub for more information on joins.

0

    231

    There is the data.table approach for an inner join, which is very time and memory efficient (and necessary for some larger data.frames):

    library(data.table)
    
    dt1 <- data.table(df1, key = "CustomerId") 
    dt2 <- data.table(df2, key = "CustomerId")
    
    joined.dt1.dt.2 <- dt1[dt2]
    

    merge also works on data.tables (as it is generic and calls merge.data.table)

    merge(dt1, dt2)
    

    data.table documented on stackoverflow:
    How to do a data.table merge operation
    Translating SQL joins on foreign keys to R data.table syntax
    Efficient alternatives to merge for larger data.frames R
    How to do a basic left outer join with data.table in R?

    Yet another option is the join function found in the plyr package

    library(plyr)
    
    join(df1, df2,
         type = "inner")
    
    #   CustomerId Product   State
    # 1          2 Toaster Alabama
    # 2          4   Radio Alabama
    # 3          6   Radio    Ohio
    

    Options for type: inner, left, right, full.

    From ?join: Unlike merge, [join] preserves the order of x no matter what join type is used.

    5

    • 10

      +1 for mentioning plyr::join. Microbenchmarking indicates, that it performs about 3 times faster than merge.

      May 30, 2013 at 11:28

    • 25

      However, data.table is much faster than both. There is also great support in SO, i don’t see many package writers answering questions here as often as the data.table writer or contributors.

      – marbel

      Jan 2, 2014 at 2:36


    • 6

      Please note: dt1[dt2] is a right outer join (not a “pure” inner join) so that ALL rows from dt2 will be part of the result even if there is no matching row in dt1. Impact: You result has potentially unwanted rows if you have key values in dt2 that do not match the dt1’s key values.

      – R Yoda

      Nov 11, 2015 at 7:24


    • 14

      @RYoda you can just specify nomatch = 0L in that case.

      Nov 11, 2015 at 21:20

    • Neat and quick solution! It’d still be great to know how to merge >2 df using plyr.

      Feb 2 at 7:29