Categories
dataframe r r-faq sorting

Sort (order) data frame rows by multiple columns

1451

I want to sort a data frame by multiple columns. For example, with the data frame below I would like to sort by column ‘z’ (descending) then by column ‘b’ (ascending):

dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"), 
      levels = c("Low", "Med", "Hi"), ordered = TRUE),
      x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
      z = c(1, 1, 1, 2))
dd
    b x y z
1  Hi A 8 1
2 Med D 3 1
3  Hi A 9 1
4 Low C 9 2

0

    1755

    You can use the order() function directly without resorting to add-on tools — see this simpler answer which uses a trick right from the top of the example(order) code:

    R> dd[with(dd, order(-z, b)), ]
        b x y z
    4 Low C 9 2
    2 Med D 3 1
    1  Hi A 8 1
    3  Hi A 9 1
    

    Edit some 2+ years later: It was just asked how to do this by column index. The answer is to simply pass the desired sorting column(s) to the order() function:

    R> dd[order(-dd[,4], dd[,1]), ]
        b x y z
    4 Low C 9 2
    2 Med D 3 1
    1  Hi A 8 1
    3  Hi A 9 1
    R> 
    

    rather than using the name of the column (and with() for easier/more direct access).

    4

    • 16

      Should work the same way, but you can’t use with. Try M <- matrix(c(1,2,2,2,3,6,4,5), 4, 2, byrow=FALSE, dimnames=list(NULL, c("a","b"))) to create a matrix M, then use M[order(M[,"a"],-M[,"b"]),] to order it on two columns.

      Mar 27, 2012 at 12:41

    • 7

      Easy enough: dd[ order(-dd[,4], dd[,1]), ], but can’t use with for name-based subsetting.

      Oct 21, 2012 at 14:34


    • why is dd[ order(-dd[,4],, ] not valid or ‘dd[ order(-dd[,4], ]’ basically why is dd[,1] required? is -dd[,4] not enough if you just want to sort by 1 column?

      Jul 30, 2014 at 22:11


    • 27

      The “invalid argument to unary operator” error occurs when you use minus with a character column. Solve it by wrapping the column in xtfrm, for example dd[ order(-xtfrm(dd[,4]), dd[,1]), ].

      Mar 24, 2015 at 11:40

    551

    Your choices

    • order from base
    • arrange from dplyr
    • setorder and setorderv from data.table
    • arrange from plyr
    • sort from taRifx
    • orderBy from doBy
    • sortData from Deducer

    Most of the time you should use the dplyr or data.table solutions, unless having no-dependencies is important, in which case use base::order.


    I recently added sort.data.frame to a CRAN package, making it class compatible as discussed here:
    Best way to create generic/method consistency for sort.data.frame?

    Therefore, given the data.frame dd, you can sort as follows:

    dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"), 
          levels = c("Low", "Med", "Hi"), ordered = TRUE),
          x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
          z = c(1, 1, 1, 2))
    library(taRifx)
    sort(dd, f= ~ -z + b )
    

    If you are one of the original authors of this function, please contact me. Discussion as to public domaininess is here: https://chat.stackoverflow.com/transcript/message/1094290#1094290


    You can also use the arrange() function from plyr as Hadley pointed out in the above thread:

    library(plyr)
    arrange(dd,desc(z),b)
    

    Benchmarks: Note that I loaded each package in a new R session since there were a lot of conflicts. In particular loading the doBy package causes sort to return “The following object(s) are masked from ‘x (position 17)’: b, x, y, z”, and loading the Deducer package overwrites sort.data.frame from Kevin Wright or the taRifx package.

    #Load each time
    dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"), 
          levels = c("Low", "Med", "Hi"), ordered = TRUE),
          x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
          z = c(1, 1, 1, 2))
    library(microbenchmark)
    
    # Reload R between benchmarks
    microbenchmark(dd[with(dd, order(-z, b)), ] ,
        dd[order(-dd$z, dd$b),],
        times=1000
    )
    

    Median times:

    dd[with(dd, order(-z, b)), ] 778

    dd[order(-dd$z, dd$b),] 788

    library(taRifx)
    microbenchmark(sort(dd, f= ~-z+b ),times=1000)
    

    Median time: 1,567

    library(plyr)
    microbenchmark(arrange(dd,desc(z),b),times=1000)
    

    Median time: 862

    library(doBy)
    microbenchmark(orderBy(~-z+b, data=dd),times=1000)
    

    Median time: 1,694

    Note that doBy takes a good bit of time to load the package.

    library(Deducer)
    microbenchmark(sortData(dd,c("z","b"),increasing= c(FALSE,TRUE)),times=1000)
    

    Couldn’t make Deducer load. Needs JGR console.

    esort <- function(x, sortvar, ...) {
    attach(x)
    x <- x[with(x,order(sortvar,...)),]
    return(x)
    detach(x)
    }
    
    microbenchmark(esort(dd, -z, b),times=1000)
    

    Doesn’t appear to be compatible with microbenchmark due to the attach/detach.


    m <- microbenchmark(
      arrange(dd,desc(z),b),
      sort(dd, f= ~-z+b ),
      dd[with(dd, order(-z, b)), ] ,
      dd[order(-dd$z, dd$b),],
      times=1000
      )
    
    uq <- function(x) { fivenum(x)[4]}  
    lq <- function(x) { fivenum(x)[2]}
    
    y_min <- 0 # min(by(m$time,m$expr,lq))
    y_max <- max(by(m$time,m$expr,uq)) * 1.05
      
    p <- ggplot(m,aes(x=expr,y=time)) + coord_cartesian(ylim = c( y_min , y_max )) 
    p + stat_summary(fun.y=median,fun.ymin = lq, fun.ymax = uq, aes(fill=expr))
    

    microbenchmark plot

    (lines extend from lower quartile to upper quartile, dot is the median)


    Given these results and weighing simplicity vs. speed, I’d have to give the nod to arrange in the plyr package. It has a simple syntax and yet is almost as speedy as the base R commands with their convoluted machinations. Typically brilliant Hadley Wickham work. My only gripe with it is that it breaks the standard R nomenclature where sorting objects get called by sort(object), but I understand why Hadley did it that way due to issues discussed in the question linked above.

    6

    • 5

      The ggplot2 microbenchmark function above is now available as taRifx::autoplot.microbenchmark.

      Jun 1, 2012 at 1:23

    • @AriB.Friedman using ‘arrange’, how do we sort by ascending? I never see examples sorting in ascending order. I tried ‘asc’ instead of ‘desc’ and it doesn’t work. thanks

      – AME

      Oct 12, 2013 at 6:37

    • 4

      @AME look at how b is sorted in the sample. The default is sort by ascending, so you just don’t wrap it in desc. Ascending in both: arrange(dd,z,b) . Descending in both: arrange(dd,desc(z),desc(b)).

      Oct 12, 2013 at 10:16

    • 4

      As per ?arrange: “# NOTE: plyr functions do NOT preserve row.names”. This makes the excellent arrange() function suboptimal if one wants to keep row.names.

      – landroni

      Mar 10, 2014 at 16:31


    • Some of these that use order might be a bit faster if you use sort.list(x, method=“radix”) instead.

      Jul 2, 2015 at 11:00

    168

    Dirk’s answer is great. It also highlights a key difference in the syntax used for indexing data.frames and data.tables:

    ## The data.frame way
    dd[with(dd, order(-z, b)), ]
    
    ## The data.table way: (7 fewer characters, but that's not the important bit)
    dd[order(-z, b)]
    

    The difference between the two calls is small, but it can have important consequences. Especially if you write production code and/or are concerned with correctness in your research, it’s best to avoid unnecessary repetition of variable names. data.table
    helps you do this.

    Here’s an example of how repetition of variable names might get you into trouble:

    Let’s change the context from Dirk’s answer, and say this is part of a bigger project where there are a lot of object names and they are long and meaningful; instead of dd it’s called quarterlyreport. It becomes :

    quarterlyreport[with(quarterlyreport,order(-z,b)),]
    

    Ok, fine. Nothing wrong with that. Next your boss asks you to include last quarter’s report in the report. You go through your code, adding an object lastquarterlyreport in various places and somehow (how on earth?) you end up with this :

    quarterlyreport[with(lastquarterlyreport,order(-z,b)),]
    

    That isn’t what you meant but you didn’t spot it because you did it fast and it’s nestled on a page of similar code. The code doesn’t fall over (no warning and no error) because R thinks it is what you meant. You’d hope whoever reads your report spots it, but maybe they don’t. If you work with programming languages a lot then this situation may be all to familiar. It was a “typo” you’ll say. I’ll fix the “typo” you’ll say to your boss.

    In data.table we’re concerned about tiny details like this. So we’ve done something simple to avoid typing variable names twice. Something very simple. i is evaluated within the frame of dd already, automatically. You don’t need with() at all.

    Instead of

    dd[with(dd, order(-z, b)), ]
    

    it’s just

    dd[order(-z, b)]
    

    And instead of

    quarterlyreport[with(lastquarterlyreport,order(-z,b)),]
    

    it’s just

    quarterlyreport[order(-z,b)]
    

    It’s a very small difference, but it might just save your neck one day. When weighing up the different answers to this question, consider counting the repetitions of variable names as one of your criteria in deciding. Some answers have quite a few repeats, others have none.

    2

    • 11

      +1 This is a great point, and gets at a detail of R’s syntax that has often irritated me. I sometimes use subset() just to avoid having to repeatedly refer to the same object within a single call.

      May 25, 2012 at 20:45


    • 8

      I guess you could add the new setorder function too here, as this thread is where we send all the order type dupes.

      Jan 8, 2015 at 19:18