How to sort a dataframe by multiple column(s)


Question

I want to sort a data.frame by multiple columns. For example, with the data.frame below I would like to sort by column z (descending) then by column b (ascending):

dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"), 
      levels = c("Low", "Med", "Hi"), ordered = TRUE),
      x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
      z = c(1, 1, 1, 2))
dd
    b x y z
1  Hi A 8 1
2 Med D 3 1
3  Hi A 9 1
4 Low C 9 2
1
1245
4/3/2019 4:37:14 PM

Accepted Answer

You can use the order() function directly without resorting to add-on tools -- see this simpler answer which uses a trick right from the top of the example(order) code:

R> dd[with(dd, order(-z, b)), ]
    b x y z
4 Low C 9 2
2 Med D 3 1
1  Hi A 8 1
3  Hi A 9 1

Edit some 2+ years later: It was just asked how to do this by column index. The answer is to simply pass the desired sorting column(s) to the order() function:

R> dd[order(-dd[,4], dd[,1]), ]
    b x y z
4 Low C 9 2
2 Med D 3 1
1  Hi A 8 1
3  Hi A 9 1
R> 

rather than using the name of the column (and with() for easier/more direct access).

1550
9/21/2018 1:42:37 PM

Your choices

  • order from base
  • arrange from dplyr
  • setorder and setorderv from data.table
  • arrange from plyr
  • sort from taRifx
  • orderBy from doBy
  • sortData from Deducer

Most of the time you should use the dplyr or data.table solutions, unless having no-dependencies is important, in which case use base::order.


I recently added sort.data.frame to a CRAN package, making it class compatible as discussed here: Best way to create generic/method consistency for sort.data.frame?

Therefore, given the data.frame dd, you can sort as follows:

dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"), 
      levels = c("Low", "Med", "Hi"), ordered = TRUE),
      x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
      z = c(1, 1, 1, 2))
library(taRifx)
sort(dd, f= ~ -z + b )

If you are one of the original authors of this function, please contact me. Discussion as to public domaininess is here: http://chat.stackoverflow.com/transcript/message/1094290#1094290


You can also use the arrange() function from plyr as Hadley pointed out in the above thread:

library(plyr)
arrange(dd,desc(z),b)

Benchmarks: Note that I loaded each package in a new R session since there were a lot of conflicts. In particular loading the doBy package causes sort to return "The following object(s) are masked from 'x (position 17)': b, x, y, z", and loading the Deducer package overwrites sort.data.frame from Kevin Wright or the taRifx package.

#Load each time
dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"), 
      levels = c("Low", "Med", "Hi"), ordered = TRUE),
      x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
      z = c(1, 1, 1, 2))
library(microbenchmark)

# Reload R between benchmarks
microbenchmark(dd[with(dd, order(-z, b)), ] ,
    dd[order(-dd$z, dd$b),],
    times=1000
)

Median times:

dd[with(dd, order(-z, b)), ] 778

dd[order(-dd$z, dd$b),] 788

library(taRifx)
microbenchmark(sort(dd, f= ~-z+b ),times=1000)

Median time: 1,567

library(plyr)
microbenchmark(arrange(dd,desc(z),b),times=1000)

Median time: 862

library(doBy)
microbenchmark(orderBy(~-z+b, data=dd),times=1000)

Median time: 1,694

Note that doBy takes a good bit of time to load the package.

library(Deducer)
microbenchmark(sortData(dd,c("z","b"),increasing= c(FALSE,TRUE)),times=1000)

Couldn't make Deducer load. Needs JGR console.

esort <- function(x, sortvar, ...) {
attach(x)
x <- x[with(x,order(sortvar,...)),]
return(x)
detach(x)
}

microbenchmark(esort(dd, -z, b),times=1000)

Doesn't appear to be compatible with microbenchmark due to the attach/detach.


m <- microbenchmark(
  arrange(dd,desc(z),b),
  sort(dd, f= ~-z+b ),
  dd[with(dd, order(-z, b)), ] ,
  dd[order(-dd$z, dd$b),],
  times=1000
  )

uq <- function(x) { fivenum(x)[4]}  
lq <- function(x) { fivenum(x)[2]}

y_min <- 0 # min(by(m$time,m$expr,lq))
y_max <- max(by(m$time,m$expr,uq)) * 1.05

p <- ggplot(m,aes(x=expr,y=time)) + coord_cartesian(ylim = c( y_min , y_max )) 
p + stat_summary(fun.y=median,fun.ymin = lq, fun.ymax = uq, aes(fill=expr))

microbenchmark plot

(lines extend from lower quartile to upper quartile, dot is the median)


Given these results and weighing simplicity vs. speed, I'd have to give the nod to arrange in the plyr package. It has a simple syntax and yet is almost as speedy as the base R commands with their convoluted machinations. Typically brilliant Hadley Wickham work. My only gripe with it is that it breaks the standard R nomenclature where sorting objects get called by sort(object), but I understand why Hadley did it that way due to issues discussed in the question linked above.


Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Icon