Categories
dataframe r sqldf

Select the first row by group

109

From a dataframe like this

test <- data.frame('id'= rep(1:5,2), 'string'= LETTERS[1:10])
test <- test[order(test$id), ]
rownames(test) <- 1:10

> test
    id string
 1   1      A
 2   1      F
 3   2      B
 4   2      G
 5   3      C
 6   3      H
 7   4      D
 8   4      I
 9   5      E
 10  5      J

I want to create a new one with the first row of each id / string pair. If sqldf accepted R code within it, the query could look like this:

res <- sqldf("select id, min(rownames(test)), string 
              from test 
              group by id, string")

> res
    id string
 1   1      A
 3   2      B
 5   3      C
 7   4      D
 9   5      E

Is there a solution short of creating a new column like

test$row <- rownames(test)

and running the same sqldf query with min(row)?

4

  • possible duplicate of Collapsing data frame by selecting one row per group

    – Matthew

    Sep 2, 2014 at 1:56

  • 1

    @Matthew, my question is older.

    – dmvianna

    Sep 2, 2014 at 2:23

  • 2

    Your question is 1 year old, and the other question is 4 years old, no? There are so many duplicates of this question

    – Matthew

    Sep 2, 2014 at 2:34


  • @Matthew Sorry, I must have misread the dates.

    – dmvianna

    Sep 4, 2014 at 4:31

137

You can use duplicated to do this very quickly.

test[!duplicated(test$id),]

Benchmarks, for the speed freaks:

ju <- function() test[!duplicated(test$id),]
gs1 <- function() do.call(rbind, lapply(split(test, test$id), head, 1))
gs2 <- function() do.call(rbind, lapply(split(test, test$id), `[`, 1, ))
jply <- function() ddply(test,.(id),function(x) head(x,1))
jdt <- function() {
  testd <- as.data.table(test)
  setkey(testd,id)
  # Initial solution (slow)
  # testd[,lapply(.SD,function(x) head(x,1)),by = key(testd)]
  # Faster options :
  testd[!duplicated(id)]               # (1)
  # testd[, .SD[1L], by=key(testd)]    # (2)
  # testd[J(unique(id)),mult="first"]  # (3)
  # testd[ testd[,.I[1L],by=id] ]      # (4) needs v1.8.3. Allows 2nd, 3rd etc
}

library(plyr)
library(data.table)
library(rbenchmark)

# sample data
set.seed(21)
test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE))
test <- test[order(test$id), ]

benchmark(ju(), gs1(), gs2(), jply(), jdt(),
    replications=5, order="relative")[,1:6]
#     test replications elapsed relative user.self sys.self
# 1   ju()            5    0.03    1.000      0.03     0.00
# 5  jdt()            5    0.03    1.000      0.03     0.00
# 3  gs2()            5    3.49  116.333      2.87     0.58
# 2  gs1()            5    3.58  119.333      3.00     0.58
# 4 jply()            5    3.69  123.000      3.11     0.51

Let’s try that again, but with just the contenders from the first heat and with more data and more replications.

set.seed(21)
test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE))
test <- test[order(test$id), ]
benchmark(ju(), jdt(), order="relative")[,1:6]
#    test replications elapsed relative user.self sys.self
# 1  ju()          100    5.48    1.000      4.44     1.00
# 2 jdt()          100    6.92    1.263      5.70     1.15

12

  • The winner: system.time(dat3[!duplicated(dat3$id),]) user system elapsed 0.07 0.00 0.07

    – dmvianna

    Nov 7, 2012 at 23:18

  • 2

    @dmvianna: I don’t have it installed and didn’t feel like bothering with it. 🙂

    Nov 7, 2012 at 23:33

  • Are we sure that my data.table code is as efficient as possible? I’m not confident in my ability to eke the best performance out of that tool.

    – joran

    Nov 7, 2012 at 23:36

  • 2

    Also, I reckon, if you are going to benchmark the data.table , keying you should include the ordering by id within the base calls.

    – mnel

    Nov 8, 2012 at 0:42

  • 1

    @JoshuaUlrich One more question: why is the first sentence needed i.e. assumption that data is already sorted. !duplicated(x) finds the first of each group even if it isn’t sorted, iiuc.

    Nov 8, 2012 at 9:24

74

I favor the dplyr approach.

group_by(id) followed by either

  • filter(row_number()==1) or
  • slice(1) or
  • slice_head(1) #(dplyr => 1.0)
  • top_n(n = -1)
    • top_n() internally uses the rank function.
      Negative selects from the bottom of rank.

In some instances arranging the ids after the group_by can be necessary.

library(dplyr)

# using filter(), top_n() or slice()

m1 <-
test %>% 
  group_by(id) %>% 
  filter(row_number()==1)

m2 <-
test %>% 
  group_by(id) %>% 
  slice(1)

m3 <-
test %>% 
  group_by(id) %>% 
  top_n(n = -1)

All three methods return the same result

# A tibble: 5 x 2
# Groups:   id [5]
     id string
  <int> <fct> 
1     1 A     
2     2 B     
3     3 C     
4     4 D     
5     5 E

4

  • 2

    Worth giving a shout-out to slice as well. slice(x) is a shortcut for filter(row_number() %in% x).

    Jun 20, 2018 at 18:53


  • Very elegant. Do you know why I have to convert my data.table to a data.frame for this to work?

    Jan 13, 2019 at 3:09

  • @JamesHirschorn I’m not an expert on the all the differences. But data.table inherits from the data.frame so in many cases you can use dplyr commands on a data.table. The example above e.g also works if test is a data.table. See e.g. stackoverflow.com/questions/13618488/… for a deeper explanantion

    – Kresten

    Jan 14, 2019 at 9:06

  • This is a tidyverse way to do it and as you see the data.frame is actually a tibble here. I personally advise you to work always with tibbles also because ggplot2 is built in a similar manner.

    – Garini

    Jan 15, 2020 at 10:28

18

What about

DT <- data.table(test)
setkey(DT, id)

DT[J(unique(id)), mult = "first"]

Edit

There is also a unique method for data.tables which will return the the first row by key

jdtu <- function() unique(DT)

I think, if you are ordering test outside the benchmark, then you can removing the setkey and data.table conversion from the benchmark as well (as the setkey basically sorts by id, the same as order).

set.seed(21)
test <- data.frame(id=sample(1e3, 1e5, TRUE), string=sample(LETTERS, 1e5, TRUE))
test <- test[order(test$id), ]
DT <- data.table(DT, key = 'id')
ju <- function() test[!duplicated(test$id),]

jdt <- function() DT[J(unique(id)),mult="first"]


 library(rbenchmark)
benchmark(ju(), jdt(), replications = 5)
##    test replications elapsed relative user.self sys.self 
## 2 jdt()            5    0.01        1      0.02        0        
## 1  ju()            5    0.05        5      0.05        0         

and with more data

** Edit with unique method**

set.seed(21)
test <- data.frame(id=sample(1e4, 1e6, TRUE), string=sample(LETTERS, 1e6, TRUE))
test <- test[order(test$id), ]
DT <- data.table(test, key = 'id')
       test replications elapsed relative user.self sys.self 
2  jdt()            5    0.09     2.25      0.09     0.00    
3 jdtu()            5    0.04     1.00      0.05     0.00      
1   ju()            5    0.22     5.50      0.19     0.03        

The unique method is fastest here.

2

  • 5

    You don’t even have to set the key. unique(DT,by="id") works directly

    – Matthew

    Sep 2, 2014 at 1:43


  • FYI as of data.table version >= 1.9.8, the default by argument for unique is by = seq_along(x) (all columns), instead of the previous default by = key(x)

    Dec 31, 2018 at 21:03