Categories
aggregate data.table dataframe r r-faq

Aggregate / summarize multiple variables per group (e.g. sum, mean)

183

From a data frame, is there a easy way to aggregate (sum, mean, max et c) multiple variables simultaneously?

Below are some sample data:

library(lubridate)
days = 365*2
date = seq(as.Date("2000-01-01"), length = days, by = "day")
year = year(date)
month = month(date)
x1 = cumsum(rnorm(days, 0.05)) 
x2 = cumsum(rnorm(days, 0.05))
df1 = data.frame(date, year, month, x1, x2)

I would like to simultaneously aggregate the x1 and x2 variables from the df2 data frame by year and month. The following code aggregates the x1 variable, but is it also possible to simultaneously aggregate the x2 variable?

### aggregate variables by year month
df2=aggregate(x1 ~ year+month, data=df1, sum, na.rm=TRUE)
head(df2)

0

    47

    Where is this year() function from?

    You could also use the reshape2 package for this task:

    require(reshape2)
    df_melt <- melt(df1, id = c("date", "year", "month"))
    dcast(df_melt, year + month ~ variable, sum)
    #  year month         x1           x2
    1  2000     1  -80.83405 -224.9540159
    2  2000     2 -223.76331 -288.2418017
    3  2000     3 -188.83930 -481.5601913
    4  2000     4 -197.47797 -473.7137420
    5  2000     5 -259.07928 -372.4563522
    

    1

    • 9

      The recast function (also from reshape2) integrates the melt and dcast function in one go for tasks like this: recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))

      – Jaap

      May 13, 2016 at 6:17


    59

    With the dplyr package, you can use summarise_all, summarise_at or summarise_if functions to aggregate multiple variables simultaneously. For the example dataset you can do this as follows:

    library(dplyr)
    # summarising all non-grouping variables
    df2 <- df1 %>% group_by(year, month) %>% summarise_all(sum)
    
    # summarising a specific set of non-grouping variables
    df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(x1, x2), sum)
    df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(-date), sum)
    
    # summarising a specific set of non-grouping variables using select_helpers
    # see ?select_helpers for more options
    df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(starts_with('x')), sum)
    df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(matches('.*[0-9]')), sum)
    
    # summarising a specific set of non-grouping variables based on condition (class)
    df2 <- df1 %>% group_by(year, month) %>% summarise_if(is.numeric, sum)
    

    The result of the latter two options:

        year month        x1         x2
       <dbl> <dbl>     <dbl>      <dbl>
    1   2000     1 -73.58134  -92.78595
    2   2000     2 -57.81334 -152.36983
    3   2000     3 122.68758  153.55243
    4   2000     4 450.24980  285.56374
    5   2000     5 678.37867  384.42888
    6   2000     6 792.68696  530.28694
    7   2000     7 908.58795  452.31222
    8   2000     8 710.69928  719.35225
    9   2000     9 725.06079  914.93687
    10  2000    10 770.60304  863.39337
    # ... with 14 more rows
    

    Note: summarise_each is deprecated in favor of summarise_all, summarise_at and summarise_if.


    As mentioned in my comment above, you can also use the recast function from the reshape2-package:

    library(reshape2)
    recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
    

    which will give you the same result.

    0

      56

      Using the data.table package, which is fast (useful for larger datasets)

      https://github.com/Rdatatable/data.table/wiki

      library(data.table)
      df2 <- setDT(df1)[, lapply(.SD, sum), by=.(year, month), .SDcols=c("x1","x2")]
      setDF(df2) # convert back to dataframe
      

      Using the plyr package

      require(plyr)
      df2 <- ddply(df1, c("year", "month"), function(x) colSums(x[c("x1", "x2")]))
      

      Using summarize() from the Hmisc package
      (column headings are messy in my example though)

      # need to detach plyr because plyr and Hmisc both have a summarize()
      detach(package:plyr)
      require(Hmisc)
      df2 <- with(df1, summarize( cbind(x1, x2), by=llist(year, month), FUN=colSums))
      

      1

      • why not do this for data.table option: dt[, .(x1.sum = sum(x1), x2.sum = sum(x2), by = c(year, month) ?

        – Bulat

        Oct 13, 2018 at 12:00