r - How do I do a conditional sum which only looks between certain date criteria -

say have data looks like

date, user, items_bought, event_number 2013-01-01, x, 2, 1 2013-01-02, x, 1, 2 2013-01-03, x, 0, 3 2013-01-04, x, 0, 4 2013-01-04, x, 1, 5 2013-01-04, x, 2, 6 2013-01-05, x, 3, 7 2013-01-06, x, 1, 8 2013-01-01, y, 1, 1 2013-01-02, y, 1, 2 2013-01-03, y, 0, 3 2013-01-04, y, 5, 4 2013-01-05, y, 6, 5 2013-01-06, y, 1, 6

to cumulative sum per user per data point doing

data.frame(cum_items_bought=unlist(tapply(as.numeric(data$items_bought), data$user, fun = cumsum)))

output looks like

date, user, items_bought 2013-01-01, x, 2 2013-01-02, x, 3 2013-01-03, x, 3 2013-01-04, x, 3 2013-01-04, x, 4 2013-01-04, x, 6 2013-01-05, x, 9 2013-01-06, x, 10 2013-01-01, y, 1 2013-01-02, y, 2 2013-01-03, y, 2 2013-01-04, y, 7 2013-01-05, y, 13 2013-01-06, y, 14

however want restrict sum add happened within 3 days of each row (relative user). i.e. output needs this:

date, user, cum_items_bought_3_days 2013-01-01, x, 2 2013-01-02, x, 3 2013-01-03, x, 3 2013-01-04, x, 1 2013-01-04, x, 2 2013-01-04, x, 4 2013-01-05, x, 6 2013-01-06, x, 7 2013-01-01, y, 1 2013-01-02, y, 2 2013-01-03, y, 2 2013-01-04, y, 6 2013-01-05, y, 11 2013-01-06, y, 12

here's dplyr solution produce desired result (14 rows) specified in question. note takes care of duplicate date entries, example, 2013-01-04 user x.

# define custom function used in dplyr chain myfunc <- function(x){   with(x, sapply(event_number, function(y)      sum(items_bought[event_number <= event_number[y] & date[y] - date <= 2]))) }  require(dplyr)                 #install , load library  df %>%   mutate(date = as.date(as.character(date))) %>%   group_by(user) %>%   do(data.frame(., cum_items_bought_3_days = myfunc(.))) %>%   select(-c(items_bought, event_number))  #         date user cum_items_bought_3_days #1  2013-01-01    x                       2 #2  2013-01-02    x                       3 #3  2013-01-03    x                       3 #4  2013-01-04    x                       1 #5  2013-01-04    x                       2 #6  2013-01-04    x                       4 #7  2013-01-05    x                       6 #8  2013-01-06    x                       7 #9  2013-01-01    y                       1 #10 2013-01-02    y                       2 #11 2013-01-03    y                       2 #12 2013-01-04    y                       6 #13 2013-01-05    y                      11 #14 2013-01-06    y                      12

in answer use custom function myfunc inside dplyr chain. done using do operator dplyr. custom function passed subsetted df user groups. uses sapply pass each event_number , calculate sums of items_bought. last line of dplyr chain deselects undesired columns.

let me know if you'd more detailed explanation.

edit after comment op:

if need more flexibility conditionally sum other columns, can adjust code follows. assume here, other columns should summed same way items_bought. if not correct, please specify how want sum other columns.

i first create 2 additional columns random numbers in data (i'll post dput of data @ bottom of answer):

set.seed(99)   # reproducibility  df$newcol1 <- sample(0:10, 14, replace=t) df$newcol2 <- runif(14)  df #         date user items_bought event_number newcol1     newcol2 #1  2013-01-01    x            2            1       6 0.687800094 #2  2013-01-02    x            1            2       1 0.640190769 #3  2013-01-03    x            0            3       7 0.357885360 #4  2013-01-04    x            0            4      10 0.102584999 #5  2013-01-04    x            1            5       5 0.097790922 #6  2013-01-04    x            2            6      10 0.182886256 #7  2013-01-05    x            3            7       7 0.227903474 #8  2013-01-06    x            1            8       3 0.080524150 #9  2013-01-01    y            1            1       3 0.821618422 #10 2013-01-02    y            1            2       1 0.591113977 #11 2013-01-03    y            0            3       6 0.773389019 #12 2013-01-04    y            5            4       5 0.350085977 #13 2013-01-05    y            6            5       2 0.006061323 #14 2013-01-06    y            1            6       7 0.814506223

next, can modify myfunc take 2 arguments, instead of 1. first argument remain subsetted data.frame before (represented . inside dplyr chain , x in function definition of myfunc), while second argument myfunc specify column sum (colname).

myfunc <- function(x, colname){   with(x, sapply(event_number, function(y)      sum(x[event_number <= event_number[y] & date[y] - date <= 2, colname]))) }

then, can use myfunc several times if want conditionally sum several columns:

df %>%   mutate(date = as.date(as.character(date))) %>%   group_by(user) %>%   do(data.frame(., cum_items_bought_3_days = myfunc(., "items_bought"),                    newcol1sums = myfunc(., "newcol1"),                                newcol2sums = myfunc(., "newcol2"))) %>% select(-c(items_bought, event_number, newcol1, newcol2))  #         date user cum_items_bought_3_days newcol1sums newcol2sums #1  2013-01-01    x                       2           6   0.6878001 #2  2013-01-02    x                       3           7   1.3279909 #3  2013-01-03    x                       3          14   1.6858762 #4  2013-01-04    x                       1          18   1.1006611 #5  2013-01-04    x                       2          23   1.1984520 #6  2013-01-04    x                       4          33   1.3813383 #7  2013-01-05    x                       6          39   0.9690510 #8  2013-01-06    x                       7          35   0.6916898 #9  2013-01-01    y                       1           3   0.8216184 #10 2013-01-02    y                       2           4   1.4127324 #11 2013-01-03    y                       2          10   2.1861214 #12 2013-01-04    y                       6          12   1.7145890 #13 2013-01-05    y                      11          13   1.1295363 #14 2013-01-06    y                      12          14   1.1706535

now created conditional sums of columns items_bought, newcol1 , newcol2. can leave out of sums in dplyr chain or add more columns sum up.

edit #2 after comment op:

to calculate cumulative sum of distinct (unique) items bought per user, define second custom function myfunc2 , use inside dplyr chain. function flexible myfunc can define columns want apply function.

the code be:

myfunc <- function(x, colname){   with(x, sapply(event_number, function(y)      sum(x[event_number <= event_number[y] & date[y] - date <= 2, colname]))) }  myfunc2 <- function(x, colname){   cumsum(sapply(seq_along(x[[colname]]), function(y)      ifelse(!y == 1 & x[y, colname] %in% x[1:(y-1), colname], 0, 1))) }  require(dplyr)                 #install , load library  dd %>%   mutate(date = as.date(as.character(date))) %>%   group_by(user) %>%   do(data.frame(., cum_items_bought_3_days = myfunc(., "items_bought"),                    newcol1sums = myfunc(., "newcol1"),                    newcol2sums = myfunc(., "newcol2"),                    distinct_items_bought = myfunc2(., "items_bought"))) %>%      select(-c(items_bought, event_number, newcol1, newcol2))

here data used:

dput(df) structure(list(date = structure(c(1l, 2l, 3l, 4l, 4l, 4l, 5l,  6l, 1l, 2l, 3l, 4l, 5l, 6l), .label = c("2013-01-01", "2013-01-02",  "2013-01-03", "2013-01-04", "2013-01-05", "2013-01-06"), class = "factor"),  user = structure(c(1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 2l, 2l,  2l, 2l, 2l, 2l), .label = c(" x", " y"), class = "factor"),  items_bought = c(2l, 1l, 0l, 0l, 1l, 2l, 3l, 1l, 1l, 1l,  0l, 5l, 6l, 1l), event_number = c(1l, 2l, 3l, 4l, 5l, 6l,  7l, 8l, 1l, 2l, 3l, 4l, 5l, 6l), newcol1 = c(6l, 1l, 7l,  10l, 5l, 10l, 7l, 3l, 3l, 1l, 6l, 5l, 2l, 7l), newcol2 = c(0.687800094485283,  0.640190769452602, 0.357885359786451, 0.10258499882184, 0.0977909218054265,  0.182886255905032, 0.227903473889455, 0.0805241498164833,  0.821618422167376, 0.591113976901397, 0.773389018839225,  0.350085976999253, 0.00606132275424898, 0.814506222726777 )), .names = c("date", "user", "items_bought", "event_number",  "newcol1", "newcol2"), row.names = c(na, -14l), class = "data.frame")

Search This Blog

Brent