r - How do I do a conditional sum which only looks between certain date criteria -
say have data looks like
date, user, items_bought, event_number 2013-01-01, x, 2, 1 2013-01-02, x, 1, 2 2013-01-03, x, 0, 3 2013-01-04, x, 0, 4 2013-01-04, x, 1, 5 2013-01-04, x, 2, 6 2013-01-05, x, 3, 7 2013-01-06, x, 1, 8 2013-01-01, y, 1, 1 2013-01-02, y, 1, 2 2013-01-03, y, 0, 3 2013-01-04, y, 5, 4 2013-01-05, y, 6, 5 2013-01-06, y, 1, 6
to cumulative sum per user per data point doing
data.frame(cum_items_bought=unlist(tapply(as.numeric(data$items_bought), data$user, fun = cumsum)))
output looks like
date, user, items_bought 2013-01-01, x, 2 2013-01-02, x, 3 2013-01-03, x, 3 2013-01-04, x, 3 2013-01-04, x, 4 2013-01-04, x, 6 2013-01-05, x, 9 2013-01-06, x, 10 2013-01-01, y, 1 2013-01-02, y, 2 2013-01-03, y, 2 2013-01-04, y, 7 2013-01-05, y, 13 2013-01-06, y, 14
however want restrict sum add happened within 3 days of each row (relative user). i.e. output needs this:
date, user, cum_items_bought_3_days 2013-01-01, x, 2 2013-01-02, x, 3 2013-01-03, x, 3 2013-01-04, x, 1 2013-01-04, x, 2 2013-01-04, x, 4 2013-01-05, x, 6 2013-01-06, x, 7 2013-01-01, y, 1 2013-01-02, y, 2 2013-01-03, y, 2 2013-01-04, y, 6 2013-01-05, y, 11 2013-01-06, y, 12
here's dplyr
solution produce desired result (14 rows) specified in question. note takes care of duplicate date entries, example, 2013-01-04 user x.
# define custom function used in dplyr chain myfunc <- function(x){ with(x, sapply(event_number, function(y) sum(items_bought[event_number <= event_number[y] & date[y] - date <= 2]))) } require(dplyr) #install , load library df %>% mutate(date = as.date(as.character(date))) %>% group_by(user) %>% do(data.frame(., cum_items_bought_3_days = myfunc(.))) %>% select(-c(items_bought, event_number)) # date user cum_items_bought_3_days #1 2013-01-01 x 2 #2 2013-01-02 x 3 #3 2013-01-03 x 3 #4 2013-01-04 x 1 #5 2013-01-04 x 2 #6 2013-01-04 x 4 #7 2013-01-05 x 6 #8 2013-01-06 x 7 #9 2013-01-01 y 1 #10 2013-01-02 y 2 #11 2013-01-03 y 2 #12 2013-01-04 y 6 #13 2013-01-05 y 11 #14 2013-01-06 y 12
in answer use custom function myfunc
inside dplyr
chain. done using do
operator dplyr
. custom function passed subsetted df user
groups. uses sapply
pass each event_number
, calculate sums of items_bought
. last line of dplyr
chain deselects undesired columns.
let me know if you'd more detailed explanation.
edit after comment op:
if need more flexibility conditionally sum other columns, can adjust code follows. assume here, other columns should summed same way items_bought
. if not correct, please specify how want sum other columns.
i first create 2 additional columns random numbers in data (i'll post dput
of data @ bottom of answer):
set.seed(99) # reproducibility df$newcol1 <- sample(0:10, 14, replace=t) df$newcol2 <- runif(14) df # date user items_bought event_number newcol1 newcol2 #1 2013-01-01 x 2 1 6 0.687800094 #2 2013-01-02 x 1 2 1 0.640190769 #3 2013-01-03 x 0 3 7 0.357885360 #4 2013-01-04 x 0 4 10 0.102584999 #5 2013-01-04 x 1 5 5 0.097790922 #6 2013-01-04 x 2 6 10 0.182886256 #7 2013-01-05 x 3 7 7 0.227903474 #8 2013-01-06 x 1 8 3 0.080524150 #9 2013-01-01 y 1 1 3 0.821618422 #10 2013-01-02 y 1 2 1 0.591113977 #11 2013-01-03 y 0 3 6 0.773389019 #12 2013-01-04 y 5 4 5 0.350085977 #13 2013-01-05 y 6 5 2 0.006061323 #14 2013-01-06 y 1 6 7 0.814506223
next, can modify myfunc
take 2 arguments, instead of 1. first argument remain subsetted data.frame before (represented .
inside dplyr chain , x
in function definition of myfunc
), while second argument myfunc
specify column sum (colname
).
myfunc <- function(x, colname){ with(x, sapply(event_number, function(y) sum(x[event_number <= event_number[y] & date[y] - date <= 2, colname]))) }
then, can use myfunc
several times if want conditionally sum several columns:
df %>% mutate(date = as.date(as.character(date))) %>% group_by(user) %>% do(data.frame(., cum_items_bought_3_days = myfunc(., "items_bought"), newcol1sums = myfunc(., "newcol1"), newcol2sums = myfunc(., "newcol2"))) %>% select(-c(items_bought, event_number, newcol1, newcol2)) # date user cum_items_bought_3_days newcol1sums newcol2sums #1 2013-01-01 x 2 6 0.6878001 #2 2013-01-02 x 3 7 1.3279909 #3 2013-01-03 x 3 14 1.6858762 #4 2013-01-04 x 1 18 1.1006611 #5 2013-01-04 x 2 23 1.1984520 #6 2013-01-04 x 4 33 1.3813383 #7 2013-01-05 x 6 39 0.9690510 #8 2013-01-06 x 7 35 0.6916898 #9 2013-01-01 y 1 3 0.8216184 #10 2013-01-02 y 2 4 1.4127324 #11 2013-01-03 y 2 10 2.1861214 #12 2013-01-04 y 6 12 1.7145890 #13 2013-01-05 y 11 13 1.1295363 #14 2013-01-06 y 12 14 1.1706535
now created conditional sums of columns items_bought
, newcol1
, newcol2
. can leave out of sums in dplyr chain or add more columns sum up.
edit #2 after comment op:
to calculate cumulative sum of distinct (unique) items bought per user, define second custom function myfunc2
, use inside dplyr chain. function flexible myfunc
can define columns want apply function.
the code be:
myfunc <- function(x, colname){ with(x, sapply(event_number, function(y) sum(x[event_number <= event_number[y] & date[y] - date <= 2, colname]))) } myfunc2 <- function(x, colname){ cumsum(sapply(seq_along(x[[colname]]), function(y) ifelse(!y == 1 & x[y, colname] %in% x[1:(y-1), colname], 0, 1))) } require(dplyr) #install , load library dd %>% mutate(date = as.date(as.character(date))) %>% group_by(user) %>% do(data.frame(., cum_items_bought_3_days = myfunc(., "items_bought"), newcol1sums = myfunc(., "newcol1"), newcol2sums = myfunc(., "newcol2"), distinct_items_bought = myfunc2(., "items_bought"))) %>% select(-c(items_bought, event_number, newcol1, newcol2))
here data used:
dput(df) structure(list(date = structure(c(1l, 2l, 3l, 4l, 4l, 4l, 5l, 6l, 1l, 2l, 3l, 4l, 5l, 6l), .label = c("2013-01-01", "2013-01-02", "2013-01-03", "2013-01-04", "2013-01-05", "2013-01-06"), class = "factor"), user = structure(c(1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 2l, 2l, 2l, 2l, 2l, 2l), .label = c(" x", " y"), class = "factor"), items_bought = c(2l, 1l, 0l, 0l, 1l, 2l, 3l, 1l, 1l, 1l, 0l, 5l, 6l, 1l), event_number = c(1l, 2l, 3l, 4l, 5l, 6l, 7l, 8l, 1l, 2l, 3l, 4l, 5l, 6l), newcol1 = c(6l, 1l, 7l, 10l, 5l, 10l, 7l, 3l, 3l, 1l, 6l, 5l, 2l, 7l), newcol2 = c(0.687800094485283, 0.640190769452602, 0.357885359786451, 0.10258499882184, 0.0977909218054265, 0.182886255905032, 0.227903473889455, 0.0805241498164833, 0.821618422167376, 0.591113976901397, 0.773389018839225, 0.350085976999253, 0.00606132275424898, 0.814506222726777 )), .names = c("date", "user", "items_bought", "event_number", "newcol1", "newcol2"), row.names = c(na, -14l), class = "data.frame")
Comments
Post a Comment