r - Keeping specific column data while aggregating and summing other columns -
i new r, , using medium-sized retail store's transactional data practice. i'd create data frame has each customer's percentage of purchases in different categories of products, sum of total purchases. way, can send marketing emails people demonstrated preference in given category, exclude people have purchased less 5 times.
sample data (except 100 categories in reality , 250,000 rows):
+-------------+-------------+--------------------+------+------+------+ | transaction | customer_id | email | cat1 | cat2 | cat3 | +-------------+-------------+--------------------+------+------+------+ | 55 | 1 | email@address.com | 1 | 0 | 0 | | 55 | 1 | email@address.com | 1 | 0 | 0 | | 56 | 2 | email2@address.com | 0 | 0 | 2 | | 57 | 3 | email3@address.com | 3 | 0 | 0 | +-------------+-------------+--------------------+------+------+------+
step 1: aggregate customer id, i've used following code:
segmented <- aggregate(df[4:6], list(customer_id=orders$customer_id), fun = sum)
step 2: make aggregated numbers percentages, used following code:
segmented_percentage <- cbind(id = segmented[, 1], segmented[, -1]/rowsums(segmented[, -1])*100)
however, lost email addresses in step 1, , when try merge data frames below, it's never finished processing (and i've waited few hours).
merge(segmented_percentage, df)
in short: how put these many pieces emails demonstrated preference , total purchases?
(many of stack overflow's other answers. accomplished above entirely result of googling , finding answers here.)
we can use email
grouping variable column of 'email' in 'segmented', assuming particular 'customer_id' has same 'email'.
segmented <- aggregate(.~customer_id+email, df1[-1], fun=sum)
if want create columns in original dataset, use mutate
library(dplyr)
library(dplyr) df2 <- df1 %>% group_by(customer_id) %>% mutate_each(funs(sum= sum(., na.rm=true)), starts_with('cat'))
we percentage 'cat' columns , assign output replace columns percentage.
ind <- grep('cat', names(df2)) df2[ind] <- df2[ind]/rowsums(df2[ind])*100
or can use prop.table
margin=1
df2[ind] <- 100*prop.table(as.matrix(df2[ind] ), 1)
we can using data.table
. convert 'data.frame' 'data.table' (setdt(df1)
), change class
of columns want change numeric
(lapply(.sd, as.numeric)
). columns selected can specified in .sdcols
, can assign (:=
) output columns numeric column index. grouped 'customer_id', loop through columns 4:6 using lapply
, sum
. use reduce
+
elementwise sum of lapply
output (which similar rowsums
), divide sum
reduce
output within map
, assign output 4:6 columns.
library(data.table) setdt(df1)[, (4:6) := lapply(.sd, as.numeric), .sdcols=4:6][, (4:6) := {tmp <- lapply(.sd, sum, na.rm=true) map(f1, tmp, reduce(`+`, tmp))}, = customer_id, .sdcols=4:6]
data
df1 <- structure(list(transaction = c(55l, 55l, 56l, 57l), customer_id = c(1l, 1l, 2l, 3l), email = c("email@address.com", "email@address.com", "email2@address.com", "email3@address.com"), cat1 = c(1l, 1l, 0l, 3l), cat2 = c(0l, 0l, 0l, 0l), cat3 = c(0l, 0l, 2l, 0l)), .names = c("transaction", "customer_id", "email", "cat1", "cat2", "cat3"), class = "data.frame", row.names = c(na, -4l))
Comments
Post a Comment