r - Keeping specific column data while aggregating and summing other columns -


i new r, , using medium-sized retail store's transactional data practice. i'd create data frame has each customer's percentage of purchases in different categories of products, sum of total purchases. way, can send marketing emails people demonstrated preference in given category, exclude people have purchased less 5 times.

sample data (except 100 categories in reality , 250,000 rows):

+-------------+-------------+--------------------+------+------+------+ | transaction | customer_id | email              | cat1 | cat2 | cat3 | +-------------+-------------+--------------------+------+------+------+ | 55          | 1           | email@address.com  | 1    | 0    | 0    | | 55          | 1           | email@address.com  | 1    | 0    | 0    | | 56          | 2           | email2@address.com | 0    | 0    | 2    | | 57          | 3           | email3@address.com | 3    | 0    | 0    | +-------------+-------------+--------------------+------+------+------+ 

step 1: aggregate customer id, i've used following code:

segmented <- aggregate(df[4:6], list(customer_id=orders$customer_id), fun = sum)     

step 2: make aggregated numbers percentages, used following code:

segmented_percentage <- cbind(id = segmented[, 1], segmented[, -1]/rowsums(segmented[, -1])*100) 

however, lost email addresses in step 1, , when try merge data frames below, it's never finished processing (and i've waited few hours).

merge(segmented_percentage, df) 

in short: how put these many pieces emails demonstrated preference , total purchases?

(many of stack overflow's other answers. accomplished above entirely result of googling , finding answers here.)

we can use email grouping variable column of 'email' in 'segmented', assuming particular 'customer_id' has same 'email'.

segmented <- aggregate(.~customer_id+email, df1[-1], fun=sum) 

if want create columns in original dataset, use mutate library(dplyr)

library(dplyr) df2 <- df1 %>%           group_by(customer_id) %>%           mutate_each(funs(sum= sum(., na.rm=true)), starts_with('cat')) 

we percentage 'cat' columns , assign output replace columns percentage.

ind <- grep('cat', names(df2)) df2[ind] <- df2[ind]/rowsums(df2[ind])*100 

or can use prop.table margin=1

df2[ind] <-  100*prop.table(as.matrix(df2[ind] ), 1) 

we can using data.table. convert 'data.frame' 'data.table' (setdt(df1)), change class of columns want change numeric (lapply(.sd, as.numeric)). columns selected can specified in .sdcols , can assign (:=) output columns numeric column index. grouped 'customer_id', loop through columns 4:6 using lapply , sum. use reduce + elementwise sum of lapply output (which similar rowsums), divide sum reduce output within map , assign output 4:6 columns.

library(data.table)  setdt(df1)[, (4:6) := lapply(.sd, as.numeric), .sdcols=4:6][,    (4:6) := {tmp <- lapply(.sd, sum, na.rm=true)              map(f1, tmp, reduce(`+`, tmp))}, = customer_id, .sdcols=4:6] 

data

df1 <- structure(list(transaction = c(55l, 55l, 56l, 57l),  customer_id = c(1l,  1l, 2l, 3l), email = c("email@address.com", "email@address.com",  "email2@address.com", "email3@address.com"), cat1 = c(1l, 1l,  0l, 3l), cat2 = c(0l, 0l, 0l, 0l), cat3 = c(0l, 0l, 2l, 0l)), .names = c("transaction",  "customer_id", "email", "cat1", "cat2", "cat3"),   class = "data.frame", row.names = c(na,  -4l)) 

Comments

Popular posts from this blog

c# - Binding a comma separated list to a List<int> in asp.net web api -

Delphi 7 and decode UTF-8 base64 -

html - Is there any way to exclude a single element from the style? (Bootstrap) -