r - Keeping specific column data while aggregating and summing other columns -

i new r, , using medium-sized retail store's transactional data practice. i'd create data frame has each customer's percentage of purchases in different categories of products, sum of total purchases. way, can send marketing emails people demonstrated preference in given category, exclude people have purchased less 5 times.

sample data (except 100 categories in reality , 250,000 rows):

+-------------+-------------+--------------------+------+------+------+ | transaction | customer_id | email              | cat1 | cat2 | cat3 | +-------------+-------------+--------------------+------+------+------+ | 55          | 1           | email@address.com  | 1    | 0    | 0    | | 55          | 1           | email@address.com  | 1    | 0    | 0    | | 56          | 2           | email2@address.com | 0    | 0    | 2    | | 57          | 3           | email3@address.com | 3    | 0    | 0    | +-------------+-------------+--------------------+------+------+------+

step 1: aggregate customer id, i've used following code:

segmented <- aggregate(df[4:6], list(customer_id=orders$customer_id), fun = sum)

step 2: make aggregated numbers percentages, used following code:

segmented_percentage <- cbind(id = segmented[, 1], segmented[, -1]/rowsums(segmented[, -1])*100)

however, lost email addresses in step 1, , when try merge data frames below, it's never finished processing (and i've waited few hours).

merge(segmented_percentage, df)

in short: how put these many pieces emails demonstrated preference , total purchases?

(many of stack overflow's other answers. accomplished above entirely result of googling , finding answers here.)

we can use email grouping variable column of 'email' in 'segmented', assuming particular 'customer_id' has same 'email'.

segmented <- aggregate(.~customer_id+email, df1[-1], fun=sum)

if want create columns in original dataset, use mutate library(dplyr)

library(dplyr) df2 <- df1 %>%           group_by(customer_id) %>%           mutate_each(funs(sum= sum(., na.rm=true)), starts_with('cat'))

we percentage 'cat' columns , assign output replace columns percentage.

ind <- grep('cat', names(df2)) df2[ind] <- df2[ind]/rowsums(df2[ind])*100

or can use prop.table margin=1

df2[ind] <-  100*prop.table(as.matrix(df2[ind] ), 1)

we can using data.table. convert 'data.frame' 'data.table' (setdt(df1)), change class of columns want change numeric (lapply(.sd, as.numeric)). columns selected can specified in .sdcols , can assign (:=) output columns numeric column index. grouped 'customer_id', loop through columns 4:6 using lapply , sum. use reduce + elementwise sum of lapply output (which similar rowsums), divide sum reduce output within map , assign output 4:6 columns.

library(data.table)  setdt(df1)[, (4:6) := lapply(.sd, as.numeric), .sdcols=4:6][,    (4:6) := {tmp <- lapply(.sd, sum, na.rm=true)              map(f1, tmp, reduce(`+`, tmp))}, = customer_id, .sdcols=4:6]

data

df1 <- structure(list(transaction = c(55l, 55l, 56l, 57l),  customer_id = c(1l,  1l, 2l, 3l), email = c("email@address.com", "email@address.com",  "email2@address.com", "email3@address.com"), cat1 = c(1l, 1l,  0l, 3l), cat2 = c(0l, 0l, 0l, 0l), cat3 = c(0l, 0l, 2l, 0l)), .names = c("transaction",  "customer_id", "email", "cat1", "cat2", "cat3"),   class = "data.frame", row.names = c(na,  -4l))

Search This Blog

Guide

r - Keeping specific column data while aggregating and summing other columns -

data

Comments

Post a Comment

Popular posts from this blog

swift - Button on Table View Cell connected to local function -

dns - Dokku server hosts two sites with TLD's, both domains are landing on only one app -

c# - ajax - How to receive data both html and json from server? -