python - How to use Pearson Correlation as distance metric in Scikit-learn Agglomerative clustering -


i have following data:

state   murder  assault urbanpop    rape alabama 13.200  236 58  21.200 alaska  10.000  263 48  44.500 arizona 8.100   294 80  31.000 arkansas    8.800   190 50  19.500 california  9.000   276 91  40.600 colorado    7.900   204 78  38.700 connecticut 3.300   110 77  11.100 delaware    5.900   238 72  15.800 florida 15.400  335 80  31.900 georgia 17.400  211 60  25.800 hawaii  5.300   46  83  20.200 idaho   2.600   120 54  14.200 illinois    10.400  249 83  24.000 indiana 7.200   113 65  21.000 iowa    2.200   56  57  11.300 kansas  6.000   115 66  18.000 kentucky    9.700   109 52  16.300 louisiana   15.400  249 66  22.200 maine   2.100   83  51  7.800 maryland    11.300  300 67  27.800 massachusetts   4.400   149 85  16.300 michigan    12.100  255 74  35.100 minnesota   2.700   72  66  14.900 mississippi 16.100  259 44  17.100 missouri    9.000   178 70  28.200 montana 6.000   109 53  16.400 nebraska    4.300   102 62  16.500 nevada  12.200  252 81  46.000 new hampshire   2.100   57  56  9.500 new jersey  7.400   159 89  18.800 new mexico  11.400  285 70  32.100 new york    11.100  254 86  26.100 north carolina  13.000  337 45  16.100 north dakota    0.800   45  44  7.300 ohio    7.300   120 75  21.400 oklahoma    6.600   151 68  20.000 oregon  4.900   159 67  29.300 pennsylvania    6.300   106 72  14.900 rhode island    3.400   174 87  8.300 south carolina  14.400  279 48  22.500 south dakota    3.800   86  45  12.800 tennessee   13.200  188 59  26.900 texas   12.700  201 80  25.500 utah    3.200   120 80  22.900 vermont 2.200   48  32  11.200 virginia    8.500   156 63  20.700 washington  4.000   145 73  26.200 west virginia   5.700   81  39  9.300 wisconsin   2.600   53  66  10.800 wyoming 6.800   161 60  15.600 

which use perform hierarchical clustering based on state. full working code:

import pandas pd  sklearn.cluster import agglomerativeclustering df = pd.io.parsers.read_table("http://dpaste.com/031vzpm.txt") samples = df["state"].tolist() ndf = df[["murder", "assault", "urbanpop","rape"]] x = ndf.as_matrix()  cluster = agglomerativeclustering(n_clusters=3,                                 linkage='complete',affinity='euclidean').fit(x) label = cluster.labels_ outclust = list(zip(label, samples))   outclust_df = pd.dataframe(outclust,columns=["clusters","samples"])    clust in outclust_df.groupby("clusters"):     print (clust) 

notice in method use euclidean distance. want use 1-pearson correlation distance. in r looks this:

dat <- read.table("http://dpaste.com/031vzpm.txt",sep="\t",header=true) dist2 = function(x) as.dist(1-cor(t(x), method="pearson")) dat = dat[c("murder","assault","urbanpop","rape")] hclust(dist2(dat), method="ward.d") 

how can achieve using scikit-learn agglomerativeclustering? understand there 'precomputed' arguments affinity. not sure how use address problem.

you can define custom affinity matrix function takes in data , returns affinity matrix:

from scipy.stats import pearsonr import numpy np  def pearson_affinity(m):    return 1 - np.array([[pearsonr(a,b)[0] in m] b in m]) 

then can call agglomerative clustering affinity function (you have change linkage, since 'ward' works euclidean distance.

cluster = agglomerativeclustering(n_clusters=3, linkage='average',                            affinity=pearson_affinity) cluster.fit(x) 

note doesn't seem work data reason:

cluster.labels_ out[107]:  array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 1, 0]) 

Comments

Popular posts from this blog

c# - Binding a comma separated list to a List<int> in asp.net web api -

how to prompt save As Box in Excel Interlop c# MVC 4 -

xslt 1.0 - How to access or retrieve mets content of an item from another item? -