python - How to use Pearson Correlation as distance metric in Scikit-learn Agglomerative clustering -
i have following data:
state murder assault urbanpop rape alabama 13.200 236 58 21.200 alaska 10.000 263 48 44.500 arizona 8.100 294 80 31.000 arkansas 8.800 190 50 19.500 california 9.000 276 91 40.600 colorado 7.900 204 78 38.700 connecticut 3.300 110 77 11.100 delaware 5.900 238 72 15.800 florida 15.400 335 80 31.900 georgia 17.400 211 60 25.800 hawaii 5.300 46 83 20.200 idaho 2.600 120 54 14.200 illinois 10.400 249 83 24.000 indiana 7.200 113 65 21.000 iowa 2.200 56 57 11.300 kansas 6.000 115 66 18.000 kentucky 9.700 109 52 16.300 louisiana 15.400 249 66 22.200 maine 2.100 83 51 7.800 maryland 11.300 300 67 27.800 massachusetts 4.400 149 85 16.300 michigan 12.100 255 74 35.100 minnesota 2.700 72 66 14.900 mississippi 16.100 259 44 17.100 missouri 9.000 178 70 28.200 montana 6.000 109 53 16.400 nebraska 4.300 102 62 16.500 nevada 12.200 252 81 46.000 new hampshire 2.100 57 56 9.500 new jersey 7.400 159 89 18.800 new mexico 11.400 285 70 32.100 new york 11.100 254 86 26.100 north carolina 13.000 337 45 16.100 north dakota 0.800 45 44 7.300 ohio 7.300 120 75 21.400 oklahoma 6.600 151 68 20.000 oregon 4.900 159 67 29.300 pennsylvania 6.300 106 72 14.900 rhode island 3.400 174 87 8.300 south carolina 14.400 279 48 22.500 south dakota 3.800 86 45 12.800 tennessee 13.200 188 59 26.900 texas 12.700 201 80 25.500 utah 3.200 120 80 22.900 vermont 2.200 48 32 11.200 virginia 8.500 156 63 20.700 washington 4.000 145 73 26.200 west virginia 5.700 81 39 9.300 wisconsin 2.600 53 66 10.800 wyoming 6.800 161 60 15.600
which use perform hierarchical clustering based on state. full working code:
import pandas pd sklearn.cluster import agglomerativeclustering df = pd.io.parsers.read_table("http://dpaste.com/031vzpm.txt") samples = df["state"].tolist() ndf = df[["murder", "assault", "urbanpop","rape"]] x = ndf.as_matrix() cluster = agglomerativeclustering(n_clusters=3, linkage='complete',affinity='euclidean').fit(x) label = cluster.labels_ outclust = list(zip(label, samples)) outclust_df = pd.dataframe(outclust,columns=["clusters","samples"]) clust in outclust_df.groupby("clusters"): print (clust)
notice in method use euclidean
distance. want use 1-pearson correlation distance
. in r looks this:
dat <- read.table("http://dpaste.com/031vzpm.txt",sep="\t",header=true) dist2 = function(x) as.dist(1-cor(t(x), method="pearson")) dat = dat[c("murder","assault","urbanpop","rape")] hclust(dist2(dat), method="ward.d")
how can achieve using scikit-learn agglomerativeclustering? understand there 'precomputed' arguments affinity. not sure how use address problem.
you can define custom affinity matrix function takes in data , returns affinity matrix:
from scipy.stats import pearsonr import numpy np def pearson_affinity(m): return 1 - np.array([[pearsonr(a,b)[0] in m] b in m])
then can call agglomerative clustering affinity function (you have change linkage, since 'ward' works euclidean distance.
cluster = agglomerativeclustering(n_clusters=3, linkage='average', affinity=pearson_affinity) cluster.fit(x)
note doesn't seem work data reason:
cluster.labels_ out[107]: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0])
Comments
Post a Comment