Python 2.7: Remove subdomains from list -
i have list 1,300,000 items. example, ['.a', '.b.a', '.c.b', '.f.c.b'].
i'd remove subdomains (e.g. '.b.a' , '.f.c.b' in list above).
i'm newbie. trying learn speed. following attempts, seem slow. suggestions:
# create separate lists, perhaps faster a1 = [] b2 = [] c3 = [] d4 = [] e5 = [] f6 = [] in dupesgone: j = i.count('.') if j == 1: a1.append(i) elif j == 2: b2.append(i) elif j == 3: c3.append(i) elif j == 4: d4.append(i) elif j == 5: e5.append(i) else: f6.append(i) in a1: la = -len(a) b in b2: if == b[la:]: b2.remove(b) c in c3: if == c[la:]: c3.remove(c) d in d4: if == d[la:]: d4.remove(d) --snip-- # how this, faster [b2.remove(b) b in b2 in a1 if == b[-len(a):]] [c3.remove(c) c in c3 in a1 if == c[-len(a):]] [d4.remove(d) d in d4 in a1 if == d[-len(a):]] [e5.remove(e) e in e5 in a1 if == e[-len(a):]] [f6.remove(f) f in f6 in a1 if == f[-len(a):]]
should create dictionary? faster?
thanks help.
as practical matter, think fastest algorithm
- reverse every item (so ".b.c" becomes "c.b.")
- sort list
- loop through list idea of "current" item. if next item on list starts (i.e. subdomain of) of current item, next item added output list , becomes current item.
- reverse each item on output list
here untested sketch of code:
def reverse(s): return s[::-1] r = map(reverse, devgone) r.sort() ci = none out = [] ni in r: if not ci or not ni.startswith(ci): out.append(ni) ci = ni return map(reverse, out)
Comments
Post a Comment