Python: Problems sending 'list' of urls to scrapy spider to scrape -
trying send 'list' of urls scrapy crawl via spider via using long string, splitting string inside crawler. i've tried copying format given in this answer.
the list i'm trying send crawler future_urls
>>> print future_urls set(['https://ca.finance.yahoo.com/q/hp?s=alxn&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m', 'http://finance.yahoo.com/q/hp?s=tfw.l&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m', 'https://ca.finance.yahoo.com/q/hp?s=dltr&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m', 'https://ca.finance.yahoo.com/q/hp?s=agnc&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m', 'https://ca.finance.yahoo.com/q/hp?s=hmsy&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m', 'http://finance.yahoo.com/q/hp?s=bats.l&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m'])
then sending crawler through:
command4 = ("scrapy crawl future -o future_portfolios_{0} -t csv -a future_urls={1}").format(input_file, str(','.join(list(future_urls)))) >>> print command4 scrapy crawl future -o future_portfolios_input_10062008_10062012_ver_1.csv -t csv -a future_urls=https://ca.finance.yahoo.com/q/hp?s=alxn&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m,http://finance.yahoo.com/q/hp?s=tfw.l&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m,https://ca.finance.yahoo.com/q/hp?s=dltr&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m,https://ca.finance.yahoo.com/q/hp?s=agnc&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m,https://ca.finance.yahoo.com/q/hp?s=hmsy&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m,http://finance.yahoo.com/q/hp?s=bats.l&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m >>> type(command4) <type 'str'>
my crawler (partial):
class futurespider(scrapy.spider): name = "future" allowed_domains = ["finance.yahoo.com", "ca.finance.yahoo.com"] start_urls = ['https://ca.finance.yahoo.com/q/hp?s=%5eixic'] def __init__(self, *args, **kwargs): super(futurespider, self).__init__(*args,**kwargs) self.future_urls = kwargs.get('future_urls').split(',') self.rate_returns_len_min = 12 self.required_amount_of_returns = 12 x in self.future_urls: print "going scrape:" print x def parse(self, response): if self.future_urls: x in self.future_urls: yield scrapy.request(x, self.stocks1)
however, printed out print 'going scrape:', x
is:
going scrape: https://ca.finance.yahoo.com/q/hp?s=alxn
only 1 url, , it's portion of first url in future_urls
problematic.
can't seem figure out why crawler won't scrape of urls in future_urls
...
i think it's stopping when hits ampersand (&
), can escape using urllib.quote
.
for example:
import urllib escapedurl = urllib.quote('https://ca.finance.yahoo.com/q/hp?s=alxn&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m')
then normal can do:
>>>>urllib.unquote(escapedurl) https://ca.finance.yahoo.com/q/hp?s=alxn&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m
Comments
Post a Comment