Python: Problems sending 'list' of urls to scrapy spider to scrape -


trying send 'list' of urls scrapy crawl via spider via using long string, splitting string inside crawler. i've tried copying format given in this answer.

the list i'm trying send crawler future_urls

    >>> print future_urls     set(['https://ca.finance.yahoo.com/q/hp?s=alxn&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m', 'http://finance.yahoo.com/q/hp?s=tfw.l&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m', 'https://ca.finance.yahoo.com/q/hp?s=dltr&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m', 'https://ca.finance.yahoo.com/q/hp?s=agnc&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m', 'https://ca.finance.yahoo.com/q/hp?s=hmsy&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m', 'http://finance.yahoo.com/q/hp?s=bats.l&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m']) 

then sending crawler through:

command4 = ("scrapy crawl future -o future_portfolios_{0} -t csv -a future_urls={1}").format(input_file, str(','.join(list(future_urls))))  >>> print command4 scrapy crawl future -o future_portfolios_input_10062008_10062012_ver_1.csv -t csv -a future_urls=https://ca.finance.yahoo.com/q/hp?s=alxn&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m,http://finance.yahoo.com/q/hp?s=tfw.l&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m,https://ca.finance.yahoo.com/q/hp?s=dltr&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m,https://ca.finance.yahoo.com/q/hp?s=agnc&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m,https://ca.finance.yahoo.com/q/hp?s=hmsy&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m,http://finance.yahoo.com/q/hp?s=bats.l&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m >>> type(command4) <type 'str'> 

my crawler (partial):

class futurespider(scrapy.spider): name = "future" allowed_domains = ["finance.yahoo.com", "ca.finance.yahoo.com"] start_urls = ['https://ca.finance.yahoo.com/q/hp?s=%5eixic']  def __init__(self, *args, **kwargs):     super(futurespider, self).__init__(*args,**kwargs)     self.future_urls = kwargs.get('future_urls').split(',')     self.rate_returns_len_min = 12     self.required_amount_of_returns = 12     x in self.future_urls:             print "going scrape:"             print x  def parse(self, response):      if self.future_urls:         x in self.future_urls:             yield scrapy.request(x, self.stocks1) 

however, printed out print 'going scrape:', x is:

going scrape: https://ca.finance.yahoo.com/q/hp?s=alxn 

only 1 url, , it's portion of first url in future_urls problematic.

can't seem figure out why crawler won't scrape of urls in future_urls...

i think it's stopping when hits ampersand (&), can escape using urllib.quote.

for example:

import urllib  escapedurl = urllib.quote('https://ca.finance.yahoo.com/q/hp?s=alxn&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m') 

then normal can do:

>>>>urllib.unquote(escapedurl) https://ca.finance.yahoo.com/q/hp?s=alxn&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m 

Comments

Popular posts from this blog

c# - Binding a comma separated list to a List<int> in asp.net web api -

Delphi 7 and decode UTF-8 base64 -

html - Is there any way to exclude a single element from the style? (Bootstrap) -