javascript - Scrapy - making selections from dropdown(e.g.date) on webpage -
i'm new scrapy , python , trying scrape data off following start url .
after login, start url--->
start_urls = ["http://www.flightstats.com/go/historicalflightstatus/flightstatusbyflight.do?"]
(a) there need interact webpage select ---by-airport---
and make ---airport, date, time period selection---
how can that? loop on time periods , past dates..
i have used firebug see source, cannot show here not have enough points post images..
i read post mentioning use of splinter..
(b) after selections lead me page there links eventual page information want. how populate links , make scrapy every 1 extract information?
-using rules? should insert rules/ linkextractor function?
i willing try myself, hope can given find posts can guide me.. student , have spent more week on this.. have done scrapy tutorial, python tutorial, read scrapy documentation , searched previous posts in stackoverflow did not manage find posts cover this.
a million thanks.
my code far log-in , items scrape via xpath eventual target site:
`import scrapy tutorial.items import flightitem scrapy.http import formrequest class flightspider(scrapy.spider): name = "flight" allowed_domains = ["flightstats.com"] login_page = 'https://www.flightstats.com/go/login/login_input.do;jsessionid=0dd6083a334aade3fd6923acb8ddcaa2.web1:8009?' start_urls = [ "http://www.flightstats.com/go/historicalflightstatus/flightstatusbyflight.do?"] def init_request(self): #"""this function called before crawling starts.""" return request(url=self.login_page, callback=self.login) def login(self, response): #"""generate login request.""" return formrequest.from_response(response,formdata= {'loginform_email': 'marvxxxxxx@hotmail.com', 'password': 'xxxxxxxx'},callback=self.check_login_response) def check_login_response(self, response): #"""check response returned login request see if aresuccessfully logged in.""" if "sign out" in response.body: self.log("\n\n\nsuccessfully logged in. let's start crawling!\n\n\n") # crawling can begin.. return self.initialized() # ****this line fixed last problem***** else: self.log("\n\n\nfailed, bad times :(\n\n\n") # went wrong, couldn't log in, nothing happens. def parse(self, response): sel in response.xpath('/html/body/div[2]/div[2]/div'): item = flightstatsitem() item['flight_number'] = sel.xpath('/div[1]/div[1]/h2').extract() item['aircraft_make'] = sel.xpath('/div[4]/div[2]/div[2]/div[2]').extract() item['dep_date'] = sel.xpath('/div[2]/div[1]/div').extract() item['dep_airport'] = sel.xpath('/div[1]/div[2]/div[2]/div[1]').extract() item['arr_airport'] = sel.xpath('/div[1]/div[2]/div[2]/div[2]').extract() item['dep_gate_scheduled'] = sel.xpath('/div[2]/div[2]/div[1]/div[2]/div[2]').extract() item['dep_gate_actual'] = sel.xpath('/div[2]/div[2]/div[1]/div[3]/div[2]').extract() item['dep_runway_actual'] = sel.xpath('/div[2]/div[2]/div[2]/div[3]/div[2]').extract() item['dep_terminal'] = sel.xpath('/div[2]/div[2]/div[3]/div[2]/div[1]').extract() item['dep_gate'] = sel.xpath('/div[2]/div[2]/div[3]/div[2]/div[2]').extract() item['arr_gate_scheduled'] = sel.xpath('/div[3]/div[2]/div[1]/div[2]/div[2]').extract() item['arr_gate_actual'] = sel.xpath('/div[3]/div[2]/div[1]/div[3]/div[2]').extract() item['arr_terminal'] = sel.xpath('/div[3]/div[2]/div[3]/div[2]/div[1]').extract() item['arr_gate'] = sel.xpath('/div[3]/div[2]/div[3]/div[2]/div[2]').extract() yield item`
Comments
Post a Comment