python - Extracting data within multiple links using SCRAPY -
i'm trying scrape this website. can see, url above has 10 names (which links) (alex, michele.. etc)
import scrapy class italy1spider(scrapy.spider): name = "italyspider" def start_requests(self): urls = [ 'http://www.odceccastrovillari.it/portale/albo_vista?page=1&field_cognome_value=&field_nome_value=', ] url in urls: yield scrapy.request(url=url, callback=self.parse) def parse(self, response): page = response.url.split("/")[-2] filename = 'italy2-%s.html' % page open(filename, 'wb') f: f.write(response.body) self.log('saved file %s' % filename)
i've written code above. can type response.css('a::text').extract()
receive 10 names mentioned above.
however, need email address contained within each link name. need pages , not 1 displayed above.
what need add code achieve this? i've tried various things can't seem working.
any appreciated!
you need crawl every page individually. first connect listing page, find urls , crawl them individual person data (name, email etc.)
class italy1spider(scrapy.spider): name = "italyspider" start_urls = ['http://www.odceccastrovillari.it/portale/albo_vista?page=1&field_cognome_value=&field_nome_value='] def parse(self, response): # find urls point people people_urls = response.css('.view-albo td a::attr(href)').extract() people_urls = list(set(people_urls)) # make unique url in people_urls: # got every persons page yield request(url, self.parse_person) def parse_person(self, response): # parse stuff here # find email need find node email text , can # navigate td node contains emails: response.xpath("//th[contains(text(),'email')]/following-sibling::td/text()").extract() #[u'studioragalexaurelio@libero.it', u'alexaurelio@odceccvlegalmail.it']
Comments
Post a Comment