python - Extracting data within multiple links using SCRAPY -

i'm trying scrape this website. can see, url above has 10 names (which links) (alex, michele.. etc)

import scrapy  class italy1spider(scrapy.spider):     name = "italyspider"      def start_requests(self):         urls = [             'http://www.odceccastrovillari.it/portale/albo_vista?page=1&field_cognome_value=&field_nome_value=',         ]         url in urls:             yield scrapy.request(url=url, callback=self.parse)      def parse(self, response):         page = response.url.split("/")[-2]         filename = 'italy2-%s.html' % page         open(filename, 'wb') f:             f.write(response.body)         self.log('saved file %s' % filename)

i've written code above. can type response.css('a::text').extract() receive 10 names mentioned above.

however, need email address contained within each link name. need pages , not 1 displayed above.

what need add code achieve this? i've tried various things can't seem working.

any appreciated!

you need crawl every page individually. first connect listing page, find urls , crawl them individual person data (name, email etc.)

class italy1spider(scrapy.spider):     name = "italyspider"     start_urls = ['http://www.odceccastrovillari.it/portale/albo_vista?page=1&field_cognome_value=&field_nome_value=']      def parse(self, response):         # find urls point people         people_urls = response.css('.view-albo td a::attr(href)').extract()         people_urls = list(set(people_urls))  # make unique         url in people_urls:             # got every persons page             yield request(url, self.parse_person)      def parse_person(self, response):         # parse stuff here         # find email need find node email text , can          # navigate td node contains emails:         response.xpath("//th[contains(text(),'email')]/following-sibling::td/text()").extract()         #[u'studioragalexaurelio@libero.it', u'alexaurelio@odceccvlegalmail.it']

Search This Blog

Brent

python - Extracting data within multiple links using SCRAPY -

Comments

Post a Comment

Popular posts from this blog

ios - Change Storyboard View using Seague -

inversion of control - Autofac named registration constructor injection -

verilog - Systemverilog dynamic casting issues -