python - Scrapy fails to crawl recursively when two rules are set -
i've written script in scrapy crawl website recursively. reason it's not being able to. i've tested xpaths in sublime , working perfectly. so, @ point can't fix i've done wrong.
"items.py" includes:
import scrapy class craigpitem(scrapy.item): name = scrapy.field() grading = scrapy.field() address = scrapy.field() phone = scrapy.field() website = scrapy.field()
the spider named "craigsp.py" includes:
from scrapy.spiders import crawlspider, rule scrapy.linkextractors import linkextractor class craigspspider(crawlspider): name = "craigsp" allowed_domains = ["craigperler.com"] start_urls = ['https://www.americangemsociety.org/en/find-a-jeweler'] rules=[rule(linkextractor(restrict_xpaths='//area')), rule(linkextractor(restrict_xpaths='//a[@class="jeweler__link"]'),callback='parse_items')] def parse_items(self, response): page = response.xpath('//div[@class="page__content"]') titles in page: aa= titles.xpath('.//h1[@class="page__heading"]/text()').extract() bb= titles.xpath('.//p[@class="appraiser__grading"]/strong/text()').extract() cc = titles.xpath('.//p[@class="appraiser__hours"]/text()').extract() dd = titles.xpath('.//p[@class="appraiser__phone"]/text()').extract() ee = titles.xpath('.//p[@class="appraiser__website"]/a[@class="appraiser__link"]/@href').extract() yield {'name':aa,'grading':bb,'address':cc,'phone':dd,'website':ee}
the command i'm running is:
scrapy crawl craigsp -o items.csv
hope lead me right direction.
filtered offsite request
this error means url queued scrapy not pass allowed_domains
setting.
you have:
allowed_domains = ["craigperler.com"]
and spider trying crawl http://ww.americangemsociety.org. either need add allowed_domains
list or rid of setting entirely.
Comments
Post a Comment