python - Scrapy crawl spider only touch start_urls -
i found crawlspider crawls start_urls, , not going further.
the following code.
import scrapy scrapy.linkextractors import linkextractor scrapy.spiders import crawlspider, rule class examplespider(crawlspider): name = 'example' allowed_domains = ['holy-bible-eng'] start_urls = ['file:///g:/holy-bible-eng/oebps/bible-toc.xhtml'] rules = ( rule(linkextractor(allow=r'oebps'), callback='parse_item', follow=true), ) def parse_item(self, response): return response below file:///g:/holy-bible-eng/oebps/bible-toc.xhtml in start_urls
<?xml version="1.0" encoding="utf-8"?> <!doctype html public "-//w3c//dtd xhtml 1.1//en" "http://www.w3.org/tr/xhtml11/dtd/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"><head><meta http-equiv="content-type" content="text/html; charset=utf-8" /><title>holy bible</title><link href="lds_epub_scriptures.css" rel="stylesheet" type="text/css" /></head><body class="bible-toc"><div class="titleblock"><h1 class="toc-title">the names , order of <br /><span class="dominant">books of old , <br />new testaments</span></h1></div><div class="bible-toc"><p><a href="bible_dedication.xhtml">epistle dedicatory</a> | <a href="quad_abbreviations.xhtml">abbreviations</a></p><h2 class="toc-title"><a href="ot.xhtml">the books of old testament</a></h2><p><a href="gen.xhtml">genesis</a> | <a href="ex.xhtml">exodus</a> | <a href="lev.xhtml">leviticus</a> | <a href="num.xhtml">numbers</a> | <a href="deut.xhtml">deuteronomy</a> | <a href="josh.xhtml">joshua</a> | <a href="judg.xhtml">judges</a> | <a href="ruth.xhtml">ruth</a> | <a href="1-sam.xhtml">1 samuel</a> | <a href="2-sam.xhtml">2 samuel</a> | <a href="1-kgs.xhtml">1 kings</a> | <a href="2-kgs.xhtml">2 kings</a> | <a href="1-chr.xhtml">1 chronicles</a> | <a href="2-chr.xhtml">2 chronicles</a> | <a href="ezra.xhtml">ezra</a> | <a href="neh.xhtml">nehemiah</a> | <a href="esth.xhtml">esther</a> | <a href="job.xhtml">job</a> | <a href="ps.xhtml">psalms</a> | <a href="prov.xhtml">proverbs</a> | <a href="eccl.xhtml">ecclesiastes</a> | <a href="song.xhtml">song of solomon</a> | <a href="isa.xhtml">isaiah</a> | <a href="jer.xhtml">jeremiah</a> | <a href="lam.xhtml">lamentations</a> | <a href="ezek.xhtml">ezekiel</a> | <a href="dan.xhtml">daniel</a> | <a href="hosea.xhtml">hosea</a> | <a href="joel.xhtml">joel</a> | <a href="amos.xhtml">amos</a> | <a href="obad.xhtml">obadiah</a> | <a href="jonah.xhtml">jonah</a> | <a href="micah.xhtml">micah</a> | <a href="nahum.xhtml">nahum</a> | <a href="hab.xhtml">habakkuk</a> | <a href="zeph.xhtml">zephaniah</a> | <a href="hag.xhtml">haggai</a> | <a href="zech.xhtml">zechariah</a> | <a href="mal.xhtml">malachi</a></p><h2 class="toc-title"><a href="nt.xhtml">the books of new testament</a></h2><p><a href="matt.xhtml">matthew</a> | <a href="mark.xhtml">mark</a> | <a href="luke.xhtml">luke</a> | <a href="john.xhtml">john</a> | <a href="acts.xhtml">acts</a> | <a href="rom.xhtml">romans</a> | <a href="1-cor.xhtml">1 corinthians</a> | <a href="2-cor.xhtml">2 corinthians</a> | <a href="gal.xhtml">galatians</a> | <a href="eph.xhtml">ephesians</a> | <a href="philip.xhtml">philippians</a> | <a href="col.xhtml">colossians</a> | <a href="1-thes.xhtml">1 thessalonians</a> | <a href="2-thes.xhtml">2 thessalonians</a> | <a href="1-tim.xhtml">1 timothy</a> | <a href="2-tim.xhtml">2 timothy</a> | <a href="titus.xhtml">titus</a> | <a href="philem.xhtml">philemon</a> | <a href="heb.xhtml">hebrews</a> | <a href="james.xhtml">james</a> | <a href="1-pet.xhtml">1 peter</a> | <a href="2-pet.xhtml">2 peter</a> | <a href="1-jn.xhtml">1 john</a> | <a href="2-jn.xhtml">2 john</a> | <a href="3-jn.xhtml">3 john</a> | <a href="jude.xhtml">jude</a> | <a href="rev.xhtml">revelation</a></p><h2 class="toc-title"><a href="bible-helps_title-page.xhtml">appendix</a></h2><p><a href="tg.xhtml">topical guide</a> | <a href="bd.xhtml">bible dictionary</a> | <a href="bible-chron.xhtml">bible chronology</a> | <a href="harmony.xhtml">harmony of gospels</a> | <a href="jst.xhtml">joseph smith translation</a> | <a href="bible-maps.xhtml">bible maps</a> | <a href="bible-photos.xhtml">bible photographs</a></p></div></body></html> and below console output.
(crawl) g:\kjvbible>scrapy crawl example ...... ...... 2017-04-08 09:24:59 [scrapy.core.engine] info: spider opened 2017-04-08 09:24:59 [scrapy.extensions.logstats] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-04-08 09:24:59 [scrapy.extensions.telnet] debug: telnet console listening on 127.0.0.1:6026 2017-04-08 09:24:59 [scrapy.core.engine] debug: crawled (200) <get file:///g:/holy-bible-eng/oebps/bible-toc.xhtml> (referer: none) 2017-04-08 09:24:59 [scrapy.core.engine] info: closing spider (finished) 2017-04-08 09:24:59 [scrapy.statscollectors] info: dumping scrapy stats: {'downloader/request_bytes': 237, 'downloader/request_count': 1, 'downloader/request_method_count/get': 1, 'downloader/response_bytes': 3693, it doesn't go deeper.
any suggestions welcome.
from crawlspider documentation:
follow boolean specifies if links should followed each response extracted rule. if callback none follow defaults true, otherwise defaults false
you cannot have rule callback , follow=true @ same time. listen callback, , won't go further.
so main idea behind crawlspider's rules can find links follow , links extract.
now scrapy isn't best idea check "local" files, create simple script.
another error setting allowed_domains class variable, specifies domains should accept. others rejected, , works links on internet. remove variable if don't want reject domains, or if not using domains @ (your case).
Comments
Post a Comment