python - Web Scraping, unable to scroll down web page using selenium web driver -
i trying extract links forum (https://www.pakwheels.com/forums/c/travel-n-tours) scrapper class stops after scrolling down once.
from bs4 import beautifulsoup sourceurl='https://www.pakwheels.com/forums/c/travel-n-tours' #----------------------------------source of below code:http://stackoverflow.com/questions/32391303/how-to-scroll-to-the-end-of-the-page-using-selenium-in-python--------------------# #----------------------- scrolling bottom of page ----------------------------- ----------# selenium import webdriver import time chrome_path=r"c:\users\shani\desktop\chromedriver.exe" driver=webdriver.chrome(chrome_path) driver.get(sourceurl) updatedlenofpage = driver.execute_script("window.scrollto(0, document.body.scrollheight);var lenofpage=document.body.scrollheight;return lenofpage;") scrollcomplete=false while(scrollcomplete==false): currentlenofpage = updatedlenofpage updatedlenofpage = driver.execute_script("window.scrollto(0, document.body.scrollheight);var lenofpage=document.body.scrollheight;return lenofpage;") print('scrolling down') time.sleep(5) if currentlenofpage==updatedlenofpage: scrollcomplete=true time.sleep(10) pagesource=driver.page_source # ------------------------------------- getting links ---------------------------------- # soup = beautifulsoup(pagesource, 'lxml') # print(soup) blogurls=[] url in soup.find_all('a'): if((url.get('href').find('/forums/t/')!=-1) , (url.get('href').find('about-the-travel-n-tours-category')==-1) , (url.get('href').find('/forums/t/topic/')==-1)): blogurls.append(url.get('href')) print(url.get('href')) print(len(blogurls)) it gives following error
traceback (most recent call last): file "d:\liclipsworkspace\nlktlib\scrapping\scrolling.py", line 32, in <module> if((url.get('href').find('/forums/t/')!=-1) , (url.get('href').find('about-the-travel-n-tours-category')==-1) , (url.get('href').find('/forums/t/topic/')==-1)): attributeerror: 'nonetype' object has no attribute 'find' please help
you don't need selenium, can links json response. code gets urls first 5 pages (for getting pages change last 5 264).
import requests in range(0, 5): r = requests.get( 'https://www.pakwheels.com/forums/c/travel-n-tours/l/latest.json?page={}'.format(i)).json() topics = r['topic_list']['topics'] topic in topics: print ('https://www.pakwheels.com/forums/t/{}/{}'.format(topic['slug'], topic['id']))
Comments
Post a Comment