python - Web Scraping, unable to scroll down web page using selenium web driver -

May 15, 2014

i trying extract links forum (https://www.pakwheels.com/forums/c/travel-n-tours) scrapper class stops after scrolling down once.

from bs4 import beautifulsoup  sourceurl='https://www.pakwheels.com/forums/c/travel-n-tours'  #----------------------------------source of below code:http://stackoverflow.com/questions/32391303/how-to-scroll-to-the-end-of-the-page-using-selenium-in-python--------------------# #----------------------- scrolling bottom of page ----------------------------- ----------#  selenium import webdriver import time chrome_path=r"c:\users\shani\desktop\chromedriver.exe" driver=webdriver.chrome(chrome_path) driver.get(sourceurl) updatedlenofpage = driver.execute_script("window.scrollto(0, document.body.scrollheight);var lenofpage=document.body.scrollheight;return lenofpage;") scrollcomplete=false while(scrollcomplete==false):         currentlenofpage = updatedlenofpage         updatedlenofpage = driver.execute_script("window.scrollto(0, document.body.scrollheight);var lenofpage=document.body.scrollheight;return lenofpage;")         print('scrolling down')         time.sleep(5)         if currentlenofpage==updatedlenofpage:             scrollcomplete=true time.sleep(10) pagesource=driver.page_source  # ------------------------------------- getting links ---------------------------------- # soup = beautifulsoup(pagesource, 'lxml') # print(soup)  blogurls=[] url in soup.find_all('a'):     if((url.get('href').find('/forums/t/')!=-1) , (url.get('href').find('about-the-travel-n-tours-category')==-1) , (url.get('href').find('/forums/t/topic/')==-1)):         blogurls.append(url.get('href'))         print(url.get('href'))        print(len(blogurls))

it gives following error

traceback (most recent call last):   file "d:\liclipsworkspace\nlktlib\scrapping\scrolling.py", line 32, in <module>     if((url.get('href').find('/forums/t/')!=-1) , (url.get('href').find('about-the-travel-n-tours-category')==-1) , (url.get('href').find('/forums/t/topic/')==-1)): attributeerror: 'nonetype' object has no attribute 'find'

please help

you don't need selenium, can links json response. code gets urls first 5 pages (for getting pages change last 5 264).

import requests  in range(0, 5):     r = requests.get(         'https://www.pakwheels.com/forums/c/travel-n-tours/l/latest.json?page={}'.format(i)).json()     topics = r['topic_list']['topics']     topic in topics:         print ('https://www.pakwheels.com/forums/t/{}/{}'.format(topic['slug'], topic['id']))

Search This Blog

MOno

python - Web Scraping, unable to scroll down web page using selenium web driver -

Comments

Post a Comment

Popular posts from this blog

Retrieving ETA (estimated time of arrival) with Google Distance Matrix API and public transit as transport mode -

javascript - Confirm a form & display message if form is valid with JQuery -

ionic framework - Meteor - Error: Failed to execute 'insertBefore' on 'Node' -