html - Python BeautifulSoup returning wrong list of inputs from find

html - Python BeautifulSoup returning wrong list of inputs from find_all() -

May 15, 2014

i have python 2.7.3 , bs.version 4.4.1

for reason code

from bs4 import beautifulsoup # parsing  html = """ <html> <head id="head1"><title>title</title></head> <body>     <form id="form" action="login.php" method="post">         <input type="text" name="fname">         <input type="text" name="email" >         <input type="button" name="submit" value="submit">     </form> </body>  </html> """  html_proc = beautifulsoup(html, 'html.parser')  form in  html_proc.find_all('form'):     input in form.find_all('input'):         print "input:" + str(input)

returns wrong list of inputs:

input:<input name="fname" type="text"> <input name="email" type="text"> <input name="submit" type="button" value="submit"> </input></input></input> input:<input name="email" type="text"> <input name="submit" type="button" value="submit"> </input></input> input:<input name="submit" type="button" value="submit"> </input>

it's supposed return

input: <input name="fname" type="text"> input: <input type="text" name="email"> input: <input type="button" name="submit" value="submit">

what happened?

to me, looks artifact of html parser. using 'lxml' parser instead of 'html.parser' seems make work. downside of (or users) need install lxml -- upside lxml better/faster parser ;-).

as why 'html.parser' doesn't seem work correctly in case, think has fact input tags self-closing. if explicitly close inputs, works:

<input type="text" name="fname" ></input> <input type="text" name="email" ></input> <input type="button" name="submit" value="submit" ></input>

i curious see if modify source code handle case ... doing little experiment monkey-patch bs4 indicates can this:

from bs4 import beautifulsoup  bs4.builder import _htmlparser  # monkey-patch beautiful soup html parser close input tags automatically. beautifulsouphtmlparser = _htmlparser.beautifulsouphtmlparser class fixedparser(beautifulsouphtmlparser):     def handle_starttag(self, name, attrs):         # old-style class... no super :-(         result = beautifulsouphtmlparser.handle_starttag(self, name, attrs)         if name.lower() == 'input':             self.handle_endtag(name)         return result  _htmlparser.beautifulsouphtmlparser = fixedparser   html = """ <html> <head id="head1"><title>title</title></head> <body>     <form id="form" action="login.php" method="post">         <input type="text" name="fname" >         <input type="text" name="email" >         <input type="button" name="submit" value="submit" >     </form> </body>  </html> """  html_proc = beautifulsoup(html, 'html.parser')  form in  html_proc.find_all('form'):     input in form.find_all('input'):         print "input:" + str(input)

obviously, isn't true fix (i wouldn't submit patch bs4 folks), demonstrate problem. since there no end-tag, handle_endtag method never getting called. if call ourselves, things tend work out (as long html doesn't also have closing input tag ...).

i'm not sure responsibility bug should be, suppose start submitting bs4 -- might forward on report bug on python tracker, i'm not sure...

Search This Blog

MOno

html - Python BeautifulSoup returning wrong list of inputs from find_all() -

Comments

Post a Comment

Popular posts from this blog

'hasOwnProperty' in javascript -

python - ValueError: No axis named 1 for object type <class 'pandas.core.series.Series'> -

java - How to provide dependency injections in Eclipse RCP 3.x? -