html - Python BeautifulSoup returning wrong list of inputs from find_all() -


i have python 2.7.3 , bs.version 4.4.1

for reason code

from bs4 import beautifulsoup # parsing  html = """ <html> <head id="head1"><title>title</title></head> <body>     <form id="form" action="login.php" method="post">         <input type="text" name="fname">         <input type="text" name="email" >         <input type="button" name="submit" value="submit">     </form> </body>  </html> """  html_proc = beautifulsoup(html, 'html.parser')  form in  html_proc.find_all('form'):     input in form.find_all('input'):         print "input:" + str(input) 

returns wrong list of inputs:

input:<input name="fname" type="text"> <input name="email" type="text"> <input name="submit" type="button" value="submit"> </input></input></input> input:<input name="email" type="text"> <input name="submit" type="button" value="submit"> </input></input> input:<input name="submit" type="button" value="submit"> </input> 

it's supposed return

input: <input name="fname" type="text"> input: <input type="text" name="email"> input: <input type="button" name="submit" value="submit"> 

what happened?

to me, looks artifact of html parser. using 'lxml' parser instead of 'html.parser' seems make work. downside of (or users) need install lxml -- upside lxml better/faster parser ;-).

as why 'html.parser' doesn't seem work correctly in case, think has fact input tags self-closing. if explicitly close inputs, works:

<input type="text" name="fname" ></input> <input type="text" name="email" ></input> <input type="button" name="submit" value="submit" ></input> 

i curious see if modify source code handle case ... doing little experiment monkey-patch bs4 indicates can this:

from bs4 import beautifulsoup  bs4.builder import _htmlparser  # monkey-patch beautiful soup html parser close input tags automatically. beautifulsouphtmlparser = _htmlparser.beautifulsouphtmlparser class fixedparser(beautifulsouphtmlparser):     def handle_starttag(self, name, attrs):         # old-style class... no super :-(         result = beautifulsouphtmlparser.handle_starttag(self, name, attrs)         if name.lower() == 'input':             self.handle_endtag(name)         return result  _htmlparser.beautifulsouphtmlparser = fixedparser   html = """ <html> <head id="head1"><title>title</title></head> <body>     <form id="form" action="login.php" method="post">         <input type="text" name="fname" >         <input type="text" name="email" >         <input type="button" name="submit" value="submit" >     </form> </body>  </html> """  html_proc = beautifulsoup(html, 'html.parser')  form in  html_proc.find_all('form'):     input in form.find_all('input'):         print "input:" + str(input) 

obviously, isn't true fix (i wouldn't submit patch bs4 folks), demonstrate problem. since there no end-tag, handle_endtag method never getting called. if call ourselves, things tend work out (as long html doesn't also have closing input tag ...).

i'm not sure responsibility bug should be, suppose start submitting bs4 -- might forward on report bug on python tracker, i'm not sure...


Comments

Popular posts from this blog

Command prompt result in label. Python 2.7 -

javascript - How do I use URL parameters to change link href on page? -

amazon web services - AWS Route53 Trying To Get Site To Resolve To www -