html - Python BeautifulSoup returning wrong list of inputs from find_all() -
i have python 2.7.3 , bs.version 4.4.1
for reason code
from bs4 import beautifulsoup # parsing html = """ <html> <head id="head1"><title>title</title></head> <body> <form id="form" action="login.php" method="post"> <input type="text" name="fname"> <input type="text" name="email" > <input type="button" name="submit" value="submit"> </form> </body> </html> """ html_proc = beautifulsoup(html, 'html.parser') form in html_proc.find_all('form'): input in form.find_all('input'): print "input:" + str(input)
returns wrong list of inputs:
input:<input name="fname" type="text"> <input name="email" type="text"> <input name="submit" type="button" value="submit"> </input></input></input> input:<input name="email" type="text"> <input name="submit" type="button" value="submit"> </input></input> input:<input name="submit" type="button" value="submit"> </input>
it's supposed return
input: <input name="fname" type="text"> input: <input type="text" name="email"> input: <input type="button" name="submit" value="submit">
what happened?
to me, looks artifact of html parser. using 'lxml'
parser instead of 'html.parser'
seems make work. downside of (or users) need install lxml
-- upside lxml
better/faster parser ;-).
as why 'html.parser'
doesn't seem work correctly in case, think has fact input
tags self-closing. if explicitly close inputs, works:
<input type="text" name="fname" ></input> <input type="text" name="email" ></input> <input type="button" name="submit" value="submit" ></input>
i curious see if modify source code handle case ... doing little experiment monkey-patch bs4
indicates can this:
from bs4 import beautifulsoup bs4.builder import _htmlparser # monkey-patch beautiful soup html parser close input tags automatically. beautifulsouphtmlparser = _htmlparser.beautifulsouphtmlparser class fixedparser(beautifulsouphtmlparser): def handle_starttag(self, name, attrs): # old-style class... no super :-( result = beautifulsouphtmlparser.handle_starttag(self, name, attrs) if name.lower() == 'input': self.handle_endtag(name) return result _htmlparser.beautifulsouphtmlparser = fixedparser html = """ <html> <head id="head1"><title>title</title></head> <body> <form id="form" action="login.php" method="post"> <input type="text" name="fname" > <input type="text" name="email" > <input type="button" name="submit" value="submit" > </form> </body> </html> """ html_proc = beautifulsoup(html, 'html.parser') form in html_proc.find_all('form'): input in form.find_all('input'): print "input:" + str(input)
obviously, isn't true fix (i wouldn't submit patch bs4 folks), demonstrate problem. since there no end-tag, handle_endtag
method never getting called. if call ourselves, things tend work out (as long html doesn't also have closing input tag ...).
i'm not sure responsibility bug should be, suppose start submitting bs4 -- might forward on report bug on python tracker, i'm not sure...
Comments
Post a Comment