Python - beautifulsoup - how to deal with missing closing tags -
i scrape table html code using beautifulsoup. snippet of html shown below. when using table.findall('tr')
entire table , not rows. (probably because closing tags missing html code?)
<table cols=9 border=0 cellspacing=3 cellpadding=0> <tr><td><b>artikelbezeichnung</b> <td><b>anbieter</b> <td><b>menge</b> <td><b>taxe-ek</b> <td><b>taxe-vk</b> <td><b>empf.-vk</b> <td><b>fb</b> <td><b>pzn</b> <td><b>nachfolge</b> <tr><td>actiq 200 mikrogramm lutschtabl.m.integr.appl. <td>orifarm <td id=r> 30 st <td id=r> 266,67 <td id=r> 336,98 <td> <td> <td>12516714 <td> </table>
here python code show struggling with:
soup = beautifulsoup(data, "html.parser") table = soup.findall("table")[0] rows = table.find_all('tr') tr in rows: print(tr.text)
as stated in documentation html5lib
parses document web browser (like lxml
in case). it'll try fix document tree adding/closing tags when needed.
in example i've used lxml parser , gave following result:
soup = beautifulsoup(data, "lxml") table = soup.findall("table")[0] rows = table.find_all('tr') tr in rows: print(tr.get_text(strip=true))
note lxml
added html & body tags because weren't present in source (it'll try create formed document state).
Comments
Post a Comment