Python - beautifulsoup - how to deal with missing closing tags -


i scrape table html code using beautifulsoup. snippet of html shown below. when using table.findall('tr') entire table , not rows. (probably because closing tags missing html code?)

  <table cols=9 border=0 cellspacing=3 cellpadding=0>   <tr><td><b>artikelbezeichnung</b>   <td><b>anbieter</b>   <td><b>menge</b>   <td><b>taxe-ek</b>   <td><b>taxe-vk</b>   <td><b>empf.-vk</b>   <td><b>fb</b>   <td><b>pzn</b>   <td><b>nachfolge</b>    <tr><td>actiq 200 mikrogramm lutschtabl.m.integr.appl.   <td>orifarm   <td id=r>     30 st   <td id=r>  266,67   <td id=r>  336,98   <td>&nbsp;   <td>&nbsp;   <td>12516714   <td>&nbsp;    </table> 

here python code show struggling with:

     soup = beautifulsoup(data, "html.parser")      table = soup.findall("table")[0]      rows = table.find_all('tr')      tr in rows:          print(tr.text) 

as stated in documentation html5lib parses document web browser (like lxmlin case). it'll try fix document tree adding/closing tags when needed.

in example i've used lxml parser , gave following result:

soup = beautifulsoup(data, "lxml") table = soup.findall("table")[0] rows = table.find_all('tr') tr in rows:     print(tr.get_text(strip=true)) 

note lxml added html & body tags because weren't present in source (it'll try create formed document state).


Comments

Popular posts from this blog

Command prompt result in label. Python 2.7 -

javascript - How do I use URL parameters to change link href on page? -

amazon web services - AWS Route53 Trying To Get Site To Resolve To www -