Python - beautifulsoup - how to deal with missing closing tags -

April 15, 2015

i scrape table html code using beautifulsoup. snippet of html shown below. when using table.findall('tr') entire table , not rows. (probably because closing tags missing html code?)

  <table cols=9 border=0 cellspacing=3 cellpadding=0>   <tr><td><b>artikelbezeichnung</b>   <td><b>anbieter</b>   <td><b>menge</b>   <td><b>taxe-ek</b>   <td><b>taxe-vk</b>   <td><b>empf.-vk</b>   <td><b>fb</b>   <td><b>pzn</b>   <td><b>nachfolge</b>    <tr><td>actiq 200 mikrogramm lutschtabl.m.integr.appl.   <td>orifarm   <td id=r>     30 st   <td id=r>  266,67   <td id=r>  336,98   <td>&nbsp;   <td>&nbsp;   <td>12516714   <td>&nbsp;    </table>

here python code show struggling with:

     soup = beautifulsoup(data, "html.parser")      table = soup.findall("table")[0]      rows = table.find_all('tr')      tr in rows:          print(tr.text)

as stated in documentation html5lib parses document web browser (like lxmlin case). it'll try fix document tree adding/closing tags when needed.

in example i've used lxml parser , gave following result:

soup = beautifulsoup(data, "lxml") table = soup.findall("table")[0] rows = table.find_all('tr') tr in rows:     print(tr.get_text(strip=true))

note lxml added html & body tags because weren't present in source (it'll try create formed document state).

Search This Blog

MOno

Python - beautifulsoup - how to deal with missing closing tags -

Comments

Post a Comment

Popular posts from this blog

'hasOwnProperty' in javascript -

python - ValueError: No axis named 1 for object type <class 'pandas.core.series.Series'> -

java - How to implement an entity bound odata action in olingo v4.3 -