python - How to extract all the hrefs and titles from several `<a href="" title=""> tags? -

given file:

<a data-parent="#accordion1" data-toggle="collapse" href="# fruitname1" title="click expand drug name"> <span class="list-unstyled" style="text-decoration: none;"></span> glipizide           </a> <a href="/loads/data/usersindex.cfm?event=overview.subprocess&amp;applno=114223" title="click view lemons (lemons) | poq  #114223 | box;67 pz | presentation | fruit company 1 ">                               lemons (lemons)</a> <a href="/loads/data/usersindex.cfm?event=overview.subprocess&amp;applno=114226" title="click view lemons (lemons) | poq  #114226 | box;67 pz | presentation | fruit company 2 ">                               lemons (lemons)</a> <a href="/loads/data/usersindex.cfm?event=overview.subprocess&amp;applno=114305" title="click view lemons (lemons) | poq  #114305 | box;67 pz | presentation | fruit company 3 ">                               lemons (lemons)</a> <a href="/loads/data/usersindex.cfm?event=overview.subprocess&amp;applno=114370" title="click view lemons (lemons) | poq  #114370 | box;67 pz | discontinued | fruit company 1 ">                               lemons (lemons)</a> <a href="/loads/data/usersindex.cfm?event=overview.subprocess&amp;applno=114378" title="click view lemons (lemons) | poq  #114378 | box;67 pz | discontinued | fruit company 4 ">                               lemons (lemons)</a> <a href="/loads/data/usersindex.cfm?event=overview.subprocess&amp;applno=114387" title="click view lemons (lemons) | poq  #114387 | box;67 pz | discontinued | fruit company 5 ">                               lemons (lemons)</a> <a href="/loads/data/usersindex.cfm?event=overview.subprocess&amp;applno=114438" title="click view lemons (lemons) | poq  #114438 | box;67 pz | presentation | fruit company 2 ">                               lemons (lemons)</a> <a href="/loads/data/usersindex.cfm?event=overview.subprocess&amp;applno=114497" title="click view lemons (lemons) | poq  #114497 | box;67 pz | presentation | fruit company 5 ">                               lemons (lemons)</a> <a href="/loads/data/usersindex.cfm?event=overview.subprocess&amp;applno=114542" title="click view lemons (lemons) | poq  #114542 | box;67 pz | discontinued | fruit company 3 ">                               lemons (lemons)</a> <a href="/loads/data/usersindex.cfm?event=overview.subprocess&amp;applno=114550" title="click view lemons (lemons) | poq  #114550 |           </a> <a href="/loads/data/usersindex.cfm?event=overview.subprocess&amp;applno=117270" title="click view grapes (green grapes ; aus) | poq  #117270 | box;67 pz | presentation | fruit company 10  ">                               grapes (green grapes ; aus)</a> <a href="/loads/data/usersindex.cfm?event=overview.subprocess&amp;applno=117511" title="click view grapes (green grapes ; aus) | poq  #117511 | box;67 pz | presentation | fruit company 11 ">                               grapes (green grapes ; aus)</a> <a href="/loads/data/usersindex.cfm?event=overview.subprocess&amp;applno=117620" title="click view grapes (green grapes ; aus) | poq  #117620 | box;67 pz | presentation | fruit company 12 ">

using regex or beautiful soup, how extract <a href="" title="">, adding www.example.com before href tags into:

www.example.com/loads/data/usersindex.cfm?event=overview.subprocess&amp;applno=114223 |  title= | click view lemons (lemons) | poq  #114223 | box;67 pz | presentation | fruit company 1 | lemons (lemons) www.example.com/loads/data/usersindex.cfm?event=overview.subprocess&amp;applno=114226 |  title= | click view lemons (lemons) | poq  #114226 | box;67 pz | presentation | fruit company 2 | lemons (lemons) www.example.com/loads/data/usersindex.cfm?event=overview.subprocess&amp;applno=114305 |  title= | click view lemons (lemons) | poq  #114305 | box;67 pz | presentation | fruit company 3 | lemons (lemons) www.example.com/loads/data/usersindex.cfm?event=overview.subprocess&amp;applno=114370 |  title= | click view lemons (lemons) | poq  #114370 | box;67 pz | discontinued | fruit company 1 | lemons (lemons)

i tried to:

for in soup.tbody.findall('a', href=true):     r = re.compile('(?<=href=").*?(?=")')     r.findall(str(a)

and:

for in soup.tbody.findall('a', href=true):     print (a.find('a')['href'])     print (a.find('a')['title'])

however, not how rearrange titles , hrefs. update

based on odradek's answer, tried this:

soup = beautifulsoup(open('file.htm'), 'lxml') in soup.tbody.findall('a', href=true):     html =     prefix = 'www.example.com'     template = '{prefix}{url} | {title}'.format     links = [template(prefix=prefix, url=e['href'], title=e['title']) e in html.find_all('a', href=true)]     print(links)

however got:

[] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] []

you can use beautifulsoup parsing methods instead of complicated regexp this:

# url want add @ beginning prefix = 'www.example.com'  # template of desired output template = '{prefix}{url} | {title}'.format  # resulting list, please note "html" variable # given source code. links = [template(prefix=prefix, url=e.get('href'), title=e.get('title'))          e in html.find_all('a', href=true)]

when ran against 2 a tags of list:

$ python get_all_a.py www.example.com/loads/data/usersindex.cfm?event=overview.subprocess&applno=117511 | click view grapes (green grapes ; aus) | poq  #117511 | box;67 pz | presentation | fruit company 11  www.example.com/loads/data/usersindex.cfm?event=overview.subprocess&applno=117620 | click view grapes (green grapes ; aus) | poq  #117620 | box;67 pz | presentation | fruit company 12

based on update, shouldn't put piece of code inside loop, rather:

html = beautifulsoup(open('file.htm'), 'html.parser')  prefix = 'www.example.com'  template = '{prefix}{url} | {title}'.format  # inside list comprehension loop implied links = [template(prefix=prefix, url=e.get('href'), title=e.get('title'))          e in html.find_all('a', href=true)]

Search This Blog

MOno

python - How to extract all the hrefs and titles from several `<a href="" title=""> tags? -

Comments

Post a Comment

Popular posts from this blog

Retrieving ETA (estimated time of arrival) with Google Distance Matrix API and public transit as transport mode -

android - ConstraintLayout: Realign baseline constraint in case if dependent view visibility was set to GONE -

c# - Populating Gridview inside Listview ItemTemplate On Web User Control from Code Behind -