python - How to extract all the hrefs and titles from several `<a href="" title=""> tags? -
given file:
<a data-parent="#accordion1" data-toggle="collapse" href="# fruitname1" title="click expand drug name"> <span class="list-unstyled" style="text-decoration: none;"></span> glipizide </a> <a href="/loads/data/usersindex.cfm?event=overview.subprocess&applno=114223" title="click view lemons (lemons) | poq #114223 | box;67 pz | presentation | fruit company 1 "> lemons (lemons)</a> <a href="/loads/data/usersindex.cfm?event=overview.subprocess&applno=114226" title="click view lemons (lemons) | poq #114226 | box;67 pz | presentation | fruit company 2 "> lemons (lemons)</a> <a href="/loads/data/usersindex.cfm?event=overview.subprocess&applno=114305" title="click view lemons (lemons) | poq #114305 | box;67 pz | presentation | fruit company 3 "> lemons (lemons)</a> <a href="/loads/data/usersindex.cfm?event=overview.subprocess&applno=114370" title="click view lemons (lemons) | poq #114370 | box;67 pz | discontinued | fruit company 1 "> lemons (lemons)</a> <a href="/loads/data/usersindex.cfm?event=overview.subprocess&applno=114378" title="click view lemons (lemons) | poq #114378 | box;67 pz | discontinued | fruit company 4 "> lemons (lemons)</a> <a href="/loads/data/usersindex.cfm?event=overview.subprocess&applno=114387" title="click view lemons (lemons) | poq #114387 | box;67 pz | discontinued | fruit company 5 "> lemons (lemons)</a> <a href="/loads/data/usersindex.cfm?event=overview.subprocess&applno=114438" title="click view lemons (lemons) | poq #114438 | box;67 pz | presentation | fruit company 2 "> lemons (lemons)</a> <a href="/loads/data/usersindex.cfm?event=overview.subprocess&applno=114497" title="click view lemons (lemons) | poq #114497 | box;67 pz | presentation | fruit company 5 "> lemons (lemons)</a> <a href="/loads/data/usersindex.cfm?event=overview.subprocess&applno=114542" title="click view lemons (lemons) | poq #114542 | box;67 pz | discontinued | fruit company 3 "> lemons (lemons)</a> <a href="/loads/data/usersindex.cfm?event=overview.subprocess&applno=114550" title="click view lemons (lemons) | poq #114550 | </a> <a href="/loads/data/usersindex.cfm?event=overview.subprocess&applno=117270" title="click view grapes (green grapes ; aus) | poq #117270 | box;67 pz | presentation | fruit company 10 "> grapes (green grapes ; aus)</a> <a href="/loads/data/usersindex.cfm?event=overview.subprocess&applno=117511" title="click view grapes (green grapes ; aus) | poq #117511 | box;67 pz | presentation | fruit company 11 "> grapes (green grapes ; aus)</a> <a href="/loads/data/usersindex.cfm?event=overview.subprocess&applno=117620" title="click view grapes (green grapes ; aus) | poq #117620 | box;67 pz | presentation | fruit company 12 "> using regex or beautiful soup, how extract <a href="" title="">, adding www.example.com before href tags into:
www.example.com/loads/data/usersindex.cfm?event=overview.subprocess&applno=114223 | title= | click view lemons (lemons) | poq #114223 | box;67 pz | presentation | fruit company 1 | lemons (lemons) www.example.com/loads/data/usersindex.cfm?event=overview.subprocess&applno=114226 | title= | click view lemons (lemons) | poq #114226 | box;67 pz | presentation | fruit company 2 | lemons (lemons) www.example.com/loads/data/usersindex.cfm?event=overview.subprocess&applno=114305 | title= | click view lemons (lemons) | poq #114305 | box;67 pz | presentation | fruit company 3 | lemons (lemons) www.example.com/loads/data/usersindex.cfm?event=overview.subprocess&applno=114370 | title= | click view lemons (lemons) | poq #114370 | box;67 pz | discontinued | fruit company 1 | lemons (lemons) i tried to:
for in soup.tbody.findall('a', href=true): r = re.compile('(?<=href=").*?(?=")') r.findall(str(a) and:
for in soup.tbody.findall('a', href=true): print (a.find('a')['href']) print (a.find('a')['title']) however, not how rearrange titles , hrefs. update
based on odradek's answer, tried this:
soup = beautifulsoup(open('file.htm'), 'lxml') in soup.tbody.findall('a', href=true): html = prefix = 'www.example.com' template = '{prefix}{url} | {title}'.format links = [template(prefix=prefix, url=e['href'], title=e['title']) e in html.find_all('a', href=true)] print(links) however got:
[] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] []
you can use beautifulsoup parsing methods instead of complicated regexp this:
# url want add @ beginning prefix = 'www.example.com' # template of desired output template = '{prefix}{url} | {title}'.format # resulting list, please note "html" variable # given source code. links = [template(prefix=prefix, url=e.get('href'), title=e.get('title')) e in html.find_all('a', href=true)] when ran against 2 a tags of list:
$ python get_all_a.py www.example.com/loads/data/usersindex.cfm?event=overview.subprocess&applno=117511 | click view grapes (green grapes ; aus) | poq #117511 | box;67 pz | presentation | fruit company 11 www.example.com/loads/data/usersindex.cfm?event=overview.subprocess&applno=117620 | click view grapes (green grapes ; aus) | poq #117620 | box;67 pz | presentation | fruit company 12 based on update, shouldn't put piece of code inside loop, rather:
html = beautifulsoup(open('file.htm'), 'html.parser') prefix = 'www.example.com' template = '{prefix}{url} | {title}'.format # inside list comprehension loop implied links = [template(prefix=prefix, url=e.get('href'), title=e.get('title')) e in html.find_all('a', href=true)]
Comments
Post a Comment