python - Encoding Emojis with Beautiful Soup -


looking help. working on project scraping specific craigslist posts using beautiful soup in python. can display emojis found within post title have been unsuccessful within post body. i've tried different variations nothing has worked far. appreciated.

code:

f = open("clcondensed.txt", "w") html2 = requests.get("https://raleigh.craigslist.org/wan/6078682335.html") soup = beautifulsoup(html2.content,"html.parser") #post title  title = soup.find(id="titletextonly")        title1 = soup.title.string.encode("ascii","xmlcharrefreplace") f.write(title1) #post body   body = soup.find(id="postingbody")           body = str(body) body = body.encode("ascii","xmlcharrefreplace") f.write(body) 

error received body:

'ascii' codec can't decode byte 0xef in position 273: ordinal not in range(128) 

you should use unicode

body = unicode(body) 

please refer beautiful soup documentation navigablestring


update:

sorry quick answer. it's not right.

here should use lxml parser instead of html parser, because html parser not support ncr (numeric character reference) emoji.

in test, when ncr emoji decimal value greater 65535, such html demo emoji 🚢, html parser decode wrong unicode \ufffd u"\u0001f6a2". can not find accurate beautiful soup reference this, lxml parser ok.

below tested code:

import requests bs4 import beautifulsoup f = open("clcondensed.txt", "w") html = requests.get("https://raleigh.craigslist.org/wan/6078682335.html") soup = beautifulsoup(html.content, "lxml") #post title title = soup.find(id="titletextonly") title = unicode(title) f.write(title.encode('utf-8')) #post body body = soup.find(id="postingbody") body = unicode(body) f.write(body.encode('utf-8')) f.close() 

you can ref lxml entity handling more things.

if not install lxml, ref lxml installing.

hope help.


Comments