python - Beautiful Soup: Data Values Not Matching Headings -
i'm new python , i'm working on learning project i'm attempting scrape data on college football players. source code website looks :
</thead> <tbody> > <tr ><th scope="row" class="right " data-stat="year_id" ><a > href="/cfb/years/1957.html">1957</a></th><td class="left " > data-stat="school_name" csk="san jose state.1957" ><a > href="/cfb/schools/san-jose-state/1957.html">san jose > state</a></td><td class="left " data-stat="conf_abbr" ><a > href="/cfb/conferences/independent/1957.html">ind</a></td><td > class="center " data-stat="class" ></td><td class="center " > data-stat="pos" >rb</td><td class="right " data-stat="g" >10</td><td > class="right " data-stat="rec" >1</td><td class="right " > data-stat="rec_yds" >6</td><td class="right " > data-stat="rec_yds_per_rec" >6.0</td><td class="right " > data-stat="rec_td" >0</td><td class="right " data-stat="rush_att" > >1</td><td class="right " data-stat="rush_yds" >3</td><td class="right " data-stat="rush_yds_per_att" >3.0</td><td class="right " > data-stat="rush_td" >0</td><td class="right " data-stat="scrim_att" > >2</td><td class="right " data-stat="scrim_yds" >9</td><td class="right " data-stat="scrim_yds_per_att" >4.5</td><td class="right > " data-stat="scrim_td" >0</td></tr> here how far i've gotten code :
headers = [item["data-stat"] item in soup.find_all(attrs={"data-stat" : true})] cellstrings = [cell.find(text = true) cell in soup.findall('td')] print headers, cellstrings this prints out following:
[u'', u'header_receiving', u'header_rushing', u'header_scrimmage', u'year_id', u'school_name', u'conf_abbr', u'class', u'pos', u'g', u'rec', u'rec_yds', u'rec_yds_per_rec', u'rec_td', u'rush_att', u'rush_yds', u'rush_yds_per_att', u'rush_td', u'scrim_att', u'scrim_yds', u'scrim_yds_per_att', u'scrim_td', u'year_id', u'school_name', u'conf_abbr', u'class', u'pos', u'g', u'rec', u'rec_yds', u'rec_yds_per_rec', u'rec_td', u'rush_att', u'rush_yds', u'rush_yds_per_att', u'rush_td', u'scrim_att', u'scrim_yds', u'scrim_yds_per_att', u'scrim_td', u'year_id', u'school_name', u'conf_abbr', u'class', u'pos', u'g', u'rec', u'rec_yds', u'rec_yds_per_rec', u'rec_td', u'rush_att', u'rush_yds', u'rush_yds_per_att', u'rush_td', u'scrim_att', u'scrim_yds', u'scrim_yds_per_att', u'scrim_td'] [u'san jose state', u'ind', none, u'rb', u'10', u'1', u'6', u'6.0', u'0', u'1', u'3', u'3.0', u'0', u'2', u'9', u'4.5', u'0', u'san jose state', none, none, none, none, u'1', u'6', u'6.0', u'0', u'1', u'3', u'3.0', u'0', u'2', u'9', u'4.5', u'0'] the problem of headings appear earlier in source code, 2 lists, data , headings, not match.
my question how can pull 'data-stat' along it's associated value instead of pulling them separately? ideally, pull dictionary.
if i'm getting correctly, want dictionary consisting of {'data-stat-value': 'value of td'}; can this:
data_stats = {e['data-stat']: e.get_text().strip() e in html.find_all(attrs={'data-stat': true})} this way surely pull text associated data-stat tag.
Comments
Post a Comment