python - Pandas dataframe: how to summarize columns containing value -
here dataframe:
df= pd.dataframe( {"mat" : ['a' ,'a', 'a', 'a', 'b'], "ppl" : ['p', 'p', 'p', '', 'p'], "ia1" : ['', 'x', 'x', '', 'x'], "ia2" : ['x', '', '', 'x', 'x']}, index = [1, 2, 3, 4, 5])
i want select unique values on 2 first columns. do:
df2 = df.loc[:,['mat','ppl']].drop_duplicates(subset=['mat','ppl']).sort_values(by=['mat','ppl'])
i get, expected:
mat ppl 4 1 p 5 b p
what want is, df3 be:
mat ppl ia1 ia2 x p x x b p x x
that is: in df3
row a+p, in column ia1, got x because there x in column ia1 in 1 of row of df
, a+p
solutions aggregate
, unique
, if multiple unique values joined ,
:
df = df.groupby(['mat','ppl']).agg(lambda x: ','.join(x[x != ''].unique())).reset_index() print (df) mat ppl ia1 ia2 0 x 1 p x x 2 b p x x
explanation:
aggregation working series
, aggregation function, output scalar. use custom function first filter out empty spaces boolean indexing (x[x != '']
, unique values. scalar output used join
- works if empty series (all values empty strings) , second advantage if multiple unique values 1 joined value ,
.
for testing possible use custom function same lambda function:
def f(x): = ''.join(x[x != ''].unique().tolist()) return df = df.groupby(['mat','ppl']).agg(f).reset_index() print (df) mat ppl ia1 ia2 0 x 1 p x x 2 b p x x
as comment of op mentioned:
instead of using lambda x: ','.join(x[x != ''].unique()), used lambda x: ','.join(set(x)-set([''])). went 13min 5s 43.2 s
Comments
Post a Comment