python - Pandas dataframe: how to summarize columns containing value -
here dataframe:
df= pd.dataframe( {"mat" : ['a' ,'a', 'a', 'a', 'b'], "ppl" : ['p', 'p', 'p', '', 'p'], "ia1" : ['', 'x', 'x', '', 'x'], "ia2" : ['x', '', '', 'x', 'x']}, index = [1, 2, 3, 4, 5]) i want select unique values on 2 first columns. do:
df2 = df.loc[:,['mat','ppl']].drop_duplicates(subset=['mat','ppl']).sort_values(by=['mat','ppl']) i get, expected:
mat ppl 4 1 p 5 b p what want is, df3 be:
mat ppl ia1 ia2 x p x x b p x x that is: in df3 row a+p, in column ia1, got x because there x in column ia1 in 1 of row of df, a+p
solutions aggregate , unique, if multiple unique values joined ,:
df = df.groupby(['mat','ppl']).agg(lambda x: ','.join(x[x != ''].unique())).reset_index() print (df) mat ppl ia1 ia2 0 x 1 p x x 2 b p x x explanation:
aggregation working series , aggregation function, output scalar. use custom function first filter out empty spaces boolean indexing (x[x != ''], unique values. scalar output used join - works if empty series (all values empty strings) , second advantage if multiple unique values 1 joined value ,.
for testing possible use custom function same lambda function:
def f(x): = ''.join(x[x != ''].unique().tolist()) return df = df.groupby(['mat','ppl']).agg(f).reset_index() print (df) mat ppl ia1 ia2 0 x 1 p x x 2 b p x x as comment of op mentioned:
instead of using lambda x: ','.join(x[x != ''].unique()), used lambda x: ','.join(set(x)-set([''])). went 13min 5s 43.2 s
Comments
Post a Comment