python - Pandas dataframe: how to summarize columns containing value -


here dataframe:

df= pd.dataframe( {"mat" : ['a' ,'a', 'a', 'a', 'b'],  "ppl" : ['p', 'p', 'p', '',  'p'],  "ia1" : ['',  'x', 'x', '',  'x'],  "ia2" : ['x', '',  '',  'x', 'x']}, index = [1, 2, 3, 4, 5]) 

i want select unique values on 2 first columns. do:

df2 = df.loc[:,['mat','ppl']].drop_duplicates(subset=['mat','ppl']).sort_values(by=['mat','ppl']) 

i get, expected:

  mat ppl 4       1     p 5   b   p 

what want is, df3 be:

 mat ppl ia1 ia2              x      p   x   x    b   p   x   x 

that is: in df3 row a+p, in column ia1, got x because there x in column ia1 in 1 of row of df, a+p

solutions aggregate , unique, if multiple unique values joined ,:

df = df.groupby(['mat','ppl']).agg(lambda x: ','.join(x[x != ''].unique())).reset_index() print (df)   mat ppl ia1 ia2 0             x 1     p   x   x 2   b   p   x   x 

explanation:

aggregation working series , aggregation function, output scalar. use custom function first filter out empty spaces boolean indexing (x[x != ''], unique values. scalar output used join - works if empty series (all values empty strings) , second advantage if multiple unique values 1 joined value ,.

for testing possible use custom function same lambda function:

def f(x):     = ''.join(x[x != ''].unique().tolist())     return  df = df.groupby(['mat','ppl']).agg(f).reset_index() print (df)   mat ppl ia1 ia2 0             x 1     p   x   x 2   b   p   x   x 

as comment of op mentioned:

instead of using lambda x: ','.join(x[x != ''].unique()), used lambda x: ','.join(set(x)-set([''])). went 13min 5s 43.2 s


Comments

Popular posts from this blog

Command prompt result in label. Python 2.7 -

javascript - How do I use URL parameters to change link href on page? -

amazon web services - AWS Route53 Trying To Get Site To Resolve To www -