python - Pandas dataframe: how to summarize columns containing value -

April 15, 2015

here dataframe:

df= pd.dataframe( {"mat" : ['a' ,'a', 'a', 'a', 'b'],  "ppl" : ['p', 'p', 'p', '',  'p'],  "ia1" : ['',  'x', 'x', '',  'x'],  "ia2" : ['x', '',  '',  'x', 'x']}, index = [1, 2, 3, 4, 5])

i want select unique values on 2 first columns. do:

df2 = df.loc[:,['mat','ppl']].drop_duplicates(subset=['mat','ppl']).sort_values(by=['mat','ppl'])

i get, expected:

  mat ppl 4       1     p 5   b   p

what want is, df3 be:

 mat ppl ia1 ia2              x      p   x   x    b   p   x   x

that is: in df3 row a+p, in column ia1, got x because there x in column ia1 in 1 of row of df, a+p

solutions aggregate , unique, if multiple unique values joined ,:

df = df.groupby(['mat','ppl']).agg(lambda x: ','.join(x[x != ''].unique())).reset_index() print (df)   mat ppl ia1 ia2 0             x 1     p   x   x 2   b   p   x   x

explanation:

aggregation working series , aggregation function, output scalar. use custom function first filter out empty spaces boolean indexing (x[x != ''], unique values. scalar output used join - works if empty series (all values empty strings) , second advantage if multiple unique values 1 joined value ,.

for testing possible use custom function same lambda function:

def f(x):     = ''.join(x[x != ''].unique().tolist())     return  df = df.groupby(['mat','ppl']).agg(f).reset_index() print (df)   mat ppl ia1 ia2 0             x 1     p   x   x 2   b   p   x   x

as comment of op mentioned:

instead of using lambda x: ','.join(x[x != ''].unique()), used lambda x: ','.join(set(x)-set([''])). went 13min 5s 43.2 s

Search This Blog

MOno

python - Pandas dataframe: how to summarize columns containing value -

Comments

Post a Comment

Popular posts from this blog

'hasOwnProperty' in javascript -

python - ValueError: No axis named 1 for object type <class 'pandas.core.series.Series'> -

java - How to provide dependency injections in Eclipse RCP 3.x? -