python - Removing duplicates in pandas data frame if one column differs, but is in a given list -
i have dataframe duplicate entries coming 2 sources, values should unique, 1 column not formatted same, hence should remove duplicate different names in 1 column, if names within list.
technically, remove row in pandas dataframe if there exist row same a , b values, if row’s z value 'bar' , other’s 'z' 'foo'.
an example might clearer:
i have given dataframe df
b z 'a' 'a' 'foo' 'a' 'a' 'bar' 'b' 'a' 'bar' 'c' 'c' 'foo' 'd' 'd' 'blb' and get
b z 'a' 'a' 'foo' 'b' 'a' 'bar' 'c' 'c' 'foo' 'd' 'd' 'blb' note that:
- the rows other values
'foo','bar'inzcolumn should not touched. - it’s not important if
'foo','bar'stay same because changed same value afterwards. - it great generalize duo
'foo','bar'list.
attempts far: here best guess, doesn’t work though… don’t understand groupby returns. i’m sure there magical pandas one-liner can’t find.
new_df = [] row in df.groupby('a'): if rowloc['z'].isin('foo'): if not row['z'].isin('bar'): new_df.append(row) thanks !
i think can expected result concatenating 2 subsets of original dataframe:
- one z values neither
foonorbar - and other 1 duplicates according
a,bdropped
here's example gives me expected output:
data = """ b z foo bar b bar c c foo d d blb""" df = pd.read_csv(stringio(data),sep='\s+') ls = ['foo','bar'] df1 = pd.concat((df.loc[~(df.z.isin(ls))], # no foos or bars here df.loc[ df.z.isin(ls)].drop_duplicates(subset=['a','b']) )).sort_index() an simpler option might replace foo bar everywhere in z , drop duplicates:
df1 = df.replace({'z':{'foo':'bar'}}).drop_duplicates() you replace both foo , bar other value you're going use:
df1 = df.replace({'z':{'foo':'xyz', 'bar':'xyz'}}).drop_duplicates()
Comments
Post a Comment