python - Removing duplicates in pandas data frame if one column differs, but is in a given list -
i have dataframe duplicate entries coming 2 sources, values should unique, 1 column not formatted same, hence should remove duplicate different names in 1 column, if names within list.
technically, remove row in pandas dataframe if there exist row same a
, b
values, if row’s z
value 'bar'
, other’s 'z' 'foo'
.
an example might clearer:
i have given dataframe df
b z 'a' 'a' 'foo' 'a' 'a' 'bar' 'b' 'a' 'bar' 'c' 'c' 'foo' 'd' 'd' 'blb'
and get
b z 'a' 'a' 'foo' 'b' 'a' 'bar' 'c' 'c' 'foo' 'd' 'd' 'blb'
note that:
- the rows other values
'foo'
,'bar'
inz
column should not touched. - it’s not important if
'foo'
,'bar'
stay same because changed same value afterwards. - it great generalize duo
'foo'
,'bar'
list.
attempts far: here best guess, doesn’t work though… don’t understand groupby returns. i’m sure there magical pandas one-liner can’t find.
new_df = [] row in df.groupby('a'): if rowloc['z'].isin('foo'): if not row['z'].isin('bar'): new_df.append(row)
thanks !
i think can expected result concatenating 2 subsets of original dataframe:
- one z values neither
foo
norbar
- and other 1 duplicates according
a
,b
dropped
here's example gives me expected output:
data = """ b z foo bar b bar c c foo d d blb""" df = pd.read_csv(stringio(data),sep='\s+') ls = ['foo','bar'] df1 = pd.concat((df.loc[~(df.z.isin(ls))], # no foos or bars here df.loc[ df.z.isin(ls)].drop_duplicates(subset=['a','b']) )).sort_index()
an simpler option might replace foo
bar
everywhere in z
, drop duplicates:
df1 = df.replace({'z':{'foo':'bar'}}).drop_duplicates()
you replace both foo
, bar
other value you're going use:
df1 = df.replace({'z':{'foo':'xyz', 'bar':'xyz'}}).drop_duplicates()
Comments
Post a Comment