python - Filtering/Grouping a list of dictionaries by a list of keys -


i need define function, group_dictionaries, take list of dictionaries , return list of dictionaries contain same values each key of list of keys. "lonely" dictionaries deleted.

here example:

my_list=[     {'id':'id1', 'key1':value_x, 'key2': value_y, 'key3':value_z},     {'id':'id3', 'key2 :value_u, 'key3': value_v},     {'id':'id2', 'key1':value_x, 'key3':value_z, 'key4': value_t},     {'id':'id4', 'key1':value_w, 'key2':value_s, 'key3':value_v} ]  group_dictionary(my_list, list_of_keys=['key1', 'key3'])  #result: dictionaries have key1 , key3 in common are: [     {'id':'id1', 'key1':value_x, 'key2': value_y, 'key3':value_z, 'group':0},     {'id':'id2', 'key1':value_x, 'key3':value_z, 'key4': value_t, 'group':0} ]  group_dictionary(my_list, list_of_keys=['key3'])  #result dictionaries have key3 in common divided in 2 groups #of different values: group 0 has value_z , group1 has value_v  [     {'id':'id1', 'key1':value_x, 'key2': value_y, 'key3':value_z, 'group':0},     {'id':'id2', 'key1':value_x, 'key3':value_z, 'key4': value_t, 'group':0},     {'id':'id3', 'key2 :value_u, 'key3': value_v, 'group':1},     {'id':'id4', 'key1':value_w, 'key2':value_s, 'key3':value_v, 'group':1} ] 

as can see:

  1. the function creates key labeled 'group' integer starting 0. key assigned each "group" of dictionaries (by group mean dictionaries keys corresponding list of keys match each key)
  2. the function deletes dictionaries don't have "groups".
  3. the existing data set on working contains unique id each dictionary. may useful create function.
  4. non existing keys prevent dictionary being candidate.

i concerned run-time; actual list contains 80,000 dictionaries of 35 keys each in average. complexity of algorithm n² (80,000²). optimization in code welcome.

i believe work, it's written in python3, haven't optimized it, starting point if it's not fast enough.

list_of_dicts = [ {'id':'id1', 'key1':'value_x', 'key2': 'value_y', 'key3':'value_z'}, {'id':'id3', 'key2' :'value_u', 'key3': 'value_v'}, {'id':'id2', 'key1':'value_x', 'key3':'value_z', 'key4': 'value_t'}, {'id':'id4', 'key1':'value_w', 'key2':'value_s', 'key3':'value_v'} ]  # since can't have objects keys, make values we're looking string, , use key. def make_value_key(d, list_of_keys):     res = ""     k in list_of_keys:         res += str(d[k])      return res  def group_dictionary(list_of_dicts, list_of_keys):     group_vals = {}     current_max_group = 0     dicts_to_remove = []     i,d in enumerate(list_of_dicts):         # if dict doesn't have keys mark removal.         if not all(k in d k in list_of_keys):             dicts_to_remove.append(i)         else:             value_key = make_value_key(d, list_of_keys)             # if value key exists assign group otherwise make new group.             if value_key in group_vals:                 d['group'] = group_vals[value_key]             else:                 group_vals[value_key] = current_max_group                 d['group'] = current_max_group                 current_max_group += 1      list_of_dicts = [i j, in enumerate(list_of_dicts) if j not in dicts_to_remove]     return list_of_dicts  list_of_keys=['key1','key3']  print(group_dictionary(list_of_dicts, list_of_keys)) print() list_of_keys=['key3']  print(group_dictionary(list_of_dicts, list_of_keys)) 

output:

[{'key3': 'value_z', 'key1': 'value_x', 'group': 0, 'key2': 'value_y', 'id': 'id1'},  {'key3': 'value_z', 'key1': 'value_x', 'key4': 'value_t', 'group': 0, 'id': 'id2'},  {'key3': 'value_v', 'key1': 'value_w', 'group': 1, 'key2': 'value_s', 'id': 'id4'}]  [{'key3': 'value_z', 'key1': 'value_x', 'group': 0, 'key2': 'value_y', 'id': 'id1'},  {'group': 1, 'key3': 'value_v', 'key2': 'value_u', 'id': 'id3'},  {'key3': 'value_z', 'key1': 'value_x', 'key4': 'value_t', 'group': 0, 'id': 'id2'},  {'key3': 'value_v', 'key1': 'value_w', 'group': 1, 'key2': 'value_s', 'id': 'id4'}] 

optimization 1:

instead of iterating keys check if exist, instead can fail while making value-key , return empty string, mark dict deletion:

def make_value_key(d, list_of_keys):     res = ""     k in list_of_keys:         if not k in d:             return ""         res += str(d[k])      return res  def group_dictionary(list_of_dicts, list_of_keys):     group_vals = {}     current_max_group = 0     dicts_to_remove = []     i,d in enumerate(list_of_dicts):         value_key = make_value_key(d, list_of_keys)         if value_key == "":             dicts_to_remove.append(i)             continue         if value_key in group_vals:             d['group'] = group_vals[value_key]          else:             group_vals[value_key] = current_max_group             d['group'] = current_max_group             current_max_group += 1      list_of_dicts = [i j, in enumerate(list_of_dicts) if j not in dicts_to_remove]     return list_of_dicts 

groups have bigger 1:

this uses second dict keep track of group sizes, , checks whether groups smaller 2 mark them removal.

def make_value_key(d, list_of_keys):     res = ""     k in list_of_keys:         if not k in d:             return ""         res += str(d[k])      return res  def group_dictionary(list_of_dicts, list_of_keys):     group_vals = {}     group_count = {}     current_max_group = 0     indices_to_remove = []     i,d in enumerate(list_of_dicts):         value_key = make_value_key(d, list_of_keys)         if value_key == "":             indices_to_remove.append(i)             continue         if value_key in group_vals:             d['group'] = group_vals[value_key]             # second group member seen, remove count dict.              group_count.pop(d['group'], none)         else:             group_vals[value_key] = current_max_group             d['group'] = current_max_group             # first time seen, add count dict.             group_count[current_max_group] =             current_max_group += 1      indices_to_remove.extend(group_count.values())     return [i j, in enumerate(list_of_dicts) if j not in indices_to_remove] 

output:

[{'key2': 'value_y', 'group': 0, 'id': 'id1', 'key1': 'value_x', 'key3': 'value_z'},  {'key4': 'value_t', 'group': 0, 'id': 'id2', 'key1': 'value_x', 'key3': 'value_z'}]  [{'key2': 'value_y', 'group': 0, 'id': 'id1', 'key1': 'value_x', 'key3': 'value_z'}, {'group': 1, 'id': 'id3', 'key2': 'value_u', 'key3': 'value_v'}, {'key4': 'value_t', 'group': 0, 'id': 'id2', 'key1': 'value_x', 'key3': 'value_z'}, {'key2': 'value_s', 'group': 1, 'id': 'id4', 'key1': 'value_w', 'key3': 'value_v'}] 

optimization 2:

you can go o(n^2) (loop through dict list once calculate , once delete) o(n*m log m) (loop through dict list once , loop through sorted removed indices):

def make_value_key(d, list_of_keys):     res = ""     k in list_of_keys:         if not k in d:             return ""         res += str(d[k])      return res  def group_dictionary(list_of_dicts, list_of_keys):     group_vals = {}     group_count = {}     current_max_group = 0     indices_to_remove = []     i,d in enumerate(list_of_dicts):         value_key = make_value_key(d, list_of_keys)         if value_key == "":             indices_to_remove.append(i)             continue         if value_key in group_vals:             d['group'] = group_vals[value_key]             # second group member seen, remove count dict.              group_count.pop(d['group'], none)         else:             group_vals[value_key] = current_max_group             d['group'] = current_max_group             # first time seen, add count dict.             group_count[current_max_group] =             current_max_group += 1      indices_to_remove.extend(group_count.values())     index in sorted(indices_to_remove, reverse=true):         del list_of_dicts[index]      return list_of_dicts 

Comments

Popular posts from this blog

How to understand 2 main() functions after using uftrace to profile the C++ program? -

c# - Update a combobox from a presenter (MVP) -

How to put a lock and transaction on table using spring 4 or above using jdbcTemplate and annotations like @Transactional? -