python - Filtering/Grouping a list of dictionaries by a list of keys -
i need define function, group_dictionaries, take list of dictionaries , return list of dictionaries contain same values each key of list of keys. "lonely" dictionaries deleted.
here example:
my_list=[ {'id':'id1', 'key1':value_x, 'key2': value_y, 'key3':value_z}, {'id':'id3', 'key2 :value_u, 'key3': value_v}, {'id':'id2', 'key1':value_x, 'key3':value_z, 'key4': value_t}, {'id':'id4', 'key1':value_w, 'key2':value_s, 'key3':value_v} ] group_dictionary(my_list, list_of_keys=['key1', 'key3']) #result: dictionaries have key1 , key3 in common are: [ {'id':'id1', 'key1':value_x, 'key2': value_y, 'key3':value_z, 'group':0}, {'id':'id2', 'key1':value_x, 'key3':value_z, 'key4': value_t, 'group':0} ] group_dictionary(my_list, list_of_keys=['key3']) #result dictionaries have key3 in common divided in 2 groups #of different values: group 0 has value_z , group1 has value_v [ {'id':'id1', 'key1':value_x, 'key2': value_y, 'key3':value_z, 'group':0}, {'id':'id2', 'key1':value_x, 'key3':value_z, 'key4': value_t, 'group':0}, {'id':'id3', 'key2 :value_u, 'key3': value_v, 'group':1}, {'id':'id4', 'key1':value_w, 'key2':value_s, 'key3':value_v, 'group':1} ] as can see:
- the function creates key labeled 'group' integer starting 0. key assigned each "group" of dictionaries (by group mean dictionaries keys corresponding list of keys match each key)
- the function deletes dictionaries don't have "groups".
- the existing data set on working contains unique id each dictionary. may useful create function.
- non existing keys prevent dictionary being candidate.
i concerned run-time; actual list contains 80,000 dictionaries of 35 keys each in average. complexity of algorithm n² (80,000²). optimization in code welcome.
i believe work, it's written in python3, haven't optimized it, starting point if it's not fast enough.
list_of_dicts = [ {'id':'id1', 'key1':'value_x', 'key2': 'value_y', 'key3':'value_z'}, {'id':'id3', 'key2' :'value_u', 'key3': 'value_v'}, {'id':'id2', 'key1':'value_x', 'key3':'value_z', 'key4': 'value_t'}, {'id':'id4', 'key1':'value_w', 'key2':'value_s', 'key3':'value_v'} ] # since can't have objects keys, make values we're looking string, , use key. def make_value_key(d, list_of_keys): res = "" k in list_of_keys: res += str(d[k]) return res def group_dictionary(list_of_dicts, list_of_keys): group_vals = {} current_max_group = 0 dicts_to_remove = [] i,d in enumerate(list_of_dicts): # if dict doesn't have keys mark removal. if not all(k in d k in list_of_keys): dicts_to_remove.append(i) else: value_key = make_value_key(d, list_of_keys) # if value key exists assign group otherwise make new group. if value_key in group_vals: d['group'] = group_vals[value_key] else: group_vals[value_key] = current_max_group d['group'] = current_max_group current_max_group += 1 list_of_dicts = [i j, in enumerate(list_of_dicts) if j not in dicts_to_remove] return list_of_dicts list_of_keys=['key1','key3'] print(group_dictionary(list_of_dicts, list_of_keys)) print() list_of_keys=['key3'] print(group_dictionary(list_of_dicts, list_of_keys)) output:
[{'key3': 'value_z', 'key1': 'value_x', 'group': 0, 'key2': 'value_y', 'id': 'id1'}, {'key3': 'value_z', 'key1': 'value_x', 'key4': 'value_t', 'group': 0, 'id': 'id2'}, {'key3': 'value_v', 'key1': 'value_w', 'group': 1, 'key2': 'value_s', 'id': 'id4'}] [{'key3': 'value_z', 'key1': 'value_x', 'group': 0, 'key2': 'value_y', 'id': 'id1'}, {'group': 1, 'key3': 'value_v', 'key2': 'value_u', 'id': 'id3'}, {'key3': 'value_z', 'key1': 'value_x', 'key4': 'value_t', 'group': 0, 'id': 'id2'}, {'key3': 'value_v', 'key1': 'value_w', 'group': 1, 'key2': 'value_s', 'id': 'id4'}] optimization 1:
instead of iterating keys check if exist, instead can fail while making value-key , return empty string, mark dict deletion:
def make_value_key(d, list_of_keys): res = "" k in list_of_keys: if not k in d: return "" res += str(d[k]) return res def group_dictionary(list_of_dicts, list_of_keys): group_vals = {} current_max_group = 0 dicts_to_remove = [] i,d in enumerate(list_of_dicts): value_key = make_value_key(d, list_of_keys) if value_key == "": dicts_to_remove.append(i) continue if value_key in group_vals: d['group'] = group_vals[value_key] else: group_vals[value_key] = current_max_group d['group'] = current_max_group current_max_group += 1 list_of_dicts = [i j, in enumerate(list_of_dicts) if j not in dicts_to_remove] return list_of_dicts groups have bigger 1:
this uses second dict keep track of group sizes, , checks whether groups smaller 2 mark them removal.
def make_value_key(d, list_of_keys): res = "" k in list_of_keys: if not k in d: return "" res += str(d[k]) return res def group_dictionary(list_of_dicts, list_of_keys): group_vals = {} group_count = {} current_max_group = 0 indices_to_remove = [] i,d in enumerate(list_of_dicts): value_key = make_value_key(d, list_of_keys) if value_key == "": indices_to_remove.append(i) continue if value_key in group_vals: d['group'] = group_vals[value_key] # second group member seen, remove count dict. group_count.pop(d['group'], none) else: group_vals[value_key] = current_max_group d['group'] = current_max_group # first time seen, add count dict. group_count[current_max_group] = current_max_group += 1 indices_to_remove.extend(group_count.values()) return [i j, in enumerate(list_of_dicts) if j not in indices_to_remove] output:
[{'key2': 'value_y', 'group': 0, 'id': 'id1', 'key1': 'value_x', 'key3': 'value_z'}, {'key4': 'value_t', 'group': 0, 'id': 'id2', 'key1': 'value_x', 'key3': 'value_z'}] [{'key2': 'value_y', 'group': 0, 'id': 'id1', 'key1': 'value_x', 'key3': 'value_z'}, {'group': 1, 'id': 'id3', 'key2': 'value_u', 'key3': 'value_v'}, {'key4': 'value_t', 'group': 0, 'id': 'id2', 'key1': 'value_x', 'key3': 'value_z'}, {'key2': 'value_s', 'group': 1, 'id': 'id4', 'key1': 'value_w', 'key3': 'value_v'}] optimization 2:
you can go o(n^2) (loop through dict list once calculate , once delete) o(n*m log m) (loop through dict list once , loop through sorted removed indices):
def make_value_key(d, list_of_keys): res = "" k in list_of_keys: if not k in d: return "" res += str(d[k]) return res def group_dictionary(list_of_dicts, list_of_keys): group_vals = {} group_count = {} current_max_group = 0 indices_to_remove = [] i,d in enumerate(list_of_dicts): value_key = make_value_key(d, list_of_keys) if value_key == "": indices_to_remove.append(i) continue if value_key in group_vals: d['group'] = group_vals[value_key] # second group member seen, remove count dict. group_count.pop(d['group'], none) else: group_vals[value_key] = current_max_group d['group'] = current_max_group # first time seen, add count dict. group_count[current_max_group] = current_max_group += 1 indices_to_remove.extend(group_count.values()) index in sorted(indices_to_remove, reverse=true): del list_of_dicts[index] return list_of_dicts
Comments
Post a Comment