r - Merge data files -
i have following data frames in r:
id class @a 64 @b 7 @c 98
and second data frame:
source target @d @b @c @a
this describes nodes , edges in social network. users (all @ in front) belong specific community , number listed in column class. analyse connections between columns want merge data frames , create new data frame looking this:
source target source.class target.class @a @i 56 2 @f @k 90 49
when try merge()
r stop responding , need terminate r. data frames constitute 20000 (node file) , 30000 (edge file) rows.
then want know how many records in given source class have same target class , percentage of connections between classes.
i happy if me since i'm new r.
edit: think manage create columns code using match()
instead of merge()
(rt_node contain columns "id", "class" , rt_node contain columns "source","target"):
#match source in rt_edges id in rt_node match(rt_edges$source,rt_nodes$id) #match target in rt_edges id in rt_node match(rt_edges$target,rt_nodes$id) #create source_class rt_nodes$modularity_class[match(rt_edges$source,rt_nodes$id)] rt_edges$source_class=rt_nodes$modularity_class[match(rt_edges$source,rt_nodes$id)] #create target_class rt_nodes$modularity_class[match(rt_edges$target,rt_nodes$id)] rt_edges$target_class=rt_nodes$modularity_class[match(rt_edges$target,rt_nodes$id)]
now need figure out how can find percentage of connections in each class , percentage of connections other classes. tips on how that?
question 1: merge
this requires 2 separate join operations: initial join of rt_edges
rt_nodes
on target
, subsequent join of intermediate result rt_nodes
on source
. in addition, rows of rt_edges
should appear in result.
the approach below uses data.table
. (i've adopted naming of variables , columns op has used in edited code of q note inconsistent sample data given op.)
reading data
library(data.table) rt_nodes <- fread( "id class @a 64 @b 7 @c 98 @d 23 @f 59") rt_edges <-fread( "source target @d @b @c @a @a @e")
note additional rows have been added sample data provided op demonstrate effect of
- a node (
@f
) not involved in edge , - an edge (
@a -> @e
) 1 id missingrt_nodes
.
twofold join
by default, joins in data.table
right joins. therefore, rt_edges
appears on right side.
result <- rt_nodes[rt_nodes[rt_edges, on = c(id = "target")], on = c(id = "source")] # rename columns setnames(result, c("source", "source.class", "target", "target.class")) result # source source.class target target.class #1: @d 23 @b 7 #2: @c 98 @a 64 #3: @a 64 @e na
all 3 edges appear in result. na
indicates @e
missing rt_nodes
.
question 2
the op has included second question (and has created a new post in meantime)
then want know how many records in given source class have same target class , percentage of connections between classes.
result[, .(.n, share_of_occurrence_in_target.class = sum(source.class == target.class)/.n), = source.class] # source.class n share_of_occurrence_in_target.classs #1: 23 1 0 #2: 98 1 0 #3: 64 1 na
the counts 1 , shares 0 here because sample data don't include enough cases of matching classes. however, code has been verified work data provided in the other post of op.
Comments
Post a Comment