r - Merge data files -


i have following data frames in r:

id   class @a    64 @b    7 @c    98  

and second data frame:

source    target  @d        @b @c        @a  

this describes nodes , edges in social network. users (all @ in front) belong specific community , number listed in column class. analyse connections between columns want merge data frames , create new data frame looking this:

source    target    source.class    target.class  @a        @i        56               2 @f        @k        90               49  

when try merge() r stop responding , need terminate r. data frames constitute 20000 (node file) , 30000 (edge file) rows.

then want know how many records in given source class have same target class , percentage of connections between classes.

i happy if me since i'm new r.

edit: think manage create columns code using match() instead of merge() (rt_node contain columns "id", "class" , rt_node contain columns "source","target"):

#match source in rt_edges id in rt_node match(rt_edges$source,rt_nodes$id)  #match target in rt_edges id in rt_node match(rt_edges$target,rt_nodes$id)  #create source_class  rt_nodes$modularity_class[match(rt_edges$source,rt_nodes$id)] rt_edges$source_class=rt_nodes$modularity_class[match(rt_edges$source,rt_nodes$id)]  #create target_class rt_nodes$modularity_class[match(rt_edges$target,rt_nodes$id)] rt_edges$target_class=rt_nodes$modularity_class[match(rt_edges$target,rt_nodes$id)] 

now need figure out how can find percentage of connections in each class , percentage of connections other classes. tips on how that?

question 1: merge

this requires 2 separate join operations: initial join of rt_edges rt_nodes on target , subsequent join of intermediate result rt_nodes on source. in addition, rows of rt_edges should appear in result.

the approach below uses data.table. (i've adopted naming of variables , columns op has used in edited code of q note inconsistent sample data given op.)

reading data

library(data.table) rt_nodes <- fread(   "id   class   @a    64   @b    7   @c    98   @d    23   @f    59") rt_edges <-fread(   "source    target    @d        @b   @c        @a   @a        @e") 

note additional rows have been added sample data provided op demonstrate effect of

  • a node (@f) not involved in edge ,
  • an edge (@a -> @e) 1 id missing rt_nodes.

twofold join

by default, joins in data.table right joins. therefore, rt_edges appears on right side.

result <- rt_nodes[rt_nodes[rt_edges, on = c(id = "target")], on = c(id = "source")]  # rename columns setnames(result, c("source", "source.class", "target", "target.class"))  result #   source source.class target target.class #1:     @d           23     @b            7 #2:     @c           98     @a           64 #3:     @a           64     @e           na 

all 3 edges appear in result. na indicates @e missing rt_nodes.

question 2

the op has included second question (and has created a new post in meantime)

then want know how many records in given source class have same target class , percentage of connections between classes.

result[, .(.n, share_of_occurrence_in_target.class = sum(source.class == target.class)/.n),         = source.class] #   source.class n share_of_occurrence_in_target.classs #1:           23 1                                    0 #2:           98 1                                    0 #3:           64 1                                   na 

the counts 1 , shares 0 here because sample data don't include enough cases of matching classes. however, code has been verified work data provided in the other post of op.


Comments

Popular posts from this blog

Command prompt result in label. Python 2.7 -

javascript - How do I use URL parameters to change link href on page? -

amazon web services - AWS Route53 Trying To Get Site To Resolve To www -