r - Merge data files -

May 15, 2010

i have following data frames in r:

id   class @a    64 @b    7 @c    98

and second data frame:

source    target  @d        @b @c        @a

this describes nodes , edges in social network. users (all @ in front) belong specific community , number listed in column class. analyse connections between columns want merge data frames , create new data frame looking this:

source    target    source.class    target.class  @a        @i        56               2 @f        @k        90               49

when try merge() r stop responding , need terminate r. data frames constitute 20000 (node file) , 30000 (edge file) rows.

then want know how many records in given source class have same target class , percentage of connections between classes.

i happy if me since i'm new r.

edit: think manage create columns code using match() instead of merge() (rt_node contain columns "id", "class" , rt_node contain columns "source","target"):

#match source in rt_edges id in rt_node match(rt_edges$source,rt_nodes$id)  #match target in rt_edges id in rt_node match(rt_edges$target,rt_nodes$id)  #create source_class  rt_nodes$modularity_class[match(rt_edges$source,rt_nodes$id)] rt_edges$source_class=rt_nodes$modularity_class[match(rt_edges$source,rt_nodes$id)]  #create target_class rt_nodes$modularity_class[match(rt_edges$target,rt_nodes$id)] rt_edges$target_class=rt_nodes$modularity_class[match(rt_edges$target,rt_nodes$id)]

now need figure out how can find percentage of connections in each class , percentage of connections other classes. tips on how that?

question 1: merge

this requires 2 separate join operations: initial join of rt_edges rt_nodes on target , subsequent join of intermediate result rt_nodes on source. in addition, rows of rt_edges should appear in result.

the approach below uses data.table. (i've adopted naming of variables , columns op has used in edited code of q note inconsistent sample data given op.)

reading data

library(data.table) rt_nodes <- fread(   "id   class   @a    64   @b    7   @c    98   @d    23   @f    59") rt_edges <-fread(   "source    target    @d        @b   @c        @a   @a        @e")

note additional rows have been added sample data provided op demonstrate effect of

a node (@f) not involved in edge ,
an edge (@a -> @e) 1 id missing rt_nodes.

twofold join

by default, joins in data.table right joins. therefore, rt_edges appears on right side.

result <- rt_nodes[rt_nodes[rt_edges, on = c(id = "target")], on = c(id = "source")]  # rename columns setnames(result, c("source", "source.class", "target", "target.class"))  result #   source source.class target target.class #1:     @d           23     @b            7 #2:     @c           98     @a           64 #3:     @a           64     @e           na

all 3 edges appear in result. na indicates @e missing rt_nodes.

question 2

the op has included second question (and has created a new post in meantime)

then want know how many records in given source class have same target class , percentage of connections between classes.

result[, .(.n, share_of_occurrence_in_target.class = sum(source.class == target.class)/.n),         = source.class] #   source.class n share_of_occurrence_in_target.classs #1:           23 1                                    0 #2:           98 1                                    0 #3:           64 1                                   na

the counts 1 , shares 0 here because sample data don't include enough cases of matching classes. however, code has been verified work data provided in the other post of op.

Search This Blog

MOno

r - Merge data files -

question 1: merge

reading data

twofold join

question 2

Comments

Post a Comment

Popular posts from this blog

'hasOwnProperty' in javascript -

python - ValueError: No axis named 1 for object type <class 'pandas.core.series.Series'> -

java - How to provide dependency injections in Eclipse RCP 3.x? -