r - How to randomly sample dataframe rows with unique column values -
the ultimate objective compare variance , standard deviation of simple statistic (numerator / denominator / true_count) avg_score 10 trials of incrementally sized random samples per word dataset similar to:
library (data.table) set.seed(1) df <- data.frame( word_id = c(rep(1,4),rep(2,3),rep(3,2),rep(4,5),rep(5,5),rep(6,3),rep(7,4),rep(8,4),rep(9,6),rep(10,4)), word = c(rep("cat",4), rep("house", 3), rep("sung",2), rep("door",5), rep("pretty", 5), rep("towel",3), rep("car",4), rep("island",4), rep("ran",6), rep("pizza", 4)), true_count = c(rep(234,4),rep(39,3),rep(876,2),rep(4,5),rep(67,5),rep(81,3),rep(90,4),rep(43,4),rep(54,6),rep(53,4)), occurrences = c(rep(234,4),rep(34,3),rep(876,2),rep(4,5),rep(65,5),rep(81,3),rep(90,4),rep(43,4),rep(54,6),rep(51,4)), item_score = runif(40), avg_score = rnorm(40), line = c(71,234,71,34,25,32,573,3,673,899,904,2,4,55,55,1003,100,432,100,29,87,326,413,32,54,523,87,988,988,12,24,754,987,12,4276,987,93,65,45,49), validity = sample(c("t", "f"), 40, replace = t) ) dt <- data.table(df) dt[ , denominator := 1:.n, by=word_id] dt[ , numerator := 1:.n, by=c("word_id", "validity")] dt$numerator[df$validity=="f"] <- 0 df <- dt <df word_id word true_count occurrences item_score avg_score line validity denominator numerator 1: 1 cat 234 234 0.25497614 0.15268651 71 f 1 0 2: 1 cat 234 234 0.18662407 1.77376261 234 f 2 0 3: 1 cat 234 234 0.74554352 -0.64807093 71 t 3 1 4: 1 cat 234 234 0.93296878 -0.19981748 34 t 4 2 5: 2 house 39 34 0.49471189 0.68924373 25 f 1 0 6: 2 house 39 34 0.64499368 0.03614551 32 t 2 1 7: 2 house 39 34 0.17580259 1.94353631 573 f 3 0 8: 3 sung 876 876 0.60299465 0.73721373 3 t 1 1 9: 3 sung 876 876 0.88775767 2.32133393 673 f 2 0 10: 4 door 4 4 0.49020940 0.34890935 899 t 1 1 11: 4 door 4 4 0.01838357 -1.13391666 904 t 2 2 the data represents each detection of word in document, it's possible word appear on same line more once. task sample size represent unique column values (line), return instances line number same- meaning actual number of rows returned more specified sample size. so, 1 two-word sample size trial "cat", form of desired result be:
word_id word true_count occurrences item_score avg_score line validity denominator numerator 1: 1 cat 234 234 0.25497614 0.15268651 71 f 1 0 2: 1 cat 234 234 0.18662407 1.77376261 234 f 2 0 3: 1 cat 234 234 0.74554352 -0.64807093 71 t 3 1 my basic iteration (found on site) looks like:
for (i in 1:10) { a2[[i]] <- lapply(split(df, df$word_id), function(x) x[sample(nrow(x), 2, replace = t), ]) b3[[i]] <- lapply(split(df, df$word_id), function(x) x[sample(nrow(x), 3, replace = t), ])} } so, can standard random sample sizes, unsure (and couldn't find similar or wasn't looking right way) how approach goal stated above. there straight-forward way approach this?
thanks,
here data.table solution uses join on sampled data.table.
set.seed(1234) df[df[, .(line=sample(unique(line), 2)), by=word], on=.(word, line)] the inner data.table consists of 2 columns, word , line, , has 2 rows per word, each unique value line. values line returned sample fed unique values of line , performed separately each word (using by=word). can vary number of unique line values changing 2 desired value. data.table joined onto main data.table in order select desired rows.
in instance, get
word_id word true_count occurrences item_score avg_score line validity 1: 1 cat 234 234 0.26550866 0.91897737 71 f 2: 1 cat 234 234 0.57285336 0.07456498 71 t 3: 1 cat 234 234 0.37212390 0.78213630 234 t 4: 2 house 39 34 0.89838968 -0.05612874 32 t 5: 2 house 39 34 0.94467527 -0.15579551 573 f 6: 3 sung 876 876 0.62911404 -0.47815006 673 t 7: 3 sung 876 876 0.66079779 -1.47075238 3 t 8: 4 door 4 4 0.06178627 0.41794156 899 f 9: 4 door 4 4 0.38410372 -0.05380504 55 f 10: 5 pretty 67 65 0.71761851 -0.39428995 100 f 11: 5 pretty 67 65 0.38003518 1.10002537 100 f 12: 5 pretty 67 65 0.49769924 -0.41499456 1003 f 13: 6 towel 81 81 0.21214252 -0.25336168 326 f 14: 6 towel 81 81 0.93470523 -0.16452360 87 f 15: 7 car 90 90 0.12555510 0.55666320 32 t 16: 7 car 90 90 0.26722067 -0.68875569 54 f 17: 8 island 43 43 0.01339033 0.36458196 87 t 18: 8 island 43 43 0.38238796 0.76853292 988 f 19: 8 island 43 43 0.86969085 -0.11234621 988 t 20: 9 ran 54 54 0.59956583 -0.61202639 754 f 21: 9 ran 54 54 0.82737332 1.43302370 4276 f 22: 10 pizza 53 51 0.79423986 -0.36722148 93 f 23: 10 pizza 53 51 0.41127443 -0.13505460 49 t word_id word true_count occurrences item_score avg_score line validity
Comments
Post a Comment