r - How to randomly sample dataframe rows with unique column values -

the ultimate objective compare variance , standard deviation of simple statistic (numerator / denominator / true_count) avg_score 10 trials of incrementally sized random samples per word dataset similar to:

library (data.table) set.seed(1) df <- data.frame(   word_id = c(rep(1,4),rep(2,3),rep(3,2),rep(4,5),rep(5,5),rep(6,3),rep(7,4),rep(8,4),rep(9,6),rep(10,4)),   word = c(rep("cat",4), rep("house", 3), rep("sung",2), rep("door",5), rep("pretty", 5), rep("towel",3), rep("car",4), rep("island",4), rep("ran",6), rep("pizza", 4)),    true_count = c(rep(234,4),rep(39,3),rep(876,2),rep(4,5),rep(67,5),rep(81,3),rep(90,4),rep(43,4),rep(54,6),rep(53,4)),   occurrences = c(rep(234,4),rep(34,3),rep(876,2),rep(4,5),rep(65,5),rep(81,3),rep(90,4),rep(43,4),rep(54,6),rep(51,4)),   item_score = runif(40),   avg_score = rnorm(40),   line = c(71,234,71,34,25,32,573,3,673,899,904,2,4,55,55,1003,100,432,100,29,87,326,413,32,54,523,87,988,988,12,24,754,987,12,4276,987,93,65,45,49),   validity = sample(c("t", "f"), 40, replace = t)  ) dt <- data.table(df) dt[ , denominator := 1:.n, by=word_id] dt[ , numerator := 1:.n, by=c("word_id", "validity")] dt$numerator[df$validity=="f"] <- 0 df <- dt  <df     word_id  word  true_count occurrences item_score   avg_score line validity denominator numerator  1:       1    cat        234         234 0.25497614  0.15268651   71        f           1         0  2:       1    cat        234         234 0.18662407  1.77376261  234        f           2         0  3:       1    cat        234         234 0.74554352 -0.64807093   71        t           3         1  4:       1    cat        234         234 0.93296878 -0.19981748   34        t           4         2  5:       2  house         39          34 0.49471189  0.68924373   25        f           1         0  6:       2  house         39          34 0.64499368  0.03614551   32        t           2         1  7:       2  house         39          34 0.17580259  1.94353631  573        f           3         0  8:       3   sung        876         876 0.60299465  0.73721373    3        t           1         1  9:       3   sung        876         876 0.88775767  2.32133393  673        f           2         0 10:       4   door          4           4 0.49020940  0.34890935  899        t           1         1 11:       4   door          4           4 0.01838357 -1.13391666  904        t           2         2

the data represents each detection of word in document, it's possible word appear on same line more once. task sample size represent unique column values (line), return instances line number same- meaning actual number of rows returned more specified sample size. so, 1 two-word sample size trial "cat", form of desired result be:

    word_id  word  true_count occurrences item_score   avg_score line validity denominator numerator  1:       1    cat        234         234 0.25497614  0.15268651   71        f           1         0  2:       1    cat        234         234 0.18662407  1.77376261  234        f           2         0  3:       1    cat        234         234 0.74554352 -0.64807093   71        t           3         1

my basic iteration (found on site) looks like:

for (i in 1:10) {    a2[[i]] <- lapply(split(df, df$word_id), function(x) x[sample(nrow(x), 2, replace = t), ])    b3[[i]] <- lapply(split(df, df$word_id), function(x) x[sample(nrow(x), 3, replace = t), ])}  }

so, can standard random sample sizes, unsure (and couldn't find similar or wasn't looking right way) how approach goal stated above. there straight-forward way approach this?

thanks,

here data.table solution uses join on sampled data.table.

set.seed(1234) df[df[, .(line=sample(unique(line), 2)), by=word], on=.(word, line)]

the inner data.table consists of 2 columns, word , line, , has 2 rows per word, each unique value line. values line returned sample fed unique values of line , performed separately each word (using by=word). can vary number of unique line values changing 2 desired value. data.table joined onto main data.table in order select desired rows.

in instance, get

    word_id   word true_count occurrences item_score   avg_score line validity  1:       1    cat        234         234 0.26550866  0.91897737   71        f  2:       1    cat        234         234 0.57285336  0.07456498   71        t  3:       1    cat        234         234 0.37212390  0.78213630  234        t  4:       2  house         39          34 0.89838968 -0.05612874   32        t  5:       2  house         39          34 0.94467527 -0.15579551  573        f  6:       3   sung        876         876 0.62911404 -0.47815006  673        t  7:       3   sung        876         876 0.66079779 -1.47075238    3        t  8:       4   door          4           4 0.06178627  0.41794156  899        f  9:       4   door          4           4 0.38410372 -0.05380504   55        f 10:       5 pretty         67          65 0.71761851 -0.39428995  100        f 11:       5 pretty         67          65 0.38003518  1.10002537  100        f 12:       5 pretty         67          65 0.49769924 -0.41499456 1003        f 13:       6  towel         81          81 0.21214252 -0.25336168  326        f 14:       6  towel         81          81 0.93470523 -0.16452360   87        f 15:       7    car         90          90 0.12555510  0.55666320   32        t 16:       7    car         90          90 0.26722067 -0.68875569   54        f 17:       8 island         43          43 0.01339033  0.36458196   87        t 18:       8 island         43          43 0.38238796  0.76853292  988        f 19:       8 island         43          43 0.86969085 -0.11234621  988        t 20:       9    ran         54          54 0.59956583 -0.61202639  754        f 21:       9    ran         54          54 0.82737332  1.43302370 4276        f 22:      10  pizza         53          51 0.79423986 -0.36722148   93        f 23:      10  pizza         53          51 0.41127443 -0.13505460   49        t     word_id   word true_count occurrences item_score   avg_score line validity

Search This Blog

MOno

r - How to randomly sample dataframe rows with unique column values -

Comments

Post a Comment