scala - How to do a self cartesian product over the different partitions of a Spark dataset? -


i need compare different rows of dataset 2 two. ideally, self cartesian product of dataset, remove duplicated comparisons (as a, b same b, a) , map decide whether each pair of rows equals or not. however, result on huge amount of rows , can't afford computational cost have.

in order bring down as possible resultant amount of rows, sort rows , apply self cartesian product on different subsets of whole dataset. example, subsets following ones:

  • from row 0 100
  • from row 50 150
  • from row 100 200
  • ....

by way, compare each row neigbours , final amount of rows compare smaller if self cartesian product on whole dataset.

attempt

i've implemented solution reason takes lot of time if dataset small.

firstly, sort , zip dataset in order identify each column.

val sortedbytitle = journalarticles.orderby("title") val withindex = sortedbytitle.rdd.zipwithindex().todf("article", "index").as[indexarticle] 

then, i've made function division , self cartesian product:

def divideandcartesian(data: dataset[indexarticle], fromindex: long, divisionsize: int): dataset[cartessianindexarticles] = {   val division = data.filter(x => x.index >= fromindex && x.index < fromindex + divisionsize)   if(division.count() == 0) seq.empty[(journalarticle, long, journalarticle, long)].todf("article1", "index1", "article2", "index2").as[cartessianindexarticles]   else     division.crossjoin(division).todf("article1", "index1", "article2", "index2").as[cartessianindexarticles].union(divideandcartesian(data, fromindex + (divisionsize / 2), divisionsize)) } 

any ideas?

thank you!

i suggest read approximate similarity join using locality sensitive hashing. per documentation:

the general idea of lsh use family of functions (“lsh families”) hash data points buckets, data points close each other in same buckets high probability, while data points far away each other in different buckets.

specifically, approximate similarity join:

approximate similarity join takes 2 datasets , approximately returns pairs of rows in datasets distance smaller user-defined threshold. approximate similarity join supports both joining 2 different datasets , self-joining. self-joining produce duplicate pairs.

in short, lsh bucketize rows avoid comparing possible pairs. after approximate similarity join instance, if use bucketed random projection euclidean distance

val joined = model.approxsimilarityjoin(data, data, 2.5) 

all pairs in joined within 2.5 of distance returned. decide if approximation enough filter out duplicates, or if want calculate exact similarity between rows.


Comments

Popular posts from this blog

Command prompt result in label. Python 2.7 -

javascript - How do I use URL parameters to change link href on page? -

amazon web services - AWS Route53 Trying To Get Site To Resolve To www -