scala - How to do a self cartesian product over the different partitions of a Spark dataset? -

June 15, 2012

i need compare different rows of dataset 2 two. ideally, self cartesian product of dataset, remove duplicated comparisons (as a, b same b, a) , map decide whether each pair of rows equals or not. however, result on huge amount of rows , can't afford computational cost have.

in order bring down as possible resultant amount of rows, sort rows , apply self cartesian product on different subsets of whole dataset. example, subsets following ones:

from row 0 100
from row 50 150
from row 100 200
....

by way, compare each row neigbours , final amount of rows compare smaller if self cartesian product on whole dataset.

attempt

i've implemented solution reason takes lot of time if dataset small.

firstly, sort , zip dataset in order identify each column.

val sortedbytitle = journalarticles.orderby("title") val withindex = sortedbytitle.rdd.zipwithindex().todf("article", "index").as[indexarticle]

then, i've made function division , self cartesian product:

def divideandcartesian(data: dataset[indexarticle], fromindex: long, divisionsize: int): dataset[cartessianindexarticles] = {   val division = data.filter(x => x.index >= fromindex && x.index < fromindex + divisionsize)   if(division.count() == 0) seq.empty[(journalarticle, long, journalarticle, long)].todf("article1", "index1", "article2", "index2").as[cartessianindexarticles]   else     division.crossjoin(division).todf("article1", "index1", "article2", "index2").as[cartessianindexarticles].union(divideandcartesian(data, fromindex + (divisionsize / 2), divisionsize)) }

any ideas?

thank you!

i suggest read approximate similarity join using locality sensitive hashing. per documentation:

the general idea of lsh use family of functions (“lsh families”) hash data points buckets, data points close each other in same buckets high probability, while data points far away each other in different buckets.

specifically, approximate similarity join:

approximate similarity join takes 2 datasets , approximately returns pairs of rows in datasets distance smaller user-defined threshold. approximate similarity join supports both joining 2 different datasets , self-joining. self-joining produce duplicate pairs.

in short, lsh bucketize rows avoid comparing possible pairs. after approximate similarity join instance, if use bucketed random projection euclidean distance

val joined = model.approxsimilarityjoin(data, data, 2.5)

all pairs in joined within 2.5 of distance returned. decide if approximation enough filter out duplicates, or if want calculate exact similarity between rows.

Search This Blog

MOno

scala - How to do a self cartesian product over the different partitions of a Spark dataset? -

attempt

Comments

Post a Comment

Popular posts from this blog

'hasOwnProperty' in javascript -

python - ValueError: No axis named 1 for object type <class 'pandas.core.series.Series'> -

java - How to implement an entity bound odata action in olingo v4.3 -