scala - How to do a self cartesian product over the different partitions of a Spark dataset? -
i need compare different rows of dataset 2 two. ideally, self cartesian product of dataset, remove duplicated comparisons (as a, b
same b, a
) , map
decide whether each pair of rows equals or not. however, result on huge amount of rows , can't afford computational cost have.
in order bring down as possible resultant amount of rows, sort rows , apply self cartesian product on different subsets of whole dataset. example, subsets following ones:
- from row 0 100
- from row 50 150
- from row 100 200
- ....
by way, compare each row neigbours , final amount of rows compare smaller if self cartesian product on whole dataset.
attempt
i've implemented solution reason takes lot of time if dataset small.
firstly, sort , zip dataset in order identify each column.
val sortedbytitle = journalarticles.orderby("title") val withindex = sortedbytitle.rdd.zipwithindex().todf("article", "index").as[indexarticle]
then, i've made function division , self cartesian product:
def divideandcartesian(data: dataset[indexarticle], fromindex: long, divisionsize: int): dataset[cartessianindexarticles] = { val division = data.filter(x => x.index >= fromindex && x.index < fromindex + divisionsize) if(division.count() == 0) seq.empty[(journalarticle, long, journalarticle, long)].todf("article1", "index1", "article2", "index2").as[cartessianindexarticles] else division.crossjoin(division).todf("article1", "index1", "article2", "index2").as[cartessianindexarticles].union(divideandcartesian(data, fromindex + (divisionsize / 2), divisionsize)) }
any ideas?
thank you!
i suggest read approximate similarity join
using locality sensitive hashing. per documentation:
the general idea of lsh use family of functions (“lsh families”) hash data points buckets, data points close each other in same buckets high probability, while data points far away each other in different buckets.
specifically, approximate similarity join:
approximate similarity join takes 2 datasets , approximately returns pairs of rows in datasets distance smaller user-defined threshold. approximate similarity join supports both joining 2 different datasets , self-joining. self-joining produce duplicate pairs.
in short, lsh bucketize rows avoid comparing possible pairs. after approximate similarity join instance, if use bucketed random projection euclidean distance
val joined = model.approxsimilarityjoin(data, data, 2.5)
all pairs in joined
within 2.5 of distance returned. decide if approximation enough filter out duplicates, or if want calculate exact similarity between rows.
Comments
Post a Comment