pyspark - Spark output when submitting via Yarn cluster vs. client -

February 15, 2010

i new spark , got running on cluster (spark 2.0.1 on 9 node cluster running community version of mapr). submit wordcount example via

./bin/spark-submit --master yarn --jars ~/hadoopperma/jars/hadoop-lzo-0.4.21-snapshot.jar examples/src/main/python/wordcount.py ./readme.md

and following output

17/04/07 13:21:34 warn client: neither spark.yarn.jars nor spark.yarn.archive set, falling uploading libraries under spark_home. : 68 help: 1 when: 1 hadoop: 3 ...

looks working properly. when add --deploy-mode cluster following output:

17/04/07 13:23:52 warn client: neither spark.yarn.jars nor spark.yarn.archive set, falling uploading libraries under spark_home.

so no errors but not seeing wordcount results. missing? see job in history server , says completed successfully. checked user directory in dfs no new files written except empty directory: /user/myuser/.sparkstaging

code (wordcount.py example shipped spark):

from __future__ import print_function import sys operator import add pyspark.sql import sparksession   if __name__ == "__main__":     if len(sys.argv) != 2:         print("usage: wordcount <file>", file=sys.stderr)         exit(-1)      spark = sparksession\         .builder\         .appname("pythonwordcount")\         .getorcreate()      lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])     counts = lines.flatmap(lambda x: x.split(' ')) \               .map(lambda x: (x, 1)) \               .reducebykey(add)     output = counts.collect()     (word, count) in output:         print("%s: %i" % (word, count))      spark.stop()

the reason output not printing is:

when run in spark-client mode node on initiating job driver , when collect result collected on node , print it.

in yarn-cluster mode driver other node not 1 through initiated job. when call .collect function result collected on , printed on node. can find result being printed in sys-out of driver. better approach write output somewhere in hdfs.

the reason spark.yarn.jars warning is:

in order run spark job yarn needs binaries available on nodes of cluster if these binaries not available part of job preparation, spark create zip file jars under $spark_home/jars , upload distributed cache.

to solve :

by default, spark on yarn use spark jars installed locally, spark jars can in world-readable(chmod 777) location on hdfs. allows yarn cache on nodes doesn't need distributed each time application runs. point jars on hdfs, example, set spark.yarn.jars hdfs:///some/path.

after placing jars run code :

./bin/spark-submit --master yarn --jars ~/hadoopperma/jars/hadoop-lzo-0.4.21-snapshot.jar examples/src/main/python/wordcount.py ./readme.md --conf spark.yarn.jars="hdfs:///some/path"

source : http://spark.apache.org/docs/latest/running-on-yarn.html

Search This Blog

MOno

pyspark - Spark output when submitting via Yarn cluster vs. client -

Comments

Post a Comment

Popular posts from this blog

Retrieving ETA (estimated time of arrival) with Google Distance Matrix API and public transit as transport mode -

javascript - Confirm a form & display message if form is valid with JQuery -

ionic framework - Meteor - Error: Failed to execute 'insertBefore' on 'Node' -