pyspark - Spark output when submitting via Yarn cluster vs. client -
i new spark , got running on cluster (spark 2.0.1 on 9 node cluster running community version of mapr). submit wordcount example via
./bin/spark-submit --master yarn --jars ~/hadoopperma/jars/hadoop-lzo-0.4.21-snapshot.jar examples/src/main/python/wordcount.py ./readme.md and following output
17/04/07 13:21:34 warn client: neither spark.yarn.jars nor spark.yarn.archive set, falling uploading libraries under spark_home. : 68 help: 1 when: 1 hadoop: 3 ... looks working properly. when add --deploy-mode cluster following output:
17/04/07 13:23:52 warn client: neither spark.yarn.jars nor spark.yarn.archive set, falling uploading libraries under spark_home. so no errors but not seeing wordcount results. missing? see job in history server , says completed successfully. checked user directory in dfs no new files written except empty directory: /user/myuser/.sparkstaging
code (wordcount.py example shipped spark):
from __future__ import print_function import sys operator import add pyspark.sql import sparksession if __name__ == "__main__": if len(sys.argv) != 2: print("usage: wordcount <file>", file=sys.stderr) exit(-1) spark = sparksession\ .builder\ .appname("pythonwordcount")\ .getorcreate() lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0]) counts = lines.flatmap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \ .reducebykey(add) output = counts.collect() (word, count) in output: print("%s: %i" % (word, count)) spark.stop()
the reason output not printing is:
when run in spark-client mode node on initiating job driver , when collect result collected on node , print it.
in yarn-cluster mode driver other node not 1 through initiated job. when call .collect function result collected on , printed on node. can find result being printed in sys-out of driver. better approach write output somewhere in hdfs.
the reason spark.yarn.jars warning is:
in order run spark job yarn needs binaries available on nodes of cluster if these binaries not available part of job preparation, spark create zip file jars under $spark_home/jars , upload distributed cache.
to solve :
by default, spark on yarn use spark jars installed locally, spark jars can in world-readable(chmod 777) location on hdfs. allows yarn cache on nodes doesn't need distributed each time application runs. point jars on hdfs, example, set spark.yarn.jars hdfs:///some/path.
after placing jars run code :
./bin/spark-submit --master yarn --jars ~/hadoopperma/jars/hadoop-lzo-0.4.21-snapshot.jar examples/src/main/python/wordcount.py ./readme.md --conf spark.yarn.jars="hdfs:///some/path" source : http://spark.apache.org/docs/latest/running-on-yarn.html
Comments
Post a Comment