Tag Archives: Spark

How to Solve spark Writes Files to odps Error

spark read and write odps exception

Error message [Summary of different error reports submitted multiple times].

ERROR ApplicationMaster: User class threw exception: java.io.IOException: GetFileMeta PANGU_CAPABILITY_NO_PERMISSION PANGU_CAPABILITY_NO_PERMISSION PanguPermissionException When GetFileMeta
Exception in thread “main” org.apache.hadoop.yarn.exceptions.YarnException: com.aliyun.odps.cupid.CupidException: subprocess exit: 512, stderr content: ERROR: ld.so: object ‘KaTeX parse error: Expected '}', got 'EOF' at end of input: …ld.so: object '{LD_PRELOAD’ from LD_PRELOAD cannot be preloaded: ignored.
ERROR: ld.so: object ‘${LD_PRELOAD’ from LD_PRELOAD cannot be preloaded: ignored.
Caused by: com.aliyun.odps.cupid.CupidException: subprocess exit: 512, stderr content: ERROR: ld.so: object ‘KaTeX parse error: Expected '}', got 'EOF' at end of input: …ld.so: object '{LD_PRELOAD’ from LD_PRELOAD cannot be preloaded: ignored.
ERROR: ld.so: object ‘${LD_PRELOAD’ from LD_PRELOAD cannot be preloaded: ignored
21/12/09 14:05:23 INFO ShutdownHookManager: Shutdown hook called
, stdout content:
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:180)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:174)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1170)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1552)
at org.apache.spark.deploy.SparkSubmit. o r g .org.orgapaches p a r k sparksparkdeployS p a r k S u b m i t SparkSubmitSparkSubmitr u n M a i n ( S p a r k S u b m i t . s c a l a : 879 ) a t o r g . a p a c h e . s p a r k . d e p l o y . S p a r k S u b m i t runMain(SparkSubmit.scala:879) at org.apache.spark.deploy.SparkSubmitrunMain(SparkSubmit.scala:879)atorg.apache.spark.deploy.SparkSubmit.doRunMain1 ( S p a r k S u b m i t . s c a l a : 197 ) a t o r g . a p a c h e . s p a r k . d e p l o y . S p a r k S u b m i t 1(SparkSubmit.scala:197) at org.apache.spark.deploy.SparkSubmit1(SparkSubmit.scala:197)atorg.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
21/12/09 14:19:11 INFO ShutdownHookManager: Shutdown hook called
, stdout content:
at com.aliyun.odps.cupid.CupidUtil.errMsg2SparkException(CupidUtil.java:43)
at com.aliyun.odps.cupid.CupidUtil.getResult(CupidUtil.java:123)
at com.aliyun.odps.cupid.requestcupid.YarnClientImplUtil.transformAppCtxAndStartAM(YarnClientImplUtil.java:287)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:178)
… 8 more

 

How to Solve:
he -class attribute was wrong when I submitted the jar package.

Incorrect:
-class should not be the separator “\” before, but the separator “.”

spark-submit --master yarn-cluster \
--conf spark.hadoop.odps.cupid.history.server.address='XX' \
--conf spark.hadoop.odps.cupid.proxy.domain.name='XX' \
--conf spark.hadoop.odps.moye.trackurl.host='XX' \
--conf spark.hadoop.odps.cupid.proxy.end.point='XX' \
--conf spark.hadoop.odps.cupid.volume.paths='Just store the address directory, no need to specify a specific file name' \
--class com/cctv/bigdata/recall/rank.video.LRRankModel \
/Users/keino/Desktop/recorecall-1.0-SNAPSHOT-shaded.jar 10 10 10 20210701

Correct writing:

spark-submit --master yarn-cluster \
--conf spark.hadoop.odps.cupid.history.server.address='XX' \
--conf spark.hadoop.odps.cupid.proxy.domain.name='XX' \
--conf spark.hadoop.odps.moye.trackurl.host='XX' \
--conf spark.hadoop.odps.cupid.proxy.end.point='XX' \
--conf spark.hadoop.odps.cupid.volume.paths='Just store the address directory, no need to specify a specific file name' \
--class com.cctv.bigdata.recall.rank.video.LRRankModel \
/Users/keino/Desktop/recorecall-1.0-SNAPSHOT-shaded.jar 10 10 10 20210701

Spark Error: java.lang.StackOverflowError [How to Solve]

Spark broadcast class message java.lang.stackoverflowerror

Background: a 167m tree class needs to be broadcast, so the stack memory is not enough
solution: add in spark submit: (at present, due to the small data level, it runs in local mode)

spark-submit \$
--class bp_beauty_op.beauty_op \$
--master local[*] \$
--driver-java-options "-Xss256m" \$
test-1.0-SNAPSHOT.jar

Or add spark.Driver.Extrajavaoptions = “- xss256m”
in spark-defaults.conf to test the running.

PySpark error: AttributeError: ‘NoneType‘ object has no attribute ‘_jvm‘

Possible reason 1: when you use from pyspark.SQL.Functions import * to pour in the pyspark function, the python built-in function in UDF is replaced by spark function, and you can import it again

Possible reason 2: the user-defined UDF function is not placed in the main function, resulting in an error

Apple M1: How to Solve Spark runs Error

snappy-java-1.1.8.3 (2021-01-20)

Could not initialize class org.xerial.snappy.Snappy
m1 no native library is found for os.name=mac and os.arch=aarch64

Solution:

<dependency>
    <groupId>org.xerial.snappy</groupId>
    <artifactId>snappy-java</artifactId>
    <version>1.1.8.3</version>
    <scope>provided</scope>
</dependency>

The latest package can support M1 chip.

Sparkcontext: error initializing sparkcontext workaround

Sparkcontext: error initializing sparkcontext workaround

Spark reports an error when configuring a highly available cluster
error sparkcontext: error initializing sparkcontext. Java.net.connectexception: call from Hadoop 102/192.168.10.102 to Hadoop 102: 8020 failed on connection exception: java.net.connectexception: connection rejected

This is because we configured spark logs to be stored in HDFS, but Hadoop was not opened after the spark cluster was started, resulting in an error when submitting tasks.

Solution:

    no longer store the event log
    find the spark installation directory/conf/spark-defaults.conf file, as shown in the figure, and comment out the corresponding event log part
    store the event log locally instead of in HDFS
    replace the directory in the second line in the figure above with the Linux local directory to start the Hadoop cluster (i.e. HDFS service)
    investigate the cause of the error, Or we configured the spark log to be stored in HDFS, but did not open HDFS, so start the Hadoop cluster

[Solved] ERROR PythonRunner: Python worker exited unexpectedly (crashed)

Some time ago, I received a private letter from my fans and reported an error when running in pychart. Error Python runner: Python worker exited unexpectedly (crashed)

The test run print (input_rdd. First()) can be printed, but the print (input_rdd. Count()) trigger function will report an error

print(input_rdd.count())

Error Python runner: Python worker exited unexpectedly (crashed) means Python worker exited unexpectedly (crashed)

21/10/24 10:24:48 ERROR PythonRunner: Python worker exited unexpectedly (crashed)
java.net.SocketException: Connection reset by peer: socket write error
	at java.net.SocketOutputStream.socketWrite0(Native Method)
	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
	at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:95)
	at java.io.DataOutputStream.writeInt(DataOutputStream.java:199)
	at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:476)
	at org.apache.spark.api.python.PythonRDD$.write$1(PythonRDD.scala:297)
	at org.apache.spark.api.python.PythonRDD$.$anonfun$writeIteratorToStream$1(PythonRDD.scala:307)
	at org.apache.spark.api.python.PythonRDD$.$anonfun$writeIteratorToStream$1$adapted(PythonRDD.scala:307)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:307)
	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:621)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:397)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:232)
21/10/24 10:24:48 ERROR PythonRunner: This may have been caused by a prior exception:
java.net.SocketException: Connection reset by peer: socket write error
	at java.net.SocketOutputStream.socketWrite0(Native Method)
	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
	at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:95)
	at java.io.DataOutputStream.writeInt(DataOutputStream.java:199)
	at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:476)
	at org.apache.spark.api.python.PythonRDD$.write$1(PythonRDD.scala:297)
	at org.apache.spark.api.python.PythonRDD$.$anonfun$writeIteratorToStream$1(PythonRDD.scala:307)
	at org.apache.spark.api.python.PythonRDD$.$anonfun$writeIteratorToStream$1$adapted(PythonRDD.scala:307)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:307)
	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:621)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:397)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:232)
21/10/24 10:24:48 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.net.SocketException: Connection reset by peer: socket write error
	at java.net.SocketOutputStream.socketWrite0(Native Method)
	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
	at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:95)
	at java.io.DataOutputStream.writeInt(DataOutputStream.java:199)
	at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:476)
	at org.apache.spark.api.python.PythonRDD$.write$1(PythonRDD.scala:297)
	at org.apache.spark.api.python.PythonRDD$.$anonfun$writeIteratorToStream$1(PythonRDD.scala:307)
	at org.apache.spark.api.python.PythonRDD$.$anonfun$writeIteratorToStream$1$adapted(PythonRDD.scala:307)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:307)
	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:621)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:397)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:232)
21/10/24 10:24:48 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (LAPTOP-RK2V2UMB executor driver): java.net.SocketException: Connection reset by peer: socket write error
	at java.net.SocketOutputStream.socketWrite0(Native Method)
	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
	at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:95)
	at java.io.DataOutputStream.writeInt(DataOutputStream.java:199)
	at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:476)
	at org.apache.spark.api.python.PythonRDD$.write$1(PythonRDD.scala:297)
	at org.apache.spark.api.python.PythonRDD$.$anonfun$writeIteratorToStream$1(PythonRDD.scala:307)
	at org.apache.spark.api.python.PythonRDD$.$anonfun$writeIteratorToStream$1$adapted(PythonRDD.scala:307)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:307)
	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:621)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:397)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:232)

21/10/24 10:24:48 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
  File "D:/Code/pycode/exercise/pyspark-study/pyspark-learning/pyspark-day04/main/01_web_analysis.py", line 28, in <module>
    print(input_rdd.first())
  File "D:\opt\Anaconda3-2020.11\lib\site-packages\pyspark\rdd.py", line 1586, in first
    rs = self.take(1)
  File "D:\opt\Anaconda3-2020.11\lib\site-packages\pyspark\rdd.py", line 1566, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "D:\opt\Anaconda3-2020.11\lib\site-packages\pyspark\context.py", line 1233, in runJob
    sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
  File "D:\opt\Anaconda3-2020.11\lib\site-packages\py4j\java_gateway.py", line 1304, in __call__
    return_value = get_return_value(
  File "D:\opt\Anaconda3-2020.11\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (LAPTOP-RK2V2UMB executor driver): java.net.SocketException: Connection reset by peer: socket write error
	at java.net.SocketOutputStream.socketWrite0(Native Method)
	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
	at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:95)
	at java.io.DataOutputStream.writeInt(DataOutputStream.java:199)
	at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:476)
	at org.apache.spark.api.python.PythonRDD$.write$1(PythonRDD.scala:297)
	at org.apache.spark.api.python.PythonRDD$.$anonfun$writeIteratorToStream$1(PythonRDD.scala:307)
	at org.apache.spark.api.python.PythonRDD$.$anonfun$writeIteratorToStream$1$adapted(PythonRDD.scala:307)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:307)
	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:621)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:397)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:232)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
	at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:166)
	at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketException: Connection reset by peer: socket write error
	at java.net.SocketOutputStream.socketWrite0(Native Method)
	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
	at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:95)
	at java.io.DataOutputStream.writeInt(DataOutputStream.java:199)
	at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:476)
	at org.apache.spark.api.python.PythonRDD$.write$1(PythonRDD.scala:297)
	at org.apache.spark.api.python.PythonRDD$.$anonfun$writeIteratorToStream$1(PythonRDD.scala:307)
	at org.apache.spark.api.python.PythonRDD$.$anonfun$writeIteratorToStream$1$adapted(PythonRDD.scala:307)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:307)
	at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:621)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:397)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:232)


Process finished with exit code 1

For the solution to this problem, Xiaobian inquired online. This problem may be caused by many situations. For the current situation that Xiaobian helps solve, spark running locally on Windows system is a software problem. The amount of data is a little large, and errors may be reported when running on pycharm.

Without much nonsense, let’s talk about the solution to the problem of fans. It’s very simple. After pycharm is closed, open it again and run it again. Note that if not, shut down again and run again.

[Solved] org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow

The following errors are reported when running spark task:

21/10/09 14:56:32 ERROR Executor: Exception in task 1.0 in stage 2.0 (TID 4)
org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 93. To avoid this, increase spark.kryoserializer.buffer.max value.
	at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:299)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:265)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
21/10/09 14:56:32 INFO TaskSetManager: Starting task 2.0 in stage 2.0 (TID 5, localhost, partition 2,NODE_LOCAL, 2262 bytes)
21/10/09 14:56:32 INFO Executor: Running task 2.0 in stage 2.0 (TID 5)
21/10/09 14:56:32 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
21/10/09 14:56:32 WARN TaskSetManager: Lost task 1.0 in stage 2.0 (TID 4, localhost): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 93. To avoid this, increase spark.kryoserializer.buffer.max value.
	at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:299)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:265)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

21/10/09 14:56:32 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
21/10/09 14:56:32 ERROR TaskSetManager: Task 1 in stage 2.0 failed 1 times; aborting job

Solution:

### Task submission plus parameters
--conf  spark.kryoserializer.buffer.max=2048m  spark.kryoserializer.buffer=512m 

[Solved] hiveonspark:Execution Error, return code 30041 from org.apache.hadoop.hive.ql.exec.spark.SparkTask

Problem Description:
when deploying hive on spark, the test reports an error, and the table creation operation is successful, but the following error occurs when inserting insert:

Failed to execute spark task, with exception ‘org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create Spark client for Spark session 2df0eb9a-15b4-4d81-aea1-24b12094bf44)’
FAILED: Execution Error, return code 30041 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. Failed to create Spark client for Spark session 2df0eb9a-15b4-4d81-aea1-24b12094bf44

View the hive log according to the required time in the/TMP/Xiaobai path:

cause analysis
prompt timed out waiting for client connection. Indicates that the connection time between hive and spark has timed out

Solution
1). Change the spark-env.sh.template file in/opt/module/spark/conf/directory to spark env. Sh , and then add the content export spark_ DIST_ CLASSPATH=$(hadoop classpath)
2). Change hive-site.xml in/opt/module/hive/conf directory to modify the connection time between hive and spark

execute the insert statement again. Success! Cry with joy

I made a mistake last night. I checked it all night and didn’t solve it. As a result, I solved it today.

How to Solve Error in importing scala word2vecmodel

import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
model save:
Link: http://spark.apache.org/docs/2.3.4/api/scala/index.html#org.apache.spark.mllib.feature.Word2VecModel

var model = Word2VecModel.load(spark.sparkContext, config.model_path)

model read:
Link: http://spark.apache.org/docs/2.3.4/api/scala/index.html#org.apache.spark.mllib.feature.Word2VecModel$

var model = Word2VecModel.load(spark.sparkContext, config.model_path)

Read Error:

Exception in thread "main" java.lang.IllegalAccessError: tried to access method com.google.common.base.Stopwatch.<init>()V from class org.apache.hadoop.mapred.FileInputFormat
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:312)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
	at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1337)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.take(RDD.scala:1331)
	at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1372)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.first(RDD.scala:1371)
	at org.apache.spark.mllib.util.Loader$.loadMetadata(modelSaveLoad.scala:129)
	at org.apache.spark.mllib.feature.Word2VecModel$.load(Word2Vec.scala:699)
	at job.ml.embeddingModel.graphEmbedding$.run(graphEmbedding.scala:40)
	at job.ml.embeddingModel.graphEmbedding$.main(graphEmbedding.scala:24)
	at job.ml.embeddingModel.graphEmbedding.main(graphEmbedding.scala)
	

POM file add

    <dependency>
      <groupId>com.google.guava</groupId>
      <artifactId>guava</artifactId>
      <version>15.0</version>
    </dependency>

Run OK again!

Zeppelin starts successfully, but an error is reported

Error message

The Zeppelin service was started successfully, and the UI interface was accessed normally, but the running code reported an error.

org.apache.thrift.transport.TTransportException
    at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
    at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
    at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
    at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
    at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
    at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
    at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:241)
    at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:225)
    at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:229)
    at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
    at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:229)
    at org.apache.zeppelin.scheduler.Job.run(Job.java:171)
    at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:328)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Cause of error reporting: this error is caused by the failure to connect to the corresponding spark and other related service parsers. If the spark and Hadoop services run normally, it is the cause of version incompatibility.

Solution

Replace with a compatible Zeppelin version.