spark-streaming连接kafka报错 java.lang.NoClassDefFoundError: kafka/common/TopicAndPartition

最新推荐文章于 2023-09-20 19:04:24 发布

原创最新推荐文章于 2023-09-20 19:04:24 发布 · 1.9k 阅读

0 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#spark #kafka #大数据

大数据专栏收录该内容

3 篇文章

订阅专栏

在尝试使用Spark 2.4.7和kafka 0.10进行streaming连接时遇到`java.lang.NoClassDefFoundError: kafka/common/TopicAndPartition`错误。问题出在jar包不匹配。解决方案是使用适用于该环境的spark-streaming-kafka-0-8-assembly_2.11-2.4.7.jar。更换jar包后，运行恢复正常。

我的版本：

linux：16.04
kafka：0.10
scala：2.11
spark：2.4.7
jar包：spark-streaming-kafka-0-8_2.11-2.4.7.jar

在尝试运行spark用直连的方式接收kafka的数据时，出现报错，python代码如下（仅测试环境，可忽略）：

#spark通过直连的方式接收kafka数据，仅以测试环境是否可用为目的

from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils


def start():
    spark_conf = SparkConf().setMaster("local").setAppName("KafkaDirect")
    sc = SparkContext(conf=spark_conf)
    sc.setLogLevel("WARN")
    ssc=StreamingContext(sc,1)

    brokers="localhost:9092"
    topic='ratingTopic'
    kafkaStreams = KafkaUtils.createDirectStream(ssc, [topic], kafkaParams={"metadata.broker.list": brokers})
    #kafkaStreams = KafkaUtils.createStream(ssc,brokers,1,{topic:1})
    #统计生成的随机数的分布情况
    result=kafkaStreams.map(lambda x:(x[0],1)).reduceByKey(lambda x, y: x + y)
    #打印offset的情况，此处也可以写到Zookeeper中
    #You can use transform() instead of foreachRDD() as your
    # first method call in order to access offsets, then call further Spark methods.
    kafkaStreams.transform(storeOffsetRanges).foreachRDD(printOffsetRanges)
    result.pprint()
    ssc.start()             # Start the computation
    ssc.awaitTermination()  # Wait for the computation to terminate

offsetRanges = []

def storeOffsetRanges(rdd):
    global offsetRanges
    offsetRanges = rdd.offsetRanges()
    return rdd

def printOffsetRanges(rdd):
    for o in offsetRanges:
        print("%s %s %s %s %s"% (o.topic, o.partition, o.fromOffset, o.untilOffset,o.untilOffset-o.fromOffset))

if __name__ == '__main__':
    start()

出现以下报错：

$ python3 SparkDirect.py 
21/05/07 17:03:48 WARN util.Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 192.168.164.129 instead (on interface ens33)
21/05/07 17:03:48 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/05/07 17:03:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/05/07 17:03:51 WARN streaming.StreamingContext: spark.master should be set as local[n], n > 1 in local mode if you have receivers to get data, otherwise Spark jobs will not get resources to process the received data.
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
    response = connection.send_command(command)
  File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receivingException in thread "Thread-4" java.lang.NoClassDefFoundError: kafka/common/TopicAndPartition
	at java.lang.Class.getDeclaredMethods0(Native Method)
	at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
	at java.lang.Class.privateGetPublicMethods(Class.java:2902)
	at java.lang.Class.getMethods(Class.java:1615)
	at py4j.reflection.ReflectionEngine.getMethodsByNameAndLength(ReflectionEngine.java:345)
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:305)
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
	at py4j.Gateway.invoke(Gateway.java:274)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: kafka.common.TopicAndPartition
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	... 12 more

Traceback (most recent call last):
  File "SparkDirect.py", line 39, in <module>
    start()
  File "SparkDirect.py", line 15, in start
    kafkaStreams = KafkaUtils.createDirectStream(ssc, [topic], kafkaParams={"metadata.broker.list": brokers})
  File "/usr/local/spark/python/pyspark/streaming/kafka.py", line 146, in createDirectStream
    ssc._jssc, kafkaParams, set(topics), jfromOffsets)
  File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 336, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o24.createDirectStreamWithoutMessageHandler

可以看到出错的地方主要是在15行：

kafkaStreams = KafkaUtils.createDirectStream(ssc, [topic], kafkaParams={"metadata.broker.list": brokers})

再结合报错没考虑是jar包版本不匹配问题，考虑更换合适的jar包。

spark官网下载jar包地址：https://search.maven.org/search?q=g:org.apache.spark%20AND%20v:2.1.0

经测试在我的版本环境下spark-streaming-kafka-0-8-assembly_2.11-2.4.7.jar是可用的。更换jar包之后运行正常。

ps：如果不清楚哪个版本的jar包是合适的，可以先不使用或者用其他版本的jar包，会报错提示找不到jar包并建议下载合适版本的jar包：

Spark Streaming's Kafka libraries not found in class path. Try one of the following.

  1. Include the Kafka library and its dependencies with in the
     spark-submit command as

     $ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.4.7 ...

  2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
     Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.7.
     Then, include the jar in the spark-submit command as

     $ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...

Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.7. 即所需要的版本。