我的版本:
- linux:16.04
- kafka:0.10
- scala:2.11
- spark:2.4.7
- jar包:spark-streaming-kafka-0-8_2.11-2.4.7.jar
在尝试运行spark用直连的方式接收kafka的数据时,出现报错,python代码如下(仅测试环境,可忽略):
#spark通过直连的方式接收kafka数据,仅以测试环境是否可用为目的
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
def start():
spark_conf = SparkConf().setMaster("local").setAppName("KafkaDirect")
sc = SparkContext(conf=spark_conf)
sc.setLogLevel("WARN")
ssc=StreamingContext(sc,1)
brokers="localhost:9092"
topic='ratingTopic'
kafkaStreams = KafkaUtils.createDirectStream(ssc, [topic], kafkaParams={"metadata.broker.list": brokers})
#kafkaStreams = KafkaUtils.createStream(ssc,brokers,1,{topic:1})
#统计生成的随机数的分布情况
result=kafkaStreams.map(lambda x:(x[0],1)).reduceByKey(lambda x, y: x + y)
#打印offset的情况,此处也可以写到Zookeeper中
#You can use transform() instead of foreachRDD() as your
# first method call in order to access offsets, then call further Spark methods.
kafkaStreams.transform(storeOffsetRanges).foreachRDD(printOffsetRanges)
result.pprint()
ssc.start() # Start the computation
ssc.awaitTermination() # Wait for the computation to terminate
offsetRanges = []
def storeOffsetRanges(rdd):
global offsetRanges
offsetRanges = rdd.offsetRanges()
return rdd
def printOffsetRanges(rdd):
for o in offsetRanges:
print("%s %s %s %s %s"% (o.topic, o.partition, o.fromOffset, o.untilOffset,o.untilOffset-o.fromOffset))
if __name__ == '__main__':
start()
出现以下报错:
$ python3 SparkDirect.py
21/05/07 17:03:48 WARN util.Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 192.168.164.129 instead (on interface ens33)
21/05/07 17:03:48 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/05/07 17:03:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/05/07 17:03:51 WARN streaming.StreamingContext: spark.master should be set as local[n], n > 1 in local mode if you have receivers to get data, otherwise Spark jobs will not get resources to process the received data.
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
response = connection.send_command(command)
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receivingException in thread "Thread-4" java.lang.NoClassDefFoundError: kafka/common/TopicAndPartition
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetPublicMethods(Class.java:2902)
at java.lang.Class.getMethods(Class.java:1615)
at py4j.reflection.ReflectionEngine.getMethodsByNameAndLength(ReflectionEngine.java:345)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:305)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: kafka.common.TopicAndPartition
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 12 more
Traceback (most recent call last):
File "SparkDirect.py", line 39, in <module>
start()
File "SparkDirect.py", line 15, in start
kafkaStreams = KafkaUtils.createDirectStream(ssc, [topic], kafkaParams={"metadata.broker.list": brokers})
File "/usr/local/spark/python/pyspark/streaming/kafka.py", line 146, in createDirectStream
ssc._jssc, kafkaParams, set(topics), jfromOffsets)
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 336, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o24.createDirectStreamWithoutMessageHandler
可以看到出错的地方主要是在15行:
kafkaStreams = KafkaUtils.createDirectStream(ssc, [topic], kafkaParams={"metadata.broker.list": brokers})
再结合报错没考虑是jar包版本不匹配问题,考虑更换合适的jar包。
spark官网下载jar包地址:https://search.maven.org/search?q=g:org.apache.spark%20AND%20v:2.1.0
经测试在我的版本环境下spark-streaming-kafka-0-8-assembly_2.11-2.4.7.jar是可用的。更换jar包之后运行正常。
ps:如果不清楚哪个版本的jar包是合适的,可以先不使用或者用其他版本的jar包,会报错提示找不到jar包并建议下载合适版本的jar包:
Spark Streaming's Kafka libraries not found in class path. Try one of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.4.7 ...
2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.7.
Then, include the jar in the spark-submit command as
$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...
Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.7. 即所需要的版本。
在尝试使用Spark 2.4.7和kafka 0.10进行streaming连接时遇到`java.lang.NoClassDefFoundError: kafka/common/TopicAndPartition`错误。问题出在jar包不匹配。解决方案是使用适用于该环境的spark-streaming-kafka-0-8-assembly_2.11-2.4.7.jar。更换jar包后,运行恢复正常。

3835

被折叠的 条评论
为什么被折叠?



