Spark运行说明

最新推荐文章于 2023-04-10 15:36:07 发布

原创最新推荐文章于 2023-04-10 15:36:07 发布 · 732 阅读

0 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#Spark

Spark 专栏收录该内容

14 篇文章

订阅专栏

一 Spark运行环境

Spark是Scala写的，运行在JVM上，所以运行环境Java7+

如果使用Python API，需要安装Python 2.6+或者运行Python3.4+

Spark 1.6.2-Scala 2.10 Spark 2.0.0+Scala2.11

二 Spark下载

下载地址：

http://spark.apache.org/downloads.html

搭Spark不需要Hadoop，如有hadoop集群，可下载相应的版本。

解压

三 Spark目录

bin包含用来和Spark交互的可执行文件，如Spark shell

core，streaming，python，...包含主要组件的源代码。

examples包含一些单机spark job,你可以研究和运行这些例子。

四 Spark的Shell

Spark的shell使你能够处理分布在集群上的数据。

Spark把数据加载到节点的内存中，因此分布式处理可在秒级完成。

快速使迭代式计算，实时查询、分析一般能够在shells中完成。

Spark提供了Python shells和Scala shells。

五 Python Shell进入方法

[root@master bin]# ./pyspark
Python 2.7.2 (default, Jan  6 2018, 08:58:52)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux3
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
  File "/opt/spark-2.0.0-bin-hadoop2.7/python/pyspark/shell.py", line 28, in <module>
    import py4j
zipimport.ZipImportError: can't decompress data; zlib not available
>>> exit();

六 Scala Shell进入方法

[root@master bin]# ./spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
18/02/04 18:40:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/02/04 18:40:45 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://192.168.0.110:4040
Spark context available as 'sc' (master = local[*], app id = local-1517740844632).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_152)
Type in expressions to have them evaluated.
Type :help for more information.
scala>

七实战

[root@master ~]# cat helloSpark.txt
go to home hello java
so many to hello word kafka java
go to so
scala> val lines = sc.textFile("/root/helloSpark.txt")
lines: org.apache.spark.rdd.RDD[String] = /root/helloSpark.txt MapPartitionsRDD[1] at textFile at <console>:24
scala> lines.count()
res0: Long = 3
scala> lines.first()
res1: String = go to home hello java

八修改日志级别

[root@master conf]# cat log4j.properties
log4j.rootCategory=WARN, console