1. Spark SQL基础
1.1 什么是Spark SQL
Spark SQL是Spark用来处理结构化数据(二维表数据)的一个模块,它提供一个编程抽象DataFrame并作为分布式SQL查询引擎的作用。
为什么要学习Spark SQL? Hive(离线)是将HQL转换成MR然后提交到集群上执行,大大简化了编写MR程序的复杂性,但是Hive(离线)如果数据量大,执行很慢,所以Spark SQL诞生了,它将Spark SQL转换成RDD,然后提交到集群执行,执行效率飞快。同时Spark SQL也支持从Hive中读取数据。
特点:
容易集成、统一的数据访问方式、兼容Hive、标准的数据连接
1.2 核心
DataFrame(表) = schema(表结构)+ Data(表数据,RDD)
DataFrame表现形式就是个RDD
1.3 建表(DataFrame)
1.3.1 使用case class定义表(过时了已经)
A.定义case class(表结构schema)
case class Student1(id:Long,name:String,weight:Double,age:Int)
B.导入数据(可以是文件)
val lines: RDD[String] = sc.parallelize(List("1,tom,99,28","2,mary,87,22","3,jack,100,36"))
C.生成表
var studentRDD: RDD[Student1] = lines.map(line=>{
val fields = line.split(",")
val id = fields(0).toLong
val name = fields(1)
val weight = fields(2).toDouble
val age = fields(3).toInt
Student1(id,name,weight,age)
})
由studentRDD直接生成表
val df = studentRDD.toDF
附代码:
package com.bigdata.spark.day1021
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}
object SQLDemo {
def main(args: Array[String]): Unit = {
Logger.getLogger("org.apache.spark").setLevel(Level.OFF)
val conf = new SparkConf().setAppName("SQLDemo").setMaster("local[*]")
val sc = new SparkContext(conf)
//SQLContext是SparkContext的一个包装(增强功能,可以处理结构化数据)
val sqlContext = new SQLContext(sc)
//DataFrame = RDD + Schema
//将RDD添加额外信息,变成DF
val lines: RDD[String] = sc.parallelize(List("1,tom,99,28","2,mary,87,22","3,jack,100,36"))
var studentRDD: RDD[Student1] = lines.map(line=>{
val fields = line.split(",")
val id = fields(0).toLong
val name = fields(1)
val weight = fields(2).toDouble
val age = fields(3).toInt
Student1(id,name,weight,age)
})
//将RDD转换成DataFrame
//导入隐式转换
import sqlContext.implicits._
val df = studentRDD.toDF
//对DataFrame进行操作
//使用sql风格的API
df.registerTempTable("student")
//不会执行,因为sql是一个transformation
val result: DataFrame =sqlContext.sql("select name,weight from student order by weight desc,age asc")
//触发action
result.show()
sc.stop()
}
}
//定义一个case class,将数据保存到case class
//case class :不用new,实现序列化,模式匹配
case class Student1(id:Long,name:String,weight:Double,age:Int)
D.操作
result.show()
df.printSchema()


1.3.2 使用SparkSession对象
根据SparkShell启动日志,Spark session available as ‘spark’.
2.0以后,Spark的新访问接口提供统一的访问方式(Spark SQL的特点)
通过Spark Session可以访问Spark的所有模块
package com.bigdata.spark.day1021
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types._
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, Row, SQLContext, SparkSession}
/*
Spark SQL 2.X
*/
object SQLDemo3 {
def main(args: Array[String]): Unit = {
Logger.getLogger("org.apache.spark").setLevel(Level.OFF)
// val conf = new SparkConf().setAppName("SQLDemo").setMaster("local[*]")
// val sc = new SparkContext(conf)
// //SQLContext是SparkContext的一个包装(增强功能,可以处理结构化数据)
// val sqlContext = new SQLContext(sc)
//在Spark 2.0以后的编程API的入口为Spark Session
val session: SparkSession = SparkSession.builder()
.appName("SQLDemo3")
.master("local[*]")
.getOrCreate()
//DataFrame = RDD + Schema
//将RDD添加额外信息,变成DF
val lines: RDD[String] = session.sparkContext.parallelize(List("1,tom,99,28","2,mary,87,22","3,jack,100,36"))
//整理数据,将RDD管理schema信息
val rowRDD: RDD[Row] = lines.map(line=>{
val fields = line.split(",")
val id = fields(0).toLong
val name = fields(1)
val weight = fields(2).toDouble
val age = fields(3).toInt
Row(id,name,weight,age)
})
val schema = StructType(
List(
StructField("id",LongType),
StructField("name",StringType),
StructField("weight",DoubleType),
StructField("age",IntegerType)
)
)
val df: DataFrame = session.createDataFrame(rowRDD,schema)
//对df进行处理,注册表(视图)
df.createTempView("student")
val result: DataFrame = session.sql("select * from student order by age desc,weight asc")
result.show()
session.stop()
}
}
1.3.3 直接读取一个带格式的文件(json)(最简单)
前提:数据文件本身要有一定的格式
使用Spark Example提供的测试数据(/root/app/spark-2.2.1-bin-hadoop2.7/examples/src/main/resources/people.json)

A.把文件复制一份到我们方便操作的目录中
cp people.json /textfile/
B.使用SparkSession直接读取文件
val peopleDF = spark.read.json("/textfile/people.json")
peopleDF.show

1.4 操作表(DataFrame):支持两种语言 SQL、DSL
1.4.1 SQL
前提条件:把DataFrame注册成一个Table或View
peopleDF.createOrReplaceTempView("people")
spark.sql("select * from people").show
spark.sql("select * from people where age = 30 ").show

1.4.2 DSL(一般不用)
A.peopleDF.show
B.peopleDF.select(“name”).show
C.peopleDF.select("age","age","age",“name”,$“age”+10).show

D.peopleDF.groupBy($“age”).count.show

1.5 视图
spark分为两种视图:
1.5.1 普通视图
上文创建的peopleDF.createOrReplaceTempView(“people”)为普通视图
只在当前的会话中有效(建立新的session后会报错)

1.5.2 全局视图
peopleDF.createGlobalTempView("people")
spark.sql("select * from global_temp.people1").show
作用域:在全局(不同的会话中)有效
注意:全局视图是创建在Spark Sql中的全局“数据库”,加前缀:global_temp

1.6 Spark SQL WordCount
1.6.1 SQL实现WordCount
package com.bigdata.spark.day1021
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
/*
SQL语法实现WC
*/
object SQLWC {
val filePath = "D:\\testFile\\wc.txt"
def main(args: Array[String]): Unit = {
Logger.getLogger("org.apache.spark").setLevel(Level.OFF)
val spark: SparkSession = SparkSession.builder()
.appName("SQLDemo3")
.master("local[*]")
.getOrCreate()
//指定以后从哪里读取数据,DataSet返回的是只有一列的DataFrame
val lines: Dataset[String] = spark.read.textFile(filePath)
//整理数据,切分压平
//DataSet(lines)调用RDD的方法(flatMap),必须导入隐式转换
import spark.implicits._
val words: Dataset[String] = lines.flatMap(_.split(" "))
//注册临时视图
words.createTempView("w_words")
val result: DataFrame = spark.sql("select value word ,count(*) counts from w_words group by word order by counts")
result.show()
spark.stop()
}
}

1.6.2 DSL实现WordCount
package com.bigdata.spark.day1021
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.{DataFrame, Dataset, RelationalGroupedDataset, SparkSession}
/**
* DSL实现WC
*/
object DSLWC {
val filePath = "D:\\testFile\\wc.txt"
def main(args: Array[String]): Unit = {
Logger.getLogger("org.apache.spark").setLevel(Level.OFF)
val spark: SparkSession = SparkSession.builder()
.appName("SQLDemo3")
.master("local[*]")
.getOrCreate()
//指定以后从哪里读取数据,DataSet返回的是只有一列的DataFrame
val lines: Dataset[String] = spark.read.textFile(filePath)
//整理数据,切分压平
//DataSet(lines)调用RDD的方法(flatMap),必须导入隐式转换
import spark.implicits._
val words: Dataset[String] = lines.flatMap(_.split(" "))
//分组计算
val group: RelationalGroupedDataset = words.groupBy($"value" as "word")
val count: DataFrame = group.count()
count.show()
spark.stop()
}
}


2884

被折叠的 条评论
为什么被折叠?



