前言
本篇主要阐述 Spark on Yarn 任务提交源码分析说的流程,目的在于了解任务提交的大概流程;其主要是想通过对 Spark 任务提交流程与涉及到的源码了解;在遇到问题的时候可以快速定位到是由什么环节导致的,从而可以快速排查问题并予以解决。
源码流程说明
Spark on Yarn 任务提交源码流程

- 在本地提交Spark Job(Cluster模式提交) 任务的时候,首先会启动
SparkSubmit中的 main 方法;通过反射类加载执行Client中的 main 方法,并创建yarnClient与 Yarn 集群建立通讯的客户端; yarnClient.run开始向 Yarn集群中的 ResourceManager 提交应用程序(执行submitApplication()方法),其中向 ResourceManager 发送的内容主要包括容器、Java环境的启动命令、ApplicationMaster启动命令等等;- Client 与 ResourceManager 建立通讯,并发出 ApplicationMaster 启动请求之后, ResourceManager 会在某一个适合的 NodeManager 中启动一个 容器 以及 ApplicationMaster 进程;
ApplicationMaster.main开始启动 Driver 线程,用于执行指定类的 main 方法(如初始化SparkContext、划分 Stage 等等);与此同时创建 YarnRMClient(主要作用是用于与 ResourceManager 进行通讯、申请资源),其中服务器进程之间的通讯主要是通过 RPC 框架;然后 ApplicationMaster 向ResourceManager 注册、申请资源,ResourceManager 会通过查看 Yarn 集群中可用的资源;- ApplicationMaster 接收到 ResourceManager 返回的可用的容器列表,开始进行容器分配(分配原则:移动数据不如移动计算,进程本地化);
- 然后 ApplicationMaster 会创建一个 NMClient (NodeManager 客户端),用于与 NodeManager 建立连接,并通知对应的NodeManager 启动容器、以及CoarseGrainedExecutorBackend(Executor);
CoarseGrainedExecutorBackend.run开始向 ApplicationMaster 进行反向注册(主要作用是用于告诉 ApplicationMaster Executor已经准备好了,以及当 Executor 挂掉的时候 ApplicationMaster 可以重新申请资源运行任务);- 当 ApplicationMaster 返回注册成功的消息,就开始启动 Executor 执行计算任务;
源码具体说明
提示:最好根据本地源代码、以及上面的流程图来进行查看。
本地提交 Spark Job
org.apache.spark.deploy.SparkSubmit
/**
* @Author: Small_Ran
* @Date: 2022/5/24
* @param args 传入的参数,例如:./bin/spark-submit --master yarn-cluster --name SparkTast ....
* @Description:
* 程序启动的人口
*/
override def main(args: Array[String]): Unit = {
//对传入的参数进行封装
val appArgs = new SparkSubmitArguments(args)
if (appArgs.verbose) {
// scalastyle:off println
printStream.println(appArgs)
// scalastyle:on println
}
appArgs.action match {
//开始提交任务
case SparkSubmitAction.SUBMIT => submit(appArgs)
case SparkSubmitAction.KILL => kill(appArgs)
case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
}
}
程序的入口是从 SparkSubmit 中的 main 方法开始执行。
org.apache.spark.deploy.SparkSubmitArguments
/**
* @Author: Small_Ran
* @Date: 2022/5/24
* @Description:
* 解析并封装来自 spark-submit 脚本的参数
*/
private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, String] = sys.env)
extends SparkSubmitArgumentsParser {
var master: String = null
var deployMode: String = null
var executorMemory: String = null
var executorCores: String = null
var totalExecutorCores: String = null
var propertiesFile: String = null
var driverMemory: String = null
var driverExtraClassPath: String = null
var driverExtraLibraryPath: String = null
var driverExtraJavaOptions: String = null
var queue: String = null
var numExecutors: String = null
var files: String = null
var archives: String = null
var mainClass: String = null
var primaryResource: String = null
var name: String = null
................. 中间省略部分代码 ....................
//获取 -class 中提交的主类
mainClass = jar.getManifest.getMainAttributes.getValue("Main-Class")
SparkSubmitArguments 类主要用于解析封装 spark-submit 脚本的参数。
org.apache.spark.deploy.SparkSubmit
/**
* @Author: Small_Ran
* @Date: 2022/5/24
* @Description:
* 使用提供的参数提交申请
*/
private def submit(args: SparkSubmitArguments): Unit = {
// 准备所需要提交的环境,如主类(childMainClass)
val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment(args)
def doRunMain(): Unit = {
if (args.proxyUser != null) {
val proxyUser = UserGroupInformation.createProxyUser(args.proxyUser,
UserGroupInformation.getCurrentUser())
try {
proxyUser.doAs(new PrivilegedExceptionAction[Unit]() {
override def run(): Unit = {
runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
}
})
} catch {
case e: Exception =>
// Hadoop's AuthorizationException suppresses the exception's stack trace, which
// makes the message printed to the output by the JVM not very helpful. Instead,
// detect exceptions with empty stack traces here, and treat them differently.
if (e.getStackTrace().length == 0) {
// scalastyle:off println
printStream.println(s"ERROR: ${e.getClass().getName()}: ${e.getMessage()}")
// scalastyle:on println
exitFn(1)
} else {
throw e
}
}
} else {
//开始执行主类
runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
}
}
org.apache.spark.deploy.SparkSubmit
/**
* @Author: Small_Ran
* @Date: 2022/5/24
* @Description:
* 准备所需要提交的环境,如主类(childMainClass)
*/
private[deploy] def prepareSubmitEnvironment(args: SparkSubmitArguments)
: (Seq[String], Seq[String], Map[String, String], String) = {
................. 中间省略部分代码 ....................
//如果是 Yarn Client 模式,则选择的主类为指定类(--class SparkTest)
if (deployMode == CLIENT || isYarnCluster) {
childMainClass = args.mainClass
if (isUserJar(args.primaryResource)) {
childClasspath += args.primaryResource
}
if (args.jars != null) { childClasspath ++= args.jars.split(",") }
}
................. 中间省略部分代码 ....................
//在 yarn-cluster 模式下,使用 org.apache.spark.deploy.yarn.Client
if (isYarnCluster) {
childMainClass = "org.apache.spark.deploy.yarn.Client"
if (args.isPython) {
childArgs += ("--primary-py-file", args.primaryResource)
childArgs += ("--class", "org.apache.spark.deploy.PythonRunner")
} else if (args.isR) {
val mainFile = new Path(args.primaryResource).getName
childArgs += ("--primary-r-file", mainFile)
childArgs += ("--class", "org.apache.spark.deploy.RRunner")
} else {
if (args.primaryResource != SparkLauncher.NO_RESOURCE) {
childArgs += ("--jar", args.primaryResource)
}
childArgs += ("--class", args.mainClass)
}
if (args.childArgs != null) {
args.childArgs.foreach { arg => childArgs += ("--arg", arg) }
}
}
prepareSubmitEnvironment() 方法主要用于准备提交申请的环境。
org.apache.spark.deploy.SparkSubmit
/**
* @Author: Small_Ran
* @Date: 2022/5/24
* @Description:
* 通过反射类加载执行主类的 main 方法
*/
private def runMain(
childArgs: Seq[String],
childClasspath: Seq[String],
sysProps: Map[String, String],
childMainClass: String,
verbose: Boolean): Unit = {
................. 中间省略部分代码 ....................
// 类加载器
Thread.currentThread.setContextClassLoader(loader)
................. 中间省略部分代码 ....................
var mainClass: Class[_] = null
try {
// 反射加载类
mainClass = Utils.classForName(childMainClass)
} catch {
................. 中间省略部分代码 ....................
System.exit(CLASS_NOT_FOUND_EXIT_STATUS)
}
................. 中间省略部分代码 ....................
// 判断指定的类中是否有 main 方法
val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass)
if (!Modifier.isStatic(mainMethod.getModifiers)) {
throw new IllegalStateException("The main method in the given main class must be static")
}
................. 中间省略部分代码 ....................
try {
// 执行指定类中的 main 方法
mainMethod.invoke(null, childArgs.toArray)
} catch {
................. 中间省略部分代码 ....................
}
}
runMain() 方法主要使用提供的启动环境运行子类的 main 方法
org.apache.spark.deploy.yarn.Client
/**第五步
* @Author: Small_Ran
* @Date: 2022/5/25
* @param argStrings
* @Description: Yarn的Client类
*/
def main(argStrings: Array[String]) {
if (!sys.props.contains("SPARK_SUBMIT")) {
logWarning("WARNING: This client is deprecated and will be removed in a " +
"future version of Spark. Use ./bin/spark-submit with \"--master yarn\"")
}
// Set an env variable indicating we are running in YARN mode.
// Note that any env variable with the SPARK_ prefix gets propagated to all (remote) processes
System.setProperty("SPARK_YARN_MODE", "true")
val sparkConf = new SparkConf
// SparkSubmit would use yarn cache to distribute files & jars in yarn mode,
// so remove them from sparkConf here for yarn mode.
sparkConf.remove("spark.jars")
sparkConf.remove("spark.files")
val args = new ClientArguments(argStrings)
// 创建 createYarnClient Yarn的客户端,可以与 yarn 集群建立连接
new Client(args, sparkConf).run()
}
提交 Application 请求
org.apache.spark.deploy.yarn.Client
def run(): Unit = {
// 开始提交应用程序
this.appId = submitApplication()
if (!launcherBackend.isConnected() && fireAndForget) {
val report = getApplicationReport(appId)
val state = report.getYarnApplicationState
logInfo(s"Application report for $appId (state: $state)")
logInfo(formatReportDetails(report))
if (state == YarnApplicationState.FAILED || state == YarnApplicationState.KILLED) {
throw new SparkException(s"Application $appId finished with status: $state")
}
} else {
val (yarnApplicationState, finalApplicationStatus) = monitorApplication(appId)
if (yarnApplicationState == YarnApplicationState.FAILED ||
finalApplicationStatus == FinalApplicationStatus.FAILED) {
throw new SparkException(s"Application $appId finished with failed status")
}
if (yarnApplicationState == YarnApplicationState.KILLED ||
finalApplicationStatus == FinalApplicationStatus.KILLED) {
throw new SparkException(s"Application $appId is killed")
}
if (finalApplicationStatus == FinalApplicationStatus.UNDEFINED) {
throw new SparkException(s"The final status of application $appId is undefined")
}
}
}
run() 方法主要用于向 ResourceManager 提交应用程序。
org.apache.spark.deploy.yarn.Client
def submitApplication(): ApplicationId = {
var appId: ApplicationId = null
try {
launcherBackend.connect()
// Setup the credentials before doing anything else,
// so we have don't have issues at any point.
setupCredentials()
yarnClient.init(yarnConf)
yarnClient.start()
logInfo("Requesting a new application from cluster with %d NodeManagers"
.format(yarnClient.getYarnClusterMetrics.getNumNodeManagers))
................. 中间省略部分代码 ....................
// Set up the appropriate contexts to launch our AM
// 创建提交的内容,包括容器、Java环境、ApplicationMaster命令等等
val containerContext = createContainerLaunchContext(newAppResponse)
val appContext = createApplicationSubmissionContext(newApp, containerContext)
// Finally, submit and monitor the application
// 通过YarnClient提交 Application
logInfo(s"Submitting application $appId to ResourceManager")
yarnClient.submitApplication(appContext)
launcherBackend.setAppId(appId.toString)
reportLauncherState(SparkAppHandle.State.SUBMITTED)
appId
} catch {
................. 中间省略部分代码 ....................
}
}
submitApplication() 方法主要用于将 ApplicationMaster 的应用程序提交到 ResourceManager。
private def createContainerLaunchContext(newAppResponse: GetNewApplicationResponse)
: ContainerLaunchContext = {
................. 中间省略部分代码 ....................
val useConcurrentAndIncrementalGC = launchEnv.get("SPARK_USE_CONC_INCR_GC").exists(_.toBoolean)
if (useConcurrentAndIncrementalGC) {
// In our expts, using (default) throughput collector has severe perf ramifications in
// multi-tenant machines
javaOpts += "-XX:+UseConcMarkSweepGC"
javaOpts += "-XX:MaxTenuringThreshold=31"
javaOpts += "-XX:SurvivorRatio=8"
javaOpts += "-XX:+CMSIncrementalMode"
javaOpts += "-XX:+CMSIncrementalPacing"
javaOpts += "-XX:CMSIncrementalDutyCycleMin=0"
javaOpts += "-XX:CMSIncrementalDutyCycle=10"
}
................. 中间省略部分代码 ....................
val userClass =
if (isClusterMode) {
Seq("--class", YarnSparkHadoopUtil.escapeForShell(args.userClass))
} else {
Nil
}
val userJar =
if (args.userJar != null) {
Seq("--jar", args.userJar)
} else {
Nil
}
val primaryPyFile =
if (isClusterMode && args.primaryPyFile != null) {
Seq("--primary-py-file", new Path(args.primaryPyFile).getName())
} else {
Nil
}
val primaryRFile =
if (args.primaryRFile != null) {
Seq("--primary-r-file", args.primaryRFile)
} else {
Nil
}
val amClass =
if (isClusterMode) {
// Yarn Cluster 任务提交方式
Utils.classForName("org.apache.spark.deploy.yarn.ApplicationMaster").getName
} else {
// Yarn Client 任务提交方式
Utils.classForName("org.apache.spark.deploy.yarn.ExecutorLauncher").getName
}
................. 中间省略部分代码 ....................
// 启动 ApplicationMaster 命令
val commands = prefixEnv ++
Seq(Environment.JAVA_HOME.$$() + "/bin/java", "-server") ++
javaOpts ++ amArgs ++
Seq(
"1>", ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stdout",
"2>", ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stderr")
// TODO: it would be nicer to just make sure there are no null commands here
val printableCommands = commands.map(s => if (s == null) "null" else s).toList
amContainer.setCommands(printableCommands.asJava)
................. 中间省略部分代码 ....................
}
createContainerLaunchContext() 方法主要用于将设置启动环境、Java 选项和启动 ApplicationMaster 的命令。
启动 ApplicationMaster
org.apache.spark.deploy.yarn.ApplicationMaster
def main(args: Array[String]): Unit = {
SignalUtils.registerLogger(log)
// 封装传入的参数
val amArgs = new ApplicationMasterArguments(args)
// Load the properties file with the Spark configuration and set entries as system properties,
// so that user code run inside the AM also has access to them.
// Note: we must do this before SparkHadoopUtil instantiated
if (amArgs.propertiesFile != null) {
Utils.getPropertiesFromFile(amArgs.propertiesFile).foreach { case (k, v) =>
sys.props(k) = v
}
}
SparkHadoopUtil.get.runAsSparkUser { () =>
// 创建YarnRMClient与ResourceManager进行连接
master = new ApplicationMaster(amArgs, new YarnRMClient)
System.exit(master.run())
}
}
org.apache.spark.deploy.yarn.ApplicationMaster
final def run(): Int = {
try {
val appAttemptId = client.getAttemptId()
var attemptID: Option[String] = None
................. 中间省略部分代码 ....................
// 开始启动 Driver
if (isClusterMode) {
// Yarn Cluster模式
runDriver(securityMgr)
} else {
// Yarn Client 模式
runExecutorLauncher(securityMgr)
}
} catch {
................. 中间省略部分代码 ....................
}
exitCode
}
向 ResourceManager 申请资源
org.apache.spark.deploy.yarn.ApplicationMaster
private def runDriver(securityMgr: SecurityManager): Unit = {
addAmIpFilter()
// 启动指定类
userClassThread = startUserApplication()
................. 中间省略部分代码 ....................
try {
val sc = ThreadUtils.awaitResult(sparkContextPromise.future,
Duration(totalWaitTime, TimeUnit.MILLISECONDS))
if (sc != null) {
rpcEnv = sc.env.rpcEnv
val driverRef = runAMEndpoint(
sc.getConf.get("spark.driver.host"),
sc.getConf.get("spark.driver.port"),
isClusterMode = true)
registerAM(sc.getConf, rpcEnv, driverRef, sc.ui.map(_.webUrl), securityMgr)
} else {
// Sanity check; should never happen in normal operation, since sc should only be null
// if the user app did not create a SparkContext.
if (!finished) {
throw new IllegalStateException("SparkContext is null but app is still running!")
}
}
// join 表示需要等线程执行完成之后才会继续往下面运行
userClassThread.join()
} catch {
................. 中间省略部分代码 ....................
}
}
org.apache.spark.deploy.yarn.ApplicationMaster
/**
* @Author: Small_Ran
* @Date: 2022/5/24
* @Description:
* ApplicationMaster主要与 NodeManager 交互(资源),以及 Driver 进行交互
*/
private def startUserApplication(): Thread = {
logInfo("Starting the user application in a separate Thread")
................. 中间省略部分代码 ....................
// 获取 --class 指定的 main 方法
val mainMethod = userClassLoader.loadClass(args.userClass)
.getMethod("main", classOf[Array[String]])
// 启动 Driver 线程,并执行指定类的 main方法
val userThread = new Thread {
override def run() {
try {
mainMethod.invoke(null, userArgs.toArray)
finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS)
logDebug("Done running users class")
} catch {
................. 中间省略部分代码 ....................
}
sparkContextPromise.tryFailure(e.getCause())
} finally {
................. 中间省略部分代码 ....................
}
}
}
userThread.setContextClassLoader(userClassLoader)
userThread.setName("Driver")
userThread.start()
userThread
}
startUserApplication() 方法主要用于启动 Driver 线程。
org.apache.spark.deploy.yarn.ApplicationMaster
private def registerAM(
_sparkConf: SparkConf,
_rpcEnv: RpcEnv,
driverRef: RpcEndpointRef,
uiAddress: Option[String],
securityMgr: SecurityManager) = {
................. 中间省略部分代码 ....................
// ApplicationMaster 开始向 ResourceManager申请资源
allocator = client.register(driverUrl,
driverRef,
yarnConf,
_sparkConf,
uiAddress,
historyAddress,
securityMgr,
localResources)
// 分配可以用资源,并启动容器
allocator.allocateResources()
reporterThread = launchReporterThread()
}
registerAM() 方法主要用于向 ApplicationMaster 注册,并且开始申请任务所需资源。
ResourceManager 返回集群可用容器
org.apache.spark.deploy.yarn.YarnAllocator
def allocateResources(): Unit = synchronized {
updateResourceRequests()
val progressIndicator = 0.1f
// Poll the ResourceManager. This doubles as a heartbeat if there are no pending container
// requests.
val allocateResponse = amClient.allocate(progressIndicator)
// 获取可分配的容器
val allocatedContainers = allocateResponse.getAllocatedContainers()
if (allocatedContainers.size > 0) {
logDebug("Allocated containers: %d. Current executor count: %d. Cluster resources: %s."
.format(
allocatedContainers.size,
numExecutorsRunning,
allocateResponse.getAvailableResources))
// 开始处理分配的容器
handleAllocatedContainers(allocatedContainers.asScala)
}
allocateResources()方法主要用于向 ResourceManager 发请求申请资源,然后 ResourceManager 会返回一个可用资源列表。
启动容器与 Executor
org.apache.spark.deploy.yarn.YarnAllocator
def handleAllocatedContainers(allocatedContainers: Seq[Container]): Unit = {
val containersToUse = new ArrayBuffer[Container](allocatedContainers.size)
................. 中间省略部分代码 ....................
// 运行分配的容器
runAllocatedContainers(containersToUse)
logInfo("Received %d containers from YARN, launching executors on %d of them."
.format(allocatedContainers.size, containersToUse.size))
}
handleAllocatedContainers() 方法主要用于处理启动 RM 授予的容器中的 Executor。
org.apache.spark.deploy.yarn.YarnAllocator
private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): Unit = {
for (container <- containersToUse) {
................. 中间省略部分代码 ....................
// 运行 Executor线程
if (numExecutorsRunning < targetNumExecutors) {
if (launchContainers) {
launcherPool.execute(new Runnable {
override def run(): Unit = {
try {
new ExecutorRunnable(
Some(container),
conf,
sparkConf,
driverUrl,
executorId,
executorHostname,
executorMemory,
executorCores,
appAttemptId.getApplicationId.toString,
securityMgr,
localResources
).run()
updateInternalState()
} catch {
case NonFatal(e) =>
logError(s"Failed to launch executor $executorId on container $containerId", e)
// Assigned container should be released immediately to avoid unnecessary resource
// occupation.
amClient.releaseAssignedContainer(containerId)
}
}
})
................. 中间省略部分代码 ....................
}
}
runAllocatedContainers()方法主要用于运行分配容器中的程序。
org.apache.spark.deploy.yarn.ExecutorRunnable
def run(): Unit = {
logDebug("Starting Executor Container")
nmClient = NMClient.createNMClient()
nmClient.init(conf)
nmClient.start()
// 启动容器
startContainer()
}
def startContainer(): java.util.Map[String, ByteBuffer] = {
val ctx = Records.newRecord(classOf[ContainerLaunchContext])
.asInstanceOf[ContainerLaunchContext]
val env = prepareEnvironment().asJava
ctx.setLocalResources(localResources.asJava)
ctx.setEnvironment(env)
val credentials = UserGroupInformation.getCurrentUser().getCredentials()
val dob = new DataOutputBuffer()
credentials.writeTokenStorageToStream(dob)
ctx.setTokens(ByteBuffer.wrap(dob.getData()))
// 启动 CoarseGrainedExecutorBackend 进程
val commands = prepareCommand()
ctx.setCommands(commands.asJava)
................. 中间省略部分代码 ....................
}
private def prepareCommand(): List[String] = {
// Extra options for the JVM
val javaOpts = ListBuffer[String]()
................. 中间省略部分代码 ....................
javaOpts += ("-Dspark.yarn.app.container.log.dir=" + ApplicationConstants.LOG_DIR_EXPANSION_VAR)
val userClassPath = Client.getUserClasspath(sparkConf).flatMap { uri =>
val absPath =
if (new File(uri.getPath()).isAbsolute()) {
Client.getClusterPath(sparkConf, uri.getPath())
} else {
Client.buildPath(Environment.PWD.$(), uri.getPath())
}
Seq("--user-class-path", "file:" + absPath)
}.toSeq
YarnSparkHadoopUtil.addOutOfMemoryErrorArgument(javaOpts)
// 设置 CoarseGrainedExecutorBackend 启动命令
val commands = prefixEnv ++
Seq(Environment.JAVA_HOME.$$() + "/bin/java", "-server") ++
javaOpts ++
Seq("org.apache.spark.executor.CoarseGrainedExecutorBackend",
"--driver-url", masterAddress,
"--executor-id", executorId,
"--hostname", hostname,
"--cores", executorCores.toString,
"--app-id", appId) ++
userClassPath ++
Seq(
s"1>${ApplicationConstants.LOG_DIR_EXPANSION_VAR}/stdout",
s"2>${ApplicationConstants.LOG_DIR_EXPANSION_VAR}/stderr")
// TODO: it would be nicer to just make sure there are no null commands here
commands.map(s => if (s == null) "null" else s).toList
}
prepareCommand()方法主要用于设置 CoarseGrainedExecutorBackend 启动的命令,其中流程图中的 Executor 实际上启动的是 CoarseGrainedExecutorBackend ;Executor只能说是进行进程之间交互的名称,真正 new 的是 CoarseGrainedExecutorBackend;Task首先会把任务发给 CoarseGrainedExecutorBackend ,然后由对象属性 Executor 执行。
org.apache.spark.executor.CoarseGrainedExecutorBackend
private def run(
driverUrl: String,
executorId: String,
hostname: String,
cores: Int,
appId: String,
workerUrl: Option[String],
userClassPath: Seq[URL]) {
................. 中间省略部分代码 ....................
val env = SparkEnv.createExecutorEnv(
driverConf, executorId, hostname, port, cores, cfg.ioEncryptionKey, isLocal = false)
// Executor只能说是进行进程之间交互的名称,真正 new 的是 CoarseGrainedExecutorBackend;
// Task首先会把任务发给 CoarseGrainedExecutorBackend ,然后由对象属性 Executor 执行;
env.rpcEnv.setupEndpoint("Executor", new CoarseGrainedExecutorBackend(
env.rpcEnv, driverUrl, executorId, hostname, cores, userClassPath, env))
workerUrl.foreach { url =>
env.rpcEnv.setupEndpoint("WorkerWatcher", new WorkerWatcher(env.rpcEnv, url))
}
env.rpcEnv.awaitTermination()
SparkHadoopUtil.get.stopCredentialUpdater()
}
}
run()方法主要用于启动 CoarseGrainedExecutorBackend (run() 方法由执行 CoarseGrainedExecutorBackend中main 方法得来)。
Executor 反向注册
org.apache.spark.executor.CoarseGrainedExecutorBackend
override def onStart() {
logInfo("Connecting to driver: " + driverUrl)
rpcEnv.asyncSetupEndpointRefByURI(driverUrl).flatMap { ref =>
// This is a very fast action so we can use "ThreadUtils.sameThread"
//向Driver反向注册,主要作用是告诉Driver Executor已经启动好了;而且当某一个 Executor 挂掉时,Driver可以及时重新申请资源运行任务
driver = Some(ref)
ref.ask[Boolean](RegisterExecutor(executorId, self, hostname, cores, extractLogUrls))
}(ThreadUtils.sameThread).onComplete {
// This is a very fast action so we can use "ThreadUtils.sameThread"
case Success(msg) =>
// Always receive `true`. Just ignore it
case Failure(e) =>
exitExecutor(1, s"Cannot register with driver: $driverUrl", e, notifyDriver = false)
}(ThreadUtils.sameThread)
}
onStart()方法主要用于开始向Driver进行反向注册;由于 CoarseGrainedExecutorBackend 是继承了ThreadSafeRpcEndpoint 类所以会重写该类中的方法(生命周期:constructor -> onStart -> receive -> onStop)
分配 Task 任务
org.apache.spark.executor.CoarseGrainedExecutorBackend
override def receive: PartialFunction[Any, Unit] = {
// 反向注册成功信息
case RegisteredExecutor =>
logInfo("Successfully registered with driver")
try {
// Executor 完成反向注册之后,Driver也会返回一个确认信息;然后Executor就开始准备计算
executor = new Executor(executorId, hostname, env, userClassPath, isLocal = false)
} catch {
case NonFatal(e) =>
exitExecutor(1, "Unable to create executor due to " + e.getMessage, e)
}
case RegisterExecutorFailed(message) =>
exitExecutor(1, "Slave registration failed: " + message)
// Executor启动信息
case LaunchTask(data) =>
if (executor == null) {
exitExecutor(1, "Received LaunchTask command but executor was null")
} else {
val taskDesc = TaskDescription.decode(data.value)
logInfo("Got assigned task " + taskDesc.taskId)
// 开始启动
executor.launchTask(this, taskDesc)
}
case KillTask(taskId, _, interruptThread, reason) =>
if (executor == null) {
exitExecutor(1, "Received KillTask command but executor was null")
} else {
executor.killTask(taskId, interruptThread, reason)
}
case StopExecutor =>
stopping.set(true)
logInfo("Driver commanded a shutdown")
// Cannot shutdown here because an ack may need to be sent back to the caller. So send
// a message to self to actually do the shutdown.
self.send(Shutdown)
case Shutdown =>
stopping.set(true)
new Thread("CoarseGrainedExecutorBackend-stop-executor") {
override def run(): Unit = {
................. 中间省略部分代码 ....................
executor.stop()
}
}.start()
}
receive()方法主要用于接收到 Driver返回的注册成功消息,然后开始根据分配的 Task 任务开始执行 Executor。
参考链接:https://www.bilibili.com/video/BV1Si4y1M7N6?p=2&spm_id_from=pageDriver

6153

被折叠的 条评论
为什么被折叠?



