依稀记得在刚开始学习spark框架的时候,第一次接触的就是spark-shell这个东西,但背后它究竟做了什么样的工作,今天来一探究竟:

  • 打开spark-shell,实际上它是一个shell脚本,里面最核心的内容如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
function main() {
if $cygwin; then
# Workaround for issue involving JLine and Cygwin
# (see http://sourceforge.net/p/jline/bugs/40/).
# If you're using the Mintty terminal emulator in Cygwin, may need to set the
# "Backspace sends ^H" setting in "Keys" section of the Mintty options
# (see https://github.com/sbt/sbt/issues/562).
stty -icanon min 1 -echo > /dev/null 2>&1
export SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Djline.terminal=unix"
"${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@"
stty icanon echo > /dev/null 2>&1
else
export SPARK_SUBMIT_OPTS
"${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@"
fi
}

我们可以看到,它调用了spark-submit ,并且以org.apache.spark.repl.Main为主类启动了一个spark job

  • 打开spark-submit,它又是一个shell脚本,里面核心内容如下:
1
exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"

我们可以看到,它又调用了spark-class ,并且以org.apache.spark.deploy.SparkSubmit为主类

  • 打开spark-class,它又是个shell脚本
1
2
3
4
build_command() {
"$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
printf "%d\0" $?
}

实际上spark-submit在调用spark-class之前给它增加了一个参数org.apache.spark.deploy.SparkSubmit,这就代表了最后spark-shell 启动了一个以org.apache.spark.deploy.SparkSubmit 为主类的jvm进程

  • 查看main线程的堆栈调用情况:

8v9IdvGI89kVge1E8TV35ls4fOgT4ej15ofaVq0voxk

  • 总结一下,spark-shell做了以下工作:
    • 启动流程:spark-shell —> spark-submit -> spark-class
    • 堆栈调用:spark框架启动了一个org.apache.spark.repl.Main 为主类的job,这个job由spark的org.apache.spark.deploy.SparkSubmit 去开启,总结一下即为:org.apache.spark.deploy.SparkSubmit -> org.apache.spark.repl.Main -> org.apache.spark.repl.SparkILoop.process