Spark - Getting Started
Last Updated: 2023-09-03
spark-shell
To launch Spark on YARN:
- make sure
HADOOP_CONF_DIR
orYARN_CONF_DIR
is correctly set(let Spark find the configurations like hdfs-site.xml etc)
$ spark-shell --master yarn --deploy-mode client
Master
- yarn: use YARN
- local[2]: use local, 2 threads
- local[*]
Deploy Modes
- cluster: the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application.
- client: the driver runs in the client process, and the application master is only used for requesting resources from YARN.
spark-submit
Create SparkConf. local[2]
means local mode, 2 cores.
val conf = new SparkConf().setAppName("myAppName").setMaster("local[2]")
Create SparkContext
val sc = new SparkContext(conf)
Create SQLContext
val sqlContext = new SQLContext(sc)
RDD
RDD is the lower level abstraction, and the building block for the higher ones like DataFrame
and Dataset
val rdd = sc.textFile("src/main/resources/Titanic/train.csv")
or
val datafile = scala.io.Source.fromFile("src/test/resources/test.csv").getLines().toList
val rdd = sc.parallelize(datafile)
println(rdd.collect.mkString("\n"))
Print class
println(rdd.getClass)
print some info
println(rdd.count())
//892
println(rdd.first())
//PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
print all the lines, use .foreach
instead of .map
, since .map
is a transformation, will not be evaluated until an action
distFile.foreach(println)
To join strings, use .mkString
records.foreach(row => println(row.mkString(",")))
Supported File Formats
- Text File Formats
- csv
- json
- Avro Row Format
- Parquet Columnar Format
Standalone vs YARN/Mesos
- Standalone: need to manually deploy spark jar to each node
- YARN/Mesos: driver program will talk to resource manager and send jar to each node(executor) during execution
YARN mode
Using YARN mode, spark will connect and submit code to yarn's ResourceManager. To locate the server, either HADOOP_CONF_DIR
or YARN_CONF_DIR
should be defined, where spark can find config files like yarn-site.xml
, hdfs-site.xml
, core-site.xml
etc
$ spark-shell --master yarn
Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.