Spark - Getting Started

spark-shell

To launch Spark on YARN:

make sure HADOOP_CONF_DIR or YARN_CONF_DIR is correctly set(let Spark find the configurations like hdfs-site.xml etc)

$ spark-shell --master yarn --deploy-mode client

Master

yarn: use YARN
local[2]: use local, 2 threads
local[*]

Deploy Modes

cluster: the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application.
client: the driver runs in the client process, and the application master is only used for requesting resources from YARN.

spark-submit

Create SparkConf. local[2] means local mode, 2 cores.

val conf = new SparkConf().setAppName("myAppName").setMaster("local[2]")

Create SparkContext

val sc = new SparkContext(conf)

Create SQLContext

val sqlContext = new SQLContext(sc)

RDD

RDD is the lower level abstraction, and the building block for the higher ones like DataFrame and Dataset

val rdd = sc.textFile("src/main/resources/Titanic/train.csv")

val datafile = scala.io.Source.fromFile("src/test/resources/test.csv").getLines().toList
val rdd = sc.parallelize(datafile)
println(rdd.collect.mkString("\n"))

Print class

println(rdd.getClass)

print some info

println(rdd.count())
//892

println(rdd.first())
//PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked

print all the lines, use .foreach instead of .map, since .map is a transformation, will not be evaluated until an action

distFile.foreach(println)

To join strings, use .mkString

records.foreach(row => println(row.mkString(",")))

Supported File Formats

Text File Formats
- csv
- json
Avro Row Format
Parquet Columnar Format

Standalone vs YARN/Mesos

Standalone: need to manually deploy spark jar to each node
YARN/Mesos: driver program will talk to resource manager and send jar to each node(executor) during execution

YARN mode

Using YARN mode, spark will connect and submit code to yarn's ResourceManager. To locate the server, either HADOOP_CONF_DIR or YARN_CONF_DIR should be defined, where spark can find config files like yarn-site.xml, hdfs-site.xml, core-site.xml etc

$ spark-shell --master yarn
Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.