start:development:datascience:spark3

Spark 3

Modules

There are few Spark-Modules:

$ module spider spark

will list them. If you require an update or updated combination, please make your install-request using the software install form.

Security of a Spark Cluster

Security in Spark is OFF by default. This could mean you are vulnerable to attack by default. This documentation covers the subject comprehensively.

In the following script the basic steps are explained. It can be customized and used to generate security configuration.

Please be aware: the script overwrite ~/.spark-config/spark-defaults.conf.
securing_standalone_spark_cluster.sh
#!/bin/bash
#
# -- Secure a Standalone Spark Cluster:
# -- Generate a keystore and truststore and create a spark config file
#
 
# Generate a random password (users should never really need to know this) 
# Being lazy and using a single password throughout (as password and spark secret etc)
 
echo ""
echo " Generating Secure Spark config"
echo ""
 
module purge
module load lang/Java/1.8.0_202
 
SPARK_PASSWORD=$(tr -dc A-Za-z0-9_ < /dev/urandom | head -c 12)
 
# echo "Generating Trust and Key store files"
 
# Generate the keystore
keytool -genkey -alias spark \
    -keyalg RSA -keystore spark-keystore.jks \
    -dname "cn=spark, ou=MPCDF, o=MPG, c=DE" \
    -storepass $SPARK_PASSWORD -keypass $SPARK_PASSWORD
 
# Export the public cert
keytool -export -alias spark -file spark.cer -keystore spark-keystore.jks -storepass $SPARK_PASSWORD
 
# Import public cert into truststore
keytool -import -noprompt -alias spark -file spark.cer -keystore spark-truststore.ts -storepass $SPARK_PASSWORD
 
# Move files to config dir and clean up
MY_SPARK_CONF_DIR=~/.spark-config
mkdir -p $MY_SPARK_CONF_DIR
chmod 700 $MY_SPARK_CONF_DIR
 
# echo "Moving generated trust and key store files to $MY_SPARK_CONF_DIR"
 
mv -f spark-keystore.jks $MY_SPARK_CONF_DIR
mv -f spark-truststore.ts $MY_SPARK_CONF_DIR
 
# clean up intermediate file
rm -f spark.cer
 
# Create the spark default conf file with the secure configs
 
# echo "Creating SPARK Config files in $MY_SPARK_CONF_DIR"
 
cat << EOF > $MY_SPARK_CONF_DIR/spark-defaults.conf
spark.ui.enabled false
spark.authenticate true
spark.authenticate.secret $SPARK_PASSWORD
spark.ssl.enabled true
spark.ssl.needClientAuth true
spark.ssl.protocol TLS
spark.ssl.keyPassword $SPARK_PASSWORD
spark.ssl.keyStore $MY_SPARK_CONF_DIR/spark-keystore.jks
spark.ssl.keyStorePassword $SPARK_PASSWORD
spark.ssl.trustStore $MY_SPARK_CONF_DIR/spark-truststore.ts
spark.ssl.trustStorePassword $SPARK_PASSWORD
EOF
echo " Secure Spark config created in - $MY_SPARK_CONF_DIR"

One node job

If your interactive SLURM job uses only one node, you can start a spark-shell directly on the node. By default spark-shell starts in the deploy mode “local[*]”. This means it uses all available resources of the local node. The spark-shell allows you to work with your data interactively using the SCALA language.

$ spark-shell 
21/08/06 18:21:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://login21:4040
Spark context available as 'sc' (master = local[*], app id = local-1628266920018).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.2
      /_/
         
Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_202)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

Job Based Usage

If you have already a packaged application, it should be submitted as a job. The following script serves as a template to start a scala application (packed to myJar.jar as an example) with the entry point in the class Main in the package main.

#!/bin/bash
 
# further SLURM job settings go here
 
# load module
module load devel/Spark
 
# start application
spark-submit --driver-memory 8G --master local[*] --class main.Main myJar.jar

The option –master local[*] allows Spark to adjust the number of workers on its own. The option –driver-memory is used to set the driver memory, the memory of the workers requires typically no changes.

Spawning spark cluster

First of all start an interactive SLURM job.

In the SLURM job, you can start the spark cluster only in standalone mode.

Setting of the important environment variables for the spark configuration

For example, if the current directory should be used as working directory:

CLUSTERWORKDIR="$PWD"
export SPARK_LOG_DIR="$CLUSTERWORKDIR/log"
export SPARK_WORKER_DIR="$CLUSTERWORKDIR/run"

if a customized configuration file should be used(for example for security settings), SPARK_CONF_DIR should be set.

Example:

export SPARK_CONF_DIR="~/.spark-config" 

Master process

The Spark folders should exist:

mkdir -p $SPARK_LOG_DIR $SPARK_WORKER_DIR

If the infrastructure is ready, the master process can be started on the head node of your SLURM job:

$ export MASTER=$(hostname -f):7077
$ start-master.sh

The starting of the master process can take a while.

Workers

Now, the worker processes can be spawned on the for your job reserved nodes:

$ srun spark-class org.apache.spark.deploy.worker.Worker $MASTER -d $SPARK_WORKER_DIR &
create_spark_cluster.sh
#!/bin/bash
 
CLUSTERWORKDIR="$PWD"
export SPARK_LOG_DIR="$CLUSTERWORKDIR/log"
export SPARK_WORKER_DIR="$CLUSTERWORKDIR/run"
# export SPARK_CONF_DIR="~/.spark-config"   # if you want to create a secure cluster uncomment this line 
export MASTER=$(hostname -f):7077
export MASTER_URL=spark://$MASTER
WAITTIME=10s
 
echo Starting master on $MASTER
mkdir -p $SPARK_LOG_DIR $SPARK_WORKER_DIR
start-master.sh
echo "wait $WAITTIME to allow master to start"
sleep $WAITTIME
echo Starting workers
srun spark-class org.apache.spark.deploy.worker.Worker $MASTER -d $SPARK_WORKER_DIR &
echo "wait $WAITTIME to allow workers to start"
sleep $WAITTIME

Be sure, that all workers are started before you submit any spark jobs.

Example:

$ spark-submit --total-executor-cores 20 --executor-memory 5G /path/example.py file:///path/Spark/Data/project.txt 

SLURM submission script example

In the following example, the previously described example script create_spark_cluster.sh is used.

spark_cluster_job_in_slurm.sh
#!/bin/bash
#SBATCH -N 2
#SBATCH -t 00:10:00
#SBATCH --mem 20000
#SBATCH --ntasks-per-node 8
#SBATCH --cpus-per-task 5
#SBATCH -p parallel
#SBATCH -C anyarch
 
echo ""
echo " Starting the Spark Cluster "
echo ""
 
./create_spark_cluster.sh
 
echo $MASTER
 
echo ""
echo " About to run the spark job"
echo ""
 
spark-submit --total-executor-cores 16 --executor-memory 4G /path/example.py file:///path/Spark/Data/project.txt 
  • start/development/datascience/spark3.txt
  • Last modified: 2021/08/10 09:52
  • by noskov