Setting up a Spark cluster within M3 cluster

Installation

Different Spark version may have different version requirements for Python and Java. We will use Spark 3.5, Python 3.10 (JupyterLab Environment), and Java 8 (default in M3). There are 2 main things that we need to set up:

Miniconda

Conda can be installed using the conda-install module from M3. The JupyterLab environment currently uses Python 3.10, meeting the prerequisite for Spark 3.5. This environment will also be utilized for running a notebook in STRUDEL. For those interested in customizing their own conda environment on M3, detailed instructions can be found in this page.

module load conda-install
conda-install

Spark Engine

Since M3 does not include a built-in module for Spark, we need to download and unzip it. Other versions can be found here

# Spark 3.5.0
wget https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
tar xzfv spark-3.5.0-bin-hadoop3.tgz
rm -rf spark-3.5.0-bin-hadoop3.tgz

Getting the cluster running

Requesting nodes from M3

We can refer to the M3 chapter to request an interactive session. E.g:

# Interactive session with: 12 cpus, 1 hour of time, 4 nodes, 8gb memory per node on m3i partion
smux new-session --jobname=testing_spark --ntasks=4 --time=0-01:00:00 --partition=m3i --mem=8GB --nodes=4 --cpuspertask=4

Setting up worker

Write requested nodes name to the config file within spark

# Get names of all nodes
# Read from 2nd node to the last node (as the 1st node is the master node)
# Write it to the config file workers within spark engine
scontrol show hostnames | sed -n '2,$p' > /path/to/spark-3.5.0-bin-hadoop3/conf/workers

Starting spark cluster

Start the cluster

source /path/to/spark-3.5.0-bin-hadoop3/sbin/start-all.sh

Connecting Jupyter Lab (STRUDEL) to the Cluster

Spark’s shell offers a straightforward approach to grasp the API and serves as a robust tool for interactive data analysis. However, incorporating a user-friendly interface facilitates more intricate operations and functionalities. In this context, we'll utilise Jupyter Lab, which is made available through STRUDEL.

Requesting a notebook through STRUDEL

When installing miniconda in M3, there will be a built-in environment named jupyterlab. This environment will also be automatically added into STRUDEL and we can launch a JupyterLab session from this environment.

JupyterLab

Exporting environments variables

import sys
import os

# Specify paths, which will be set as environment variables and read by the Spark engine
SPARK_HOME = "/path/to/spark-3.5.0-bin-hadoop3"
os.environ["SPARK_HOME"] = SPARK_HOME
PREV_PATH = os.environ["PATH"]
os.environ["PATH"] = f"{SPARK_HOME}/bin:{PREV_PATH}"

# Specify which Python to be used by the driver and workers
os.environ["PYSPARK_PYTHON"] = sys.executable

# Add pyspark to the PYTHONPATH
PYSPARK_HOME=f"{SPARK_HOME}/python"
sys.path.append(PYSPARK_HOME)

# Different Spark version will have different py4j version
sys.path.append(f"{PYSPARK_HOME}/lib/py4j-0.10.9.7-src.zip")

Connecting to the Spark master

import pyspark
from pyspark.sql import SparkSession

# Initilize a Spark context from the Spark Cluster 
spark = SparkSession \
    .builder \
    .master("spark://{master_node}.massive.org.au:7077") \
    .appName("test") \
    .config("spark.workers.show", "True") \
    .getOrCreate()

SSH Tunnel SparkUI to Localhost

We can use ssh tunneling to forward the Spark UI of the engine to our local machine, the UI will be available as a web application at http://localhost:4040.

# The jupyterlab node will be the node that the jupyter lab is currently running on
ssh -L 4040:{jupyterlab_node}.massive.org.au:4040 {username}@m3.massive.org.au

Overview of what we have just done

The overall architecture can be demonstrated through this diagram:

Each circle is a computing node of M3.
A mini-cluster managed by Spark is running inside the big M3 cluster.
We also have a notebook (from STRUDEL) connecting to the master node of the Spark cluster.

Overview

HPC Training