Apache Spark

Apache Spark is an open-source, distributed computing system that has gained immense popularity for its speed, ease of use, and versatility in handling large-scale data processing tasks. Developed to overcome the limitations of the MapReduce paradigm, Spark offers a unified platform for various data processing workloads, including batch processing, real-time data streaming, machine learning, and graph processing.

Spark provides high-level APIs in languages like Scala, Java, Python, and R, making it accessible to a wide range of developers with different programming backgrounds.

In this chapter, we will:

Set up a mini Spark cluster in M3.
Take a closer look at the internal data structure, specifically Resilient Distributed Datasets (RDDs).
Explore data processing in Spark and JupyterLab.
Submit batch jobs utilizing both Slurm and Spark.
Engage in some challenges.

Notes:

The material covered in this chapter draws heavily from the official documentation of Spark 3.5.0.

Contents of Setting up a Spark Cluster within M3 via Slurm and Submit Spark Job inside Slurm Job are both derived from a trial-and-error approach, and doesn't adhere to any official documentation. Consequently, there is a likelihood that it may not be the best practice. Thus, if you've discovered alternative methods or more effective approaches or may be even security vulnerabilities, please don't hesitate to submit a pull request.

HPC Training

Apache Spark