Apache Spark Challenges

Overview

Note: Tasks 1, 2, and 3 closely resemble a typical workflow when working with Apache Spark:

  • Step 1: Interactively work with a small sample of the problem
  • Step 2: Solve and optimize the sample problem
  • Step 3: Submit the entire larger problem as a batch job
  • Step 4: Analyze the result and, if necessary, repeat steps 1 to 4

You should employ this workflow into task 4 and task 5

Task 1 - Classic Distributed Problem: Token Counting

Given a string of tokens, count the number of times each token apprears. You should do this task in an interactive JupyterLab notebook connecting to a Spark cluster. This is a cananical problem of distributed data processing, and often served as an example for MapReduce Programming Model.

Hint: Have a look at map() and reduceByKey()

Task 2 - Cluster Set-up Bash Scripts

Write Bash Scripts to streamline the process of installing Spark and running the cluster.

Hint: Try to combine the subchapter: set up

Task 3 - Spark and Slurm

Submit task 1 as a Spark job using Slurm. This should be similar to subchapter: job batching

Hint:

  • You will need to convert the notebook into a Python file.
  • Compare the content of $SPARK_HOME/examples/src/main/python/pi.py and our Monte Carlo Pi Estimation. They both solve the same problem, however, there are stuffs that we don't need to add when directly using spark-submit. Why?

Task 4 - Data Processing

In this task, we will start working witha dataframe and try to process a given real-world dataset.

The dataset, at around ~100MB, is considered small and not well-suited for Spark utilization (opting for Pandas might be more efficient). Nevertheless, working with this dataset serves as an exercise to understand more about Spark concepts and its capabilities.

Task 5 - Spark Machine Learning

We will use the data from task 4 to build an intro-to-Machine-Learning model, Linear Regression, with MLlib