Apache Spark Challenges
Overview
Note: Tasks 1, 2, and 3 closely resemble a typical workflow when working with Apache Spark:
- Step 1: Interactively work with a small sample of the problem
- Step 2: Solve and optimize the sample problem
- Step 3: Submit the entire larger problem as a batch job
- Step 4: Analyze the result and, if necessary, repeat steps 1 to 4
You should employ this workflow into task 4 and task 5
Task 1 - Classic Distributed Problem: Token Counting
Given a string of tokens, count the number of times each token apprears. You should do this task in an interactive JupyterLab notebook connecting to a Spark cluster. This is a cananical problem of distributed data processing, and often served as an example for MapReduce Programming Model.
Hint: Have a look at map() and reduceByKey()
Task 2 - Cluster Set-up Bash Scripts
Write Bash Scripts to streamline the process of installing Spark and running the cluster.
Hint: Try to combine the subchapter: set up
Task 3 - Spark and Slurm
Submit task 1 as a Spark job using Slurm. This should be similar to subchapter: job batching
Hint:
- You will need to convert the notebook into a Python file.
- Compare the content of
$SPARK_HOME/examples/src/main/python/pi.py
and our Monte Carlo Pi Estimation. They both solve the same problem, however, there are stuffs that we don't need to add when directly usingspark-submit
. Why?
Task 4 - Data Processing
In this task, we will start working witha dataframe and try to process a given real-world dataset.
The dataset, at around ~100MB, is considered small and not well-suited for Spark utilization (opting for Pandas might be more efficient). Nevertheless, working with this dataset serves as an exercise to understand more about Spark concepts and its capabilities.
Task 5 - Spark Machine Learning
We will use the data from task 4 to build an intro-to-Machine-Learning model, Linear Regression, with MLlib