Apache Spark on ARC: Difference between revisions
Ian.percel (talk | contribs) (insert sleep→Instantiate Spark Cluster) |
Ian.percel (talk | contribs) |
||
Line 30: | Line 30: | ||
import atexit | import atexit | ||
import sys | import sys | ||
import time | |||
import pyspark | import pyspark |
Revision as of 15:03, 18 March 2021
Overview
This guide gives an overview of running Apache Spark clusters under the existing scheduling system of the ARC cluster at the University of Calgary.
When activated by a Spark module, a python module is added to the path which allows provisioning of an Apache Spark cluster as a job on a Slurm cluster. This class is instantiated right inside of the Python code and submits a job to the cluster to gather resources during execution. Although this "Driver" process can be run on the login node for smaller examples, using a node in the "single" partition is recommended.
How to get a node in the single partition on ARC for 3 hours:
$ salloc -p single -N 1 -n 8 -c 1 --mem=0 -t 3:00:00
Procedure
Jupyter Notebook
- Point your browser to jupyter.ucalgary.ca and login using your IT username and password
- Create a new Python 3 notebook
- Paste the below code block into a cell in your new notebook
- Once you run the cell you will have an sc object in your environment. This is the "Spark Context". Executing methods on this object allows you to interact with the spark cluster you have just created.
From the Command Line
- Login to arc.ucalgary.ca using your IT username and password
- Load the spark module with "module load spark/2.2.0" BEFORE starting your Python interpreter.
- For interactive work, simply start your preferred Python. Examples: Jupyter Notebook, ipython, python.
Instantiate Spark Cluster
In your Python file or terminal load the appropriate python modules and instantiate the cluster:
import os
import atexit
import sys
import time
import pyspark
from pyspark.context import SparkContext
from pyspark.sql import SQLContext
import findspark
from sparkhpc import sparkjob
#Exit handler to clean up the Spark cluster if the script exits or crashes
def exitHandler(sj,sc):
try:
print('Trapped Exit cleaning up Spark Context')
sc.stop()
except:
pass
try:
print('Trapped Exit cleaning up Spark Job')
sj.stop()
except:
pass
findspark.init()
#Parameters for the Spark cluster
nodes=1
tasks_per_node=24
memory_per_task=4096 #4 gig per process, adjust accordingly
# Please estimate walltime carefully to keep unused Spark clusters from sitting
# idle so that others may use the resources.
walltime="3:00" #3 hours
#os.environ['SBATCH_PARTITION']='cpu2019' #Set the appropriate ARC partition
sj = sparkjob.sparkjob(
ncores=nodes*tasks_per_node,
cores_per_executor=tasks_per_node,
memory_per_core=memory_per_task,
walltime=walltime
)
sj.wait_to_start()
time.sleep(60)
sc = sj.start_spark()
#Register the exit handler
atexit.register(exitHandler,sj,sc)
#You need this line if you want to use SparkSQL
sqlCtx=SQLContext(sc)
You now have a sc (Spark Context) and sqlCtx (SQL Context) objects to operate on. Please remember to cal sc.stop() and sj.stop() when you are finished.
There are many Spark tutorials out there. Here are some good places to look:
- https://spark.apache.org/docs/latest/quick-start.html
- https://spark.apache.org/docs/latest/rdd-programming-guide.html
- https://spark.apache.org/docs/latest/sql-programming-guide.html
HINT: It helps to google "pyspark" as that returns Python results instead of Scala which is another common language used to interact with Spark.