Apache Spark on ARC: Difference between revisions

From RCSWiki
Jump to navigation Jump to search
No edit summary
(3 intermediate revisions by 2 users not shown)
Line 2: Line 2:


This guide gives an overview of running Apache Spark clusters under the existing scheduling system of the ARC cluster at the University of Calgary.
This guide gives an overview of running Apache Spark clusters under the existing scheduling system of the ARC cluster at the University of Calgary.
If you want to use Spark with Jupyter via the OOD (Open On Demand) system skip to the section titled "Jupyter Notebook".


When activated by a Spark module, a python module is added to the path which allows provisioning of an Apache Spark cluster as a job on a Slurm cluster. This class is instantiated right inside of the Python code and submits a job to the cluster to gather resources during execution. Although this "Driver" process can be run on the login node for smaller examples, using a node in the "single" partition is recommended.
When activated by a Spark module, a python module is added to the path which allows provisioning of an Apache Spark cluster as a job on a Slurm cluster. This class is instantiated right inside of the Python code and submits a job to the cluster to gather resources during execution. Although this "Driver" process can be run on the login node for smaller examples, using a node in the "single" partition is recommended.
Line 7: Line 9:
How to get a node in the '''single''' partition on ARC for '''3 hours''':
How to get a node in the '''single''' partition on ARC for '''3 hours''':
  $ salloc -p single -N 1 -n 8 -c 1 --mem=0 -t 3:00:00
  $ salloc -p single -N 1 -n 8 -c 1 --mem=0 -t 3:00:00
The first time that you start up spark on ARC, you may need to install some additional packages locally, depending on which libraries you are including. At a minimum, graphframes:graphframes will be needed to use basic spark data structures. This can be obtained by running the following command from the command line in an interactive job:
$ pyspark --packages=graphframes:graphframes:0.3.0-spark2.0-s_2.11 --repositories=<nowiki>https://repos.spark-packages.org</nowiki>
You may need to install multiple packages to resolve all of the missing 3rd party spark modules, depending on what you are doing. You can also correct the repositories list by modifying the PYSPARK_SUBMIT_ARGS environment variable to include the relevant string. After the first time that you run this, the jar files and configuration files are downloaded to your ~/.ivy2 directory, and you shouldn't have to pass these arguments again in the future. 


= Procedure =
= Procedure =
'''Note:'''The jupyter.ucalgary.ca system is being decommissioned and will soon redirect to the new system.


== Jupyter Notebook ==
== Jupyter Notebook ==


# Point your browser to jupyter.ucalgary.ca and login using your IT username and password
# Point your browser to ood-arc.rcs.ucalgary.ca and login using the IT portal.  If you have recently signed in to your email this may not even happen.
# On the Home screen click the "Jupyter + Spark Cluster" App
# Fill out the form presented with the number of nodes/workers per node/memory per cpu.  These define the size of your Spark cluster.
# If you like jupyterlab go ahead and check the box to use it.
# Create a new Python 3 notebook
# Create a new Python 3 notebook
# Paste the below code block into a cell in your new notebook
# Paste the below code block into a cell in your new notebook
Line 26: Line 35:


In your Python file or terminal load the appropriate python modules and instantiate the cluster:
In your Python file or terminal load the appropriate python modules and instantiate the cluster:
<syntaxhighlight lang=python>
<syntaxhighlight lang="python">
 
import os
import os
import atexit
import atexit
import sys
import sys
import time
import re


import pyspark
import pyspark
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.context import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import SQLContext
import findspark
conflines=[tuple(a.rstrip().split(" ")) for a in open(os.environ['SPARK_CONFIG_FILE']).readlines()]
from sparkhpc import sparkjob
conf=SparkConf()
conf.setAll(conflines)
conf.setMaster("spark://%s:%s"% (os.environ['SPARK_MASTER_HOST'],os.environ['SPARK_MASTER_PORT']))
sc=pyspark.SparkContext(conf=conf)


#Exit handler to clean up the Spark cluster if the script exits or crashes
#You need this line if you want to use SparkSQL
def exitHandler(sj,sc):
sqlCtx=SQLContext(sc)
    try:
        print('Trapped Exit cleaning up Spark Context')
        sc.stop()
    except:
        pass
    try:
        print('Trapped Exit cleaning up Spark Job')
        sj.stop()
    except:
        pass
 
findspark.init()
 
#Parameters for the Spark cluster
nodes=1
tasks_per_node=24
memory_per_task=4096 #4 gig per process, adjust accordingly
# Please estimate walltime carefully to keep unused Spark clusters from sitting
# idle so that others may use the resources.
walltime="3:00" #3 hours
#os.environ['SBATCH_PARTITION']='cpu2019' #Set the appropriate ARC partition


sj = sparkjob.sparkjob(
#YOUR CODE GOES HERE
    ncores=nodes*tasks_per_node,
    cores_per_executor=tasks_per_node,
    memory_per_core=memory_per_task,
    walltime=walltime
    )


sj.wait_to_start()
time.sleep(60)
sc = sj.start_spark()
#Register the exit handler                                                                                                   
atexit.register(exitHandler,sj,sc)
#You need this line if you want to use SparkSQL
sqlCtx=SQLContext(sc)
</syntaxhighlight>
</syntaxhighlight>


You now have a sc (Spark Context) and sqlCtx (SQL Context) objects to operate on. Please remember to cal sc.stop() and sj.stop() when you are finished.
You now have a sc (Spark Context) and sqlCtx (SQL Context) objects to operate on. Please remember to return to the OOD screen and terminate the Jupyter + Spark app when you are finished.


There are many Spark tutorials out there.  Here are some good places to look:
There are many Spark tutorials out there.  Here are some good places to look:

Revision as of 20:30, 8 March 2022

Overview

This guide gives an overview of running Apache Spark clusters under the existing scheduling system of the ARC cluster at the University of Calgary.

If you want to use Spark with Jupyter via the OOD (Open On Demand) system skip to the section titled "Jupyter Notebook".

When activated by a Spark module, a python module is added to the path which allows provisioning of an Apache Spark cluster as a job on a Slurm cluster. This class is instantiated right inside of the Python code and submits a job to the cluster to gather resources during execution. Although this "Driver" process can be run on the login node for smaller examples, using a node in the "single" partition is recommended.

How to get a node in the single partition on ARC for 3 hours:

$ salloc -p single -N 1 -n 8 -c 1 --mem=0 -t 3:00:00

The first time that you start up spark on ARC, you may need to install some additional packages locally, depending on which libraries you are including. At a minimum, graphframes:graphframes will be needed to use basic spark data structures. This can be obtained by running the following command from the command line in an interactive job:

$ pyspark --packages=graphframes:graphframes:0.3.0-spark2.0-s_2.11 --repositories=https://repos.spark-packages.org

You may need to install multiple packages to resolve all of the missing 3rd party spark modules, depending on what you are doing. You can also correct the repositories list by modifying the PYSPARK_SUBMIT_ARGS environment variable to include the relevant string. After the first time that you run this, the jar files and configuration files are downloaded to your ~/.ivy2 directory, and you shouldn't have to pass these arguments again in the future.

Procedure

Note:The jupyter.ucalgary.ca system is being decommissioned and will soon redirect to the new system.

Jupyter Notebook

  1. Point your browser to ood-arc.rcs.ucalgary.ca and login using the IT portal. If you have recently signed in to your email this may not even happen.
  2. On the Home screen click the "Jupyter + Spark Cluster" App
  3. Fill out the form presented with the number of nodes/workers per node/memory per cpu. These define the size of your Spark cluster.
  4. If you like jupyterlab go ahead and check the box to use it.
  5. Create a new Python 3 notebook
  6. Paste the below code block into a cell in your new notebook
  7. Once you run the cell you will have an sc object in your environment. This is the "Spark Context". Executing methods on this object allows you to interact with the spark cluster you have just created.

From the Command Line

  1. Login to arc.ucalgary.ca using your IT username and password
  2. Load the spark module with "module load spark/jupyterhub" BEFORE starting your Python interpreter.
  3. For interactive work, simply start your preferred Python. Examples: Jupyter Notebook, ipython, python.

Instantiate Spark Cluster

In your Python file or terminal load the appropriate python modules and instantiate the cluster:

import os
import atexit
import sys
import re

import pyspark
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SQLContext
conflines=[tuple(a.rstrip().split(" ")) for a in open(os.environ['SPARK_CONFIG_FILE']).readlines()]
conf=SparkConf()
conf.setAll(conflines)
conf.setMaster("spark://%s:%s"% (os.environ['SPARK_MASTER_HOST'],os.environ['SPARK_MASTER_PORT']))
sc=pyspark.SparkContext(conf=conf)

#You need this line if you want to use SparkSQL
sqlCtx=SQLContext(sc)

#YOUR CODE GOES HERE

You now have a sc (Spark Context) and sqlCtx (SQL Context) objects to operate on. Please remember to return to the OOD screen and terminate the Jupyter + Spark app when you are finished.

There are many Spark tutorials out there. Here are some good places to look:

HINT: It helps to google "pyspark" as that returns Python results instead of Scala which is another common language used to interact with Spark.