· Get the notebooks from:
o
https://github.com/cerndb/SparkTraining
o
https://gitlab.cern.ch/db/SparkTraining
· How to run the notebooks:
o
CERN SWAN
(recommended option):
o
Local/private
Jupyter notebook
o
See also the SWAN gallery and the video:
· Introduction and objectives: slides
and video
· Session 1:
Apache Spark fundamentals
o
Lecture “Spark architecture and intro to DataFrames”: slides and video
o
Notebooks:
o
Tutorial
on DataFrames with exercises – video
· Session 2:
Working with Spark DataFrames and SQL
o
Lecture “Introduction to Spark SQL”: slides and video
o
Notebooks:
o
Tutorial
on Spark SQL – video
· Session 3:
Building on top of the DataFrame API
o
Lecture “Spark as a Data Platform”: slides and video
o
Lecture “Spark Streaming”: slides and video
o
Lecture “Spark and Machine Learning”: slides and video
o
Notebooks:
o
Tutorial on Spark
Streaming – video
o
Tutorial on Spark
Machine Learning – regression task – video
o
Tutorial on Spark
Machine Learning – classification task with the Higgs dataset
o
Demo of the Spark
JDBC data source how
to read Oracle tables from Spark
o
Note on Spark
and Parquet format
· Session 4:
How to scale out Spark jobs
o
Lecture “Running Spark on CERN resources”: slides and video
o
Notebooks:
o
Demo on using SWAN
with Spark on Hadoop – video
o
Demo of Spark
processing Physics data using CERN private Cloud resources – video
o
Example notebook
for the NXCALS project
·
Bonus
material:
o
How to
monitor Spark execution: slides and video
o
Spark
as a library, examples of how to use
Spark in Scala and Python programs: code and video
o
Next
steps: reading
material and links, miscellaneous Spark
notes
·
Read and watch
at your pace:
o
Download the
course material for offline use:
slides.zip,
github_repo.zip,
videos.zip
o
Watch the videos
on YouTube
Author and contact for
feedback and questions: Luca Canali - Luca.Canali@cern.ch
CERN-IT Spark and data
analytics services
Former contributors: Riccardo
Castellotti, Prasanth Kothuri
Many thanks to CERN Technical
Training for their collaboration and support
License: CC BY-SA 4.0
Published in November 2022