· Get the notebooks from:
o
https://github.com/cerndb/SparkTraining
o
https://gitlab.cern.ch/hadoop/training/SparkTraining
· How to run the notebooks:
o
CERN SWAN
(recommended option):
o
Local/private
Jupyter notebook
o
See also the SWAN gallery and the video:
· Introduction and objectives: slides
and video
· Session 1:
Apache Spark fundamentals
o
Lecture Spark architecture and intro to DataFrames: slides and video
o
Notebooks:
o
Tutorial
on Spark DataFrames with exercises video
· Session 2:
Working with Spark DataFrames and SQL
o
Lecture Introduction to Spark SQL: slides and video
o
Notebooks:
o
Tutorial
on Spark SQL video
· Session 3:
Building on top of the DataFrame API
o
Lecture Spark as a Data Platform: slides and video
o
Lecture Spark Streaming: slides and video
o
Lecture Spark and Machine Learning: slides and video
o
Notebooks:
o
Tutorial on Spark
Streaming video
o
Tutorial on Spark
Machine Learning regression task video
o
Tutorial on Spark
Machine Learning classification task with the Higgs dataset
o
Demo of the Spark
JDBC data source how
to read Oracle tables from Spark
o
Note on Spark
and Parquet format
· Session 4:
How to scale out Spark jobs
o
Lecture Running Spark on CERN resources: slides and video
o
Notebooks:
o
Demo on using SWAN
with Spark on Hadoop video
o
Demo of Spark
processing Physics data using CERN private Cloud resources video
o
Example notebook
for the NXCALS project
· How to monitor Spark execution: slides and video
· Spark as a library, examples of how to use Spark in Scala and Python
programs: code
and video
· Next steps: reading material and links,
miscellaneous Spark
notes
·
Read and watch
at your pace:
o
Download the
course material for offline use:
slides.zip,
github_repo.zip,
videos.zip
o
Watch the videos
on YouTube
Author and contact for feedback
and questions: Luca Canali - Luca.Canali@cern.ch
CERN-IT Spark and data
analytics services
Former contributors: Riccardo
Castellotti, Prasanth Kothuri
Many thanks to CERN Technical
Training for their collaboration and support
License: CC BY-SA 4.0
Published: November 2022
Last modified: December 2023