·
Get the notebooks from:
o https://github.com/cerndb/SparkTraining
o https://gitlab.cern.ch/hadoop/training/SparkTraining
·
How to run the notebooks:
o CERN SWAN
(recommended option):
o Local/private
Jupyter notebook
o See also the SWAN gallery and the video:
·
Introduction and objectives: slides and video
·
Session 1: Apache Spark fundamentals
o Lecture Spark architecture and intro to DataFrames: slides and video
o Notebooks:
o Tutorial
on Spark DataFrames with exercises video
·
Session 2: Working with Spark DataFrames
and SQL
o Lecture Introduction to Spark SQL: slides and video
o Notebooks:
o Tutorial
on Spark SQL video
·
Session 3: Building on top of the
DataFrame API
o Lecture Spark as a Data Platform: slides and video
o Lecture Spark Streaming: slides and video
o Lecture Spark and Machine Learning: slides and video
o Notebooks:
o Tutorial on Spark
Streaming video
o Tutorial on Spark
Machine Learning regression task video
o Tutorial on Spark
Machine Learning classification task with the Higgs dataset
o Demo of the Spark
JDBC data source how
to read Oracle tables from Spark
o Note on Spark
and Parquet format
·
Session 4: How to scale out Spark jobs
o Lecture Running Spark on CERN resources: slides and video
o Notebooks:
o Demo on using SWAN
with Spark on Hadoop video
o Demo of Spark
processing Physics data using CERN private Cloud resources video
o Notebook example on
how to run the TPC-DS
benchmark at scale with Spark and Hadoop
o Notebook
for the NXCALS project (data extraction)
o Example for
NXCALS project (vector data and timestamps)
·
How to monitor Spark execution: slides and video
o TPCDS_PySpark - workload generator
§ Watch
TPCDS-PySpark demo and tutorial
o
SparkMeasure - metrics collection
§ Watch
sparkMeasure's getting started demo and tutorial
o
Spark-Dashboard - real-time dashboards
§ Watch Spark-Dashboard demo and tutorial
·
Spark as a library, examples of how
to use Spark in Scala and Python programs: code and video
·
Next steps: reading material and links,
miscellaneous Spark
notes
·
Read and watch at your pace:
o Download the course
material for offline use:
slides.zip,
github_repo.zip,
videos.zip
o Watch the videos
on YouTube
Author and contact: Luca Canali - Luca.Canali@cern.ch
CERN-IT Spark and Data Analytics Services
Former contributors: R. Castellotti, P. Kothuri
Many thanks to CERN Technical Training for their collaboration and
support
License: CC BY-SA 4.0
Published: November 2022
Last modified: April 2024