Text Box:
Introduction to Apache Spark APIs for Data Processing

Welcome to the website of the course on Apache Spark by CERN IT. The course is self-paced and open, it is a short introduction to the architecture and key abstractions used by Spark. Theory and demos cover the main Spark APIs: DataFrame API, Spark SQL, Streaming, Machine Learning. You will also learn how to deploy Spark on CERN computing resources, notably using the CERN SWAN service. Most tutorials and exercises are in Python and run on Jupyter notebooks.

Apache Spark is a popular engine for data processing at scale. Spark provides an expressive API and a scalable engine that integrates very well with the Hadoop ecosystem as well as with Cloud resources. Spark is currently used by several projects at CERN, notably by IT monitoring, by the security team, by the BE NXCALS project, by teams in ATLAS and CMS. Moreover, Spark is integrated with the CERN Hadoop service, the CERN Cloud service, and the CERN SWAN web notebooks service.

Accompanying notebooks

·       Get the notebooks from:

o   https://github.com/cerndb/SparkTraining

o   https://gitlab.cern.ch/db/SparkTraining

·       How to run the notebooks:

o   CERN SWAN (recommended option):

o   Colab , Binder

o   Local/private Jupyter notebook

o   See also the SWAN gallery and the video:

 

Course lectures and tutorials

·       Introduction and objectives: slides and video

         


·       Session 1: Apache Spark fundamentals

o   Lecture “Spark architecture and intro to DataFrames”: slides and video

Graphical user interface, diagram

Description automatically generated

o   Notebooks:

o   Tutorial on DataFrames with exercisesvideo Icon

Description automatically generated

o   Solutions to the exercises

·       Session 2: Working with Spark DataFrames and SQL

o   Lecture “Introduction to Spark SQL”: slides and video



o   Notebooks:

o   Tutorial on Spark SQLvideo Icon

Description automatically generated

o   Exercises on Spark SQL

o   Solutions to the exercises

·       Session 3: Building on top of the DataFrame API

o   Lecture “Spark as a Data Platform”: slides and video



o   Lecture “Spark Streaming”: slides and video



o   Lecture “Spark and Machine Learning”: slides and video



o   Notebooks:

o   Tutorial on Spark Streamingvideo Icon

Description automatically generated

o   Tutorial on Spark Machine Learning – regression taskvideo Icon

Description automatically generated

o   Tutorial on Spark Machine Learning – classification task with the Higgs dataset

o   Demo of the Spark JDBC data source how to read Oracle tables from Spark

o   Note on Spark and Parquet format

·       Session 4: How to scale out Spark jobs

o   Lecture “Running Spark on CERN resources”: slides and video



o   Notebooks:

o   Demo on using SWAN with Spark on Hadoopvideo Icon

Description automatically generated

o   Demo of Spark processing Physics data using CERN private Cloud resourcesvideo Icon

Description automatically generated

o   Example notebook for the NXCALS project

 

·       Bonus material:

o   How to monitor Spark execution: slides and video Icon

Description automatically generated

o   Spark as a library, examples of how to use Spark in Scala and Python programs: code and video Icon

Description automatically generated

o   Next steps: reading material and links, miscellaneous Spark notes

 

·       Read and watch at your pace:

o   Download the course material for offline use:
 slides.zip, github_repo.zip, videos.zip

o   Watch the videos on YouTube Logo, icon

Description automatically generated

 

 

Acknowledgements and feedback

Author and contact for feedback and questions: Luca Canali - Luca.Canali@cern.ch

CERN-IT Spark and data analytics services

Former contributors: Riccardo Castellotti, Prasanth Kothuri

Many thanks to CERN Technical Training for their collaboration and support

 

License: CC BY-SA 4.0

Published in November 2022