Text Box:
Introduction to Apache Spark APIs for Data Processing

Welcome to our Apache Spark course, presented by CERN IT! This course offers a comprehensive yet concise exploration of Apache Spark's architecture and its fundamental components. Designed to be self-paced and accessible to all, it delves into the core Spark APIs including DataFrame API, Spark SQL, Streaming, and Machine Learning, blending theoretical insights with practical demonstrations. Key highlights of the course:

Hands-On Learning: Engage in interactive tutorials and exercises predominantly in Python, utilizing Jupyter notebooks for a seamless learning experience.

Real-World Application: Discover how to efficiently deploy Spark on CERN’s computing infrastructure, especially leveraging the CERN SWAN service.

Relevance: Understand how Spark, a leading engine for large-scale data processing, is effectively integrated within various projects at CERN, including IT monitoring, security, the BE NXCALS project, and by teams in ATLAS and CMS, as well as its synergy with the Hadoop service and Cloud resources at CERN.

By the end of this course, you’ll have a solid understanding of how Spark operates within complex systems like those at CERN, and how it can be applied to a range of data processing challenges. Whether you are a beginner or looking to enhance your existing knowledge, this course is tailored to provide a foundational understanding of Spark's capabilities and real-world applications.

Accompanying notebooks

·       Get the notebooks from:

o   https://github.com/cerndb/SparkTraining

o   https://gitlab.cern.ch/hadoop/training/SparkTraining

·       How to run the notebooks:

o   CERN SWAN (recommended option):

o   Colab , Binder

o   Local/private Jupyter notebook

o   See also the SWAN gallery and the video:

 

Course lectures and tutorials

·       Introduction and objectives: slides and video

         


·       Session 1: Apache Spark fundamentals

o   Lecture “Spark architecture and intro to DataFrames”: slides and video

Graphical user interface, diagram

Description automatically generated

o   Notebooks:

o   Tutorial on Spark DataFrames with exercises – video Icon

Description automatically generated

o   Solutions to the exercises

o   Examples of Pandas on Spark

·       Session 2: Working with Spark DataFrames and SQL

o   Lecture “Introduction to Spark SQL”: slides and video



o   Notebooks:

o   Tutorial on Spark SQL – video Icon

Description automatically generated

o   Exercises on Spark SQL

o   Solutions to the exercises

·       Session 3: Building on top of the DataFrame API

o   Lecture “Spark as a Data Platform”: slides and video



o   Lecture “Spark Streaming”: slides and video



o   Lecture “Spark and Machine Learning”: slides and video



o   Notebooks:

o   Tutorial on Spark Streaming – video Icon

Description automatically generated

o   Tutorial on Spark Machine Learning – regression task – video Icon

Description automatically generated

o   Tutorial on Spark Machine Learning – classification task with the Higgs dataset

o   Demo of the Spark JDBC data source how to read Oracle tables from Spark

o   Note on Spark and Parquet format

·       Session 4: How to scale out Spark jobs

o   Lecture “Running Spark on CERN resources”: slides and video



o   Notebooks:

o   Demo on using SWAN with Spark on Hadoop – video Icon

Description automatically generated

o   Demo of Spark processing Physics data using CERN private Cloud resources – video Icon

Description automatically generated

o   Notebook example on how to run the TPC-DS benchmark at scale with Spark and Hadoop

o   Notebook for the NXCALS project (data extraction)

o   Example for NXCALS project (vector data and timestamps)

 

Bonus material

·       How to monitor Spark execution: slides and video Icon

Description automatically generated

·       Spark Performance Lab:

o   TPCDS_PySpark - workload generator

§  Watch the video Watch TPCDS-PySpark demo and tutorial

o   SparkMeasure - metrics collection

§  Watch the video Watch sparkMeasure's getting started demo and tutorial

o   Spark-Dashboard - real-time dashboards

§  Watch the video Watch Spark-Dashboard demo and tutorial

·       Spark as a library, examples of how to use Spark in Scala and Python programs: code and video Icon

Description automatically generated

·       Next steps: reading material and links, miscellaneous Spark notes

 

·       Read and watch at your pace:

o   Download the course material for offline use:
 slides.zip, github_repo.zip, videos.zip

o   Watch the videos on YouTube Logo, icon

Description automatically generated

 

 

Acknowledgements and feedback

Author and contact: Luca Canali - Luca.Canali@cern.ch

CERN-IT Spark and Data Analytics Services

Former contributors: R. Castellotti, P. Kothuri

Many thanks to CERN Technical Training for their collaboration and support

 

License: CC BY-SA 4.0

Published: November 2022

Last modified: April 2024