Apache Spark APIs for Data Processing

Welcome

Welcome to Apache Spark APIs for Data Processing, a self-paced, interactive course developed by CERN IT. This course offers a practical introduction to Spark’s powerful architecture and essential capabilities, combining theoretical insights with engaging demonstrations and hands-on exercises.

Comprehensive API Coverage: Dive into Spark DataFrames, SQL queries, real-time Streaming, and Machine Learning to harness Spark's full potential.
Interactive Hands-On Tutorials: Build practical experience with guided exercises in Python using Jupyter notebooks.
Real-World Deployments: Explore how you can deploy Apache Spark at scale and see examples from CERN's accelerator logging systems, IT infrastructure monitoring, and physics research.

By course end, you will be equipped to confidently apply Apache Spark to large-scale data processing scenarios.

Accompanying Notebooks

Download Notebooks: Access all materials on the Spark Training GitHub Repository.
Run the Notebooks:
- CERN SWAN
- Colab
- Binder
- GitHub Codespaces
- Local Jupyter Notebook
Explore the SWAN Gallery and watch the intro video for more details.

Course Lectures & Tutorials

Introduction & Objectives:
Slides – Video
Session 1: Apache Spark Fundamentals
- Lecture: "Spark Architecture & DataFrames" – Slides – Video
- Notebooks:
Session 2: Spark DataFrames & SQL
- Lecture: "Introduction to Spark SQL" – Slides – Video
- Notebooks:
Session 3: Data Platform, Streaming & Machine Learning
- Lecture: "Spark as a Data Platform" – Slides – Video
- Lecture: "Spark Streaming" – Slides – Video
- Lecture: "Spark & Machine Learning" – Slides – Video
- Notebooks:
- Demo: Read Oracle Tables Using Spark JDBC
- Note: Understanding Spark with Parquet
Session 4: Scaling Out Spark Jobs
- Lecture: "Running Spark on CERN Resources" – Slides – Video
- Notebooks:

Bonus Material

Monitor Spark Execution: Slides – Video
Spark Performance Lab:
- TPCDS_PySpark – Benchmark workload generator.
- SparkMeasure – Metrics collection tool.
- Spark-Dashboard – Real-time dashboards.
Spark as a Library: Explore examples in Scala and Python: Code Repository – Video Demo
Next Steps: Dive deeper with Reading Materials & References and explore additional Spark Notes.
Offline Materials: Download the complete course package: slides.zip, github_repo.zip, videos.zip or watch on YouTube.

Credits

Author & Contact: Luca.Canali@cern.ch
Presented by: CERN-IT Data Analytics Services
Contributors: R. Castellotti, P. Kothuri
License: CC BY-SA 4.0
Published: November 2022 | Last modified: March 2025