BIG DATA COMPUTING

Course objectives

General goals: The course is aimed at training students on fundamental algorithmic and programming techniques in big-data computing, tackling a variety of data mining problems on computational models used for managing massive information structures. Specific goals: Ability to analyze, model, and solve typical "Big Data" tasks by implementing machine learning pipelines using PySpark over distributed environments. Knowledge and understanding: At the end of the course the students will have deep understanding of programming models for distributed data analysis on large clusters of computers, as well as of advanced computational models for processing massive amounts of data (e.g., data streaming, MapReduce-style parallelism, and I/O-efficient algorithms). Applying knowledge and understanding: Students will be able to design and analyze algorithms in different big data settings, to write efficient code taking into account architectural features of modern computing platforms (including distributed systems), and to make use of good programming practices and advanced programming frameworks, such as Hadoop. Critical and judgmental skills: Students will be able to distinguish the proper settings in which to use different computational paradigms for big data analysis, to evaluate the advantages and disadvantages of each model, and to face challenges arising in the design and implementation of diverse big data applications. Communication skills: The students will be able to communicate effectively, summarizing the main ideas in the design of big data systems and algorithms clearly and presenting accurate technical information. Ability of learning: The goal for the class is to be broad and to touch upon a variety of techniques, introducing standard practices as well as cutting-edge research topics in this area, making it possible for the students to extend their knowledge independently according to technological changes and evolution.

Channel 1
DANIELE DE SENSI Lecturers' profile

Program - Frequency - Exams

Course program
Introduction - The Big Data Phenomenon - The Big Data Infrastructure - Datacenter and their relevance to Big Data workloads Datacenter Architecture: Compute - Introduction to GPU, TPU, and other computer architectures Datacenter Architecture: Network - Limitations of TCP for high-performance and big data workloads - RDMA - Datacenter network topologies - Congestion control and routing algorithms - In-network compute: SmartNICs and programmable switches, and use cases involving big data processing Datacenter Architecture: Storage - Brief introduction Big Data Frameworks - Distributed File Systems (HDFS) - MapReduce (Hadoop) - Spark - PySpark + Google Colaboratory Unsupervised Learning: Clustering - Similarity Measures - Algorithms: K-means - Example: Document Clustering Dimensionality Reduction - Feature Extraction - Algorithms: Principal Component Analysis (PCA) - Example: PCA + Handwritten Digit Recognition Supervised Learning - Basics of Machine Learning - Regression/Classification - Algorithms: Linear Regression/Logistic Regression/Random Forest - Examples: - Linear Regression - House Pricing Prediction (i.e., predict the price which a house will be sold) - Logistic Regression/Random Forest - Marketing Campaign Prediction (i.e., predict whether a customer will subscribe a term deposit of a bank) Recommender Systems - Content-based vs. Collaborative filtering - Algorithms: k-NN, Matrix Factorization (MF) - Example: Movie Recommender System (MovieLens) Graph Analysis - Link Analysis - Algorithms: PageRank - Example: Ranking (a sample of) the Google Web Graph Real-time Analytics - Streaming Data Processing - Example: Twitter Hate Speech Detector
Prerequisites
The course assumes that students are familiar with the basics of data analysis, machine learning, computer architecture, and computer networks. These must be properly supported by a strong knowledge of foundational concepts of calculus, linear algebra, and probability and statistics. In addition, students must have non-trivial computer programming skills (preferably using Python programming language).
Books
- The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines - RDMA Aware Networks Programming User Manual - Mining of Massive Datasets [Leskovec, Rajaraman, Ullman] (available online) - Big Data Analysis with Python [Marin, Shukla, VK] - Large Scale Machine Learning with Python [Sjardin, Massaron, Boschetti] - Spark: The Definitive Guide [Chambers, Zaharia] - Learning Spark: Lightning-Fast Big Data Analysis [Karau, Konwinski, Wendell, Zaharia] - Hadoop: The Definitive Guide [White] - Python for Data Analysis [Mckinney]
Frequency
Not mandatory.
Exam mode
Oral examination session, covering a project and/or a scientific paper presentation covering some topics seen during the course. The oral exam includes questions on any subject presented during lectures.
Bibliography
- The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines - RDMA Aware Networks Programming User Manual - Mining of Massive Datasets [Leskovec, Rajaraman, Ullman] (available online) - Big Data Analysis with Python [Marin, Shukla, VK] - Large Scale Machine Learning with Python [Sjardin, Massaron, Boschetti] - Spark: The Definitive Guide [Chambers, Zaharia] - Learning Spark: Lightning-Fast Big Data Analysis [Karau, Konwinski, Wendell, Zaharia] - Hadoop: The Definitive Guide [White] - Python for Data Analysis [Mckinney]
GABRIELE TOLOMEI Lecturers' profile

Program - Frequency - Exams

Course program
Introduction - The Big Data Phenomenon - The Big Data Infrastructure - Distributed File Systems (HDFS) - MapReduce (Hadoop) - Spark - PySpark + Google Colaboratory Unsupervised Learning: Clustering - Similarity Measures - Algorithms: K-means - Example: Document Clustering Dimensionality Reduction - Feature Extraction - Algorithms: Principal Component Analysis (PCA) - Example: PCA + Handwritten Digit Recognition Supervised Learning - Basics of Machine Learning - Regression/Classification - Algorithms: Linear Regression/Logistic Regression/Random Forest - Examples: - Linear Regression - House Pricing Prediction (i.e., predict the price which a house will be sold) - Logistic Regression/Random Forest - Marketing Campaign Prediction (i.e., predict whether a customer will subscribe a term deposit of a bank) Recommender Systems - Content-based vs. Collaborative filtering - Algorithms: k-NN, Matrix Factorization (MF) - Example: Movie Recommender System (MovieLens) Graph Analysis - Link Analysis - Algorithms: PageRank - Example: Ranking (a sample of) the Google Web Graph Real-time Analytics - Streaming Data Processing - Example: Twitter Hate Speech Detector
Prerequisites
The course assumes that students are familiar with the basics of data analysis and machine learning, properly supported by a strong knowledge of foundational concepts of calculus, linear algebra, and probability and statistics. In addition, students must have non-trivial computer programming skills (preferably using Python programming language).
Books
- Mining of Massive Datasets [Leskovec, Rajaraman, Ullman] (available online) - Big Data Analysis with Python [Marin, Shukla, VK] - Large Scale Machine Learning with Python [Sjardin, Massaron, Boschetti] - Spark: The Definitive Guide [Chambers, Zaharia] - Learning Spark: Lightning-Fast Big Data Analysis [Karau, Konwinski, Wendell, Zaharia] - Hadoop: The Definitive Guide [White] - Python for Data Analysis [Mckinney]
Teaching mode
Classes where both the theoretical and practical aspects of each subject of the course are in-depth discussed and analyzed.
Frequency
Not mandatory.
Exam mode
Development of a software project on the topics covered in the course. The project must be agreed with and approved by the professor, and it will be discussed and evaluated during a proper oral examination session, possibly including questions on any subject presented during lectures.
Bibliography
- Mining of Massive Datasets [Leskovec, Rajaraman, Ullman] (available online) - Big Data Analysis with Python [Marin, Shukla, VK] - Large Scale Machine Learning with Python [Sjardin, Massaron, Boschetti] - Spark: The Definitive Guide [Chambers, Zaharia] - Learning Spark: Lightning-Fast Big Data Analysis [Karau, Konwinski, Wendell, Zaharia] - Hadoop: The Definitive Guide [White] - Python for Data Analysis [Mckinney]
Lesson mode
Classes are held in person.
  • Lesson code1041764
  • Academic year2025/2026
  • CourseComputer Science
  • CurriculumSingle curriculum
  • Year1st year
  • Semester1st semester
  • SSDINF/01
  • CFU6