BIG DATA COMPUTING

Course objectives

General Objectives: Knowledge of main application scenarios in Big Data Computing. Knowledge and understanding of main algorithms and approaches in Big Data Computing. Knowledge of main tools to implement them. Understanding of theoretical foundations underlying main techniques of analysis Ability to implement the aforementioned algorithms, approaches and techniques and to apply them to specific problems and scenarios. Knowledge of main evaluation techniques and their application to practical scenarios. Specific objectives: Ability to: - identify the most suitable techniques to address a data analysis problem where data dimensionality is concern; - implement the proposed solution, identifying the most appropriate design and implementation tools, among available ones; - Design and implement experiments to evaluate proposed solutions in realistic settings; Knowledge and understanding: - knowledge of main application scenarios; - knowledge of main techniques of analysis; - understanding of methodological and theoretical foundations of main analysis techniques; - knowledge and understanding of main evalutation techniques and corresponding performance indices Apply knowledge and understanding: - being able to translate application needs into specific data analysis problems; - being able to identify aspects of the problem for which data dimensionality might play a critical role; - being able to identify the most suitable techniques and tools to addresse the aforementioned problems; - being able to estimate in advance, at least qualitatively, the degree of scalability of proposed solutions; Critical and judgment skills: Being able to evaluate, also experimentally, the effectiveness and efficiency of proposed solutions Communication skills: Being able to effectively describe the requirements of a problem and provide to third parties the relative specifications, design choices and the reasons underlying these choices. Learning ability: The course will facilitate the development of skills for the independent study of topics related to the course. It will also allow students to identify and critically examine material contained in andvanced manuals and/or scientific literature, allowing them to face new application scenarios and/or apply alternative techniques to known ones.

Channel 1
ARISTIDIS ANAGNOSTOPOULOS Lecturers' profile
ARISTIDIS ANAGNOSTOPOULOS Lecturers' profile
LUCA BECCHETTI Lecturers' profile

Program - Frequency - Exams

Course program
Large scale computation and mining large graphs Standard approaches and Map Reduce - like paradigm. Solving basic problems using Apache Spark MapReduce/Hadoop - like algorithms for triangle counting and connected components A quick tour of community detection in large graphs Hashing and sampling techniques for neighborhood search Neighborhood search and the curse of dimensionality in euclidean spaces Reducing dimensions in euclidean spaces via hashing Extensions to different metrics Efficient neighborhood search via hashing and bucketing Reducing search space via sampling Dimensionality reduction SVD and basic approach Sparsification techniques and CUR Random projections Sketching and sampling for streams of data Estimating frequency moments in sliding windows Sketching algorithms for heavy hitters tracking Sketching techniques for join size estimation Sketching techniques for large graph mining, with application to neighborhood search, and analysis of community structure Distributed algorithms in following MapReduce paradigm Graph semi-streaming algorithms
Prerequisites
- Linear algebra - Calculus and basic knowledge of probability theory and statistics - Programming, fundamental algorithms and data structures
Books
- Selected sections of "Foundations of Data Science", by Avrim Blum, John Hopcroft, and Ravindran Kannan, available at https://www.cs.cornell.edu/jeh/book.pdf - Selected sections and chapters of "Mining of massive datasets" (2nd edition). Cambridge University Press, 2014, by Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. - Scientific papers and on line resources. Pointers will be given by the instructor when needed
Frequency
Attendance of theoretical and practical lectures is not mandatory, but it is strongly advised.
Exam mode
- Theoretical and practical homeworks on the topics covered during the course - Written exam - Oral exam
Lesson mode
Lectures will be held in presence. Part of the lectures will be applied, with the students involved, together with the instructor, in implementing notions and concepts introduced in the course
LUCA BECCHETTI Lecturers' profile

Program - Frequency - Exams

Course program
Large scale computation and mining large graphs Standard approaches and Map Reduce - like paradigm. Solving basic problems using Apache Spark MapReduce/Hadoop - like algorithms for triangle counting and connected components A quick tour of community detection in large graphs Hashing and sampling techniques for neighborhood search Neighborhood search and the curse of dimensionality in euclidean spaces Reducing dimensions in euclidean spaces via hashing Extensions to different metrics Efficient neighborhood search via hashing and bucketing Reducing search space via sampling Dimensionality reduction SVD and basic approach Sparsification techniques and CUR Random projections Sketching and sampling for streams of data Estimating frequency moments in sliding windows Sketching algorithms for heavy hitters tracking Sketching techniques for join size estimation Sketching techniques for large graph mining, with application to neighborhood search, and analysis of community structure Distributed algorithms in following MapReduce paradigm Graph semi-streaming algorithms
Prerequisites
- Linear algebra - Calculus and basic knowledge of probability theory and statistics - Programming, fundamental algorithms and data structures
Books
- Selected sections of "Foundations of Data Science", by Avrim Blum, John Hopcroft, and Ravindran Kannan, available at https://www.cs.cornell.edu/jeh/book.pdf - Selected sections and chapters of "Mining of massive datasets" (2nd edition). Cambridge University Press, 2014, by Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. - Scientific papers and on line resources. Pointers will be given by the instructor when needed
Frequency
Attendance of theoretical and practical lectures is not mandatory, but it is strongly advised.
Exam mode
- Theoretical and practical homeworks on the topics covered during the course - Written exam - Oral exam
Lesson mode
Lectures will be held in presence. Part of the lectures will be applied, with the students involved, together with the instructor, in implementing notions and concepts introduced in the course
  • Lesson code1044406
  • Academic year2025/2026
  • CourseData Science
  • CurriculumSingle curriculum
  • Year2nd year
  • Semester1st semester
  • SSDING-INF/05
  • CFU6