Data Mining

Course objectives

The goal of the course is twofold. First, it will present the main theory behind the analysis of data. Second, it will be hands-on and at the end students will become familiar with various state-of-the-art tools and techniques for analyzing data. We will use Python for downloading data as well as its rich machine-learning libraries, the R environment for statistical processing, and the MapReduce framework for mining of large-scale data.

Channel 1
ARISTIDIS ANAGNOSTOPOULOS Lecturers' profile
LUCA BECCHETTI Lecturers' profile

Program - Frequency - Exams

Course program
Large scale computation and mining large graphs Standard approaches and Map Reduce - like paradigm. Solving basic problems using Apache Spark MapReduce/Hadoop - like algorithms for triangle counting and connected components A quick tour of community detection in large graphs Hashing and sampling techniques for neighborhood search Neighborhood search and the curse of dimensionality in euclidean spaces Reducing dimensions in euclidean spaces via hashing Extensions to different metrics Efficient neighborhood search via hashing and bucketing Reducing search space via sampling Dimensionality reduction SVD and basic approach Sparsification techniques and CUR Random projections Sketching and sampling for streams of data Estimating frequency moments in sliding windows Sketching algorithms for heavy hitters tracking Sketching techniques for join size estimation Sketching techniques for large graph mining, with application to neighborhood search, and analysis of community structure Distributed algorithms in following MapReduce paradigm Graph semi-streaming algorithms
Prerequisites
- Linear algebra - Calculus and basic knowledge of probability theory and statistics - Programming, fundamental algorithms and data structures
Books
- Selected sections of "Foundations of Data Science", by Avrim Blum, John Hopcroft, and Ravindran Kannan, available at https://www.cs.cornell.edu/jeh/book.pdf - Selected sections and chapters of "Mining of massive datasets" (2nd edition). Cambridge University Press, 2014, by Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. - Scientific papers and on line resources. Pointers will be given by the instructor when needed
Frequency
Attendance of theoretical and practical lectures is not mandatory, but it is strongly advised.
Exam mode
- Theoretical and practical homeworks on the topics covered during the course - Written exam - Oral exam
Lesson mode
Lectures will be held in presence. Part of the lectures will be applied, with the students involved, together with the instructor, in implementing notions and concepts introduced in the course
  • Lesson code1038138
  • Academic year2025/2026
  • CourseEngineering in Computer Science and Artificial Intelligence
  • CurriculumSingle curriculum
  • Year2nd year
  • Semester1st semester
  • SSDING-INF/05
  • CFU6