Data Science

Natural Language Processing and Text Mining

Course objectives

General Objectives 1. Knowledge of the main application scenarios in analyzing collections of textual data using NLP techniques. 2. Knowledge and understanding of the main methodological and analytical challenges. 3. Knowledge of the main data analysis and machine learning techniques for natural language and the primary tools available to implement them. 4. Understanding of the theoretical foundations underlying advanced techniques for textual data analysis and natural language learning. 5. Ability to translate acquired notions into programs that solve specific problems. 6. Knowledge of the main evaluation techniques and their application to practical scenarios. Specific Objectives Abilities - Identify the most suitable text-mining and/or NLP techniques to address a given problem. - Implement the proposed solution by selecting the most appropriate tools. - Design and conduct experiments to evaluate proposed solutions under realistic conditions. Knowledge and Understanding - Knowledge of the main application scenarios. - Knowledge of the main analysis techniques. - Understanding of the theoretical and methodological assumptions underlying the main techniques. - Knowledge and understanding of the main evaluation techniques and corresponding performance indices. Applying Knowledge and Understanding - Translate application requirements into concrete data-analysis problems. - Identify the most suitable techniques and tools to address those problems. - Qualitatively estimate the scalability of the proposed solutions in advance. Critical and Judgment Skills - Evaluate experimentally the effectiveness, efficiency, and scalability of proposed solutions. Communication Skills - Effectively describe the requirements of a problem and communicate the chosen solutions and their rationale to others. Learning Ability - Develop independent-study skills on course-related topics and critically consult advanced manuals and scientific literature to tackle new scenarios or apply alternative techniques.

Channel 1

FABRIZIO SILVESTRI Lecturers' profile

Program - Frequency - Exams

Course program

Section I – Ranking and similarity search 1. Problems of interest. Document Ranking. Link analysis: review of Pagerank as a query-independent ranking algorithm. Context-dependent link analysis: Topic-sensiti and Personalized Pagerank. Hubs and authorities: the HITS algorithm. 2. Similarity search in high dimensions: Top-k and approximate near(est)-neighbour search. Similarity between sets and Jaccard similarity/distance. Minwise independent permutations. Ideal case and its analysis. Minwise signatures and Jaccard similarity estimation. Implementation with universal (pairwise independent) hash families.Improving accuracy: the banding technique. Estimation of false positive and negative rates. 3. General properties of the banding technique. Locality Sensitive Hashing for other distance measures: Hamming and Cosine distances. 4. Unit 4: Other techniques for efficient similarity search. Clustering and vectorial quantization: properties and limits. Product quantization: definition, properties, implementation. Efficiency of product quantization. 5. Graph-based methods. Navigable small world networks. Kleinberg's navigable small world network for 2-dimensional lattices. Navigable small world networks on euclidean points sets. Search and network construction algorithms. Section II – Dimensionality reduction and clustering 1. A motivating application: recommender systems and collaborative filtering 2. A review of the Singular Value Decomposition: Properties of the SVD (and PCA). Explained variance, best low-rank approximation in Frobenius norm. Using the SVD for classification and recommandation. Spectral embeddings using Truncated SVD. Section III – Deep learning and Natural Language Processing 1. Neural networks and Large Language Models 2. Neural Information Retrieval

Prerequisites

- Linear algebra - Calculus and basic knowledge of probability theory and statistics - Programming, fundamental algorithms and data structures

Books

- Christopher D. Manning, Prabhakar Raghavan, Henrich Schueze Introduction to Information Retrieval, Cambridge University Press, 2008 - J. Leskovec, A. Rajaraman, and J. Ullman, Mining of Massive Datasets, Cambridge University Press. - Dan Jurafsky and James H. Martin. Speech and Language Processing (3rd ed. draft). - Lecture notes and scientific papers on the topics covered in the course

Frequency

No mandatory attendance

Exam mode

Evaluation is based on: - Homeworks assigned during the course and valid for the current academic year - Written exam on the entire study plan or on parts of it corresponding to homeworks that were not delivered by the students - Oral exam It is always possible to take the written exam plus an oral one

Lesson mode

Lectures and exercised solved in the classroom

LUCA BECCHETTI Lecturers' profile

Lesson code10621173
Academic year2025/2026
CourseData Science
CurriculumSingle curriculum
Year1st year
Semester2nd semester
SSDING-INF/05
CFU6

Course catalogue

25/09/2025 - 2nd Extraordinary Session for the academic year 2024–2025

04/09/2025 - First semester of academic year 2025-26: start of classes

14/07/2025 - Exam Sessions Not Showing on Infostud: How to Resolve the Issue

Natural Language Processing and Text Mining

Course objectives

Program - Frequency - Exams

Course program

Prerequisites

Books

Frequency

Exam mode

Lesson mode

Data Science

Featured announcements

25/09/2025 - 2nd Extraordinary Session for the academic year 2024–2025

04/09/2025 - First semester of academic year 2025-26: start of classes

14/07/2025 - Exam Sessions Not Showing on Infostud: How to Resolve the Issue

Natural Language Processing and Text Mining

Course objectives

Program - Frequency - Exams

Course program

Prerequisites

Books

Frequency

Exam mode

Lesson mode