Methods of Text Data Analysis
Michal Vašinek

Course Information

Lectures and additional information on the website of doc. Dvorský link

Grading

Classified Credit
70 points for active completion of exercises
30 points for completing and submitting a project

Exercises

18.2 - Introductory Exercise
25.2 - Automatic Word Completion (N-gram Model) - 7 points

In this task, you should try to create a simple language model. A language model is a statistical model that attempts to estimate the probability of a word occurring based on its context. In this task, we will focus on the n-gram model, which is based on the probability of n consecutive words occurring together. You are allowed to use artificial intelligence in any capacity while solving this task.

Control Questions for the Task: Automatic Word Completion (N-gram Model)
4.3 - Pattern Matching in Text Data - 7 points

In this exercise, you will analyze different algorithms for pattern searching in text. You will focus on comparing three algorithms: brute force, Knuth-Morris-Pratt (KMP), and Boyer-Moore-Horspool (BMH). The goal is to understand when each algorithm is more advantageous and how they behave with different types of texts and patterns. You are allowed to use artificial intelligence to any extent while solving this task.

🎯 Bonus Task (+2 extra points)

18.3 - Automatic Word Correction and Fuzzy Search - 7 points

In this exercise, you will implement an algorithm for automatic word correction and analyze the efficiency of different approaches to fuzzy word search. The primary inspiration for the implementation is the well-known algorithm by Peter Norvig. Your goal is to implement the computation of edit distance and then create a system for automatic word correction based on the probability of word occurrence in a dictionary. You are allowed to use artificial intelligence to any extent while solving this task.

🎯 Bonus Task (+2 extra points)

25.3 - Boolean Information Retrieval – Inverted Index and Queries – 7 points

In this assignment, you will explore the basic principles of Boolean search in textual data. You will create an inverted index with token normalization, parse and evaluate queries with parentheses and various logical operators. Then, you will extend your system with a compact representation of the index and analyze its efficiency. This task is designed to provide a deeper understanding of classical IR model principles. You are allowed to use artificial intelligence tools to any extent when solving this task.

1.4 – Vector Space Model and tf-idf Computation – 7 points

In this exercise, you will explore practical work with the vector space model for document representation. You will manually compute tf-idf weights, compare documents using cosine similarity, and reflect on the limitations of this method. The task is designed to help you understand the principles of term weighting and document similarity, not just to use built-in functions. You may use artificial intelligence for implementation and design consultation, but the output must be your own work and interpretation, and you are expected to be able to explain the topic, not merely present a tool's output.

15.4 – Word Translation Using Vector Representations – 11 points

In this exercise, you will explore transferring meaning between languages using word embeddings and linear transformation. First, you will obtain bilingual data (vector representations and translation pairs), then implement a method to learn a transformation matrix using gradient descent, and finally evaluate the translation quality using accuracy. The exercise combines practical programming with understanding the mathematical foundations.

22.4 - CBOW Model for your language – 20 or 30 Points (Exercise and Project)

In this exercise, you will create your own Continuous Bag of Words (CBOW) model for your domestic language. The goal is to train the model to predict a word based on its context, thereby learning high-quality distributed word vector representations (embeddings). The task has two variants: using existing libraries (20 points) or full implementation from scratch (30 points).

April 29 - Transformer for Dialogue Summarization – 26 points (exercise and project)

In this exercise, you will implement an encoder-decoder model based on the Transformer architecture for the task of summarizing short dialogues from the Samsum dataset. The goal is to create a system capable of automatically generating a summary of a given dialogue.