Course Information
Grading
70 points for active completion of exercises
- 10 exercises worth 7 points each
- Submission:
- Exercises are submitted during the class or in the class of the following week.
- For late submission, after two or more weeks, max. 3 points.
- Building a system for working with text documents.
Exercises
- Course description
In this task, you should try to create a simple language model. A language model is a statistical model that attempts to estimate the probability of a word occurring based on its context. In this task, we will focus on the n-gram model, which is based on the probability of n consecutive words occurring together. You are allowed to use artificial intelligence in any capacity while solving this task.
- Basic introduction to n-grams (easier) - 1 point
- Task:
- Load any text file in your language, e.g., from opus.nlpl.eu.
- Tokenize the text (split it into words). You can decide whether to remove punctuation (if you keep commas and periods in the text, treat them as separate words) and whether to convert the text to lowercase.
- Create unigram (single-word), bigram (two-word), and trigram (three-word) frequency models.
- Display the most frequent n-grams.
- Goal: Learn basic text processing and understand what n-grams are.
- Task:
- N-gram probability (intermediate) - 1 point
- Task:
- Based on your data, create a bigram and trigram model converted to probabilities.
- Calculate the probability of a word appearing after a given n-gram.
- Learn about Laplace smoothing and adjust the probability calculation accordingly.
- Goal: Get familiar with the probabilistic n-gram model and basic smoothing techniques.
- Task:
- Word prediction using a bigram model (intermediate) - 1 point
- Task:
- Implement a simple autocomplete function that suggests the most probable next word based on an input word.
- Test it on various inputs and compare the results.
- Goal: Understand the concept of word prediction and be able to create a basic autocomplete model.
- Task:
- Creating a text generator (more difficult) - 2 points
- Task:
- Use a trigram model to create a simple text generator.
- The model should generate additional words based on an input word and form several sentences.
- Goal: Apply the n-gram model to text generation.
- Task:
- Model evaluation using perplexity (advanced) - 2 points
- Task:
- Study the concept of perplexity.
- Implement a method to compute the perplexity of a trained n-gram model.
- Compare the perplexity of different n-gram models (unigrams, bigrams, trigrams).
- Evaluate the model's quality on a test dataset.
- Goal: Learn how to measure the quality of a language model and understand the significance of perplexity.
- Task:
- Basic Introduction to N-grams
- What is an N-gram?
- How are unigram, bigram, and trigram frequencies calculated?
- Probability of N-grams
- How is the probability of a word occurrence calculated within an N-gram model?
- What is Laplace smoothing, and why is it used?
- Word Prediction Using a Bigram Model
- How can a bigram model be used for automatic word completion?
- How is the most probable next word determined?
- Creating a Text Generator
- How can a trigram model be used for text generation?
- How can the quality of generated text be improved?
- Model Evaluation Using Perplexity
- What is perplexity, and how is it calculated?
- What is the interpretation of the perplexity value for a language model?
- How does perplexity change with an increasing N-gram size?
In this exercise, you will analyze different algorithms for pattern searching in text. You will focus on comparing three algorithms: brute force, Knuth-Morris-Pratt (KMP), and Boyer-Moore-Horspool (BMH). The goal is to understand when each algorithm is more advantageous and how they behave with different types of texts and patterns. You are allowed to use artificial intelligence to any extent while solving this task.
- Implementation Preparation (easier) - 2 points
- Task:
- Prepare implementations of the three algorithms: brute force, KMP, and BMH (you may use AI).
- Modify the algorithms to return not only the found pattern positions but also the number of character comparisons.
- Goal: Obtain implementations of the algorithms and ensure they provide statistics on character comparisons.
- Task:
- Testing on Different Data (medium) - 2 points
- Task:
- Test the implementations on three different types of texts:
- Short text (~100 characters)
- Long text (~1000 characters)
- Randomly generated text (e.g., a DNA sequence “AGCTAGCT…”)
- For each type of text, conduct tests with at least three different patterns.
- Goal: Verify how the algorithms perform on different data sets.
- Task:
- Comparison of Character Comparisons (medium) - 1 point
- Task:
- Record the number of character comparisons for each algorithm and each test case.
- Create a table with the results.
- Goal: Quantify the efficiency of the algorithms.
- Task:
- Visualization of Algorithm Performance (medium) - 1 point
- Task:
- Create a graph that illustrates the efficiency of the algorithms based on text and pattern length.
- Goal: Visually represent the performance of the algorithms.
- Task:
- Analysis and Decision Making on Algorithm Suitability (advanced) - 1 point
- Task:
- Answer the following questions:
- When does KMP perform better than BMH?
- When is BMH faster than Brute Force?
- When is KMP disadvantageous to use?
- How do the algorithms perform on texts with repeating patterns?
- Goal: Understand the strengths and weaknesses of each algorithm.
- Task:
🎯 Bonus Task (+2 extra points)
- Propose a hybrid approach:
- Develop a heuristic that selects the most suitable algorithm based on the length and properties of the pattern and text.
- Compare the performance of this strategy with individual algorithms.
In this exercise, you will implement an algorithm for automatic word correction and analyze the efficiency of different approaches to fuzzy word search. The primary inspiration for the implementation is the well-known algorithm by Peter Norvig. Your goal is to implement the computation of edit distance and then create a system for automatic word correction based on the probability of word occurrence in a dictionary. You are allowed to use artificial intelligence to any extent while solving this task.
- Computation of Edit Distance (easier) - 3 points
- Task:
- Implement an algorithm for computing the Levenshtein distance between two words.
- Test your implementation on your own data – select several word pairs and verify whether the computed distance meets expectations.
- Goal: Understand the concept of edit distance and ensure a correct implementation.
- Task:
- Implementation of Automatic Word Correction (medium) - 4 points
- Dictionary Preparation - 1 point
- Create a dictionary of words based on a chosen dataset.
- Store the frequency of individual words to determine their probability of occurrence.
- Generating Word Variants - 1 point
- For an input word, generate all possible word variants with an edit distance of at most 2.
- Consider insertion, deletion, substitution, and adjacent character swaps.
- Determine the number of generated variants.
- Selecting the Most Probable Word - 1 point
- From the generated variants, select the most probable word based on its frequency in the dictionary.
- Correct the following sentence: I well go to the restorant tonigth and order a delicshus desert befor going home.
- Alternative Approach and Efficiency Comparison - 1 point
- Instead of generating word variants, compute the edit distance for all words in the dictionary and select the closest candidates.
- Compare the computational complexity and result quality of both approaches.
- For a word of length n, determine the number of generated variants.
- Dictionary Preparation - 1 point
🎯 Bonus Task (+2 extra points)
- Improving Probability Calculation Using an N-gram Model:
- Design and implement an improved system that utilizes n-grams and conditional probabilities to determine the most probable corrected word.
- Compare the results with the original approach and analyze the reasons for improvement or possible issues.
In this assignment, you will explore the basic principles of Boolean search in textual data. You will create an inverted index with token normalization, parse and evaluate queries with parentheses and various logical operators. Then, you will extend your system with a compact representation of the index and analyze its efficiency. This task is designed to provide a deeper understanding of classical IR model principles. You are allowed to use artificial intelligence tools to any extent when solving this task.
- Inverted index with normalization – 1 point
- Task:
- Create a text corpus (at least 50 documents) and build an inverted index.
- Remove stopwords during index construction.
- The index should also store the frequency of each word in each document.
- Goal: Understand how to represent a text corpus using an inverted index and how to preprocess textual data.
- Task:
- Parsing and evaluating complex Boolean queries – 2 points
- Task:
- Implement support for Boolean queries using the operators
AND
,OR
, andNOT
. - Evaluate the queries using set operations over the inverted index.
- Implement support for Boolean queries using the operators
- Goal: Learn how to correctly parse and efficiently evaluate more complex logical queries.
- Task:
- Index efficiency and size – 1 point
- Task:
- Analyze the size of the created index: number of tokens, average posting list length, and total number of entries.
- Goal: Understand the storage demands of the inverted index.
- Task:
- Query interface and query comparison – 1 point
- Task:
- Create a simple interface (e.g., console or script-based) that allows the user to input arbitrary queries and view results.
- Display at least document IDs or the first sentence from each matched document.
- Goal: Make working with the search engine easier and demonstrate how different queries impact results.
- Task:
- Extended Boolean model with weighting – 2 points
- Task:
- Design and implement a simple extended Boolean model that assigns scores to documents.
- The input is a query, and the output is a ranked list of documents by relevance score.
- Compare the result quality with that of the standard Boolean approach.
- Goal: Get familiar with the concept of weighted Boolean models and the basics of relevance ranking.
- Task:
In this exercise, you will explore practical work with the vector space model for document representation. You will manually compute tf-idf weights, compare documents using cosine similarity, and reflect on the limitations of this method. The task is designed to help you understand the principles of term weighting and document similarity, not just to use built-in functions. You may use artificial intelligence for implementation and design consultation, but the output must be your own work and interpretation, and you are expected to be able to explain the topic, not merely present a tool's output.
- Text Preprocessing – 1 point
- Task:
- Select a dataset with at least 20 documents; you may use NLTK datasets such as Gutenberg, Twitter, reviews, etc.
- For the given documents, perform: lowercasing, punctuation removal, tokenization, and stop word removal.
- Create your own list of stopwords (at least 5 entries).
- The output should be a list of terms for each document.
- Goal: Become familiar with manual text preprocessing and prepare the data for vector representation.
- Task:
- Computing tf and idf – 2 points
- Task:
- Compute the term frequency (tf) of each word
t
in all documents. Use some form of normalization (e.g. relative frequency), or justify why you chose to use the non-normalized version. - Compute the inverse document frequency (idf) using the formula
idf(t) = log(N / df(t))
. - Compute tf-idf weights:
tf-idf(t,d) = tf(t,d) * idf(t)
- Compute scores for the terms in query
q
:Score(q,d) = sum(t in q) tf-idf(t,d)
- Return the documents sorted by score.
- Compute the term frequency (tf) of each word
- Goal: Understand how to compute tf-idf components and their meaning in the context of a text corpus.
- Task:
- tf-idf and Cosine Similarity – 2 points
- Task:
- Compute the cosine similarity between all document pairs.
- Identify the two most similar documents and explain why.
- How would the results change if you used only tf without idf?
- Goal: Apply the vector space model in practice and understand how document similarity is computed.
- Task:
- The Meaning of idf in Specific Domains – 1 point
- Task:
- Provide an example of a domain or topic where frequently occurring words may still be highly important.
- Explain why using the standard idf in such a case may be inappropriate.
- Propose a modification of the idf computation that could address this issue.
- Goal: Critically assess the limitations of the vector space model and propose adjustments for specific situations.
- Task:
- Proposing an Alternative Weighting Scheme – 1 point
- Task:
- Design a term weighting scheme suitable for short texts (e.g. tweets) that could better capture word importance than standard tf-idf.
- Could hashtags and user mentions be utilized in some way?
- Describe how your scheme would treat words that are:
- very frequent across the corpus,
- occurring only once,
- present in only some documents.
- Goal: Encourage creative thinking in designing custom models and understanding the function of individual weighting components.
- Task:
In this exercise, you will explore transferring meaning between languages using word embeddings and linear transformation. First, you will obtain bilingual data (vector representations and translation pairs), then implement a method to learn a transformation matrix using gradient descent, and finally evaluate the translation quality using accuracy. The exercise combines practical programming with understanding the mathematical foundations.
- Data Preparation – 2 points
- Entry question: What is an embedding?
- Task:
- Download two embeddings from the FastText website (e.g., Czech and English), in the text
.vec
format. - Download translation pairs from the MUSE – bilingual dictionaries project, specifically the
cs-en train
andcs-en test
files. - Load the embeddings and create matrices X (source language) and Y (target language) for the training and test sets.
- Download two embeddings from the FastText website (e.g., Czech and English), in the text
- Goal: Obtain consistent embeddings and translation pairs for experiments without the need to create your own dictionary.
- Implementation of the Training Algorithm – 6 points
- Entry question: What does gradient descent do?
- Task:
- Implement a function to compute the Frobenius norm:
||XWT - Y||F2
(1 point) - Compute the expression
XWT - Y
(0.5 points) - Derive the gradient of the loss function with respect to the matrix
WT
(2 point) - Implement the gradient of the loss function with respect to the matrix
WT
(1 point) - Implement gradient descent with a parameter
alpha
(learning rate) (1 point) - Create training in a loop: fixed number of steps and also convergence (e.g., loss function must decrease in the last ten iterations) (0.5 points)
- Implement a function to compute the Frobenius norm:
- Goal: Understand and implement the training process for learning a linear transformation between languages.
- Translation and Evaluation – 3 points
- Entry question: How would we have to change the matrices X and Y if we used matrix W instead of
WT
? - Task:
- Based on the learned matrix W, implement a function that translates a given word – the output should be the 5 most similar words in the target language according to cosine similarity (1 point)
- Test the accuracy on test words and compute the translation accuracy (accuracy - top 1, accuracy - top 5) (2 point)
- Goal: Verify that the learned transformation enables meaningful translation between languages.
- Entry question: How would we have to change the matrices X and Y if we used matrix W instead of
- Recommended resources:
In this exercise, you will create your own Continuous Bag of Words (CBOW) model for your domestic language. The goal is to train the model to predict a word based on its context, thereby learning high-quality distributed word vector representations (embeddings). The task has two variants: using existing libraries (20 points) or full implementation from scratch (30 points).
- Data Preparation and Processing – 5 points
- Entry question: What is the CBOW model and how does it work?
- Task:
- Download or use a prepared corpus in your language (e.g., from
huggingface.co
or a Wikipedia excerpt). - Tokenize the text and build a vocabulary (e.g., the 10,000 most frequent words).
- Create training pairs: context words (e.g., 2 to the left + 2 to the right) → target word.
- Mark out-of-vocabulary words as
<UNK>
.
- Download or use a prepared corpus in your language (e.g., from
- Goal: Prepare data for training the CBOW model in a format suitable for learning.
- Model Training – 10 points (Variant A) or 20 points (Variant B)
- Entry question: What does the CBOW model learn and how are its weights optimized?
- Variant A – Using a library (max. 20 points total):
- Use
PyTorch
,Keras
,TensorFlow
or another library with anEmbedding
layer. - Implement the CBOW architecture and train the model using CrossEntropyLoss and an optimizer (SGD/Adam).
- Use
- Variant B – Full implementation (max. 30 points total):
- Initialize matrices
E
,W
,b
using e.g., normal distribution. - Compute the average embedding for the context.
- Implement softmax, loss function (cross-entropy), gradient computation, and SGD update manually (e.g., in NumPy).
- Train over multiple epochs and display the loss progression.
- Initialize matrices
- Goal: Understand the CBOW principle and learn how to implement and train it.
- Model Evaluation – 5 points
- Task:
- For selected words (e.g., “pes”, “škola”, “krásný”), find the 5 nearest neighbors based on cosine similarity.
- Visualize the embedding space using t-SNE or PCA.
- Evaluate whether the embeddings capture semantic similarity.
- Goal: Verify the quality of the learned embeddings and their interpretability.
- Task:
- Recommended Resources:
In this exercise, you will implement an encoder-decoder model based on the Transformer architecture for the task of summarizing short dialogues from the Samsum dataset. The goal is to create a system capable of automatically generating a summary of a given dialogue.
- Data Preparation and Processing – 2 points
- Answer: What is the structure of the Samsum dataset and why is it suitable for summarization tasks?
- Task:
- Download the Samsum dataset (e.g., using the
datasets
library from Hugging Face). Hugging Face link - Split the data into training (and optionally validation) and test sets.
- Download the Samsum dataset (e.g., using the
- Goal: Prepare training and test data suitable for summarization modeling.
- Preprocessing and Tokenization – 2 points
- Answer: How does tokenization work and why is it important when working with text models?
- Task:
- Select any pretrained tokenizer.
- Tokenize both the input texts (dialogues) and outputs (summaries).
- Goal: Convert text data into a tokenized format suitable for model training.
- Transformer Model Implementation – 8 points
- Answer: How does the encoder-decoder architecture based on the Transformer work?
- Answer: Why do we need padding in the encoder and decoder?
- Task:
- Implement an encoder-decoder model for summary generation using a Transformer.
- You may use:
nn.Transformer
from PyTorch,- or
TransformerEncoder
andTransformerDecoder
layers in Keras.
- Add token embedding and positional encoding for both input and target sequences.
- Implement proper masking:
- Padding mask – to ignore padding tokens,
- Causal mask – to prevent looking at future tokens in the decoder.
- Goal: Build a functional Transformer model ready for training.
- Model Training – 8 points
- Answer: How do you train a sequence model for a sequence-to-sequence task?
- Task:
- Define a loss function (for example,
CrossEntropyLoss
for token prediction). - Select an appropriate optimizer (such as Adam/AdamW).
- Train the model on the training data (validate during training).
- Define a loss function (for example,
- Goal: Train a model capable of generating summaries.
- Inference – Summary Generation – 5 points
- Answer: How can the output sequence be generated in a decoder Transformer?
- Hint: What are the purposes of special tokens <SOS> and <EOS>?
- Task:
- Implement a function that generates a summary based on an input dialogue.
- Study and use greedy decoding or beam search.
- Goal: Be able to obtain an output summary from the trained model.
- Model Evaluation – 1 point
- Task:
- Evaluate the quality of the generated summaries using ROUGE-1 and ROUGE-2 metrics.
- Compare the generated summaries to the reference summaries in the test set.
- Goal: Quantitatively verify the model's summarization quality.
- Task:
- Recommended Resources: