Basic course information
Informations and materials at professor Dvorsky website
Classification
Graded credit
30 pts presence and activity in class
70 pts semestral project
30 pts presence and activity in class
70 pts semestral project
Labs
19.2 - String Searching using Brute Force and through Programming Language Libraries (2 points - AI allowed)
- Implement brute force search - Example.
- Using your favorite language and one of its functions, find all occurrences of a search string in a text.
- Find out what algorithm is used in the implementation of the function for locating the searched string. (1 point)
- Download a test set of texts: Dataset
- Search for any word in the test data and measure the time it takes to find it using brute force and how long does it takes with the implementation in your favorite programming language. (1 point)
26.2 - Searching with a Deterministic Finite Automaton (2 points, 1 bonus point)
- Study the algorithm described in Handbook of Exact String Matching Algorithms - page 25, Chapter: Search with an automaton
- Implement a deterministic finite automaton that represents the pattern to be searched. Then, search for this pattern in text and return the positions of the given pattern in the text. (1 point)
- Measure the time it takes to search for any string and compare the achieved times with the brute force search algorithm. (1 point)
- Bonus task - Regular expression search. Consider a regular expression that allows the use of parentheses, the operator | (representing or), and the operator * (denoting any number of occurrences). For example, a(b|c)* signifies a word starting with "a" and continuing with any number of repetitions of "b" or "c". The procedure should be as follows: convert the regular expression into a nondeterministic finite automaton, then convert that into a deterministic finite automaton, and use the deterministic one to perform the search. (1 point - AI allowed)
4.3 - String Searching: Boyer-Moore-Horspool (2 points)
- Study and implement the following Boyer-Moore-Horspool algorithm Handbook of Exact String Matching Algorithms - chapter 18
- Select one of the 50 MB files from the dataset below. Randomly generate 1000 words from the selected dataset, which you will then search for using the algorithms: Naive, Finite Automaton, BMH, compare the search time.
- Test data:
- Literature:
11.3 - Error-tolerant Search (4 points)
- Implement a nondeterministic finite automaton for error-tolerant search for up to k errors.
- By sequentially scanning the text: English find all occurrences of a word with a preset number of errors. (4 points)
- Test the code.
- Literature:
- Main source: Overview of methods, Navarro 1999 - page 25-26
- Additional methods used: Introduction to Information Retrieval - Chapter 3
- Notes on implementation:
- The automaton is nondeterministic => you will need to try all paths through the automaton.
- Recursion can be used in the solution.
- From the diagram, you can see that if we number each state, then the whole process can be handled using only integer operations:
- To verify that the current state is an accepting state, you can use the modulo operation: state_number % (len(pattern) + 1) == 0
- The transition from one state to another can be handled by adding a precalculated value:
- Symbol match: next_state = current_state + 1
- Symbol mismatch or missing symbol, diagonal move: next_state = current_state + len(pattern) + 2
- Excess symbol, move down: next_state = current_state + len(pattern) + 1
- If we allow the algorithm to compute with up to k errors, then the number of states will be: (k+1)*(len(pattern)+1)
- When choosing paths to explore, we first prefer paths leading to fewer errors => we seek a path with the least number of errors.
- Thus, if we put all this knowledge together, we find that for implementing the algorithm, we only need to know the word (which may contain errors) specified by the user, and in the function performing the comparison against the current position in the text, then maintain a list of pairs (current state number, position in the text relative to the starting position), which we gradually expand (remove the current state and add pairs representing new available states) as we go through the states.
18.3 - Text Preprocessing (2 points)
- Tasks
- Choose one of the document collections below, or prepare your own document collection.
- Study the collection: annotations and structuring of documents. Prepare methods for text extraction.
- Remove punctuation and other special characters, leaving only alphanumeric symbols in the collection. Convert uppercase letters to lowercase.
- Implement a Stop list: Introduction to Information Retrieval - Chapter 2 page 26
- Save the preprocessed documents into separate files.
- Test Collections:
- Reuters newsletter
- Wikipedia dump
- English books from Project Gutenberg
- English books from Project Gutenberg - download script
- Porter Stemmer
- Use the preprocessed collection.
- Implement or add to your code the Porter Stemmer: Algorithm description (Source codes)
25.3 - Inverted Index (4 points)
- Prepare an inverted index:
- Create an array: (term, docID).
- Sort the array by term and docID.
- Remove duplicate pairs (term,docID).
- For each term, create a list of documents in which it appears: (term: docID1, docID2,...).
- For study: Introduction to Information Retrieval - Chapter 1 pages 7-9 for more details.
- Query system:
- Prepare methods for processing AND queries using the list intersection algorithm: see Introduction to Information Retrieval - Chapter 1, Figures 1.6 and 1.7, pages 11-12.
- Question: query = word1 AND word2 AND word3, |p1| > |p2| > |p3|, where p1,p2, and p3 are the lengths of document lists for the individual words, which of these evaluations will run faster:
- (word1 AND word2) AND word3
- word1 AND (word2 AND word3)
- Test your implementation.
- [Optionally] Implement OR and NOT queries.