Skip to main content
Skip header

ECTS Course Overview



Methods of Analysis of Textual Data

* Exchange students do not have to consider this information when selecting suitable courses for an exchange stay.

Course Unit Code460-4074/02
Number of ECTS Credits Allocated4 ECTS credits
Type of Course Unit *Optional
Level of Course Unit *Second Cycle
Year of Study *
Semester when the Course Unit is deliveredSummer Semester
Mode of DeliveryFace-to-face
Language of InstructionEnglish
Prerequisites and Co-Requisites Course succeeds to compulsory courses of previous semester
Name of Lecturer(s)Personal IDName
DVO26doc. Mgr. Jiří Dvorský, Ph.D.
Summary
The course deals with basic principles of analysis of text documents. Text documents are understood as a typical representative of weak structured data. Individual areas of processing of text data - documents, web pages will be presented. The subject includes algorithms for pattern matching in the text, design of index systems for text data, work with natural languages in which texts are written. The various approaches to searching in text data, including methods of latent semantics analysis, will be also described. At the end, the course focuses on web search.
Learning Outcomes of the Course Unit
The aim of the course is to introduce students with the basic and advanced techniques of analysis of textual data. After finishing the course the student will be able to:
describe different methods of analysis of textual data,
understand these methods,
implement these methods, or use existing libraries,
incorporate these methods into your own design analysis of specific data.
Course Contents
A brief outline of the lectures' topics:
1. Introduction to information systems. The history and evolution of text retrieval. Differences between database systems and information retrieval (IR) systems. The general model of information retrieval system.
2. Pattern matching. One sample pattern matching. Aho-Corasick algorithm. Regular expressions, finite automata. Algorithms for approximate pattern matching.
3. Suffix trees. DAWG. Patricia and similar data structures.
4. Primary processing of texts. Lexical analysis. Stemming. Lemmatization. Stop words.
5. Construction of index systems. Zipf law and the estimated size of the index system. Indexing based on classification. Positional index systems. Methods for weighting terms. TF-IDF weight terms. Methods of compression index systems. Methods for encoding natural numbers.
6. Query Languages​​. Relevance document. The degree of similarity between pairs of document-query. Relevance vs. similarity. The structure and query evaluation. Boolean DIS. IR system evaluation (accuracy, completeness, F-measure).
7. Signature methods. Chained and layered coding signatures. Efficient evaluation of queries.
8. Latent semantics. Methods for dimension reduction. Methods based on matrix decomposition. Random projection. Vector DIS. Construction and evaluation of the query vector. Other types of DIS (extended Boolean). Indexing, query structure, evaluation questions.
9. Search the site. Analysis of hypertext documents, structural methods. PageRank and HITS. Metasearch and cooperative search. Application of computational intelligence and soft computing in processing a text search.
10. Methods for automatic summarization: abstraction and extraction. Detection and evolution theme. Sentiment analysis, classification and clustering of documents.
11. Parallel and distributed search. Decentralized P2P and search.
12. Semantic and contextual search technology Hummingbird, Snapshot (Satori) and Graph Search.
Recommended or Required Reading
Required Reading:
1. Manning, C. D.; Raghavan, P. & Schutze, H. Introduction to Information Retrieval, Cambridge University Press, 2008
2. Witten I. H., Moffat A., Bell T. C.: Managing Gigabytes (2nd ed.): Compressing and Indexing Documents and Images, Morgan Kaufmann Publishers Inc., 1999, ISBN 1-55860-570-3
3. Baeza-Yates R. A., Ribeiro-Neto B.: Modern Information Retrieval, Addison-Wesley Longman Publishing Co., Inc., 1999, ISBN 020139829X
4. Feldman R., Sanger J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, Cambridge University Press, 2006, ISBN 978-0521836579
5. Berry M. W., Kogan J.: Text Mining: Applications and Theory, Wiley, 2010, ISBN 978-0470749821
6. Weiss S. M., Indurkhya N., Zhang T.: Fundamentals of Predictive Text Mining, Springer, 2010, ISBN 978-1849962254
7. Langville, A. N. & Meyer, C. D. Google's PageRank and Beyond: The Science of Search Engine Rankings Princeton University Press, 2006
8. Korfhage, R. R. Information Storage and Retrieval, John Wiley & Sons, 1997
1. Kopecký M., Pokorný J.:Dokumentografické informační systémy, Karolinum 2006, ISBN 8024611481
2. Manning, C. D.; Raghavan, P. & Schutze, H. Introduction to Information Retrieval, Cambridge University Press, 2008
3. Witten I. H., Moffat A., Bell T. C.: Managing Gigabytes (2nd ed.): Compressing and Indexing Documents and Images, Morgan Kaufmann Publishers Inc., 1999, ISBN 1-55860-570-3
4. Baeza-Yates R. A., Ribeiro-Neto B.: Modern Information Retrieval, Addison-Wesley Longman Publishing Co., Inc., 1999, ISBN 020139829X
5. Feldman R., Sanger J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, Cambridge University Press, 2006, ISBN 978-0521836579
6. Berry M. W., Kogan J.: Text Mining: Applications and Theory, Wiley, 2010, ISBN 978-0470749821
7. Weiss S. M., Indurkhya N., Zhang T.: Fundamentals of Predictive Text Mining, Springer, 2010, ISBN 978-1849962254
8. Langville, A. N. & Meyer, C. D. Google's PageRank and Beyond: The Science of Search Engine Rankings Princeton University Press, 2006
9. Korfhage, R. R. Information Storage and Retrieval, John Wiley & Sons, 1997

Recommended Reading:
1. Witten, I. H.; Gori, M. & Numerico, T. Web Dragons: Inside the Myths of Search Engine Technology, Morgan Kaufmann, 2006
1. Witten, I. H.; Gori, M. & Numerico, T. Web Dragons: Inside the Myths of Search Engine Technology, Morgan Kaufmann, 2006
Planned learning activities and teaching methods
Lectures, Tutorials
Assesment methods and criteria
Tasks are not Defined