CSCE410/810 IR Supplement 6

CSCE 410/810

Supplement 6: Text Operations

September 25, 2003

This supplement is based on Chapter 7 of Baeza-Yates, R. and B. Ribeiro-Neto (1999). Modern Information Retrieval, New York: Addison-Wesley.

Introduction

Not all words are equally significant for representing the semantics of a document. In written language, some words carry more meaning than others. Usually, noun words (or groups of noun words) are the ones which are most representative of a document content. And these terms can be used to as index terms. We have some process/steps to obtain index terms.

The preprocessing of the documents in the collection might be viewed simply as a process of controlling the size of the vocabulary (i.e., the number of distinct words used as index terms).

IMPORTANT: Despite a potential improvement in retrieval performance, text transformations done at preprocessing time might make it more difficult for the user to interpret the retrieval task. And actually, some search engines in the Web are giving up text operations entirely and simply indexing all the words in the text—using now a full-text search.

Text operations include (1) text normalization, (2) building of a thesaurus, (3) text compression, and (d) encryption.

Text normalization and the building of a thesaurus are strategies aimed at improving the precision of the documents retrieved.

To reduce query response time, one might consider text compression. A good compression algorithm is able to reduce the text to 30-35% of its original size. The compressed text requires less storage space and takes less time to be transmitted over a communication link. The main disadvantage is the time spent compressing and decompressing the text.

Besides compression, another text operation is encryption. Due to the fast popularization of services in the Web (including all types of e-commerce), key (and old) questions regarding security and privacy have surfaced again.

Document Preprocessing

Document preprocessing is a procedure which can be divided into mainly 5 text operations (or transformations):

(1) Lexical analysis of the text with the objective of treating digits, hyphens, punctuation marks, and the case of letters.

(2) Elimination of stopwords with the objective of filtering out words with very low discrimination values for retrieval purposes.

(3) Stemming of the remaining words with the objectives of removing affixes (i.e., prefixes and suffixes) and allowing the retrieval of documents containing syntactic variations of query terms (e.g., connect, connecting, connected, etc.).

(4) Selection of index terms to determine which words/stems (or groups of words) will be used as indexing elements. Usually, the decision on whether a particular word will be used as an index term is related to the syntactic nature o the word. In fact, noun words frequently carry more semantics than adjectives, adverbs, and verbs.

(5) Construction of term categorization structures such as thesaurus, or extraction of structure directly represented in the text, for allowing the expansion of the original query with related terms (a usually useful procedure).

Lexical Analysis of the Text

Lexical analysis is the process of converting a stream of characters (the text of the documents) into a stream of words (the candidate words to be adopted as index terms). Thus, one of major objectives of the lexical analysis phase is the identification of the words in the text and the recognition of:

· Spaces as word separators

· Digits: These are usually not good index terms because without a surrounding context, they are inherently vague

· Hyphens: Breaking up hyphenated words might be useful due to the inconsistency of usage. For example, this allows treating “state-of-the-art” and “state of the art” identically. However, there are words which include hypens as an integral part: gilt-edge, B-49, etc.

· Punctuation marks: Normally, punctuation marks are removed entirely in the process of lexical analysis. While some punctuation marks are an integral part of the word (e.g., ‘510B.C’), removing them does not seem to have an impact in retrieval performance because the risk of misinterpretation in this case is minimal.

· The case of the letters (upper and lower case): The case of letters is usually not important for the identification of index terms. As a result, a lexical analyzer usually converts all the text to either lower or upper case.

Elimination of Stopwords

Words which are too frequent among the documents in the collection are not good discriminators. In fact, a word which occurs in 80% of the documents is useless for purposes of retrieval. Such words are referred to as stopwords and are filtered out as potential index terms.

· Articles, prepositions, and conjunctions are natural candidates for a list of stopwords.

· Some verbs, adverbs, and adjectives could be treated as stopwords.

· Domain-specific terms could be treated as stopwords.

Advantages This process also reduces the size of the indexing structure considerably. It is typical to obtain compression in the size of the indexing structure of 40% or more solely with the elimination of stopwords.

Disadvantages Elimination of stopwords might reduce recall. For example, say the user is looking for documents containing the phrase ‘to be or not to be’. Elimination of stopwords might leave only the term be making it almost impossible to properly recognize the documents which contain the phrase specified.

Stemming

Frequently, the user specifies a word in a query but only a variant of this word is present in a relevant document. Plurals, gerund forms, and past tense suffixes are examples of syntactical variations which prevent a perfect match between a query word and a respective document word. This problem can be partially overcome with the substitution of the words by their respective stems.

A stem is the portion of a word which is left after the removal of its affixes (i.e., prefixes and suffixes). A typical example of a stem is the word connect which is the stem for the variants connected, connecting, connection and connections.

Four types of stemming strategies:

· Table lookup: This is bead on looking for the stem of a word in a table. It is simple but dependent on data on stems for the whole language. Since such data is not readily available and might require considerable storage space, this type of stemming algorithm might not be practical.

· Successor variety stemming: This is based on the determination of morpheme boundaries—sing knowledge from structural linguistics—and is more complex than affix removal stemming algorithms.

· N-grams stemming: This is based on the identification of digrams and trigrams and is more a term clustering procedure than a stemming one.

· Affix removal – This is intuitive, simple, and can be implemented efficiently. The most important part is suffix removal because most variants of a word are generated by the introduction of suffixes.

Advantages Stems are thought to be useful for improving retrieval importance because the reduce variants of the same root word to a common concept. Furthermore, stemming has the secondary effect of reducing the size of the indexing structure because the number of distinct index terms is reduced.

Index Terms Selection

If a full text representation of the text is adopted then all words in the text are used as index terms. The alternative is to adopt a more abstract view in which not all words are used as index terms.

This implies that the set of terms used as indices must be selected. Two possible ways to accomplish this:

· Keyword are selected by a specialist

· Keywords are selected automatically using a program:

o A sentence in natural language text is usually composed of nouns, pronouns, articles, verbs, adjectives, adverbs, and connectives. While the words in each grammatical class are used with a particular purpose, it can be argued that most of the semantics is carried by the noun words. Thus, an intuitively promising strategy for selecting index terms automatically is to use the nouns in the text.

o Since it is common to combine two or three nouns in a single component (e.g., computer science), it makes sense to cluster nouns which appear nearby in the text into a single indexing component (or concept). So, we use noun groups. A noun group is a set of nouns whose syntactic distance in the text (measured in terms of number of words between 2 nouns) does not exceed a predefined threshold (for instance, 3).

Thesauri (Advanced Topic)

Document Clustering (Advanced Topic)

Text Compression

Text compression is about finding ways to represent the text in fewer bits or bytes. The amount of space required to store text on computers can be reduced significantly using compression techniques.

· Compression methods create a reduced representation by identifying and using structures that exist in the text.

· From the compressed version, the original text can be reconstructed exactly.

· Compression is becoming more important due to the widespread use of digital libraries, office automation systems, document databases.

Advantages So, text compression can be used to reduce cost associated with space requirements, I/O overhead, and communication delays. And then it takes less time to search.

Disadvantages Price is the time necessary to code and decode the text. A major obstacle for storing text in compressed forms is the need for IR systems to access text randomly. To access a given word in a compressed text, it is usually necessary to decode the entire text from the beginning until the desired word is reached. It could be argued that a large text could be divided into blocks that are compressed independently, thus allowing fast random access to each block. However, efficient compression methods need to process some text before making compression effective (usually more than 10 kilobytes). The smaller the blocks, the less effective compression is expected to be. Decompression speed is more important than compression speed!

Research issues Another important characteristic of a compression method is the possibility of performing compressed pattern matching, defined as the task of performing pattern matching in a compressed text without decompressing it. In this case, sequential searching can be sped up by compressing the search key rather than decoding the compressed text being searched.

Two general approaches to text compression are:

· Statistical methods rely on generating good probability estimates (of appearance in the text) for each symbol. The more accurate the estimates are, the better the compression obtained. Two well known statistical coding strategies are (1) Huffman coding, and (2) arithmetic coding.

· Dictionary methods substitute a sequence of symbols by a pointer to a previous occurrence of that sequence. The pointer representations are references to entries in a dictionary composed of a list of symbols (often called phrases) that are expected to occur frequently. Most well known are the dictionary methods called the Ziv-Lempel family.