Fundamentals of Database Systems
Seventh Edition
Chapter 27
Introduction to
Information Retrieval and
Web Search
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
27.1 Information Retrieval (IR) Concepts (1 of 4)
• Information retrieval
– Process of retrieving documents from a collection in
response to a query (search request)
– Deals mainly with unstructured data
▪Example: homebuying contract documents
• Unstructured information
– Does not have a well-defined formal model
– Based on an understanding of natural language
– Stored in a wide variety of standard formats
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Information Retrieval (IR) Concepts (2 of 4)
• Information retrieval field predates database field
– Academic programs in Library and Information
Science
• RDBMS vendors providing new capabilities to support
various data types
– Extended RDBMSs or object-relational database
management systems
• User’s information need expressed as free-form search
request
– Keyword search query
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Information Retrieval (IR) Concepts (3 of 4)
• Characterizing an IR system
– Types of users
▪ Expert
▪ Layperson
– Types of data
▪Domain-specific
– Types of information needs
▪Navigational search
▪Informational search
▪Transactional search
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Information Retrieval (IR) Concepts (4 of 4)
• Enterprise search systems
– Limited to an intranet
• Desktop search engines
– Searches an individual computer system
• Databases have fixed schemas
– IR system has no fixed data model
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Comparing Databases and IR Systems
Table 27.1 A comparison of databases and IR systems
Databases IR Systems
• Structured data • Unstructured data
• Schema driven • No fixed schema; various data models
(e.g., vector space model)
• Relational (or object, hierarchical, and
network) model is predominant
• Free-form query models
• Structured query model • Rich data operations
• Rich metadata operations • Search request returns list or pointers to
documents
• Query returns data Blank
• Results are based on exact matching
(always correct)
• Results are based on approximate
matching and measures of effectiveness
(may be imprecise and ranked)
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
A Brief History of IR
• Stone tablets and papyrus scrolls
• Printing press
• Public libraries
• Computers and automated storage systems
– Inverted file organization based on keywords and their
weights as indexing method
• Search engine
• Crawler
• Challenge: provide high quality, pertinent, timely information
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Modes of Interactions in IR Systems
• Primary modes of interaction
– Retrieval
▪Extract relevant information from document
repository
– Browsing
▪Exploratory activity based on user’s assessment of
relevance
• Web search combines both interaction modes
– Rank of a web page measures its relevance to query
that generated the result set
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Generic IR Pipeline (1 of 2)
• Statistical approach
– Documents analyzed and broken down into chunks of
text
– Each word or phrase is counted, weighted, and
measured for relevance or importance
• Types of statistical approaches
– Boolean
– Vector space
– Probabilistic
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Generic IR Pipeline (2 of 2)
• Semantic approaches
– Use knowledge-based retrieval techniques
– Rely on syntactic, lexical, sentential, discourse-based,
and pragmatic levels of knowledge understanding
– Also apply some form of statistical analysis
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Figure 27.1 Generic IR Framework
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Figure 27.2 Simplified IR Process Pipeline
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
27.2 Retrieval Models (1 of 5)
• Boolean model
– One of earliest and simplest IR models
– Documents represented as a set of terms
– Queries formulated using AND, OR, and NOT
– Retrieved documents are an exact match
▪No notion of ranking of documents
– Easy to associate metadata information and write
queries that match contents of documents
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Retrieval Models (2 of 5)
• Vector space model
– Weighting, ranking, and determining relevance are
possible
– Uses individual terms as dimensions
– Each document represented by an n-dimensional
vector of values
– Features
▪Subset of terms in a document set that are deemed
most relevant to an IR search for the document set
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Retrieval Models (3 of 5)
• Vector space model
– Different similarity assessment functions can be used
• Term frequency-inverse document frequency (TF-IDF)
– Statistical weight measure used to evaluate the
importance of a document word in a collection of
documents
– A discriminating term must occur in only a few
documents in the general population
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Retrieval Models (4 of 5)
• Probabilistic model
– Involves ranking documents by their estimated
probability of relevance with respect to the query and
the document
– IR system must decide whether a document belongs
to the relevant set or nonrelevant set for a query
▪Calculate probability that document belongs to the
relevant set
– BM25: a popular ranking algorithm
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Retrieval Models (5 of 5)
• Semantic model
– Morphological analysis
▪Analyze roots and affixes to determine parts of speech
of search words
– Syntactic analysis
▪Parse and analyze complete phrases in documents
– Semantic analysis
▪Resolve word ambiguities and generate relevant
synonyms based on semantic relationships
– Uses techniques from artificial intelligence and expert
systems
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
27.3 Types of Queries in IR Systems (1 of 4)
• Keyword queries
– Simplest and most commonly used
– Keyword terms implicitly connected by logical AND
• Boolean queries
– Allow use of AND, OR, NOT, and other operators
– Exact matches returned
▪No ranking possible
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Types of Queries in IR Systems (2 of 4)
• Phrase queries
– Sequence of words that make up a phrase
– Phrase enclosed in double quotes
– Each retrieved document must contain at least one
instance of the exact phrase
• Proximity queries
– How close within a record multiple search terms are
to each other
– Phrase search is most commonly used proximity
query
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Types of Queries in IR Systems (3 of 4)
• Proximity queries
– Specify order of search terms
– NEAR, ADJ (adjacent), or AFTER operators
– Sequence of words with maximum allowed distance
between them
– Computationally expensive
▪Suitable for smaller document collections rather
than the Web
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Types of Queries in IR Systems (4 of 4)
• Wildcard queries
– Supports regular expressions and pattern-based
matching
▪Example ‘data*’ would retrieve data, database,
dataset, etc.
– Not generally implemented by Web search engines
• Natural language queries
– Definitions of textual terms or common facts
– Semantic models can support
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
27.4 Text Preprocessing (1 of 3)
• Stopword removal must be performed before indexing
• Stopwords
– Words that are expected to occur in 80% or more of
the documents of a collection
▪Examples: the, of, to, a, and, said, for, that
– Do not contribute much to relevance
• Queries preprocessed for stopword removal before
retrieval process
– Many search engines do not remove stopwords
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Text Preprocessing (2 of 3)
• Stemming
– Trims suffix and prefix
– Reduces the different forms of the word to a common
stem
– Martin Porter’s stemming algorithm
• Utilizing a thesaurus
– Important concepts and main words that describe
each concept for a particular knowledge domain
– Collection of synonyms
– UMLS
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Figure 27.3 A Portion of the UMLS Semantic
Network: “Biologic Function” Hierarchy
Source: UMLS Reference Manual, National Library of Medicine
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Text Preprocessing (3 of 3)
• Other preprocessing steps
– Digits
▪May or may not be removed during preprocessing
– Hyphens and punctuation marks
▪Handled in different ways
– Cases
▪Most search engines use case-insensitive search
• Information extraction tasks
– Identifying noun phrases, facts, events, people,
places, and relationships
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
27.5 Inverted Indexing (1 of 3)
• Inverted index structure
– Vocabulary information
▪Set of distinct query terms in the document set
– Document information
– Data structure that attaches distinct terms with a list
of all documents that contain the term
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Inverted Indexing (2 of 3)
• Construction of an inverted index
– Break documents into vocabulary terms
▪Tokenizing, cleansing, removing stopwords,
stemming, and/or using a thesaurus
– Collect document statistics
▪Store statistics in document lookup table
– Invert the document-term stream into a term-
document stream
▪Add additional information such as term
frequencies, term positions, and term weights
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Figure 27.4 Example of an Inverted Index
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Inverted Indexing (3 of 3)
• Searching for relevant documents from an inverted index
– Vocabulary search
– Document information retrieval
– Manipulation of retrieved information
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Introduction to Lucene
• Lucene: open source indexing/search engine
– Indexing is primary focus
• Document composed of set of fields
– Chunks of untokenized text
– Series of processed lexical units called token streams
▪Created by tokenization and filtering algorithms
• Highly-configurable search API
• Ease of indexing large, unstructured document
collections
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
27.6 Evaluation Measures of Search
Relevance (1 of 4)
• Topical relevance
– Measures result topic match to query topic
• User relevance
– Describes ‘goodness’ of retrieved result with regard to
user’s information need
• Web information retrieval
– No binary classification made for relevance or
nonrelevance
– Ranking of documents
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Evaluation Measures of Search
Relevance (2 of 4)
• Recall
– Number of relevant documents retrieved by a search
divided by the total number of actually relevant
documents existing in the database
• Precision
– Number of relevant documents retrieved by a search
divided by total number of documents retrieved by
that search
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Retrieved Versus Relevant Search Results
• TP: true positive
• FP: false positive
• TN: true negative
• FN: false negative
Figure 27.5 Retrieved versus relevant search results
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Evaluation Measures of Search
Relevance (3 of 4)
• Recall can be increased by presenting more results to the user
– May decrease the precision
Doc. No. Rank Position i Relevant Precision(i) Recall(i)
10 1 Yes 1/1 = 100% 1/10 = 10%
2 2 Yes 2/2 = 100% 2/10 = 20%
3 3 Yes 3/3 = 100% 3/10 = 30%
5 4 No 3/4 = 75% 3/10 = 30%
17 5 No 3/5 = 60% 3/10 = 30%
34 6 No 3/6 = 50% 3/10 = 30%
215 7 Yes 4/7 = 57.1% 4/10 = 40%
33 8 Yes 5/8 = 62.5% 5/10 = 50%
45 9 No 5/9 = 55.5% 5/10 = 50%
16 10 Yes 6/10 = 60% 6/10 = 60%
Table 27.2 Precision and recall for ranked retrieval
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Evaluation Measures of Search
Relevance (4 of 4)
• Average precision
– Computed based on the precision at each relevant
document in the ranking
• Recall/precision curve
– Based on the recall and precision values at each rank
position
▪x-axis is recall and y-axis is precision
• F-score
– Harmonic mean of the precision (p) and recall (r)
values
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
27.7 Web Search and Analysis (1 of 8)
• Search engines must crawl and index Web sites and
document collections
– Regularly update indexes
– Link analysis used to identify page importance
• Vertical search engines
– Customized topic-specific search engines that crawl
and index a specific collection of documents on the
Web
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Web Search and Analysis (2 of 8)
• Metasearch engines
– Query different search engines simultaneously and
aggregate information
• Digital libraries
– Collections of electronic resources and services for
the delivery of materials in a variety of formats
• Web analysis
– Applies data analysis techniques to discover and
analyze useful information from the Web
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Web Search and Analysis (3 of 8)
• Goals of Web analysis
– Finding relevant information
– Personalization of the information
– Finding information of social value
• Categories of Web analysis
– Web structure analysis
– Web content analysis
– Web usage analysis
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Web Search and Analysis (4 of 8)
• Web structure analysis
– Hyperlink
– Destination page
– Anchor text
– Hub
– Authority
• PageRank ranking algorithm
– Used by Google
– Analyzes forward links and backlinks
▪Highly linked pages are more important
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Web Search and Analysis (5 of 8)
• Web content analysis tasks
– Structured data extraction
▪Wrapper
– Web information integration
▪Web query interface integration
▪Schema matching
▪Ontology-based information integration
– Building concept hierarchies
– Segmenting web pages and detecting noise
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Web Search and Analysis (6 of 8)
• Approaches to Web content analysis
– Agent-based
▪Intelligent Web agents
▪Personalized Web agents
▪Information filtering/categorization
– Database-based
▪Attempts to organize a Web site as a database
▪Object Exchange Model
▪Multilevel database
▪Web query system
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Web Search and Analysis (7 of 8)
• Web usage analysis attempts to discover usage patterns
from Web data
– Preprocessing
▪Usage, content, structure
– Pattern discovery
▪Statistical analysis, association rules, clustering,
classification, sequential patterns, dependency
modeling
– Pattern analysis
▪Filter out patterns not of interest
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Web Search and Analysis (8 of 8)
• Practical applications of Web analysis
– Web analytics
▪Understand and optimize the performance of Web
usage
– Web spamming
▪Deliberate activity to promote a page by
manipulating search engine results
– Web security
▪Allow design of more robust Web sites
– Web crawlers
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
27.8 Trends in Information Retrieval (1 of 3)
• Faceted search
– Classifying content
• Social search
– Collaborative social search
• Conversational information access
– Intelligent agents perform intent extraction to provide
information relevant to a conversation
• Probabilistic topic modeling
– Automatically organize large collections of documents into
relevant themes
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Trends in Information Retrieval (2 of 3)
Figure 27.6 A document D and its topic proportions
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Trends in Information Retrieval (3 of 3)
• Question-answering systems
– Factoid questions
– List questions
– Definition questions
– Opinion questions
– Composed of question analysis, query generation,
search, candidate answer generation, and answer
scoring
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
27.9 Summary
• Information retrieval mainly targeted at unstructured data
• Query and browsing modes of interaction
• Retrieval models
– Boolean, vector space, probabilistic, and semantic
• Text preprocessing
• Web search
• Web ranking
• Trends
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Copyright

Chapter27 distributed database syst.pptx

  • 1.
    Fundamentals of DatabaseSystems Seventh Edition Chapter 27 Introduction to Information Retrieval and Web Search Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
  • 2.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved 27.1 Information Retrieval (IR) Concepts (1 of 4) • Information retrieval – Process of retrieving documents from a collection in response to a query (search request) – Deals mainly with unstructured data ▪Example: homebuying contract documents • Unstructured information – Does not have a well-defined formal model – Based on an understanding of natural language – Stored in a wide variety of standard formats
  • 3.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Information Retrieval (IR) Concepts (2 of 4) • Information retrieval field predates database field – Academic programs in Library and Information Science • RDBMS vendors providing new capabilities to support various data types – Extended RDBMSs or object-relational database management systems • User’s information need expressed as free-form search request – Keyword search query
  • 4.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Information Retrieval (IR) Concepts (3 of 4) • Characterizing an IR system – Types of users ▪ Expert ▪ Layperson – Types of data ▪Domain-specific – Types of information needs ▪Navigational search ▪Informational search ▪Transactional search
  • 5.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Information Retrieval (IR) Concepts (4 of 4) • Enterprise search systems – Limited to an intranet • Desktop search engines – Searches an individual computer system • Databases have fixed schemas – IR system has no fixed data model
  • 6.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Comparing Databases and IR Systems Table 27.1 A comparison of databases and IR systems Databases IR Systems • Structured data • Unstructured data • Schema driven • No fixed schema; various data models (e.g., vector space model) • Relational (or object, hierarchical, and network) model is predominant • Free-form query models • Structured query model • Rich data operations • Rich metadata operations • Search request returns list or pointers to documents • Query returns data Blank • Results are based on exact matching (always correct) • Results are based on approximate matching and measures of effectiveness (may be imprecise and ranked)
  • 7.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved A Brief History of IR • Stone tablets and papyrus scrolls • Printing press • Public libraries • Computers and automated storage systems – Inverted file organization based on keywords and their weights as indexing method • Search engine • Crawler • Challenge: provide high quality, pertinent, timely information
  • 8.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Modes of Interactions in IR Systems • Primary modes of interaction – Retrieval ▪Extract relevant information from document repository – Browsing ▪Exploratory activity based on user’s assessment of relevance • Web search combines both interaction modes – Rank of a web page measures its relevance to query that generated the result set
  • 9.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Generic IR Pipeline (1 of 2) • Statistical approach – Documents analyzed and broken down into chunks of text – Each word or phrase is counted, weighted, and measured for relevance or importance • Types of statistical approaches – Boolean – Vector space – Probabilistic
  • 10.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Generic IR Pipeline (2 of 2) • Semantic approaches – Use knowledge-based retrieval techniques – Rely on syntactic, lexical, sentential, discourse-based, and pragmatic levels of knowledge understanding – Also apply some form of statistical analysis
  • 11.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Figure 27.1 Generic IR Framework
  • 12.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Figure 27.2 Simplified IR Process Pipeline
  • 13.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved 27.2 Retrieval Models (1 of 5) • Boolean model – One of earliest and simplest IR models – Documents represented as a set of terms – Queries formulated using AND, OR, and NOT – Retrieved documents are an exact match ▪No notion of ranking of documents – Easy to associate metadata information and write queries that match contents of documents
  • 14.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Retrieval Models (2 of 5) • Vector space model – Weighting, ranking, and determining relevance are possible – Uses individual terms as dimensions – Each document represented by an n-dimensional vector of values – Features ▪Subset of terms in a document set that are deemed most relevant to an IR search for the document set
  • 15.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Retrieval Models (3 of 5) • Vector space model – Different similarity assessment functions can be used • Term frequency-inverse document frequency (TF-IDF) – Statistical weight measure used to evaluate the importance of a document word in a collection of documents – A discriminating term must occur in only a few documents in the general population
  • 16.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Retrieval Models (4 of 5) • Probabilistic model – Involves ranking documents by their estimated probability of relevance with respect to the query and the document – IR system must decide whether a document belongs to the relevant set or nonrelevant set for a query ▪Calculate probability that document belongs to the relevant set – BM25: a popular ranking algorithm
  • 17.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Retrieval Models (5 of 5) • Semantic model – Morphological analysis ▪Analyze roots and affixes to determine parts of speech of search words – Syntactic analysis ▪Parse and analyze complete phrases in documents – Semantic analysis ▪Resolve word ambiguities and generate relevant synonyms based on semantic relationships – Uses techniques from artificial intelligence and expert systems
  • 18.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved 27.3 Types of Queries in IR Systems (1 of 4) • Keyword queries – Simplest and most commonly used – Keyword terms implicitly connected by logical AND • Boolean queries – Allow use of AND, OR, NOT, and other operators – Exact matches returned ▪No ranking possible
  • 19.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Types of Queries in IR Systems (2 of 4) • Phrase queries – Sequence of words that make up a phrase – Phrase enclosed in double quotes – Each retrieved document must contain at least one instance of the exact phrase • Proximity queries – How close within a record multiple search terms are to each other – Phrase search is most commonly used proximity query
  • 20.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Types of Queries in IR Systems (3 of 4) • Proximity queries – Specify order of search terms – NEAR, ADJ (adjacent), or AFTER operators – Sequence of words with maximum allowed distance between them – Computationally expensive ▪Suitable for smaller document collections rather than the Web
  • 21.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Types of Queries in IR Systems (4 of 4) • Wildcard queries – Supports regular expressions and pattern-based matching ▪Example ‘data*’ would retrieve data, database, dataset, etc. – Not generally implemented by Web search engines • Natural language queries – Definitions of textual terms or common facts – Semantic models can support
  • 22.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved 27.4 Text Preprocessing (1 of 3) • Stopword removal must be performed before indexing • Stopwords – Words that are expected to occur in 80% or more of the documents of a collection ▪Examples: the, of, to, a, and, said, for, that – Do not contribute much to relevance • Queries preprocessed for stopword removal before retrieval process – Many search engines do not remove stopwords
  • 23.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Text Preprocessing (2 of 3) • Stemming – Trims suffix and prefix – Reduces the different forms of the word to a common stem – Martin Porter’s stemming algorithm • Utilizing a thesaurus – Important concepts and main words that describe each concept for a particular knowledge domain – Collection of synonyms – UMLS
  • 24.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Figure 27.3 A Portion of the UMLS Semantic Network: “Biologic Function” Hierarchy Source: UMLS Reference Manual, National Library of Medicine
  • 25.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Text Preprocessing (3 of 3) • Other preprocessing steps – Digits ▪May or may not be removed during preprocessing – Hyphens and punctuation marks ▪Handled in different ways – Cases ▪Most search engines use case-insensitive search • Information extraction tasks – Identifying noun phrases, facts, events, people, places, and relationships
  • 26.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved 27.5 Inverted Indexing (1 of 3) • Inverted index structure – Vocabulary information ▪Set of distinct query terms in the document set – Document information – Data structure that attaches distinct terms with a list of all documents that contain the term
  • 27.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Inverted Indexing (2 of 3) • Construction of an inverted index – Break documents into vocabulary terms ▪Tokenizing, cleansing, removing stopwords, stemming, and/or using a thesaurus – Collect document statistics ▪Store statistics in document lookup table – Invert the document-term stream into a term- document stream ▪Add additional information such as term frequencies, term positions, and term weights
  • 28.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Figure 27.4 Example of an Inverted Index
  • 29.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Inverted Indexing (3 of 3) • Searching for relevant documents from an inverted index – Vocabulary search – Document information retrieval – Manipulation of retrieved information
  • 30.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Introduction to Lucene • Lucene: open source indexing/search engine – Indexing is primary focus • Document composed of set of fields – Chunks of untokenized text – Series of processed lexical units called token streams ▪Created by tokenization and filtering algorithms • Highly-configurable search API • Ease of indexing large, unstructured document collections
  • 31.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved 27.6 Evaluation Measures of Search Relevance (1 of 4) • Topical relevance – Measures result topic match to query topic • User relevance – Describes ‘goodness’ of retrieved result with regard to user’s information need • Web information retrieval – No binary classification made for relevance or nonrelevance – Ranking of documents
  • 32.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Evaluation Measures of Search Relevance (2 of 4) • Recall – Number of relevant documents retrieved by a search divided by the total number of actually relevant documents existing in the database • Precision – Number of relevant documents retrieved by a search divided by total number of documents retrieved by that search
  • 33.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Retrieved Versus Relevant Search Results • TP: true positive • FP: false positive • TN: true negative • FN: false negative Figure 27.5 Retrieved versus relevant search results
  • 34.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Evaluation Measures of Search Relevance (3 of 4) • Recall can be increased by presenting more results to the user – May decrease the precision Doc. No. Rank Position i Relevant Precision(i) Recall(i) 10 1 Yes 1/1 = 100% 1/10 = 10% 2 2 Yes 2/2 = 100% 2/10 = 20% 3 3 Yes 3/3 = 100% 3/10 = 30% 5 4 No 3/4 = 75% 3/10 = 30% 17 5 No 3/5 = 60% 3/10 = 30% 34 6 No 3/6 = 50% 3/10 = 30% 215 7 Yes 4/7 = 57.1% 4/10 = 40% 33 8 Yes 5/8 = 62.5% 5/10 = 50% 45 9 No 5/9 = 55.5% 5/10 = 50% 16 10 Yes 6/10 = 60% 6/10 = 60% Table 27.2 Precision and recall for ranked retrieval
  • 35.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Evaluation Measures of Search Relevance (4 of 4) • Average precision – Computed based on the precision at each relevant document in the ranking • Recall/precision curve – Based on the recall and precision values at each rank position ▪x-axis is recall and y-axis is precision • F-score – Harmonic mean of the precision (p) and recall (r) values
  • 36.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved 27.7 Web Search and Analysis (1 of 8) • Search engines must crawl and index Web sites and document collections – Regularly update indexes – Link analysis used to identify page importance • Vertical search engines – Customized topic-specific search engines that crawl and index a specific collection of documents on the Web
  • 37.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Web Search and Analysis (2 of 8) • Metasearch engines – Query different search engines simultaneously and aggregate information • Digital libraries – Collections of electronic resources and services for the delivery of materials in a variety of formats • Web analysis – Applies data analysis techniques to discover and analyze useful information from the Web
  • 38.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Web Search and Analysis (3 of 8) • Goals of Web analysis – Finding relevant information – Personalization of the information – Finding information of social value • Categories of Web analysis – Web structure analysis – Web content analysis – Web usage analysis
  • 39.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Web Search and Analysis (4 of 8) • Web structure analysis – Hyperlink – Destination page – Anchor text – Hub – Authority • PageRank ranking algorithm – Used by Google – Analyzes forward links and backlinks ▪Highly linked pages are more important
  • 40.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Web Search and Analysis (5 of 8) • Web content analysis tasks – Structured data extraction ▪Wrapper – Web information integration ▪Web query interface integration ▪Schema matching ▪Ontology-based information integration – Building concept hierarchies – Segmenting web pages and detecting noise
  • 41.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Web Search and Analysis (6 of 8) • Approaches to Web content analysis – Agent-based ▪Intelligent Web agents ▪Personalized Web agents ▪Information filtering/categorization – Database-based ▪Attempts to organize a Web site as a database ▪Object Exchange Model ▪Multilevel database ▪Web query system
  • 42.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Web Search and Analysis (7 of 8) • Web usage analysis attempts to discover usage patterns from Web data – Preprocessing ▪Usage, content, structure – Pattern discovery ▪Statistical analysis, association rules, clustering, classification, sequential patterns, dependency modeling – Pattern analysis ▪Filter out patterns not of interest
  • 43.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Web Search and Analysis (8 of 8) • Practical applications of Web analysis – Web analytics ▪Understand and optimize the performance of Web usage – Web spamming ▪Deliberate activity to promote a page by manipulating search engine results – Web security ▪Allow design of more robust Web sites – Web crawlers
  • 44.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved 27.8 Trends in Information Retrieval (1 of 3) • Faceted search – Classifying content • Social search – Collaborative social search • Conversational information access – Intelligent agents perform intent extraction to provide information relevant to a conversation • Probabilistic topic modeling – Automatically organize large collections of documents into relevant themes
  • 45.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Trends in Information Retrieval (2 of 3) Figure 27.6 A document D and its topic proportions
  • 46.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Trends in Information Retrieval (3 of 3) • Question-answering systems – Factoid questions – List questions – Definition questions – Opinion questions – Composed of question analysis, query generation, search, candidate answer generation, and answer scoring
  • 47.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved 27.9 Summary • Information retrieval mainly targeted at unstructured data • Query and browsing modes of interaction • Retrieval models – Boolean, vector space, probabilistic, and semantic • Text preprocessing • Web search • Web ranking • Trends
  • 48.
    Copyright © 2016,2011, 2007 Pearson Education, Inc. All Rights Reserved Copyright

Editor's Notes

  • #1 If this PowerPoint presentation contains mathematical equations, you may need to check that your computer has the following installed: 1) MathType Plugin 2) Math Player (free versions available) 3) NVDA Reader (free versions available)