Beyond document retrieval using 
semantic annotations 
Roi Blanco (roi@yahoo-inc.com) 
http://labs.yahoo.com/Yahoo_Labs_Barcelona
Yahoo! Research Barcelona 
• Established January, 2006 
• Led by Ricardo Baeza-Yates 
• Research areas 
• Web Mining 
• Social Media 
• Distributed Web retrieval 
• Semantic Search 
• NLP and Semantics
Contributions 
Hugo 
Zaragoza 
• “Every time I fire a linguist my performance goes up…” (Fred Jelinek) 
Great strategy until you’ve fired them all… but what then? 
Michael 
Matthews 
Jordi 
Atserias 
Roi 
Blanco 
Sebastiano Vigna (U. Milan) 
Paolo Boldi 
- Indexing (MG4J) 
Peter 
Mika
Agenda 
• Search: this was then, this is now 
• Natural Language processing and search 
• Semantic Search 
• Search over annotated documents 
• Time Explorer
Natural Language Retrieval 
• How to exploit the structure and meaning of 
natural language text to improve search 
• Current search engines perform only limited NLP 
(tokenization, stemming) 
• Automated tools exist for deeper analysis 
• Applications to diversity-aware search 
• Source, Location, Time, Language, Opinion, 
Ranking… 
• Search over semi-structured data, semantic 
search 
• Roll-out user experiences that use higher layers 
of the NLP stack
WEB SEARCH
Structured data - Web search 
Top-1 entity with 
structured data 
Related entities 
Structured data 
extracted from HTML
New devices 
• Different interaction (e.g. voice) 
• Different capabilities (e.g. display) 
• More Information (geo-localization) 
• More personalized
Yahoo! Axis 
Smarter, Faster Search 
instant answers 
visual previews 
infinite browsing 
Connected Experience: 
across devices, iPhone, iPad, 
Firefox, Safari, Internet Explorer, 
Chrome. 
Pesonalized Home Page 
Signing:Yahoo!, Google, Facebook, 
direct access to favorite sites, 
saved articles and bookmarks.
SEMANTIC SEARCH
Semantic Search 
• What different kinds of search and 
applications beyond string matching or 
returning 10 blue links? 
• Can we have a better understanding of 
documents and queries? 
• New devices open new possibilities, new 
experiences 
• Is current technology in natural language 
understanding mature enough?
Semantic Search (II) 
• Matching the user’s query with the Web’s content at a 
conceptual level, often with the help of world knowledge 
– Natural Language Search 
• Exploiting the (implicit) structure and semantics of natural language 
• Intersection of IR and NLP 
– Semantic Web Search 
• Exploiting the (explicit) meaning of data 
• Intersection of IR and Semantic Web 
• As a field 
– ISWC/ESWC/ASWC, WWW, SIGIR, VLDB, CIKM 
– Exploring Semantic Annotations in Information Retrieval 
(ECIR08, WSDM09) 
– Semantic Search Workshop (ESWC08, WWW09, WWW10) 
– Future of Web Search: Semantic Search (FoWS09)
State of search 
• “We are at the beginning of search.“ (Marissa Mayer) 
• Old battles are won 
– Marginal returns on investments in crawling, indexing, 
ranking 
– Solved large classes of queries (e.g. navigational) 
– Lots of tempting, but high hanging fruits 
• Currently, the biggest bottlenecks in IR not 
computational, but in modeling user cognition 
– If only we could find a computationally expensive way to 
solve the problem… 
• In particular, solving queries that require a deeper understanding of the 
query, the content and/or the world at large 
– Corollary : go beyond string matching!
Some examples… 
• Ambiguous searches 
– paris hilton 
• Multimedia search 
– paris hilton sexy 
• Imprecise or overly precise searches 
– jim hendler 
– pictures of strong adventures people 
• Searches for descriptions 
– 33 year old computer scientist living in barcelona 
– reliable digital camera under 300 dollars 
• Searches that require aggregation 
– height eiffel tower 
– harry potter movie review 
– world temperature 2020
Is NLU that complex? 
”A child of five would understand this. 
Send someone to fetch a child of five”. 
Groucho Marx
Language is Ambiguous 
The man saw the girl with the telescope
Paraphrases 
• ‘This parrot is dead’ 
• ‘This parrot has kicked the bucket’ 
• ‘This parrot has passed away’ 
• ‘This parrot is no more' 
• 'He's expired and gone to meet his maker,’ 
• 'His metabolic processes are now history’
Not just search…
Semantics at every step of the IR process 
bla bla bla? 
bla 
bla bla 
The IR engine The Web 
Query interpretation 
q=“bla” * 3 
Document processing bla 
bla bla 
bla 
bla 
bla 
Indexing 
Ranking 
θ(q,d) “bla” 
Result presentation
Understanding Queries 
• Query logs are a big source of information & 
knowledge 
To rank better the results (what you click) 
To understand queries better 
Paris Paris Flights 
Paris Paris Hilton
“Understand” Documents 
NLU Still 
an open 
issue
NLP for IR 
• Full NLU is AI complete, not scalable to the web 
size (parsing the web is really hard). 
• BUT … what about other shallow NLP techniques? 
• Hypothesis/Requirements: 
• Linear extraction/parsing time 
• Error-prone output (e.g. 60-90%) 
• Highly redundant information 
• Explore new ways of browsing 
• Support your answers
Usability 
We also fail at using the technology 
Sometimes
Support your answers 
Errors happen: choose the right ones! 
• Humans need to “verify” unknown facts 
• Multiple sources of evidence 
• Common sense vs. Contradictions 
• are you sure? is this spam? Interesting! 
• Tolerance to errors greatly increases if users can 
verify things fast 
• Importance of snippets, image search 
• Often the context is as important as the fact 
• E.g. “S discovered the transistor in X” 
• There are different kinds of errors 
• Ridiculous result (decreases overall confidence in system) 
• Reasonably wrong result (makes us feel good)
SEARCH OVER ANNOTATED 
DOCUMENTS
Annotated documents 
Barack Obama visited Tokyo this Monday as part of an extended Asian trip. 
He is expected to deliver a speech at the ASEAN conference next Tuesday 
20 May 2009 
28 May 2009 
Barack Obama visited Tokyo this Monday as part of an extended Asian trip. 
He is expected to deliver a speech at the ASEAN conference next Tuesday
How does it work? 
Monty 
Python Inverted Index 
(sentence/doc level) 
Forward Index 
(entity level) 
Flying Circus 
John Cleese 
Brian
Efficient element retrieval 
• Goal 
– Given an ad-hoc query, return a list of documents and 
annotations ranked according to their relevance to the query 
• Simple Solution 
– For each document that matches the query, retrieve its 
annotations and return the ones with the highest counts 
• Problems 
– If there are many documents in the result set this will take too 
long - too many disk seeks, too much data to search through 
– What if counting isn’t the best method for ranking elements? 
• Solution 
– Special compressed data structures designed specifically for 
annotation retrieval
Forward Index 
• Access metadata and document contents 
– Length, terms, annotations 
• Compressed (in memory) forward indexes 
– Gamma, Delta, Nibble, Zeta codes (power laws) 
• Retrieving and scoring annotations 
– Sort terms by frequency 
• Random access using an extra compressed 
pointer list (Elias-Fano)
Parallel Indexes 
• Standard index contains only tokens 
• Parallel indices contain annotations on the tokens – the 
annotation indices must be aligned with main token index 
• For example: given the sentence “New York has great 
pizza” where New York has been annotated as a LOCATION 
– Token index has five entries 
(“new”, “york”, “has”, “great”, “pizza”) 
– The annotation index has five entries 
(“LOC”, “LOC”, “O”,”O”,”O”) 
Can optionally encode BIO format (e.g. LOC-B, LOC-I) 
• To search for the New York location entity, we search for: 
“token:New ^ entity:LOC token:York ^ entity:LOC”
Parallel Indices (II) 
Doc #3: The last time Peter exercised was in the XXth century. 
Doc #5: Hope claims that in 1994 she run to Peter Town. 
Peter  D3:1, D5:9 
Town  D5:10 
Hope  D5:1 
1994  D5:5 
… 
Possible Queries: 
“Peter AND run” 
“Peter AND WNS:N_DATE” 
“(WSJ:CITY ^ *) AND run” 
“(WSJ:PERSON ^ Hope) AND run” 
WSJ:PERSON  D3:1, D5:1 
WSJ:CITY  D5:9 
WNS:V_DATE  D5:5 
(Bracketing can also be dealt with)
Pipelined Architecture
TIME EXPLORER
Time(ly) opportunities 
Can we create new user experiences based on a deeper analysis and 
exploration of the time dimension? 
• Goals: 
– Build an application that helps users to explore, 
interact and ultimately understand existing 
information about the past and the future. 
– Help the user cope with the information overload 
and eventually find/learn about what she’s looking 
for.
Original Idea 
• R. Baeza-Yates, Searching the Future, MF/IR 2005 
– On December 1st 2003, on Google News, there were more than 100K 
references to 2004 and beyond. 
– E.g. 2034: 
• The ownership of Dolphin Square in London must revert to an insurance company. 
• Voyager 2 should run out of fuel. 
• Long-term care facilities may have to house 2.1 million people in the USA. 
• A human base in the moon would be in operation.
Time Explorer 
• Public demo since August 2010 
• Winner of HCIR NYT Challenge 
• Goal: explore news through time and into 
the future 
• Using a customized Web crawl from news 
and blog feeds 
• http://fbmya01.barcelonamedia.org:8080/future/
Time Explorer
Time Explorer - Motivation 
 Time is important to search 
 Recency, particularly in news is highly related 
to relevancy 
 But, what about evolution over time? 
 How has a topic evolved over time? 
 How did the entities (people, place, etc) evolve with respect to the topic over 
time? 
 How will this topic continue to evolve over the future? 
 How does bias and sentiment in blogs and news change over time? 
 Google Trends, Yahoo! Clues, RecordedFuture 
… 
 Great research playground 
 Open source!
Time Explorer
Analysis Pipeline 
 Tokenization, Sentence Splitting, Part-of-speech 
tagging, chunking with OpenNLP 
 Entity extraction with SuperSense tagger 
 Time expressions extracted with TimeML 
 Explicit dates (August 23rd, 2008) 
 Relative dates (Next year, resolved with Pub Date) 
 Sentiment Analysis with LivingKnowledge 
 Ontology matching with Yago 
 Image Analysis – sentiment and face detection
Indexing/Search 
• Lucene/Solr search platform to index and search 
– Sentence level 
– Document level 
• Facets for annotations (multiple fields for faster 
entity-type access) 
• Index publication date and content date –extracted 
dates if they exists or publication date 
• Solr Faceting allows aggregation over query entity 
ranking and for aggregating counts over time 
• Content date enables search into the future
Timeline
Timeline - Document
Facets
Timeline – Facet Trend
Timeline – Future
Opinions
Quotes
Other challenges 
– Large scale processing 
• Distributed computing 
• Shift from batch (Hadoop) to online (S4) 
– Efficient extraction/retrieval, algorithmic/data 
structures 
• Critical for interactive exploration 
– Connection with the user experience 
• Measures! User engagement? 
– Personalization 
– Integration with Knowledge Bases (Semantic Web) 
– Multilingual

Beyond document retrieval using semantic annotations

  • 1.
    Beyond document retrievalusing semantic annotations Roi Blanco ([email protected]) http://labs.yahoo.com/Yahoo_Labs_Barcelona
  • 2.
    Yahoo! Research Barcelona • Established January, 2006 • Led by Ricardo Baeza-Yates • Research areas • Web Mining • Social Media • Distributed Web retrieval • Semantic Search • NLP and Semantics
  • 3.
    Contributions Hugo Zaragoza • “Every time I fire a linguist my performance goes up…” (Fred Jelinek) Great strategy until you’ve fired them all… but what then? Michael Matthews Jordi Atserias Roi Blanco Sebastiano Vigna (U. Milan) Paolo Boldi - Indexing (MG4J) Peter Mika
  • 4.
    Agenda • Search:this was then, this is now • Natural Language processing and search • Semantic Search • Search over annotated documents • Time Explorer
  • 5.
    Natural Language Retrieval • How to exploit the structure and meaning of natural language text to improve search • Current search engines perform only limited NLP (tokenization, stemming) • Automated tools exist for deeper analysis • Applications to diversity-aware search • Source, Location, Time, Language, Opinion, Ranking… • Search over semi-structured data, semantic search • Roll-out user experiences that use higher layers of the NLP stack
  • 6.
  • 10.
    Structured data -Web search Top-1 entity with structured data Related entities Structured data extracted from HTML
  • 11.
    New devices •Different interaction (e.g. voice) • Different capabilities (e.g. display) • More Information (geo-localization) • More personalized
  • 12.
    Yahoo! Axis Smarter,Faster Search instant answers visual previews infinite browsing Connected Experience: across devices, iPhone, iPad, Firefox, Safari, Internet Explorer, Chrome. Pesonalized Home Page Signing:Yahoo!, Google, Facebook, direct access to favorite sites, saved articles and bookmarks.
  • 13.
  • 14.
    Semantic Search •What different kinds of search and applications beyond string matching or returning 10 blue links? • Can we have a better understanding of documents and queries? • New devices open new possibilities, new experiences • Is current technology in natural language understanding mature enough?
  • 15.
    Semantic Search (II) • Matching the user’s query with the Web’s content at a conceptual level, often with the help of world knowledge – Natural Language Search • Exploiting the (implicit) structure and semantics of natural language • Intersection of IR and NLP – Semantic Web Search • Exploiting the (explicit) meaning of data • Intersection of IR and Semantic Web • As a field – ISWC/ESWC/ASWC, WWW, SIGIR, VLDB, CIKM – Exploring Semantic Annotations in Information Retrieval (ECIR08, WSDM09) – Semantic Search Workshop (ESWC08, WWW09, WWW10) – Future of Web Search: Semantic Search (FoWS09)
  • 16.
    State of search • “We are at the beginning of search.“ (Marissa Mayer) • Old battles are won – Marginal returns on investments in crawling, indexing, ranking – Solved large classes of queries (e.g. navigational) – Lots of tempting, but high hanging fruits • Currently, the biggest bottlenecks in IR not computational, but in modeling user cognition – If only we could find a computationally expensive way to solve the problem… • In particular, solving queries that require a deeper understanding of the query, the content and/or the world at large – Corollary : go beyond string matching!
  • 17.
    Some examples… •Ambiguous searches – paris hilton • Multimedia search – paris hilton sexy • Imprecise or overly precise searches – jim hendler – pictures of strong adventures people • Searches for descriptions – 33 year old computer scientist living in barcelona – reliable digital camera under 300 dollars • Searches that require aggregation – height eiffel tower – harry potter movie review – world temperature 2020
  • 18.
    Is NLU thatcomplex? ”A child of five would understand this. Send someone to fetch a child of five”. Groucho Marx
  • 19.
    Language is Ambiguous The man saw the girl with the telescope
  • 20.
    Paraphrases • ‘Thisparrot is dead’ • ‘This parrot has kicked the bucket’ • ‘This parrot has passed away’ • ‘This parrot is no more' • 'He's expired and gone to meet his maker,’ • 'His metabolic processes are now history’
  • 21.
  • 22.
    Semantics at everystep of the IR process bla bla bla? bla bla bla The IR engine The Web Query interpretation q=“bla” * 3 Document processing bla bla bla bla bla bla Indexing Ranking θ(q,d) “bla” Result presentation
  • 23.
    Understanding Queries •Query logs are a big source of information & knowledge To rank better the results (what you click) To understand queries better Paris Paris Flights Paris Paris Hilton
  • 24.
    “Understand” Documents NLUStill an open issue
  • 25.
    NLP for IR • Full NLU is AI complete, not scalable to the web size (parsing the web is really hard). • BUT … what about other shallow NLP techniques? • Hypothesis/Requirements: • Linear extraction/parsing time • Error-prone output (e.g. 60-90%) • Highly redundant information • Explore new ways of browsing • Support your answers
  • 26.
    Usability We alsofail at using the technology Sometimes
  • 27.
    Support your answers Errors happen: choose the right ones! • Humans need to “verify” unknown facts • Multiple sources of evidence • Common sense vs. Contradictions • are you sure? is this spam? Interesting! • Tolerance to errors greatly increases if users can verify things fast • Importance of snippets, image search • Often the context is as important as the fact • E.g. “S discovered the transistor in X” • There are different kinds of errors • Ridiculous result (decreases overall confidence in system) • Reasonably wrong result (makes us feel good)
  • 28.
  • 29.
    Annotated documents BarackObama visited Tokyo this Monday as part of an extended Asian trip. He is expected to deliver a speech at the ASEAN conference next Tuesday 20 May 2009 28 May 2009 Barack Obama visited Tokyo this Monday as part of an extended Asian trip. He is expected to deliver a speech at the ASEAN conference next Tuesday
  • 31.
    How does itwork? Monty Python Inverted Index (sentence/doc level) Forward Index (entity level) Flying Circus John Cleese Brian
  • 32.
    Efficient element retrieval • Goal – Given an ad-hoc query, return a list of documents and annotations ranked according to their relevance to the query • Simple Solution – For each document that matches the query, retrieve its annotations and return the ones with the highest counts • Problems – If there are many documents in the result set this will take too long - too many disk seeks, too much data to search through – What if counting isn’t the best method for ranking elements? • Solution – Special compressed data structures designed specifically for annotation retrieval
  • 33.
    Forward Index •Access metadata and document contents – Length, terms, annotations • Compressed (in memory) forward indexes – Gamma, Delta, Nibble, Zeta codes (power laws) • Retrieving and scoring annotations – Sort terms by frequency • Random access using an extra compressed pointer list (Elias-Fano)
  • 34.
    Parallel Indexes •Standard index contains only tokens • Parallel indices contain annotations on the tokens – the annotation indices must be aligned with main token index • For example: given the sentence “New York has great pizza” where New York has been annotated as a LOCATION – Token index has five entries (“new”, “york”, “has”, “great”, “pizza”) – The annotation index has five entries (“LOC”, “LOC”, “O”,”O”,”O”) Can optionally encode BIO format (e.g. LOC-B, LOC-I) • To search for the New York location entity, we search for: “token:New ^ entity:LOC token:York ^ entity:LOC”
  • 35.
    Parallel Indices (II) Doc #3: The last time Peter exercised was in the XXth century. Doc #5: Hope claims that in 1994 she run to Peter Town. Peter  D3:1, D5:9 Town  D5:10 Hope  D5:1 1994  D5:5 … Possible Queries: “Peter AND run” “Peter AND WNS:N_DATE” “(WSJ:CITY ^ *) AND run” “(WSJ:PERSON ^ Hope) AND run” WSJ:PERSON  D3:1, D5:1 WSJ:CITY  D5:9 WNS:V_DATE  D5:5 (Bracketing can also be dealt with)
  • 36.
  • 37.
  • 38.
    Time(ly) opportunities Canwe create new user experiences based on a deeper analysis and exploration of the time dimension? • Goals: – Build an application that helps users to explore, interact and ultimately understand existing information about the past and the future. – Help the user cope with the information overload and eventually find/learn about what she’s looking for.
  • 39.
    Original Idea •R. Baeza-Yates, Searching the Future, MF/IR 2005 – On December 1st 2003, on Google News, there were more than 100K references to 2004 and beyond. – E.g. 2034: • The ownership of Dolphin Square in London must revert to an insurance company. • Voyager 2 should run out of fuel. • Long-term care facilities may have to house 2.1 million people in the USA. • A human base in the moon would be in operation.
  • 40.
    Time Explorer •Public demo since August 2010 • Winner of HCIR NYT Challenge • Goal: explore news through time and into the future • Using a customized Web crawl from news and blog feeds • http://fbmya01.barcelonamedia.org:8080/future/
  • 41.
  • 42.
    Time Explorer -Motivation  Time is important to search  Recency, particularly in news is highly related to relevancy  But, what about evolution over time?  How has a topic evolved over time?  How did the entities (people, place, etc) evolve with respect to the topic over time?  How will this topic continue to evolve over the future?  How does bias and sentiment in blogs and news change over time?  Google Trends, Yahoo! Clues, RecordedFuture …  Great research playground  Open source!
  • 43.
  • 44.
    Analysis Pipeline Tokenization, Sentence Splitting, Part-of-speech tagging, chunking with OpenNLP  Entity extraction with SuperSense tagger  Time expressions extracted with TimeML  Explicit dates (August 23rd, 2008)  Relative dates (Next year, resolved with Pub Date)  Sentiment Analysis with LivingKnowledge  Ontology matching with Yago  Image Analysis – sentiment and face detection
  • 45.
    Indexing/Search • Lucene/Solrsearch platform to index and search – Sentence level – Document level • Facets for annotations (multiple fields for faster entity-type access) • Index publication date and content date –extracted dates if they exists or publication date • Solr Faceting allows aggregation over query entity ranking and for aggregating counts over time • Content date enables search into the future
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
    Other challenges –Large scale processing • Distributed computing • Shift from batch (Hadoop) to online (S4) – Efficient extraction/retrieval, algorithmic/data structures • Critical for interactive exploration – Connection with the user experience • Measures! User engagement? – Personalization – Integration with Knowledge Bases (Semantic Web) – Multilingual