Ranking the Web with Spark
Apache Big Data Europe 2016
sylvain@sylvainzimmer.com
@sylvinus
/usr/bin/whoami
• Jamendo (Founder & CTO, 2004-2011)
• TEDxParis (Co-founder, 2009-2012)
• dotConferences (Founder, 2012-)
• Pricing Assistant (Co-founder & CTO, 2012-)
transparency
reproducibility
https://uidemo.commonsearch.org
https://explain.commonsearch.org/?q=python&g=en
Ranking
Disclaimer: IANASRE
(I Am Not A Search Relevance Engineer)
What's in a score
score = fn( doc, query, language, user, time )
What's in a score
score = fn( doc, query )
What's in a score
score = fn( static_score, dynamic_score ( query ))
Static score
Static features
• Scopes:
• Page: URL depth, markup stats, ...
• Domain: Age, page count, blacklists, ...
• WebGraph: PageRank, ...
http://infolab.stanford.edu/~backrub/google.html
The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)
Crawler
Indexer
Database
SearcherRanker
Dynamic score
Dynamic features
• Text match: TF-IDF, BM25, proximity, topic, ...
• Query-level: number of words, popularity, ...
• Usage: clicks, dwell time, reformulations, ...
• Time
Scoring function
Users
Database
Elasticsearch
Indexer
Python, Spark
Data sources
Common Crawl, Alexa top 1M, ...
words, static score
query top 10 docs, final scores
Offline
Online
Searcher
Go
https://explain.commonsearch.org/?q=python&g=en
Issues with this architecture
• Static & dynamic scoring are in different
codebases
• No control over result diversity
• Hard to optimize
• Very dependent on Elasticsearch
Rescoring
Users
Database
Indexer
words, static score, features
query
Searcher
top 1k docs, features
Rescorer
final 10 docs
Issues with rescoring
• Latency
• Pagination
• Harder to explain
Learning to rank
LTR Model
• Features
• Training dataset
• Evaluation: NDCG, ERR, ...
• Algorithms: AdaRank, ListNet, LambdaMART, ...
• Learning with Spark!
The right questions
• What do users expect?
• What features?
• How to evaluate and fine-tune in the real world?
PageRank with Spark
http://commoncrawl.org
https://github.com/commonsearch/cosr-back
Common Search Pipeline
Doc sources
Common Crawl,
WARC files,
URLs ...
Filter
plugins
Document
parsing
Output
plugins
Data output
Database, file,
HDFS, S3, ...
Most popular Wikipedia pages
Dumping the web graph
Naive pyspark PageRank
GraphFrames
SparkSQL PageRank
SparkSQL PageRank
https://github.com/commonsearch/cosr-back/blob/master/spark/jobs/pagerank.py
Tests
http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm
https://github.com/commonsearch/cosr-back/blob/master/tests/sparktests/test_pagerank.py
https://about.commonsearch.org/developer/get-started
Top 10
Spam
Spamdexing
• Keyword stuffing, hidden text
• Scraper sites, Mirrors
• Link farms
• Splogs, Comment spam
• Domaining
• Cloaking
• Bombing
Questions?
https://about.commonsearch.org/contributing
https://github.com/commonsearch
contact@commonsearch.org
slack.commonsearch.org

Ranking the Web with Spark