You have W billion URLs. How do you detect the duplicate documents?In this case, assume that "duplicate" means that the URLs are identical.
- If all URLs can put in memory, we could simply use hash table.
- Two pass method. First, allocate 4000 buckets in disk, and hash the URLs into [0,4000]. Then check each bucket, because same url will match to the same bucket.
- If in distributed systems, we can do it in parallel in 4000 different machines.