cc150/ch10/6.md at master · KickAlgorithm/cc150 · GitHub

8 lines (5 loc) · 455 Bytes

You have W billion URLs. How do you detect the duplicate documents?In this case, assume that "duplicate" means that the URLs are identical.

If all URLs can put in memory, we could simply use hash table.
Two pass method. First, allocate 4000 buckets in disk, and hash the URLs into [0,4000]. Then check each bucket, because same url will match to the same bucket.
If in distributed systems, we can do it in parallel in 4000 different machines.