Finding useful information in a sea of garbage

Web data can be massive, but finding nuggets of useful information in it is hard. For example, suppose you have a language model over newswire. How can you efficiently identify web sentences, that when added to a newswire language model reduce perplexity? This problem can be tackled as follows: treat all of newswire sentences as queries and perform a nearest-neighbour search over the web data for each query.

This project will have access to 110 billon words of web data. (Smaller versions exist).

A sequential approach to this problem can be found here