An Approximate L^p Difference Algorithm for Massive Data Streams
Authors: Jessica H. Fong ^{1}; Martin Strauss ^{2}
NULL##NULL
Jessica H. Fong;Martin Strauss
1 Department of Computer Science
2 AT&T Laboratories
Several recent papers have shown how to approximate the difference ∑ _i|a_i-b_i| or ∑ |a_i-b_i|^2 between two functions, when the function values a_i and b_i are given in a data stream, and their order is chosen by an adversary. These algorithms use little space (much less than would be needed to store the entire stream) and little time to process each item in the stream. They approximate with small relative error. Using different techniques, we show how to approximate the L^p-difference ∑ _i|a_i-b_i|^p for any rational-valued p∈(0,2], with comparable efficiency and error. We also show how to approximate ∑ _i|a_i-b_i|^p for larger values of p but with a worse error guarantee. Our results fill in gaps left by recent work, by providing an algorithm that is precisely tunable for the application at hand. These results can be used to assess the difference between two chronologically or physically separated massive data sets, making one quick pass over each data set, without buffering the data or requiring the data source to pause. For example, one can use our techniques to judge whether the traffic on two remote network routers are similar without requiring either router to transmit a copy of its traffic. A web search engine could use such algorithms to construct a library of small ''sketches,'' one for each distinct page on the web; one can approximate the extent to which new web pages duplicate old ones by comparing the sketches of the web pages. Such techniques will become increasingly important as the enormous scale, distributional nature, and one-pass processing requirements of data sets become more commonplace.