An Approximate L^p Difference Algorithm for Massive Data Streams

Jessica H. Fong; Martin Strauss

doi:10.46298/dmtcs.290

Jessica H. Fong ; Martin Strauss - An Approximate L^p Difference Algorithm for Massive Data Streams

dmtcs:290 - Discrete Mathematics & Theoretical Computer Science, January 1, 2001, Vol. 4 no. 2 - https://doi.org/10.46298/dmtcs.290

An Approximate L^p Difference Algorithm for Massive Data StreamsArticle

Authors: Jessica H. Fong ¹; Martin Strauss ²

1 Department of Computer Science
2 AT&T Laboratories

Several recent papers have shown how to approximate the difference ∑ _i|a_i-b_i| or ∑ |a_i-b_i|^2 between two functions, when the function values a_i and b_i are given in a data stream, and their order is chosen by an adversary. These algorithms use little space (much less than would be needed to store the entire stream) and little time to process each item in the stream. They approximate with small relative error. Using different techniques, we show how to approximate the L^p-difference ∑ _i|a_i-b_i|^p for any rational-valued p∈(0,2], with comparable efficiency and error. We also show how to approximate ∑ _i|a_i-b_i|^p for larger values of p but with a worse error guarantee. Our results fill in gaps left by recent work, by providing an algorithm that is precisely tunable for the application at hand. These results can be used to assess the difference between two chronologically or physically separated massive data sets, making one quick pass over each data set, without buffering the data or requiring the data source to pause. For example, one can use our techniques to judge whether the traffic on two remote network routers are similar without requiring either router to transmit a copy of its traffic. A web search engine could use such algorithms to construct a library of small ''sketches,'' one for each distinct page on the web; one can approximate the extent to which new web pages duplicate old ones by comparing the sketches of the web pages. Such techniques will become increasingly important as the enormous scale, distributional nature, and one-pass processing requirements of data sets become more commonplace.

https://doi.org/10.46298/dmtcs.290

Source: HAL:hal-00958964v1

Volume: Vol. 4 no. 2

Published on: January 1, 2001

Imported on: March 26, 2015

Keywords: [INFO.INFO-DM]Computer Science [cs]/Discrete Mathematics [cs.DM], [en] streaming algorithms, data streams, Lp norms

Licence: Hal authorisation v1

Bibliographic References

Share and export

Consultation statistics

This page has been seen 527 times.

This article's PDF has been downloaded 643 times.