A new algorithm to seriate noisy data

Aerial view of the campus
Archeologists and other scientists sometimes have to seriate data, i.e., arrange the data in a sequence according to prescribed criteria.

In his research, Professor Aaron Smith studies a concept named ‘noisy seriation’: Given a collection of types of objects (e.g., the collection of all types of pottery glaze used during a Sumerian dynasty) and information about when these objects co-occurred (e.g., when two different types of glaze are discovered in the same dig site), approximately reconstruct the relative ages of the objects (e.g., which glazes were developed first, second, etc.). Archaeologists have been using seriation since 1898, and computer scientists have developed algorithms for efficiently solving many variants of the problem over the last 40 years. However, none of these algorithms has all of the following properties: (i) giving reasonable answers for noisy data, (ii) being computationally tractable (a computationally intractable algorithm is one for which no computer could calculate the outcome in a reasonable amount of time), and (iii) giving substantially more accurate estimates than the usual "spectral embedding" approach. Since all data is noisy and since it is not possible to run computationally intractable algorithms, there had not been, until Prof. Smith’s work, any substantial improvements on the well-known "spectral embedding" approach. Prof. Smith developed a new "clean-up" procedure (algorithm) and showed that it gives reasonable answers for most models of noisy data, that it is always computationally tractable, and much more accurate than "spectral embedding" approaches for a wide variety of models. Furthermore, he showed that "spectral embedding"-based approaches are something of a dead end, by showing that any embedding-based approaches are substantially less accurate than his algorithm.

Professor Aaron Smith
Professor Aaron Smith

The immediate impact of Prof. Smith’s work comes from the algorithm he developed. Scientists who need to seriate pairwise-similarity data (such as neuroscientists, people interested in ranking choices, etc.) can use his algorithm instead of or as a supplement to traditional embedding-based approaches.   In the long term, Prof. Smith expects that his algorithm will be supplanted and that his theoretical results will have a larger impact. The most popular approaches to seriation, ranking, and similar problems all rely on embedding points in a continuous latent space. Prof. Smith showed that there are some fundamental limits to the accuracy of any such embedding-based approach. He hopes that his work will allow researchers to redirect their attention away from the hopeless task of sharpening their embeddings and towards more promising algorithms.

Read more: