Speaker: Professor Amin Karbasi (Yale
University)
Title: Cracking Big Data with Small Data
Abstract: For the last several years, we have witnessed the emergence of
datasets of an unprecedented scale across different scientific disciplines. The
large volume of such datasets presents new computational challenges as the
diverse, feature-rich, and usually high-resolution data does not allow for
effective data-intensive inference. In this regard, data summarization is a
compelling (and sometimes the only) approach that aims at both exploiting the
richness of large-scale data and being computationally tractable; Instead of
operating on complex and large data directly, carefully constructed summaries
not only enable the execution of various data analytics tasks but also improve
their efficiency and scalability.
A systematic way for data summarization is to turn the problem into selecting a
subset of data elements optimizing a utility function that quantifies
ÒrepresentativenessÓ of the selected set. Often-times, these objective
functions satisfy submodularity, an intuitive notion
of diminishing returns stating that selecting any given element earlier helps
more than selecting it later. Thus, many problems in data summarization require
maximizing submodular set functions subject to cardinality and massive data
means we have to solve these problems at scale.
In this talk, I will present our recent efforts in developing practical schemes
for data summarization. In particular, I will first discuss the fastest
centralized solution whose query complexity is only linear in data size.
However, to truly summarize massive data we need to opt for scalable methods.
I will then present a streaming algorithm that with a single pass over
the data provides a constant-factor approximation guarantee to the optimum
solution. Finally, I will talk about a distributed approach that summarizes
tens of millions of data points in a timely fashion. I will also demonstrate
experiments on several applications, including sparse Gaussian process
inference and exemplar-based clustering using Apache-Spark.
Bio: Amin Karbasi is currently an assistant professor
of Electrical Engineering and Computer Science at Yale University. He has
been the recipient of the Grainger grant 2017 from National Academy of
Engineering, DARPA Young Faculty Award 2016, Simons-Berkeley fellowship 2016,
Google Faculty Award 2015, and ETH fellowship 2013. His work has been
recognized with a variety of paper awards from AISTATS 2015, IEEE Data Storage
2013, ICASSP 2011, ACM SIGMETRICS 2010, and ISIT 2010 (runner-up). His Ph.D.
work received the Patrick Denantes Memorial Prize
2013 for the best doctoral thesis in the School of Computer and Communication
Sciences at EPFL.